site banner

Repeating the LLM vs Advent of Code experiment

Last year I did an experiment with ChatGPT and Advent of Code. I was thinking of repeating it and since last year I was criticized for choice of model and prompt I'm going to crowdsource them: which LLM should I use, which one is best at writing code? What prompt should I give it?

6
Jump in the discussion.

No email address required.

make sure to use o1, it's by far the best at complex reasoning and should be competitive at the later hardest ones, which are the most interesting

So, I had a little more success than you last year, and you can see my transcript here. Part of the reason is that I didn't give it a minimal prompt. Try to give it full context for what it's doing - this is an LLM, not a Google search, and brevity hurts it. And don't "help" it by removing the story from the problem - after all, English comprehension is its strength. Tell it, up front, exactly how you're going to interact with it: it can "think step by step", it can try some experiments on its own, but you won't help it in any way. The only thing you'll do is run the code it gives you, or submit an answer to the site, telling it the (exact) error message that AoC generates.

To reiterate, give it all the information a human solving AoC would have. That's the fairest test.

My prediction is that o1 will do better (of course), maybe solving a few in the day 10-20 range. However, I think it'll still have problems with certain problems, and with debugging, especially when text output (or a diagram in the input) needs to be parsed character-by-character. This is a fundamental problem with LLMs: textual output that looks well-formatted and readable to us is fed into the LLM as a gobbledegook mixture of tokens, and it just has no clue how to process it (but, sadly, pretends that it can). This is related to how they have trouble with anagrams or spelling questions (e.g. how many Rs are in "strawberry"). I wonder if there's some way we could process text output so it tokenizes properly.

The new Qwen 2.5 32B dropped, people are saying it's roughly as good as the newest Sonnet for coding. I don't know how easy it is to get access to Qwen but it is Chinese and should be cheap, it is open source...

Might be good to try out as a comparison, see if it's really that good or if they've just been benchmark hacking? But Sonnet is the most obvious pick IMO.

I have been working on my own reasoning-with-code framework. If I get a proof of concept up I'll try it on this year's advent of code!

There is an AI track in the Meta Hacker Cup this year. I don't know exactly how it works, but it might be helpful to check which techniques the more successful participants used.

I second the recommendation of Anthropic's 3.5 sonnet, it's much better than OpenAI's models. For the prompts, I would be interested in 0-shot instructions-as-written, and also what results you get if you follow up any output that doesn't work once with "That didn't work, [I get this error: "..."]/[the result doesn't match instructions]. Analyze what went wrong and suggest improvements."

In my experience, doing that follow-up once fixes quite a few problems, but there are diminishing returns after the first time. If there are persistent problems, I have to stop and think on what could be wrong and direct sonnet accordingly to get it to progress.

I'd say do at least 3.5 Sonnet and whichever model of o1 is out by then. Sonnet is the best "classical" code llm (imo!), though you may have to prompt it pretty hard to get it to try a oneshot. But o1 is designed for oneshots and is the only one that may be a paradigm shift in ai design. It's been worse than sonnet at some tasks, but this may play to its strengths. Also if adding a Python interpreter, implore the models to add timeouts. :)