Last year I did an experiment with ChatGPT and Advent of Code. I was thinking of repeating it and since last year I was criticized for choice of model and prompt I'm going to crowdsource them: which LLM should I use, which one is best at writing code? What prompt should I give it?
What is this place?
This website is a place for people who want to move past shady thinking and test their ideas in a
court of people who don't all share the same biases. Our goal is to
optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.
The weekly Culture War threads host the most
controversial topics and are the most visible aspect of The Motte. However, many other topics are
appropriate here. We encourage people to post anything related to science, politics, or philosophy;
if in doubt, post!
Check out The Vault for an archive of old quality posts.
You are encouraged to crosspost these elsewhere.
Why are you called The Motte?
A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently,
it's an element in a rhetorical move called a "Motte-and-Bailey",
originally identified by
philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial
but high value claim to a defensible but less exciting one upon any resistance to the former. He likens
this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for
the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired
propositions to which one retreats when hard pressed."
On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.
New post guidelines
If you're posting something that isn't related to the culture war, we encourage you to post a thread for it.
A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts
such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a
submission statement. A submission statement is required for non-text sources (videos, podcasts, images).
Culture war posts go in the culture war thread; all links must either include a submission statement or
significant commentary. Bare links without those will be removed.
If in doubt, please post it!
Rules
- Courtesy
- Content
- Engagement
- When disagreeing with someone, state your objections explicitly.
- Proactively provide evidence in proportion to how partisan and inflammatory your claim might be.
- Accept temporary bans as a time-out, and don't attempt to rejoin the conversation until it's lifted.
- Don't attempt to build consensus or enforce ideological conformity.
- Write like everyone is reading and you want them to be included in the discussion.
- The Wildcard Rule
- The Metarule
Jump in the discussion.
No email address required.
Notes -
make sure to use o1, it's by far the best at complex reasoning and should be competitive at the later hardest ones, which are the most interesting
More options
Context Copy link
So, I had a little more success than you last year, and you can see my transcript here. Part of the reason is that I didn't give it a minimal prompt. Try to give it full context for what it's doing - this is an LLM, not a Google search, and brevity hurts it. And don't "help" it by removing the story from the problem - after all, English comprehension is its strength. Tell it, up front, exactly how you're going to interact with it: it can "think step by step", it can try some experiments on its own, but you won't help it in any way. The only thing you'll do is run the code it gives you, or submit an answer to the site, telling it the (exact) error message that AoC generates.
To reiterate, give it all the information a human solving AoC would have. That's the fairest test.
My prediction is that o1 will do better (of course), maybe solving a few in the day 10-20 range. However, I think it'll still have problems with certain problems, and with debugging, especially when text output (or a diagram in the input) needs to be parsed character-by-character. This is a fundamental problem with LLMs: textual output that looks well-formatted and readable to us is fed into the LLM as a gobbledegook mixture of tokens, and it just has no clue how to process it (but, sadly, pretends that it can). This is related to how they have trouble with anagrams or spelling questions (e.g. how many Rs are in "strawberry"). I wonder if there's some way we could process text output so it tokenizes properly.
More options
Context Copy link
The new Qwen 2.5 32B dropped, people are saying it's roughly as good as the newest Sonnet for coding. I don't know how easy it is to get access to Qwen but it is Chinese and should be cheap, it is open source...
Might be good to try out as a comparison, see if it's really that good or if they've just been benchmark hacking? But Sonnet is the most obvious pick IMO.
More options
Context Copy link
I have been working on my own reasoning-with-code framework. If I get a proof of concept up I'll try it on this year's advent of code!
More options
Context Copy link
There is an AI track in the Meta Hacker Cup this year. I don't know exactly how it works, but it might be helpful to check which techniques the more successful participants used.
More options
Context Copy link
I second the recommendation of Anthropic's 3.5 sonnet, it's much better than OpenAI's models. For the prompts, I would be interested in 0-shot instructions-as-written, and also what results you get if you follow up any output that doesn't work once with "That didn't work, [I get this error: "..."]/[the result doesn't match instructions]. Analyze what went wrong and suggest improvements."
In my experience, doing that follow-up once fixes quite a few problems, but there are diminishing returns after the first time. If there are persistent problems, I have to stop and think on what could be wrong and direct sonnet accordingly to get it to progress.
More options
Context Copy link
I'd say do at least 3.5 Sonnet and whichever model of o1 is out by then. Sonnet is the best "classical" code llm (imo!), though you may have to prompt it pretty hard to get it to try a oneshot. But o1 is designed for oneshots and is the only one that may be a paradigm shift in ai design. It's been worse than sonnet at some tasks, but this may play to its strengths. Also if adding a Python interpreter, implore the models to add timeouts. :)
More options
Context Copy link