Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Quokka's Den Telegram
- Astral Codex Ten Discord

aaa 7mo ago (text post) 1011 thread views

Repeating the LLM vs Advent of Code experiment

Last year I did an experiment with ChatGPT and Advent of Code. I was thinking of repeating it and since last year I was criticized for choice of model and prompt I'm going to crowdsource them: which LLM should I use, which one is best at writing code? What prompt should I give it?

Jump in the discussion.

No email address required.

curious_straight_ca 7mo ago

make sure to use o1, it's by far the best at complex reasoning and should be competitive at the later hardest ones, which are the most interesting

Context

SnapDragon 7mo ago

So, I had a little more success than you last year, and you can see my transcript here. Part of the reason is that I didn't give it a minimal prompt. Try to give it full context for what it's doing - this is an LLM, not a Google search, and brevity hurts it. And don't "help" it by removing the story from the problem - after all, English comprehension is its strength. Tell it, up front, exactly how you're going to interact with it: it can "think step by step", it can try some experiments on its own, but you won't help it in any way. The only thing you'll do is run the code it gives you, or submit an answer to the site, telling it the (exact) error message that AoC generates.

To reiterate, give it all the information a human solving AoC would have. That's the fairest test.

My prediction is that o1 will do better (of course), maybe solving a few in the day 10-20 range. However, I think it'll still have problems with certain problems, and with debugging, especially when text output (or a diagram in the input) needs to be parsed character-by-character. This is a fundamental problem with LLMs: textual output that looks well-formatted and readable to us is fed into the LLM as a gobbledegook mixture of tokens, and it just has no clue how to process it (but, sadly, pretends that it can). This is related to how they have trouble with anagrams or spelling questions (e.g. how many Rs are in "strawberry"). I wonder if there's some way we could process text output so it tokenizes properly.

Context

RandomRanger Just build nuclear plants! 7mo ago

The new Qwen 2.5 32B dropped, people are saying it's roughly as good as the newest Sonnet for coding. I don't know how easy it is to get access to Qwen but it is Chinese and should be cheap, it is open source...

Might be good to try out as a comparison, see if it's really that good or if they've just been benchmark hacking? But Sonnet is the most obvious pick IMO.

Context

GasStationManager 7mo ago

I have been working on my own reasoning-with-code framework. If I get a proof of concept up I'll try it on this year's advent of code!

Context

MartianNight 7mo ago

There is an AI track in the Meta Hacker Cup this year. I don't know exactly how it works, but it might be helpful to check which techniques the more successful participants used.

Context

hooser 7mo ago

I second the recommendation of Anthropic's 3.5 sonnet, it's much better than OpenAI's models. For the prompts, I would be interested in 0-shot instructions-as-written, and also what results you get if you follow up any output that doesn't work once with "That didn't work, [I get this error: "..."]/[the result doesn't match instructions]. Analyze what went wrong and suggest improvements."

In my experience, doing that follow-up once fixes quite a few problems, but there are diminishing returns after the first time. If there are persistent problems, I have to stop and think on what could be wrong and direct sonnet accordingly to get it to progress.

Context

FeepingCreature 7mo ago

I'd say do at least 3.5 Sonnet and whichever model of o1 is out by then. Sonnet is the best "classical" code llm (imo!), though you may have to prompt it pretty hard to get it to try a oneshot. But o1 is designed for oneshots and is the only one that may be a paradigm shift in ai design. It's been worse than sonnet at some tasks, but this may play to its strengths. Also if adding a Python interpreter, implore the models to add timeouts. :)

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.