LLMs vs Advent of Code 2024

LLMs do Advent of Code 2024

This is a repeat of my experiment from last year with using ChatGPT to solve the Advent of Code problems. Much of the premise is the same and I'm not going to repeat myself explaining what's Advent of Code and why I'm doing this.

Instead I will explain what I did differently this year:

Instead of using ChatGPT 4 I used ChatGPT 4o, I still used the paid API version with a command line client instead of paying for the Plus version however, just like last year, youtuber Martin Zikmund did a similar thing using ChatGPT plus so I didn't need to. He also used o1 where possible and it did not make any difference as far as I can tell.
A couple of months ago I asked if I should use a different LLM to run this little experiment and I the most suggested alternative was Claude 3.5-sonnet so I did that.

I also stole update the prompt that I used by stealing the one that user @SnapDragon used last year, only changing it a little. This was the prompt I used for Claude:

Are you familiar with Advent of Code? I'd like to see if you can solve one of this year's problems. I'll provide the problem statement and input data. I'd like you think step by step about the problem. At any time you can output code for me to run on the input in python; I'll tell you what the result is, including any debut output. Or you can run it yourself with Code Interpreter. I'll be your interface with the problem site, but I won't help you think about the problem. Does this sound like something you'll be able to do?

The ChatGPT prompt was more or less identical, just with the mention of the Code Interpreter (which I didn't have because it is Plus only).

Claude-3.5-sonnet Results

	Part 1	Part 2	notes
day 01	OK	OK
day 02	OK	OK
day 03	OK	OK
day 04	OK	OK
day 05	OK	OK
day 06	OK	OK
day 07	OK	OK
day 08	OK	OK
day 09	OK	OK	Quadratic solution for both parts
day 10	OK	OK
day 11	OK	OK
day 12	OK	FAIL
day 13	OK	FAIL
day 14	OK	---
day 15	FAIL	N/A
day 16	OK	OK
day 17	OK	FAIL
day 18	OK	OK
day 19	OK	OK
day 20	FAIL	N/A
day 21	FAIL	N/A
day 22	OK	FAIL
day 23	OK	OK
day 24	OK	FAIL
day 25	OK		(note that part 2 of day 25 does not exist)

Days fully finished: 15 / 25 (excludes the 25th)

ChatGPT-4o Results

	Part 1	Part 2	notes
day 01	OK	OK
day 02	OK	OK
day 03	OK	OK
day 04	OK	OK
day 05	OK	OK
day 06	OK	OK
day 07	OK	OK
day 08	FAIL	N/A
day 09	FAIL	FAIL
day 10	OK	OK
day 11	FAIL	N/A	ChatGPT Plus could solve it
day 12	OK	FAIL
day 13	OK	FAIL
day 14	OK	N/A
day 15	FAIL	N/A
day 16	OK	FAIL	ChatGPT plus could solve it
day 17	OK	FAIL
day 18	OK	OK
day 19	OK	OK
day 20	FAIL	N/A
day 21	FAIL	N/A
day 22	OK	FAIL	ChatGPT plus could solve it
day 23	OK	OK
day 24	OK	FAIL
day 25	OK		(note that part 2 of day 25 does not exist)

Days fully finished: 11 / 25 (excludes the 25th) Days fully finished, ChatGPT Plus: 14 / 25 (excludes the 25th)

Discussion of the results

I went into this experiment expecting the LLMs to do about as well as last year, that is: finish a 2~4 easy days on the first week and maybe a part 1 here and there during the rest of the month.

The data however proved me wrong, they did a lot better than I expected, especially Claude. Now, there is a snag to this: this year felt much easier than previous years. Quantifying difficulty is always hard, the scatter plot seems to confirm this, as does this image I stole from 4chan which measures the same thing (fill time for the global leaderboard) but in a way that I find clearer.

If we count the number of days the leaderboard filled in under 10 minutes we get this (I excluded 2015 and 2016 because many fewer people participated and day 25th because it's a half day):

2017	8
2018	2
2019	2
2020	10
2021	8
2022	8
2023	6
2024	13

This may be deliberate since this was a 10 year anniversary of AoC. Nevertheless, the improvement is hard to deny: both ChatGPT and Claude could get all the easy problems and a couple of the medium ones, where last year ChatGPT could not even get all the easy ones.

This is enough that it makes me think that if Eric wants to keep the integrity of the global leaderboard intact he should start requiring serious participants to provide livestreamed proof, like they are starting to do in the speedrunning community.

Various notes

Day 14 part 2 asked to find the first output configuration containing a christmas tree. This is essentially impossible to solve independently for an LLM since the problem didn't even specify what the christmas tree would look like and there are many plausible ways to draw a christmas tree with pixel art. That said human participants faced the same problem and LLMs could have come up with an heuristic to find possible candidates and asked the human in the loop (me) to verify, but they couldn't so I still count it as a failure.
When Claude tried to use its code interpreter it often started doing it in python, which Claude's code interpreter does not support (it's javascript only). This seems like a piece of jank they should be fixing.
While the quantitative data does not really support either Claude or ChatGPT being better my subjective impression is that Claude usually got to the solution faster and writing better code (in the few cases where there was a quality difference).
Difficulty is hard to estimate, day 9 seemed as easy if not easier than day 8 to me and yet it took twice as long for the leaderboard to fill, ChatGPT couldn't finish it and Claude only barely solved it.
Still speaking of day 9, it was one of the only two non-easy days solved by Claude but this almost a technicality, the program produced by Claude was quadratic garbage and if the input file had been a little bit bigger it would have taken hours for the program to finish.
Day 15 problem was very very similar to an implementation of Sokoban but with a slightly different rule set. This was fatal to both ChatGPT and Claude. Much like with poker in 2023, the Sokoban attractor was too strong to overcome, they just kept implementing the "no more than one box at a time" rule that does not exist anywhere in the problem's prompt.
Last year I complained that ChatGPT never attempted to do any kind of debugging. This year I have been proven wrong, for day 5 ChatGPT struggled to get part 1 right and rewrote the program to include debug print statements and eventually figured out how to solve the problem (even though the final code was fairly unreadable).
Day 17 part 2 required gives you a program (specified as a series of assembly instructions for a made up ISA) and asks you to figure out the input that turns the program into a Quine. Neither LLM could solve however Claude got really close before running into its character limit, this is the last output it produced:

I'm starting to see a pattern! Each loop:

The program starts by setting B to A % 8 Through a series of operations, it manipulates B The program then divides A by 8 (2^3) It outputs the value of B (mod 8) If A is not zero, it jumps back to the start

The program seems to take a number in register A, and output digits from it by repeatedly dividing by 8 and using the remainder in each step to generate output through the B register. Let's try to find a pattern of outputs for some carefully chosen A values:

I don't know if lifting the character limit would make it solve the problem but I admit that this was the most impressed I've been with an LLM ever since ChatGPT 3.5.

A similar thing happened again with Claude and day 24 part 2.

Jump in the discussion.

No email address required.

hooser 5mo ago

Day 14 part 2 asked to find the first output configuration containing a christmas tree. This is essentially impossible to solve independently for an LLM since the problem didn't even specify what the christmas tree would look like and there are many plausible ways to draw a christmas tree with pixel art.

I thought about how to solve it in an automated way, and came out with the following approach. I figured that any recognizable pixel art of a christmas tree will have a lot of straight lines or filled-in spaces. So:

For a configuration of robots, count the number of robots that have two other robots right next to them in a straight line. (E.g., for a robot in position (x,y), are there also robots on (x-1,y-1) and (x+1,y+1)?)
Collect this measure for n-th move, for n from 0 to 10000.
Submit n with the highest measure.

(If that's not the answer: repeat for another 10000 moves, etc. Turned out this wasn't necessary.)

Context

RandomRanger Just build nuclear plants! 6mo ago

character limit

Did you really use up a whole Sonnet context length?

Personally I try to move work into a new context length around the time it says 'this conversation is getting long', since its IQ falls past a certain point.

CodexesEverywhere 6mo ago

I failed 3 stars this AoC (first AoC I've done). That's still quite a bit better then the AIs, but I'm definitely feeling the heat. Thankfully I'm not all in on programming (coding isn't even a primary part of my job).

AoC was quite a lot of fun regardless, even if I got a bit tired of manipulating 2d arrays before the end. Jokes on me for not using a better language or writing more general code I guess.

TwiceHuman 6mo ago

How good are these solutions? I find that they seem quite easy to solve (I have only looked at a few of the problems so far though!), but that memory and time-efficient solutions take a bit more thinking and coding. Something which surprised me here is that the outputs weren't too long for chatGPT. I have never tried giving it the full input as I simply expected it to be too many tokens

aaa TwiceHuman 6mo ago

How good are these solutions? I find that they seem quite easy to solve (I have only looked at a few of the problems so far though!), but that memory and time-efficient solutions take a bit more thinking and coding

Aside from day 9, which was bad, I haven't noticed anything. I don't recall any problem this year, aside from day 9, which was solvable with a suboptimal solution. As far as microoptimization go: I asked it to write python, so they were all shit, couldn't be better.

Something which surprised me here is that the outputs weren't too long for chatGPT. I have never tried giving it the full input as I simply expected it to be too many tokens

I'm not sure what you mean that the outputs are too long. But I didn't give the full input to ChatGPT, I did give it to Claude as an attachment.

TwiceHuman aaa 6mo ago

For day 6 part 2, you can solve it left-to-right (slow) or right-to-left (fast) using recursion left takes 120ms on mine (C#), right would probably take less than 10ms (too lazy to test it now). I doubt a LLMs code will anywhere as fast as either of these. While Python is quite slow inherently, algorithmic performance matters as well

I'm not sure what you mean that the outputs are too long

My bad, I meant that the inputs seem too long. But I suppose you can just describe the input structure, size and possible characters and make the LLM make a program which works for all possible valid inputs

Did you mean day 9 part 2 where you wrote day 6 part 2? The code for day 9 part 2 was very slow because it did a loop with decreasing file ids, for each file id it scanned the array to find where the file id was, did a second scan to find an empty spot and then actually moved the file.

Sheepclothes 6mo ago

I have a really really hard to believe o1 or o1-mini even didnt improve the performance over gpt4-o. What does "used o1 where" possible mean? He didnt use it on day 8 for instance.

self_made_human C'est la vie Sheepclothes 6mo ago

I'd certainly like to see the problems the other LLMs struggled with run through o1.

aaa self_made_human 6mo ago

What's stopping you?

self_made_human C'est la vie aaa 6mo ago

Well, that I don't have an OAI subscription subscription and wouldn't normally gain enough utility out of getting one to justify the expense really.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats