site banner

LLMs vs Advent of Code 2024

LLMs do Advent of Code 2024

This is a repeat of my experiment from last year with using ChatGPT to solve the Advent of Code problems. Much of the premise is the same and I'm not going to repeat myself explaining what's Advent of Code and why I'm doing this.

Instead I will explain what I did differently this year:

  1. Instead of using ChatGPT 4 I used ChatGPT 4o, I still used the paid API version with a command line client instead of paying for the Plus version however, just like last year, youtuber Martin Zikmund did a similar thing using ChatGPT plus so I didn't need to. He also used o1 where possible and it did not make any difference as far as I can tell.

  2. A couple of months ago I asked if I should use a different LLM to run this little experiment and I the most suggested alternative was Claude 3.5-sonnet so I did that.

I also stole update the prompt that I used by stealing the one that user @SnapDragon used last year, only changing it a little. This was the prompt I used for Claude:

Are you familiar with Advent of Code? I'd like to see if you can solve one of this year's problems. I'll provide the problem statement and input data. I'd like you think step by step about the problem. At any time you can output code for me to run on the input in python; I'll tell you what the result is, including any debut output. Or you can run it yourself with Code Interpreter. I'll be your interface with the problem site, but I won't help you think about the problem. Does this sound like something you'll be able to do?

The ChatGPT prompt was more or less identical, just with the mention of the Code Interpreter (which I didn't have because it is Plus only).

Claude-3.5-sonnet Results

Part 1Part 2notes
day 01 OK OK
day 02 OK OK
day 03 OK OK
day 04 OK OK
day 05 OK OK
day 06 OK OK
day 07 OK OK
day 08 OK OK
day 09 OK OK Quadratic solution for both parts
day 10 OK OK
day 11 OK OK
day 12 OK FAIL
day 13 OK FAIL
day 14 OK ---
day 15 FAIL N/A
day 16 OK OK
day 17 OK FAIL
day 18 OK OK
day 19 OK OK
day 20 FAIL N/A
day 21 FAIL N/A
day 22 OK FAIL
day 23 OK OK
day 24 OK FAIL
day 25 OK (note that part 2 of day 25 does not exist)

Days fully finished: 15 / 25 (excludes the 25th)

ChatGPT-4o Results

Part 1Part 2notes
day 01 OK OK
day 02 OK OK
day 03 OK OK
day 04 OK OK
day 05 OK OK
day 06 OK OK
day 07 OK OK
day 08 FAIL N/A
day 09 FAIL FAIL
day 10 OK OK
day 11 FAIL N/A ChatGPT Plus could solve it
day 12 OK FAIL
day 13 OK FAIL
day 14 OK N/A
day 15 FAIL N/A
day 16 OK FAIL ChatGPT plus could solve it
day 17 OK FAIL
day 18 OK OK
day 19 OK OK
day 20 FAIL N/A
day 21 FAIL N/A
day 22 OK FAIL ChatGPT plus could solve it
day 23 OK OK
day 24 OK FAIL
day 25 OK (note that part 2 of day 25 does not exist)

Days fully finished: 11 / 25 (excludes the 25th) Days fully finished, ChatGPT Plus: 14 / 25 (excludes the 25th)

Discussion of the results

I went into this experiment expecting the LLMs to do about as well as last year, that is: finish a 2~4 easy days on the first week and maybe a part 1 here and there during the rest of the month.

The data however proved me wrong, they did a lot better than I expected, especially Claude. Now, there is a snag to this: this year felt much easier than previous years. Quantifying difficulty is always hard, the scatter plot seems to confirm this, as does this image I stole from 4chan which measures the same thing (fill time for the global leaderboard) but in a way that I find clearer.

If we count the number of days the leaderboard filled in under 10 minutes we get this (I excluded 2015 and 2016 because many fewer people participated and day 25th because it's a half day):

20178
20182
20192
202010
20218
20228
20236
202413

This may be deliberate since this was a 10 year anniversary of AoC. Nevertheless, the improvement is hard to deny: both ChatGPT and Claude could get all the easy problems and a couple of the medium ones, where last year ChatGPT could not even get all the easy ones.

This is enough that it makes me think that if Eric wants to keep the integrity of the global leaderboard intact he should start requiring serious participants to provide livestreamed proof, like they are starting to do in the speedrunning community.

Various notes

  • Day 14 part 2 asked to find the first output configuration containing a christmas tree. This is essentially impossible to solve independently for an LLM since the problem didn't even specify what the christmas tree would look like and there are many plausible ways to draw a christmas tree with pixel art. That said human participants faced the same problem and LLMs could have come up with an heuristic to find possible candidates and asked the human in the loop (me) to verify, but they couldn't so I still count it as a failure.
  • When Claude tried to use its code interpreter it often started doing it in python, which Claude's code interpreter does not support (it's javascript only). This seems like a piece of jank they should be fixing.
  • While the quantitative data does not really support either Claude or ChatGPT being better my subjective impression is that Claude usually got to the solution faster and writing better code (in the few cases where there was a quality difference).
  • Difficulty is hard to estimate, day 9 seemed as easy if not easier than day 8 to me and yet it took twice as long for the leaderboard to fill, ChatGPT couldn't finish it and Claude only barely solved it.
  • Still speaking of day 9, it was one of the only two non-easy days solved by Claude but this almost a technicality, the program produced by Claude was quadratic garbage and if the input file had been a little bit bigger it would have taken hours for the program to finish.
  • Day 15 problem was very very similar to an implementation of Sokoban but with a slightly different rule set. This was fatal to both ChatGPT and Claude. Much like with poker in 2023, the Sokoban attractor was too strong to overcome, they just kept implementing the "no more than one box at a time" rule that does not exist anywhere in the problem's prompt.
  • Last year I complained that ChatGPT never attempted to do any kind of debugging. This year I have been proven wrong, for day 5 ChatGPT struggled to get part 1 right and rewrote the program to include debug print statements and eventually figured out how to solve the problem (even though the final code was fairly unreadable).
  • Day 17 part 2 required gives you a program (specified as a series of assembly instructions for a made up ISA) and asks you to figure out the input that turns the program into a Quine. Neither LLM could solve however Claude got really close before running into its character limit, this is the last output it produced:

I'm starting to see a pattern! Each loop:

The program starts by setting B to A % 8 Through a series of operations, it manipulates B The program then divides A by 8 (2^3) It outputs the value of B (mod 8) If A is not zero, it jumps back to the start

The program seems to take a number in register A, and output digits from it by repeatedly dividing by 8 and using the remainder in each step to generate output through the B register. Let's try to find a pattern of outputs for some carefully chosen A values:

I don't know if lifting the character limit would make it solve the problem but I admit that this was the most impressed I've been with an LLM ever since ChatGPT 3.5.

  • A similar thing happened again with Claude and day 24 part 2.
9
Jump in the discussion.

No email address required.

For day 6 part 2, you can solve it left-to-right (slow) or right-to-left (fast) using recursion left takes 120ms on mine (C#), right would probably take less than 10ms (too lazy to test it now). I doubt a LLMs code will anywhere as fast as either of these. While Python is quite slow inherently, algorithmic performance matters as well

I'm not sure what you mean that the outputs are too long

My bad, I meant that the inputs seem too long. But I suppose you can just describe the input structure, size and possible characters and make the LLM make a program which works for all possible valid inputs

Did you mean day 9 part 2 where you wrote day 6 part 2? The code for day 9 part 2 was very slow because it did a loop with decreasing file ids, for each file id it scanned the array to find where the file id was, did a second scan to find an empty spot and then actually moved the file.