LLMs vs Advent of Code 2024

11 comments - 987 thread views 5mo ago by aaa (text post)

LLMs do Advent of Code 2024

This is a repeat of my experiment from last year with using ChatGPT to solve the Advent of Code problems. Much of the premise is the same and I'm not going to repeat myself explaining what's Advent of Code and why I'm doing this.

Instead I will explain what I did differently this year:

Instead of using ChatGPT 4 I used ChatGPT 4o, I still used the paid API version with a command line client instead of paying for the Plus version however, just like last year, youtuber Martin Zikmund did a similar thing using ChatGPT plus so I didn't need to. He also used o1 where possible and it did not make any difference as far as I can tell.
A couple of months ago I asked if I should use a different LLM to run this little experiment and I the most suggested alternative was Claude 3.5-sonnet so I did that.

I also stole update the prompt that I used by stealing the one that user @SnapDragon used last year, only changing it a little. This was the prompt I used for Claude:

Are you familiar with Advent of Code? I'd like to see if you can solve one of this year's problems. I'll provide the problem statement and input data. I'd like you think step by step about the problem. At any time you can output code for me to run on the input in python; I'll tell you what the result is, including any debut output. Or you can run it yourself with Code Interpreter. I'll be your interface with the problem site, but I won't help you think about the problem. Does this sound like something you'll be able to do?

The ChatGPT prompt was more or less identical, just with the mention of the Code Interpreter (which I didn't have because it is Plus only).

Claude-3.5-sonnet Results

	Part 1	Part 2	notes
day 01	OK	OK
day 02	OK	OK
day 03	OK	OK
day 04	OK	OK
day 05	OK	OK
day 06	OK	OK
day 07	OK	OK
day 08	OK	OK
day 09	OK	OK	Quadratic solution for both parts
day 10	OK	OK
day 11	OK	OK
day 12	OK	FAIL
day 13	OK	FAIL
day 14	OK	---
day 15	FAIL	N/A
day 16	OK	OK
day 17	OK	FAIL
day 18	OK	OK
day 19	OK	OK
day 20	FAIL	N/A
day 21	FAIL	N/A
day 22	OK	FAIL
day 23	OK	OK
day 24	OK	FAIL
day 25	OK		(note that part 2 of day 25 does not exist)

Days fully finished: 15 / 25 (excludes the 25th)

ChatGPT-4o Results

	Part 1	Part 2	notes
day 01	OK	OK
day 02	OK	OK
day 03	OK	OK
day 04	OK	OK
day 05	OK	OK
day 06	OK	OK
day 07	OK	OK
day 08	FAIL	N/A
day 09	FAIL	FAIL
day 10	OK	OK
day 11	FAIL	N/A	ChatGPT Plus could solve it
day 12	OK	FAIL
day 13	OK	FAIL
day 14	OK	N/A
day 15	FAIL	N/A
day 16	OK	FAIL	ChatGPT plus could solve it
day 17	OK	FAIL
day 18	OK	OK
day 19	OK	OK
day 20	FAIL	N/A
day 21	FAIL	N/A
day 22	OK	FAIL	ChatGPT plus could solve it
day 23	OK	OK
day 24	OK	FAIL
day 25	OK		(note that part 2 of day 25 does not exist)

Days fully finished: 11 / 25 (excludes the 25th) Days fully finished, ChatGPT Plus: 14 / 25 (excludes the 25th)

Discussion of the results

I went into this experiment expecting the LLMs to do about as well as last year, that is: finish a 2~4 easy days on the first week and maybe a part 1 here and there during the rest of the month.

The data however proved me wrong, they did a lot better than I expected, especially Claude. Now, there is a snag to this: this year felt much easier than previous years. Quantifying difficulty is always hard, the scatter plot seems to confirm this, as does this image I stole from 4chan which measures the same thing (fill time for the global leaderboard) but in a way that I find clearer.

If we count the number of days the leaderboard filled in under 10 minutes we get this (I excluded 2015 and 2016 because many fewer people participated and day 25th because it's a half day):

2017	8
2018	2
2019	2
2020	10
2021	8
2022	8
2023	6
2024	13

This may be deliberate since this was a 10 year anniversary of AoC. Nevertheless, the improvement is hard to deny: both ChatGPT and Claude could get all the easy problems and a couple of the medium ones, where last year ChatGPT could not even get all the easy ones.

This is enough that it makes me think that if Eric wants to keep the integrity of the global leaderboard intact he should start requiring serious participants to provide livestreamed proof, like they are starting to do in the speedrunning community.

Various notes

Day 14 part 2 asked to find the first output configuration containing a christmas tree. This is essentially impossible to solve independently for an LLM since the problem didn't even specify what the christmas tree would look like and there are many plausible ways to draw a christmas tree with pixel art. That said human participants faced the same problem and LLMs could have come up with an heuristic to find possible candidates and asked the human in the loop (me) to verify, but they couldn't so I still count it as a failure.
When Claude tried to use its code interpreter it often started doing it in python, which Claude's code interpreter does not support (it's javascript only). This seems like a piece of jank they should be fixing.
While the quantitative data does not really support either Claude or ChatGPT being better my subjective impression is that Claude usually got to the solution faster and writing better code (in the few cases where there was a quality difference).
Difficulty is hard to estimate, day 9 seemed as easy if not easier than day 8 to me and yet it took twice as long for the leaderboard to fill, ChatGPT couldn't finish it and Claude only barely solved it.
Still speaking of day 9, it was one of the only two non-easy days solved by Claude but this almost a technicality, the program produced by Claude was quadratic garbage and if the input file had been a little bit bigger it would have taken hours for the program to finish.
Day 15 problem was very very similar to an implementation of Sokoban but with a slightly different rule set. This was fatal to both ChatGPT and Claude. Much like with poker in 2023, the Sokoban attractor was too strong to overcome, they just kept implementing the "no more than one box at a time" rule that does not exist anywhere in the problem's prompt.
Last year I complained that ChatGPT never attempted to do any kind of debugging. This year I have been proven wrong, for day 5 ChatGPT struggled to get part 1 right and rewrote the program to include debug print statements and eventually figured out how to solve the problem (even though the final code was fairly unreadable).
Day 17 part 2 required gives you a program (specified as a series of assembly instructions for a made up ISA) and asks you to figure out the input that turns the program into a Quine. Neither LLM could solve however Claude got really close before running into its character limit, this is the last output it produced:

I'm starting to see a pattern! Each loop:

The program starts by setting B to A % 8 Through a series of operations, it manipulates B The program then divides A by 8 (2^3) It outputs the value of B (mod 8) If A is not zero, it jumps back to the start

The program seems to take a number in register A, and output digits from it by repeatedly dividing by 8 and using the remainder in each step to generate output through the B register. Let's try to find a pattern of outputs for some carefully chosen A values:

I don't know if lifting the character limit would make it solve the problem but I admit that this was the most impressed I've been with an LLM ever since ChatGPT 3.5.

A similar thing happened again with Claude and day 24 part 2.

Repeating the LLM vs Advent of Code experiment

7 comments - 1004 thread views 7mo ago by aaa (text post)

ChatGPT vs Advent of Code

37 comments - 65914 thread views 1yr ago by aaa (text post)

ChatGPT does Advent of Code 2023

LLM are all the rage and people are worried that they will soon replace programmers (or, indeed, every possible office job) so I decided to do an experiment to see how well ChatGPT-4 does against Advent of Code 2023.

What is Advent of Code

Advent of Code (henceforth AoC) is an annual programming "event", held by Eric Wastl, that takes place during the first 25 days of december. Each day at midnight a problem unlocks, consisting of an input file and a description of the required solution (either a number or a sequence of letters and numbers) to be determined by processing the input file. To solve the problem you have to submit to the website the correct solution. Once you do part 2 of the problem unlocks, usually a harder version of the problem in part 1. You don't have to submit any code so, in theory, you could solve everything by hand, however, usually, this is intractable and writing a program to do the work for you is the only easy way to solve the problem.

There's also a leaderboard where participants are scored based on how fast they submitted a solution.

Problems start very easy on day 1 (sometimes as easy as just asking for a program that sums all numbers in the input) and progress towards more difficult ones, but they never get very hard: a CS graduate should be able to solve all problems, except maybe 1 or 2, in a couple of hours each.

Prior history

This isn't the first time ChatGPT (or LLMs) was used to participate in Advent of Code. In fact last year (2022) it was big news that users of ChatGPT were able, in multiple days, to reach the top of the global leaderboard. And this was enough of a concern that Eric explicitly banned ChatGPT users from submitting solutions before the global leaderboard was full (of course he also doesn't have any way to actually enforce this ban). Some people even expected GPT-4 to finish the whole year.

A lot of noise was made of GPT-3.5 performance in AoC last year but the actual results were quite modest and LLM enthusiasts behaved in a very unscientific way, by boasting successes but becoming very quiet when it started to fail. In fact ChatGPT struggled to get through day 3 and 5 and probably couldn't solve anything after day 5.

Why do AoC with GPT?

I think it's as close to the perfect benchmark as you can get. The problems are roughly in order of increasing difficulty so you can see where it stops being able to solve. Since almost all of the problems in any given year are solvable by a CS graduate in a couple of hours is a good benchmark for AGI. And since all of the problems are novel the solutions can't come from overfitting.

Also around release people tried GPT-4 on AoC 2022 and found that it performed better so it would be interesting to see how much of the improvement was overfitting vs actual improvement

Methodology

I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs. The prompt I used for part 1 was:

Write a python program to solve the following problem, the program should read its input from a file passed as argument on the command line:

followed by the copypasted text of the problem. I manually removed from the prompt all the story fluff that Eric wrote, which constitutes a small amount of help for ChatGPT. If the output had trivial syntax mistakes I fixed them manually.

I gave up on a solution if it didn't terminate within 15 minutes, and let ChatGPT fail 3 times before giving up. A failure constitutes either an invalid program or a program that runs to completion but returns the wrong output value.

If the program ran to completion with the wrong answer I used the following prompt:

There seems to be a bug can you add some debug output to the program so we can find what the bug is?

If the program ran into an error I would say so and copy the error message.

If the first part was solved correctly the prompt for the second part would be:

Very good, now I want you to write another python program, that still reads input from a command line argument, same input as before, and solves this additional problem:

I decided I would stop the experiment after 4 consecutive days where ChatGPT was unable to solve part 1.

ChatGPT Plus

Because I was aware of the possibility that ChatGPT Plus would be better I supplemented my experiment with two other sources. The first one is the Youtube channel of Martin Zikmund (hencefort "youtuber") who did videos on how to solve the problems in C# as well as trying to solve them using ChatGPT (with a Plus account).

The second one was the blog of a ChatGPT enthusiast "Advent of AI" (henceforth enthusiast) who tried to solve the problems using ChatGPT Plus and then also wrote a blog about it using ChatGPT Plus. Since the blog is generated by ChatGPT it's absolute shit and potentially contains hallucinations, however the github repo with the transcripts is valuable.

The enthusiast turned out to be completely useless: it resorted often to babystepping ChatGPT through to the result and he stopped on day 6 anyway.

The youtuber was much more informative, for the most part he stuck to letting ChatGPT solve the problem on its own. However he did give it, on a few occasions, some big hints, either by debugging ChatGPT's solution for it or explaining it how to solve the problem. I have noted this in the results.

Results

	part 1	part 2	notes
day 1	OK	FAIL
day 2	OK	OK
day 3	FAIL	N/A
day 4	OK	OK	Uses brute force solution for part 2
day 5	OK	FAIL
day 6	FAIL	N/A	ChatGPT Plus solves both parts
day 7	FAIL	N/A
day 8	OK	FAIL	ChatGPT Plus solves part 2 if you tell it what the solution is
day 9	FAIL	N/A	ChatGPT Plus solves both parts
day 10	FAIL	N/A
day 11	FAIL	N/A	ChatGPT Plus could solve part 1 with a big hint
day 12	FAIL	N/A

The perofrmance of GPT-4 this year was a bit worse than GPT-3.5 last year. Last year GPT-3.5 could solve 3 days on its own (1, 2 and 4) while GPT-4 this year could only solve 2 full days (2 and 4).

ChatGPT Plus however did a bit better, solving on its own 4 days (2, 4, 6 and 9). This is probably down to its ability to see the problem input (as an attachment), rather than just the problem prompt and the example input to better sytem prompts and to just being able to do more round-trips through the code interpreter (I gave up after 3~4 errors / wrong outputs).

One shouldn't read too much on its ability to solve day 9, the problem difficulty doesn't increase monotonically and day 9 just happened to be very easy.

Conclusions

Overall my subjective impression is that not much has changed, it can't solve anything that requires something more complicated than just following instructions and its bad at following instructions unless they are very simple.

It could be that LLMs have reached their plateau. Or maybe Q* or Bard Ultra or Grok Extra will wipe the floor next year, like GPT-4 was supposed to do this year. It's hard not to feel jaded about the hype cycle.

I have a bunch of observations about the performance of ChatGPT on AoC which I will report here in no particular order.

Debugging / world models

Most humans are incapable of solving AoC problems on the first try without making mistakes so I wouldn't expect a human-level AI to be able to do it either (if it could it would be by definition super-human).

Some of my prompting strategy went into the direction of trying to get ChatGPT to debug its flawed solution. I was asking it to add debug prints to figure out where the logic of the solution went wrong.

ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.

This is drastically different from what programmers.

This is interesting because debugging techniques aren't really taught. By and large programming textbooks teach you to program, not how to fix errors you wrote. And yet people do pick up debugging skills, implicitly.

ChatGPT has the same access to programming textbooks that humans have and yet it does not learn to debug. I think this points to the fact that ChatGPT hasn't really learned to program, that it doesn't have a "world model", a logical understanding of what it is doing when it's programming.

The bruteforce way to get ChatGPT to learn debugging I think would be to scrape hundreds of hours of programming livestreams from twitch and feed it to the training program after doing OCR on the videos and speech-to-text on the audio. That's the only source of massive amounts of worked out debugging examples that I can think of.

Difficulty

Could it be that this year of AoC was just harder than last year's and that's why GPT-4 didn't do well? Maybe.

Difficulty is very hard to gauge objectively. There's scatter plots for leaderboard fill-up time but time-to-complete isn't necessarily equivalent difficulty and the difference between this year and last year isn't big anyway (note: the scatter plots aren't to scale unfortunately).

My own subjective impression is also that this year (so far) was not harder.

The best evidence for an increase in difficulty is day 1 part 2, which contained a small trap in which both human participants and ChatGPT fell.

I think this points to a problem with this AIs trained with enormous amounts of training data: you can't really tell how much better they are. Ideally you would just test GPT-4 on AoC 2022, but GPT-4 training set contains many copies of AoC 2022's solutions so it's not really a good benchmark anymore.

Normally you would take out a portion of the training set to use as test set but with massive training set this is impossible, nobody knows what's in them and so nobody knows how many times each individual training example is replicated in them.

I wonder if OpenAI has a secret test dataset that they don't put on the internet anywhere to avoid training set contamination.

Some people have even speculated that the problems this year were deliberately formulated to foil ChatGPT, but Eric actually denied that this is the case.

Overfitting

GPT 4 is 10x larger than GPT 3.5 and it does much better on a bunch of standard tests, for example the bar exam.

Why did it not do much better on AoC? If it isn't difficulty it could be overfitting. It has simply memorized the answers to a bunch of standardized tests.

Is this the case? My experience with AoC day 7 points towards this. The problem asks to write a custom string ordering function, the strings in questions represent hands of cards (A25JQ is ace, 2, 5 jack and queen) and the order it asks for is similar to Poker scoring. However it is not Poker.

This is a really simple day and I expected ChatGPT would be able to solve it without problems, since you just have to follow instructions. And yet it couldn't it was inesorably pulled towards writing a solution for Poker rather than for this problem.

My guess is that this is an example of overfitting in action. It's seen too many examples of poker in its training set to be able to solve this quasi-poker thing.

What is this place?

Why are you called The Motte?

New post guidelines

Rules

Recommended Posts And Communities

Recommended Realtime Chats

aaa

aaa