site banner

ChatGPT vs Advent of Code

ChatGPT does Advent of Code 2023

LLM are all the rage and people are worried that they will soon replace programmers (or, indeed, every possible office job) so I decided to do an experiment to see how well ChatGPT-4 does against Advent of Code 2023.

What is Advent of Code

Advent of Code (henceforth AoC) is an annual programming "event", held by Eric Wastl, that takes place during the first 25 days of december. Each day at midnight a problem unlocks, consisting of an input file and a description of the required solution (either a number or a sequence of letters and numbers) to be determined by processing the input file. To solve the problem you have to submit to the website the correct solution. Once you do part 2 of the problem unlocks, usually a harder version of the problem in part 1. You don't have to submit any code so, in theory, you could solve everything by hand, however, usually, this is intractable and writing a program to do the work for you is the only easy way to solve the problem.

There's also a leaderboard where participants are scored based on how fast they submitted a solution.

Problems start very easy on day 1 (sometimes as easy as just asking for a program that sums all numbers in the input) and progress towards more difficult ones, but they never get very hard: a CS graduate should be able to solve all problems, except maybe 1 or 2, in a couple of hours each.

Prior history

This isn't the first time ChatGPT (or LLMs) was used to participate in Advent of Code. In fact last year (2022) it was big news that users of ChatGPT were able, in multiple days, to reach the top of the global leaderboard. And this was enough of a concern that Eric explicitly banned ChatGPT users from submitting solutions before the global leaderboard was full (of course he also doesn't have any way to actually enforce this ban). Some people even expected GPT-4 to finish the whole year.

A lot of noise was made of GPT-3.5 performance in AoC last year but the actual results were quite modest and LLM enthusiasts behaved in a very unscientific way, by boasting successes but becoming very quiet when it started to fail. In fact ChatGPT struggled to get through day 3 and 5 and probably couldn't solve anything after day 5.

Why do AoC with GPT?

I think it's as close to the perfect benchmark as you can get. The problems are roughly in order of increasing difficulty so you can see where it stops being able to solve. Since almost all of the problems in any given year are solvable by a CS graduate in a couple of hours is a good benchmark for AGI. And since all of the problems are novel the solutions can't come from overfitting.

Also around release people tried GPT-4 on AoC 2022 and found that it performed better so it would be interesting to see how much of the improvement was overfitting vs actual improvement

Methodology

I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs. The prompt I used for part 1 was:

Write a python program to solve the following problem, the program should read its input from a file passed as argument on the command line:

followed by the copypasted text of the problem. I manually removed from the prompt all the story fluff that Eric wrote, which constitutes a small amount of help for ChatGPT. If the output had trivial syntax mistakes I fixed them manually.

I gave up on a solution if it didn't terminate within 15 minutes, and let ChatGPT fail 3 times before giving up. A failure constitutes either an invalid program or a program that runs to completion but returns the wrong output value.

If the program ran to completion with the wrong answer I used the following prompt:

There seems to be a bug can you add some debug output to the program so we can find what the bug is?

If the program ran into an error I would say so and copy the error message.

If the first part was solved correctly the prompt for the second part would be:

Very good, now I want you to write another python program, that still reads input from a command line argument, same input as before, and solves this additional problem:

I decided I would stop the experiment after 4 consecutive days where ChatGPT was unable to solve part 1.

ChatGPT Plus

Because I was aware of the possibility that ChatGPT Plus would be better I supplemented my experiment with two other sources. The first one is the Youtube channel of Martin Zikmund (hencefort "youtuber") who did videos on how to solve the problems in C# as well as trying to solve them using ChatGPT (with a Plus account).

The second one was the blog of a ChatGPT enthusiast "Advent of AI" (henceforth enthusiast) who tried to solve the problems using ChatGPT Plus and then also wrote a blog about it using ChatGPT Plus. Since the blog is generated by ChatGPT it's absolute shit and potentially contains hallucinations, however the github repo with the transcripts is valuable.

The enthusiast turned out to be completely useless: it resorted often to babystepping ChatGPT through to the result and he stopped on day 6 anyway.

The youtuber was much more informative, for the most part he stuck to letting ChatGPT solve the problem on its own. However he did give it, on a few occasions, some big hints, either by debugging ChatGPT's solution for it or explaining it how to solve the problem. I have noted this in the results.

Results

part 1part 2 notes
day 1 OK FAIL
day 2 OK OK
day 3 FAIL N/A
day 4 OK OK Uses brute force solution for part 2
day 5 OK FAIL
day 6 FAIL N/A ChatGPT Plus solves both parts
day 7 FAIL N/A
day 8 OK FAIL ChatGPT Plus solves part 2 if you tell it what the solution is
day 9 FAIL N/A ChatGPT Plus solves both parts
day 10 FAIL N/A
day 11 FAIL N/A ChatGPT Plus could solve part 1 with a big hint
day 12 FAIL N/A

The perofrmance of GPT-4 this year was a bit worse than GPT-3.5 last year. Last year GPT-3.5 could solve 3 days on its own (1, 2 and 4) while GPT-4 this year could only solve 2 full days (2 and 4).

ChatGPT Plus however did a bit better, solving on its own 4 days (2, 4, 6 and 9). This is probably down to its ability to see the problem input (as an attachment), rather than just the problem prompt and the example input to better sytem prompts and to just being able to do more round-trips through the code interpreter (I gave up after 3~4 errors / wrong outputs).

One shouldn't read too much on its ability to solve day 9, the problem difficulty doesn't increase monotonically and day 9 just happened to be very easy.

Conclusions

Overall my subjective impression is that not much has changed, it can't solve anything that requires something more complicated than just following instructions and its bad at following instructions unless they are very simple.

It could be that LLMs have reached their plateau. Or maybe Q* or Bard Ultra or Grok Extra will wipe the floor next year, like GPT-4 was supposed to do this year. It's hard not to feel jaded about the hype cycle.

I have a bunch of observations about the performance of ChatGPT on AoC which I will report here in no particular order.

Debugging / world models

Most humans are incapable of solving AoC problems on the first try without making mistakes so I wouldn't expect a human-level AI to be able to do it either (if it could it would be by definition super-human).

Some of my prompting strategy went into the direction of trying to get ChatGPT to debug its flawed solution. I was asking it to add debug prints to figure out where the logic of the solution went wrong.

ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.

This is drastically different from what programmers.

This is interesting because debugging techniques aren't really taught. By and large programming textbooks teach you to program, not how to fix errors you wrote. And yet people do pick up debugging skills, implicitly.

ChatGPT has the same access to programming textbooks that humans have and yet it does not learn to debug. I think this points to the fact that ChatGPT hasn't really learned to program, that it doesn't have a "world model", a logical understanding of what it is doing when it's programming.

The bruteforce way to get ChatGPT to learn debugging I think would be to scrape hundreds of hours of programming livestreams from twitch and feed it to the training program after doing OCR on the videos and speech-to-text on the audio. That's the only source of massive amounts of worked out debugging examples that I can think of.

Difficulty

Could it be that this year of AoC was just harder than last year's and that's why GPT-4 didn't do well? Maybe.

Difficulty is very hard to gauge objectively. There's scatter plots for leaderboard fill-up time but time-to-complete isn't necessarily equivalent difficulty and the difference between this year and last year isn't big anyway (note: the scatter plots aren't to scale unfortunately).

My own subjective impression is also that this year (so far) was not harder.

The best evidence for an increase in difficulty is day 1 part 2, which contained a small trap in which both human participants and ChatGPT fell.

I think this points to a problem with this AIs trained with enormous amounts of training data: you can't really tell how much better they are. Ideally you would just test GPT-4 on AoC 2022, but GPT-4 training set contains many copies of AoC 2022's solutions so it's not really a good benchmark anymore.

Normally you would take out a portion of the training set to use as test set but with massive training set this is impossible, nobody knows what's in them and so nobody knows how many times each individual training example is replicated in them.

I wonder if OpenAI has a secret test dataset that they don't put on the internet anywhere to avoid training set contamination.

Some people have even speculated that the problems this year were deliberately formulated to foil ChatGPT, but Eric actually denied that this is the case.

Overfitting

GPT 4 is 10x larger than GPT 3.5 and it does much better on a bunch of standard tests, for example the bar exam.

Why did it not do much better on AoC? If it isn't difficulty it could be overfitting. It has simply memorized the answers to a bunch of standardized tests.

Is this the case? My experience with AoC day 7 points towards this. The problem asks to write a custom string ordering function, the strings in questions represent hands of cards (A25JQ is ace, 2, 5 jack and queen) and the order it asks for is similar to Poker scoring. However it is not Poker.

This is a really simple day and I expected ChatGPT would be able to solve it without problems, since you just have to follow instructions. And yet it couldn't it was inesorably pulled towards writing a solution for Poker rather than for this problem.

My guess is that this is an example of overfitting in action. It's seen too many examples of poker in its training set to be able to solve this quasi-poker thing.

25
Jump in the discussion.

No email address required.

So, I gave this a bit of a try myself on Day 3, which ChatGPT failed in your test and on Youtube. While I appreciate that you framed this as a scientific experiment with unvarying prompts and strict objective rules, you're handicapping it compared to a human who has more freedom to play around. Given this, I think your conclusions that it can't debug are a bit too strong.

I wanted to give it more of the flexibility of a human programmer solving AoC, so I made it clear up front that it should brainstorm (I used the magic "think step by step" phrase) and iterate, only using me to try to submit solutions to the site. Then I followed its instructions as it tried to solve the tasks. This is subjective and still pretty awkward, and there was confusion over whether it or I should be running the code; I'm sure there's a better way to give it the proper AoC solving experience. But it was good enough for one test. :) I'd call it a partial success: it thought through possible issues and figured out the two things it was doing wrong on Day 3 Part 1, and got the correct answer on the third try (and then got Part 2 with no issues). The failure, though, is that it never seemed to realize it could use the example in the problem statement to help debug its solution (and I didn't tell it).

Anyway, the transcript's here, if you want to see ChatGPT4 troubleshooting its solution. It didn't use debug output, but it did "think" (whatever that means) about possible mistakes it might have made and alter its code to fix those mistakes, eventually getting it right. That sure seems like debugging to me.

Remember, it's actually kind of difficult to pin down GPT4's capabilities. There are two reasons it might not be using debug output like you want: a) it's incapable, or b) you're not prompting it right. LLMs are strange, fickle beasts.

Then I tried it on Day 7 (adjusting the prompt slightly and letting it just use Code Interpreter on its own). It figured out what it was doing wrong on Part 1 and got it on the second try. Then it did proceed to try a bunch of different things (including some diagnostic output!) and spin and fail on Part 2 without ever finding its bug. Still, this is better than your result, and the things it was trying sure look like "debugging" to me. More evidence that it could do better with different prompting and the right environment.

EDIT: Heh, I added a bit more to the transcript, prodding ChatGPT to see if we could debug together. It produced some test cases to try, but failed pretty hilariously at analyzing the test cases manually. It weakens my argument a bit, but it's interesting enough to include anyway.

I tried on Day 10 and it failed. I want to avoid publication bias, though, so I'm posting the transcript anyway. :) Note that it IS using debug output to try to figure out its error, but I think it's analyzing it incorrectly.

Thanks for doing this and posting about it, very interesting. The results are roughly what I presumed - these LLM-based "AI"s are pretty good at regurgitating and mixing and matching things it's already seen, but have no real ability to reason and fall flat fast when asked to do anything unusual or unexpected.

The impression I get from a lot of the discourse surrounding machine learning and LLMs/GPT in particular, is that the people who are the most bullish about it's prospects often have the least understanding of what's actually going on "under the hood". A Regression engine does not "reason" in the classical sense that humans or even some birds might be said to. What they do is they provide a response that is derived probabilistically from the input (IE the prompt) based on their training data.

The crux of the issue is that contra Bayes and a good chunk of the posters on LessWrong, a "probable" response given some dataset does not necessarily equate to an "accurate" or "true" response. This is why it sucks at following all but the simplest of instructions (IE those that don't require reasoning to follow) and issues like the so-called "hallucination problem" (where GPT4 will cook up nonexistent court cases when asked for a legal citation) are unlikely to be resolved outside a ground-up rework of the underlying architecture.

Hi, bullish ML developer here, who is very familiar with what's going on "under the hood". Maybe try not calling the many, many people who disagree with you idiots? It certainly does not "suck at following all but the simplest of instructions", unless you've raised this subjective metric so high that much of the human race would fail your criterion. And while I agree that the hallucination problem is fundamental to the architecture, it has nothing to do with GPT4's reasoning capabilities or lack thereof. If you actually had a "deep understanding" of what's going on under the hood, you'd be aware of this. It's because GPT4 (the model) and ChatGPT (the intelligent oracle it's trying to predict) are distinct entities which do not match perfectly. GPT4 might reasonably guess that ChatGPT would start a response with "the answer is..." even if GPT4 itself doesn't know the answer ... and then the algorithm picks the next word from GPT4's probability distribution anyway, causing a hallucination. Tuning can help reduce the disparity between these entities, but it seems unlikely that we'll ever get it to work perfectly. A new idea will be needed (like, perhaps, an algorithm that does a directed search on response phrases rather than greedily picking unchangeable words one by one).

To be honest, it sounds like you don't have much experience with ChatGPT4 yourself, and think that the amusing failures you read about on blogs (selected because they are amusing) are representative. Let me try to push back on your selection bias with some fairly typical conversations I've had with it (asking for coding help): 1, 2. These aren't selected to be amusing; ChatGPT4 doesn't get everything right, nor does it fail spectacularly. But it does keep up its end of a detailed, unprecedented conversation with no trouble at all.

It bears mentioning that so far ChatGPT has shown no ability to discern factual correctness in its responses. No capability appears to exist in the architecture to be able to differentiate between a reasonable response and utter horseradish.

Any appearance of correctness is because the model happens to contain content that correlates with the input text vector. The machine is simply returning a bunch of tokens from its model that match the input with high probability with some random variance. The machine cannot problem solve, because problem solving requires predicting whether a change will make a solution more or less correct. The machine may be able to predict an outcome but it chooses randomly not based on correctness.

While it can appear to perform adequately for subjective tasks (blogs, marketing, social media), the illusion crumbles when presented a novel challenge that requires some degree of correctness. Humans, dogs, rats, and yes even birds, seem to be able to adapt to novel conditions estimating their competence and adjusting their behaviour. The machine appears to have no clue which change is better or worse and so it just picks randomly, by design.

With this is mind, it’s amusing when users and developers attempt to instruct the model, especially so when it is expected to restrict some behaviour, such as ChatGPTs system prompts. The machine has no concept of whether its output meets the expectations of its prompt.

If code generated by ChatGPT happens to perform competently at a task, it’s because the human has engineered a vector that points to a region in the model that contains tokens from an existing example that solves the users problem. Any problem solving ability is still firmly encapsulated in the human operator’s brain.

I didn't call anyone in particular "an idiot" I simply observed that there seems to be a profound disconnect between the rhetoric and the reality. You say you're a "bullish ML developer", that's cool, what have you developed and for whom?

Edit: I honestly can't tell if your linked examples are supposed to be helping your case or mine.

I'm on Apple's AI/ML team, but I can't really go into details.

...and I am on the "algorithms team" for a big-name defense contractor. Point being that unlike a lot of users here I'm not playing with this stuff in my off-time for entertainment or as some sort of intellectual exercise. I have specific use-cases in mind and with specific targets to be met.

Furthermore, I think it's rather silly of you to accuse me setting my "subjective metric so high that much of the human race would fail your criterion" when my core point is that we are discussing a level of "reason" already demonstrated as being within the operational capabilities of a literal bird-brain.

Consider the possibility it is not my bar that has been set unreasonably high, but rather yours that is unreasonably low. Consider that keeping those specific target metrics (both objective, and subjective) in mind make the superficial improvements generated by throwing more CPU cycles and bigger datasets at the problem less impressive.

Wow, you're really doubling down on that link to a video of a bird fishing with bread. And in your mind, this is somehow comparable to holding a complex conversation and solving Advent of Code problems. I honestly don't know what to say to that.

Really, the only metric that I need is that ChatGPT makes me more productive in my job and personal projects. If you think that's "unreasonably low", well, I hope that our eventual AI Overlords can hope to meet your stringent requirements. The rest of the human race won't care.

Wow, you're really doubling down on that link to a video of a bird fishing with bread.

Yes, I am, and I will explain why.

You know what I see when watch that video? I see the real-time generation and processing of multiple multi-path conditional states. I see this process is being run in parallel with a complex kinematic control algorithm on hardware that is arguably "highly bandwidth-limited".

Fishing, as a behavior, is not only comparable to holding a rudimentary conversation or solving (some) Advent of Code problems, it is orders of magnitude more complex for all the reasons Moravec and Searle lay out. I know the OpenAI fanboys get pissy when "the Chinese Room" is brought up but I don't think I've ever seen a more succinct and apt illustration of the difference between "reason" and "vocabulary" said thought experiment is intended to explore, than I have in the calculating eyes of that bird.

Really, the only metric that I need is that ChatGPT makes me more productive in my job and personal projects. If you think that's "unreasonably low", well, I hope that our eventual AI Overlords can hope to meet your stringent requirements. The rest of the human race won't care.

My copy of Dover's Differential Equations also makes me more productive in my job and personal projects. That doesn't mean I believe that a paper-back book possesses a consciousness or the ability to reason.

Like I said, consider the possibility that your bar for what constitues intelligence may be set unreasonably low.

Ugh, what a ridiculous take. The ability to move a body and process senses and learn behaviour that generates food is miraculous, yes. We can't build machines that come close to this yet. It's amazing that birds can do it! And humans! And cats, dogs, pigs, mice, ants, mosquitos, and 80 million other species too. Gosh, wow, I'm so agog at the numinous wondrousness of nature.

That doesn't make it intelligence. Humans are special. Intelligence is special. Until transformers and LLMs, every single story, coherent conversation, and, yes, Advent of Code solution was the creation of a human being. Even if all development stops here, even if LLMs never get smarter and these chatbots continue to have weird failure modes for you to sneer at, something fundamental has changed in the world.

Do you think you're being super deep by redefining intelligence as "doing what birds can do?" I'd expect that from a stoner, not from a long-standing mottizen. Words MEAN things, you know. If you'd rather change your vocabulary than your mind, I don't think we have anything more to discuss.

I don't think we have anything more to discuss.

No offense, but that's the worst sentence I keep reading on the motte. It occurs frequently and signals failure and comes off as ostensibly-objective but actually-one-sided. Maybe it's just a figure of speech that comes naturally to native speakers, but to me it sounds very schoolmarmy. If anything, a severe inferential gulf implies that there is more to discuss, as seems to me most of the point of this place, and not that the discussion must be shut down because we consider each other impossible to communicate with and would like to go nearer back to where we only hear what we want to hear.

Rant over, relevance to actual post low, please ignore.

I don't think I'm being "deep" at all, and I don't think that you finding my take "ridiculous" makes it any less true.

Instead, I would suggest that the fact that playing chess is a relatively trivial behavior to replicate programmatically, while fishing is not, should cause you to reconsider wich of these tasks is actually the more computationally complex and intensive of the two.

Likewise, if you're going to play the whole "words mean things" card, perhaps you should take a moment and actually spell out what you think the word "Intelligence" means because if we are going to be working off of your metric of "makes me more productive in my job and personal projects" I can name any number of books, cheat-sheets, software tools, and select inanimate objects that are more "intelligent" than most of the interns and comp-sci grads I've worked with in the last year.

ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.

Is this how you define debugging? Surely a more straightforward definition is 'making non-functional code work'. When I used GPT-4 a few months back to supplement my negligible coding skills I could get it to debug stuff. First it'd give me code that didn't work, then I'd give it the error code and tell it to fix it. We'd go through a few iterations and it would often work.

It wasn't perfect at debugging but it could do it some of the time. If you read the literature or even the top results from a search engine, they explicitly note debugging as something GPT-4 can do.

I'm thinking more of debugging logical problems than fixing syntax errors.

It seems worth mentioning that although trying to have general-purpose LLMs one-shot code might well be a handy benchmark of how close those LLMs are to AGI, it's a far cry from the state of the art in AI code generation. AlphaCode 2 performs at the 85th percentile vs. humans competitors despite using a base model inferior to GPT-4, by using a fine-tuned variant of that model in combination with scaffolding to help it break down problems into smaller parts and generate and select among many candidate solutions.

Very interesting claims, we'll see if they hold up.

Go to a blackboard and write this 100 times like Bart Simpson. "LLMs don't need to be able to do the job of anyone 100%. Even making some workers marginally more productive in the absence of equal increase in demand means you have aggregate disemployment."

We employ more *senior programmers than ever. The labor market is shifting so fast what was true 2 months ago isn't. Ask anyone on the market right now how difficult it is at the moment.

So AI replaces the "outsourced to an Indian call-center" work was already going to need to be re-done by a proper coder at some point any way. What changes?

No AI replaced the college grad out of US

color me skeptical

Why wouldn’t there be an increase in demand?

Well, it depends on what you measure. Low end 2D art makers are finished and dead as a job.

But if you are a programmer or panicking about superhuman AI then such post is quite useful bit of info.

I find it a bit puzzling that the LLM is expected to do things correctly with minimal or no guidance, which is a bit like expecting a riderless horse to stay on track and win a race. Maybe it can sometimes, but with a code jockey, it can be so much better.

That probably looks something like noticing that it's overfitting on poker, translating the question to avoid that, and seeing if it does any better. Eg. not calling the symbols "cards" or "faces" or "suits". ROT13-ing the letters so they don't look like a poker hand, or whatever.

That is what working as a dev is like. You have a large codebase that interacts with many systems and has many quirks. Client gives vague goals that don't really fit the reality on the ground. Developer has to figure out what is meant and create something that fulfills several often contradicting demands. ChatGPT is great if the assignment is "find the prime numbers in this array and sort them by their digit sum". Unfortunately most development isn't much like leetcode at all and writing a prompt to explain the problem to an AI would involve writing a wall of text.

AI won't be useful in most cases unless it is trained on my codebase, my emails, and my meetings.

I suggest you try to get it to solve that problem correctly, it isn't easy. The prompt is already giving clear and simple instructions and even reminding it not to use the highest value card for the second part of the ordering doesn't work.

Even so, for problems this simple once you start trying to read and find out bugs in the code it produces you have already lost, it would have been faster to write the implementation yourself.

If there's a sizeable group that thinks riderless horses can stay on track and win a race, then the best way to test that is to put a riderless horse on the racetrack. Considering we're explicitly testing LLM capabilities here, fudging and adding human intervention can only harm the experiment.

Ability to navigate that kind of confusion and stay on task is a pretty good shorthand for intelligence, though. And it helps keep the skill of the jockey out of any benchmark.

Nice writeup. Unfortunately not a lot of discussion yet so let me add some random comments:

And since all of the problems are novel the solutions can't come from overfitting.

Depends on what you call “novel”. A lot of the problems are based on well-known algorithms like path finding, Josephus problems, etc. And there is quite a bit of repetition of concepts between years as well. So I think LLMs and humans alike benefit from being having the previous problems in their data set.

There is also something that makes Advent of Code relatively harder for LLMs (and new competitors): on some days, the stated problem is generally much harder than the actual input file. In that case, careful inspection of the input data is required to figure out what the problem is actually asking, which I assume ChatGPT has no way of doing or even asking for.

(This year's Day 8 was an example of this, but this has happened pretty much every year.)

ChatGPT never did this: its debugging skills are completely non-existent. If it encounters an error it will simply rewrite entire functions, or more often the entire program, from scratch.

True, and it's consistent with it being a language model. It mostly sees completed code snippets (of varying quality) written by humans. How could it know how humans construct solutions like this?

It's probably the same reason why ChatGPT does so poorly at writing longform fiction. It has no idea how to construct an overarching narrative because the planning, rewriting and editing necessary is invisible to ChatGPT; it only sees the finished output.

I think coding assistants (like GitHub Copilot) will be able to fill this gap by observing how humans actually develop code.

Difficulty is very hard to gauge objectively. There's scatter plots for leaderboard fill-up time but time-to-complete isn't necessarily equivalent difficulty and the difference between this year and last year isn't big anyway (note: the scatter plots aren't to scale unfortunately).

True, and I agree with your subjective assessment that the problems aren't any harder this year, but I'd add also that the leaderboard is not really representative of the overall participant base. People on the leaderboard are the top 1% of all solvers (let alone participants), and they have their own specific strengths and weaknesses. For example, a problem that requires dynamic programming is easy for them (but hard for most casual programmers), while the top 1% still need more time on problems that require lots of of careful reading, convoluted input parsing, tricky edge cases, etc.

I don't pay for ChatGPT Plus, I only have a paid API key so I used instead a command line client, chatgpt-cli and manually ran the output programs.

Please explain the logic here because this is baffling to me. You were willing to invest the time to solve every single AoC problem this year with ChatGPT and you wrote up this summary of it, which together must have taken hours, but you couldn't fork over the $20 needed for a month-long pro subscription, which would make your results an order of magnitude more interesting? How do you value your time such that this makes sense?

There is also something that makes Advent of Code relatively harder for LLMs (and new competitors): on some days, the stated problem is generally much harder than the actual input file. In that case, careful inspection of the input data is required to figure out what the problem is actually asking, which I assume ChatGPT has no way of doing or even asking for.

This is actually not a factor for ChatGPT plus, you can attach the input file to the request and it will examine it. Doesn't seem to help it at all, but it's a thing.

(This year's Day 8 was an example of this, but this has happened pretty much every year.)

Day 8 is a bit of a bad example, the general solution is the chinese remainder problem which isn't much harder anyway.

True, and it's consistent with it being a language model. It mostly sees completed code snippets (of varying quality) written by humans. How could it know how humans construct solutions like this?

How could other humans learn how to construct those solutions? They read the same textbooks that are in the training set of ChatGPT (a miniscule fraction of them) and they understand their contents.

Please explain the logic here because this is baffling to me. You were willing to invest the time to solve every single AoC problem this year with ChatGPT and you wrote up this summary of it, which together must have taken hours, but you couldn't fork over the $20 needed for a month-long pro subscription, which would make your results an order of magnitude more interesting? How do you value your time such that this makes sense?

I wouldn't call it an order of magnitude, it's the same model but with a different prompt and the ability to run code on its own. Anyway the logic is this: I had fun doing this but its a silly project and I didn't want to spend 20$ on it. Plus I didn't have to because a youtuber did it for me.

You can also look at this as a question of whether ChatGPT Plus is worth it in general: it did better than straight API calls but I spent 2$ of API calls vs 20$ for plus, it isn't 10 times better.

Day 8 is a bit of a bad example, the general solution is the chinese remainder problem which isn't much harder anyway.

Have you tried to solve the general problem yourself? It's absolutely much harder than the version contestants had to solve.

First, the Chinese remainder theorem is genuinely a lot harder than simply calculating the least common multiple. Second, the problem statement allows much more complicated input than that. For example, the problem statement allows loops with multiple end states; I don't even know how you'd deal with that efficiently, I doubt you know on the top off your head, and I certainly wouldn't fault ChatGPT for not knowing it either.

If you post your code I can probably come up with a test case that breaks it.

How could other humans learn how to construct those solutions? They read the same textbooks that are in the training set of ChatGPT (a miniscule fraction of them) and they understand their contents.

No, that's absolutely not how humans learn to code. Or at least it's not how I learned or anyone I know that's good at solving AoC style problems learned to solve them. Reading textbooks is the absolute minimum time investment. The majority of time is spent thinking about the problem, writing code, noticing it doesn't work, trying to find a flaw by reading through it, stepping through the execution with a debugger, or maybe adding printf() statements to get insight in the internal state, and so on.

It's a very interactive process. But the intermediate code, with all the printf() statements for debugging, isn't something that usually gets committed. That's why ChatGPT doesn't know to debug code that way. It has never even seen someone do this. It might have heard about printf() debugging from Wikipedia but it has never done it itself, or if it did (because of user requests), it keeps no memory of it.

You can also look at this as a question of whether ChatGPT Plus is worth it in general: it did better than straight API calls but I spent 2$ of API calls vs 20$ for plus, it isn't 10 times better.

I don't think this comparison makes sense. You're treating it as a comparison of efficiency: as if Model A is solving problems at a rate of X/day and Model B at a rate of 2X/day, so Model B is only twice as valuable as Model A. But that's not what's happening: Model B is solving problems that apparently Model A cannot solve at all. If Einstein can prove only 1% more theorems than the average physics major, does that mean he should be paid only 1% more?

I don't pay for ChatGPT Plus, I only have a paid API key

you couldn't fork over the $20 needed for a month-long pro subscription, which would make your results an order of magnitude more interesting?

I think you might be confused about this: you can access GPT-4 via the API. I haven't seen anything suggesting that the versions of GPT-4 used in the ChatGPT Plus interface are smarter than the versions you can access via the API (modulo dubious rumours of secret tests of more advanced models, which in any case would be uncontrollable for OP's experiment)

It's likely that I misunderstood something; I'm not very familiar with the various offerings. I was going by OP's own admission that they didn't pay for the top model and their version was only able to solve 7 (sub)problems vs 13(ish) for “Chat GPT Plus” which seemed to imply that the latter is a stronger problem solver.