site banner

Culture War Roundup for the week of March 3, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

6
Jump in the discussion.

No email address required.

Claude AI playing Pokemon shows AGI is still a long ways off

(Read this on Substack for some funny pictures)

Evaluating AI is hard. One of the big goals of AI is to create something that could functionally act like a human -- this is commonly known as “Artificial General Intelligence” (AGI). The problem with testing AI’s is that their intelligence is often “spiky”, i.e. it’s really good in some areas but really bad in others, so any single test is likely to be woefully inadequate. Computers have always been very good at math, and even something as simple as a calculator could easily trounce humans when it comes to doing simple arithmetic. This has been true for decades if not over a century. But calculators obviously aren’t AGI. They can do one thing at a superhuman level, but are useless for practically anything else.

LLMs like chatGPT and Claude are more like calculators than AI hype-meisters would like to let on. When they burst onto the scene in late 2022, they certainly seemed impressively general. You could ask them a question on almost any topic, and they’d usually give a coherent answer so long as you excused the occasional hallucinations. They also performed quite well on human measurements of intelligence, such as college level exams, the SAT, and IQ tests.. If LLMs could do well on the definitive tests of human intelligence, then certainly AGI was only months or even weeks away, right? The problem is that LLMs are still missing quite a lot of things that would make them practically useful for most tasks. In the words of Microsoft’s CEO, they’re “generating basically no value”. There’s some controversy over whether the relative lack of current applications is a short-term problem that will be solved soon, or if it’s indicative of larger issues. Claude’s performance playing Pokemon Red points quite heavily toward the latter explanation.

First, the glass-half-full view: The ability for Claude to play Pokemon at all is highly impressive at baseline. If we were just looking for any computer algorithm to play games, then TAS speedruns have existed for a while, but that would be missing the point. While AI playing a children’s video game isn’t exactly Kasparov vs Deep Blue, the fact it’s built off of something as general as an LLM is remarkable. It has rudimentary vision to see the screen and respond to events that occur as they come into the field of view. It interacts with the game through a bespoke button-entering system built by the developer. It interprets a coordinate system to plan to move to different squares on the screen. It accomplishes basic tasks like battling and rudimentary navigation in ways that are vastly superior to random noise. It’s much better than monkeys randomly plugging away at typewriters. This diagram by the dev shows how it works

I have a few critiques that likely aren’t possible for a single developer, but would still be good to keep in mind when/if capabilities improve. The goal should be to play the game like a player would, so it shouldn’t be able to read directly from the RAM, and instead it should only rely on what it can see on the screen. It also shouldn’t need to have a bespoke button-entering system designed at all and should instead do this using something like ChatGPT’s Operator. There should be absolutely no game-specific hints given, and ideally its training data wouldn’t have Pokemon Red (or even anything Pokemon-related) included. That said, though, this current iteration is still a major step forward.

Oh God it’s so bad

Now the glass-half-empty view: It sucks. It’s decent enough at the battles which have very few degrees of freedom, but it’s enormously buffoonish at nearly everything else. There’s an absurdist comedy element to the uncanny valley AI that’s good enough to seem like it’s almost playing the game as a human would, but bad enough that it seems like it’s severely psychotic and nonsensical in ways similar to early LLMs writing goofy Harry Potter fanfiction. Some of the best moments include it erroneously thinking it was stuck and writing a letter to Anthropic employees demanding they reset the game, to developing an innovative new tactic for faster navigation called the “blackout strategy” where it tries to commit suicide as quickly as possible to reset to the most recently visited Pokemon center… and then repeating this in the same spot over and over again. This insanity also infects its moment-to-moment thinking, from hallucinating that any rock could be a Geodude in disguise (pictured at the top of this article), to thinking it could judge a Jigglypuff’s level solely by its girth.

All these attempts are streamed on Twitch, and they could make for hilarious viewing if it wasn’t so gosh darn slow. There’s a big lag in between its actions as the agent does each round of thinking. Something as simple as running from a random encounter, which would take a human no more than a few seconds, can last up to a full minute as Claude slowly thinks about pressing ‘A’ for the introductory text “A wild Zubat has appeared!”, then thinks again about moving its cursor to the right, then thinks again about moving its cursor down, and then thinks one last time about pressing ‘A’ again to run from the battle. Even in the best of times, everything is covered in molasses. The most likely reaction anyone would have to watching this would likely be boredom after the novelty wears off in a few minutes. As such, the best way to “watch” this insanity is on a second monitor, or to just hear the good parts second-hand from people who watched it themselves.

Is there an AI that can watch dozens of hours of boring footage and only pick out the funny parts?

By far the worst aspect, though, is Claude’s inability to navigate. It gets trapped in loops very easily, and is needlessly distracted by any objects it sees. The worst example of this so far has been its time in Mount Moon, which is a fairly (though not entirely) straightforward level that most kids probably beat in 15-30 minutes. Claude got trapped there for literal days, with its typical loop being going down a ladder, wandering around a bit, finding the ladder again, going back up the ladder, wandering around a bit, finding the ladder, going back down again, repeat. It’s like watching a sitcom of a man with a 7 second memory.

There’s supposed to be a second AI (Critique Claude) to help evaluate actions from time to time, but it’s mostly useless since LLMs are inherently yes-men, so when he's talking to the very deluded and hyperfixated main Claude he just goes with it. Even when he disagrees, main Claude acts like a belligerent drunk and simply ignores him.

In the latest iteration, the dev created a tool for storing long-term memories. I’m guessing the hope was that Claude would write down that certain ladders were dead-ends and thus should be ignored, which would have gone a long way towards fixing the navigation issues. However, it appears to have backfired: while Claude does indeed record some information about dead-ends, he has a tendency to delete those entries fairly quickly which renders them pointless. Worse, it seems to have made Claude remember that his “blackout strategy” “succeeded” in getting out of Mount Moon, prompting it to double, triple, and quadruple down on it. I’m sure there’s some dark metaphor in the development of long-term memory leading to Claude chaining suicides.

What does this mean for AGI predictions?

Watching this trainwreck has been one of the most lucid negative advertisements for LLMs I’ve seen. A lot of the perceptions about when AGI might arrive are based on the vibes people get by watching what AI can do. LLMs can seem genuinely godlike when they spin up a full stack web app in <15 seconds, but the vibes come crashing back down to Earth when people see Claude bumbling around in circles for days in a simplistic video game made for children.

The “strawberry” test had been a frequent concern for early LLMs that often claimed the word only contained 2 R’s. The problem has been mostly fixed by now, but there’s questions to be asked in how this was done. Was it resolved by LLMs genuinely becoming smarter, or did the people making LLMs cheat a bit by hardcoding special logic for these types of questions. If it’s the latter, then problems would tend to arise when the AI encounters the issue in a novel format, as Gary Marcus recently showed. But of course, the obvious followup question is “does this matter”? So what if LLMs can’t do the extremely specific task of counting letters if they can do almost everything else? It might be indicative of some greater issue… or it might not.

But it’s a lot harder to doubt that game playing is an irrelevant metric. Pokemon Red is a pretty generous test for many reasons: There’s no punishment for long delays between actions. It’s a children’s game, so it’s not very hard. The creator is using a mod for coloring to make it easier to see (this is why Jigglypuff’s eyes look a bit screwy in the picture above). Yet despite all this, Claude still sucks. If it can’t even play a basic game, how could anyone expect LLMs to do regular office work, for, say, $20,000 a month? The long-term memory and planning just isn’t there yet, and that’s not exactly a trivial problem to solve.

It’s possible that Claude will beat pokemon this year, probably through some combination of brute-force and overfitting knowledge to the game at hand. However, I find it fairly unlikely (<50% chance) that by the end of 2025 there will be an AI that exists that can 1) be able to play Pokemon at the level of a human child, i.e. beat the game, able to do basic navigation, not have tons of lag in between trivial actions, and 2) be genuinely general (putting the G in AGI) and not just overfit to Pokemon, with evidence coming from being able to achieve similar results in similar games like Fire Emblem, Dragon Quest, early Final Fantasy titles, or whatever else.

LLMs are pretty good right now at a narrow slice of tasks, but they’re missing a big chunk of the human brain that would allow them to accomplish most tasks. Perhaps this can be remedied through additional “scaffolding”, and I expect “scaffolding” of various types to be a big part of what gives AI more mainstream appeal over the next few years (think stuff like Deep Research). Perhaps scaffolding alone is insufficient and we need a much bigger breakthrough to make AI reasonably agentic. In any case, there will probably be a generic game-playing AI at some point in the next decade… just don’t expect it to be done by the end of the year. This is the type of thing that will take some time to play out.

From the context, it seems that Claude was not really trained (in the NN sense) to play Pokemon.

Deployed LLMs are limited by the length of their context window. Moreover, they have been trained to generate text, not to be especially good at being mesa-trainable at deployment time.

It is a bit like complaining that a chip which was designed to decode MP3 makes a terrible FPGA.

From the context, it seems that Claude was not really trained (in the NN sense) to play Pokemon.

I've seen this defense a lot but IMO it holds no water. The whole point of an AGI would be that you don't have to train it for specific tasks. The G stands for "general", after all. If you have to train it specially to play Pokemon, then it fails the test.

I have no idea why people think it would even be valuable to train a model to play Pokemon. We have had purpose-built computer programs playing games like chess, go, and even Starcraft for decades now. Doing that with Pokemon is obviously achievable, but it also isn't impressive.

However, I find it fairly unlikely (<50% chance) that by the end of 2025 there will be an AI that exists that can 1) be able to play Pokemon at the level of a human child, i.e. beat the game, able to do basic navigation, not have tons of lag in between trivial actions, and 2) be genuinely general (putting the G in AGI) and not just overfit to Pokemon, with evidence coming from being able to achieve similar results in similar games like Fire Emblem, Dragon Quest, early Final Fantasy titles, or whatever else.

Back in 2020, Google announced MuZero, which could play not only chess and go but also Atari games. It wasn't an LLM, but it was a deep RL system, and I'm pretty confident that it's capable of learning to play Pokemon well past the level of the average child.

LLMs playing Pokemon badly is like managing to teach your dog perfect English and then observing that in the process he picked up mediocre French, despite not being taught it . Would we criticize the dog for that? What's impressive about Claude is that it can kind of play at all, despite it not being trained to.

Now, is there a significant gap between Claude and MuZero? Who knows; MuZero did use (lots of) task-specific training, and maybe with some Claude could match it (not interesting to me; I only care about Pokemon as a test of how well a model can one-shot it). But 3 years ago, transformer-based models were very limited in what they could do compared to today; 3 months ago, they couldn't play Pokemon at all; today, they play Pokemon as badly as a 90 year old grandma with dementia (though badly in a very alien way). I'd be surprised if 3 months from now any LLMs could play Pokemon as well as a child, but I'd be even more surprised if 3 years from now they didn't play flawlessly. And although getting there from here is highly nontrivial, it's also not some vast unknown researchers and engineers have no idea how to approach.

Would we criticize the dog for that?

Of course we would. Or rather, we'd call people idiots if they claimed dogs were going to evolve Real Soon Now (tm) into superintelligent beings.

LLMs are ok as text predictors. That's also what they are: text predictors. That's what they'll stay as without significant revamp of the entire structure that gives them learning capability.

If the task of taking over the world is harder than playing pokemon, then I suppose that a deployment of the current iteration of Claude will not take over the world (unless it has more reason to get good at taking over the world during training than it has to get good at pokemon).

I think that given recent developments, anyone who is very confident on whether we will have ASI by 2050 is overconfident.

I'm reminded of a joke from the idle incremental game community: 'Ah, I see we've reached 1% of our target. Splendid, almost done!'

Serms like about 2yrs between 'near the bottom of human ability range' and 'clearly exceeding 99th percentile human output in any particular domain, given current levels of investment. So if the definition of AGI is 'can beat the elite four deathless' we're on track for ~2027 as expected.

I don't understand people who can see the current state of AI and the trendline and not at least see where things are headed unless we break trend. You guys know a few years ago our state of the art could hardly complete coherent paragraphs right? chain of thought models are literally a couple months old. How could you possibly be this confident we've hit a stumbling block because one developer's somewhat janky implementation has hick ups? And one of the criticism is speed, which is something you can just throw more compute at and scale linearly?

I'm suspicious of these kinds of extrapolation arguments. Advances aren't magic, people have to find and implement them. Sometimes you just hit a wall. So far most of what we've been doing is milking transformers. Which is a great discovery, but I think this playthrough is strong evidence that transformers alone is not enough to make a real general intelligence.

One of the reasons hype is so strong is that these models are optimized to produce plausible, intelligent-sounding bullshit. (That's not to say they aren't useful. Often the best way to bullshit intelligence is to say true things.) If you're used to seeing LLMs perform at small, one-shot tasks and riddles, you might overestimate their intelligence.

You have to interact with a model on a long-form task to see its limitations. Right now, the people who are doing that are largely programmers and /g/ gooners, and those are the most likely people to have realistic appraisals of where we are. But this Pokemon thing is a entertaining way to show the layman how dumb these models can be. It's even better at this, because LLMs tend to stealthily "absorb" intelligence from humans by getting gently steered by hints they leave in their prompts. But this game forces the model to rely on its own output, leading to hilarious ideas like the blackout strategy.

To put the obvious counterpoint out there, Claude was never actually designed to play video games at all, and has gotten decent at doing so in a couple of months. The drawbacks are still there: navigation sucks, it’s kinda so, it likes to suicide, etc., but even then, the system is no designed to play games at all.

To me, this is a success, as it’s demonstrating using information it has in its memory to make an informed decision about outcomes. It can meet a monster, read its name, knows its stats, and can think about whether or not its own stats are good enough to take it on. This is applied knowledge. Applied knowledge is one of the hallmarks of general understanding. If I can only apply a procedure if told to do so, I don’t understand it. If I can use that procedure in the context of solving a problem, I do understand it. Clause at minimum understands the meaning of the stats it sees: level, HP, stamina, strength, etc. and can understand that the ratio between the monster’s stats and its own are import, and understand that if the monster has better stats than the player, that the player will lose. That’s thinking strategically based on information at hand.

Claude didn't "get decent at playing" games in a couple of months. A human wrote a scaffold to let a very expensive text prediction model, along with a vision model, attempt to play a video game. A human constructed a memory system and knowledge transfer system, and wired up ways for the model to influence the emulator, read relevant RAM states, wedge all that stuff into its prompt, etc. So far this is mostly a construct of human engineering, which still collapses the moment it gets left to its own devices.

When you say it's "understanding" and "thinking strategically", what you really mean it that it's generating plausible-looking text that, in the small, resembles human reasoning. That's what these models are designed to do. But if you hide the text window and judge it by how it's behaving, how intelligent does it look, really? This is what makes it so funny, the model is slowly blundering around in dumb loops while producing volumes of eloquent optimistic narrative about its plans and how much progress it's making.

I'm not saying there isn't something there, but we live in a world where it's claimed that programmers will be obsolete in 2 years, people are fretting about superintelligent AI killing us all, openAI is planning to rent "phd level" AI agent "employees" to companies for large sums, etc. Maybe this is a sign that we should back up a bit.

What I mean by thinking strategically is exactly what makes the thing interesting. It’s not just creating plausible texts, but it understands how the game works. It understands that losing HP means losing a life, and thus if the HP of the enemy and its STR are too high for it to handle at a given level. In other words, it can contextualize that information and use it not only to understand, but to work toward a goal.

I’m not saying this is the highest standard. It’s about what a 3-4 year old can understand about a game of that complexity. And as a proof of concept, I think it shows that AI can reason a bit. Give this thing 10 years, a decent research budget, I think it could probably take on something like Morrowind. It’s slow, but I think given what it can do now, im pretty optimistic that an AI can make data driven decisions in a fairly short timeframe.

Advances aren't magic, people have to find and implement them.

And the silicon to do so has to be there. The silicon that notably hasn’t been able to increase exponentially for two decades now without a corresponding exponential increase in the investment.

Almost all improvements since late 2022 have been marginal. ChatGPT 4.5 was a disappointment outside of perhaps having a better personality. There's no robust solution to hallucinations, or any of the other myriad of problems LLMs have. I imagine there will be some decent products in the same vein as Deep Research, but AGI is highly unlikely in the short term.

I made my prediction 2 years ago and am increasingly sure it will come true.

Would you care to make a prediction as to when (or if!) LLMs will be able to reliably clear games on the level of Pokemon (i.e., 2D & turn-based)? If a new model could do it, would you believe that represents an improvement that isn't just "marginal"? Assume it has to generalize beyond specific games, so it should be able to clear early Final Fantasy titles (and similar) too, to cover the case where "Pokemon ability" becomes a benchmark and gets Goodharted.

Personally I don't think it's necessarily impossible for current models with the proper tooling and prompting but it might be prohibitively difficult; the problem is a little underspecified, for it to be "fair" I'd want to limit the LLMs to screenshot input and tools/extensions it writes itself (for navigation, parsing data, long-term planning, memory, etc). I don't think 2 years is too optimistic though.

(I asked 3.7 Sonnet, which suggested "3-5 years", o3-mini which said "a decade or more", DeepSeek said "5-7 years", and Grok "5-10 years".)

On the other hand, have you seen old non-computer people trying to play video games? They make a lot of mistakes that sound very similar (due to a lack of "gamer common sense" about what parts of the UI and stage design matter and what sort of objectives there are), and that's with vision that is much less scuffed than whatever vision model has been joined onto the LLM here. I wouldn't be surprised if this turned out to be yet another thing where some token amount of 8xA100 finetuning on actual successful playthrough transcripts for a few games will result in the "play arbitrary games by chain-of-thought" barrier falling faster than substack AI doomers can prepare the next goalpost article (unless they get an LLM to help writing).

due to a lack of "gamer common sense" about what parts of the UI and stage design matter and what sort of objectives there are

It's actually quite remarkable, though a bit sad, that I've started to experience the same thing from time to time. Sure, I can bitch about discoverability and all that all day long, and counsel people that yes, things are (to a point) laid out logically, but at the end of the day if the guesses aren't good enough they aren't good enough and that's the way it is.

LLMs have intelligence, what they don't have is advanced spatial skills and visual comprehension. Claude Sonnet 3.7 is designed to code first and foremost, secondarily as a writer/conversationalist. Game-playing and consoling in Minecraft cathedrals (which Sonnet does quite well) is a tertiary capability that they didn't even aim for but are testing anyway. They didn't try to make it good at this, unsurprisingly it's not that great at it.

I had Sonnet play through a civ 4 game where I implemented its strategy and tactics, it was perfectly capable of reacting to text inputs but didn't really understand the pictures, where units were in relation to eachother. If you give it the inputs in the medium it understands best, text, it's pretty capable. When these AIs struggle with strawberry, that's because their tokenizer can't properly count letters. They never see a single letter.

Have a look at what they do with code in minecraft: https://youtube.com/watch?v=FCnQvdypW_I

It's a long video but you can take a general look at what they build. Can you build a house by coding it in? OK, they're not that great at stairs. But they are directly coding things in. I bet 98% of the planet couldn't do this specific niche skill that humans have no aptitude for, coding in architecture in minecraft. That's not a tool in our arsenal. It doesn't discredit our intelligence because we're bad at things we're not supposed to do in mediums unintuitive to us.

Now consider Alphago. You can't beat it, nobody can. A few people can beat Google's 7 year old AI Starcraft pro, not very many though (and this is the version downgraded to non vastly superhuman speed).

That Sonnet can kind of play Pokemon is proof in my mind that AGI is imminent. It proves that Sonnet's intelligence generalizes out into domains it wasn't remotely designed for, even crippled by the visual input barrier. Combined with the specialized bots, we have both 'extremely broad' and 'extremely capable'. All that remains is marrying the strengths of both approaches together, scaling up, working on visuals and tweaking.

Consider just today another AI agent arrived, it seems pretty capable: https://x.com/LamarDealMaker/status/1898454061277458498

Would you be willing to state here what your predictions are for when you think we're likely to have AGI? It might also be good to have a definition of AGI, or whatever metric that could be used to judge success. For instance, "I predict we'll have AGI by 2028, and it will lead to >5% annual GDP growth or >10% unemployment thereafter"... something like that?

AI-boosters seem to think any piece of evidence points to AGI being imminent. I just can't imagine how someone would look at Claude's performance playing Pokemon and see it as a good sign.

I think by 2027. Executives in Anthropic, OpenAI seem to give that as their date.

Defining AGI is tricky. I think the key element is take-off. AI right now is best at coding. You use code to make AI. Recursive self-improvement is the name of the game, AGI will be when you have bots performing the tasks that AI researchers do now in collating data and training models (not in the partial sense like how synthetic data is used today but in a holistic sense where the main intellectual work would be done by AIs). And AGI will be ephemeral because superintelligence will happen immediately afterwards and things will get very crazy very quickly, the world power structure will change fundamentally. Nobody will be asking 'what does this mean for unemployment' because that will be the least of our concerns. We will know AGI when we see it.

Little things like mapping the relations between objects in space don't disprove intelligence. If you presented Pokemon like a text adventure game, Claude would have no problem winning. The intelligence is there, that's what they're working on. Advanced vision isn't there, people don't particularly need AIs to play Pokemon, they're needed for writing code.

And yes I am heavily, heavily invested in AI companies, so I have some skin in the game.

If you presented Pokemon like a text adventure game, Claude would have no problem winning.

doubt pretty much, given what with chess it gets out lost of track very quickly

If you presented Pokemon like a text adventure game, Claude would have no problem winning.

Text adventure games exist. Has anyone tried pitting Claude against one?

I tried with Colossal Cave, but it's in the training set.

AI right now is best at coding. You use code to make AI. Recursive self-improvement is the name of the game, AGI will be when you have bots performing the tasks that AI researchers do now in collating data and training models (not in the partial sense like how synthetic data is used today but in a holistic sense where the main intellectual work would be done by AIs).

I'm struggling to come up with a way to contribute something valuable to the conversation, I just want to put my name down as predicting nothing remotely close to this will end up happening in two years.

We already have AI models improving kernels (for AI deployments) today. It's not a big leap of logic that they'll do more and more, achieving takeoff.

https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/

And yes I am heavily, heavily invested in AI companies, so I have some skin in the game.

Do you have any interesting recommendations? It always seemed like apart from Google, the most interesting ones are not publicly traded. MSFT for a while seemed like a way to get exposure to OpenAI, but now there are rumours that they may want to divest.

I saw a clip of Neuro being amazingly good at "Where's Waldo." Like "he's by the boat with yellow sails at the top right" level of identifying and understanding images. I wonder if that sort of skill is done with separate models that talk to the main one, or if it's integrated.

In some sense I think this is even more damning than it may seem to a naive viewer.

Like, that scratchpad memory a la Memento strategy should absolutely work, assuming you have an intelligent agent in the first place. The fact that it works so poorly is a sign there’s serious problems with the thinking part of SotA LLMs. It’s not merely a memory problem.

My reaction was that this is just about like most other advances in ML - simultaneously really cool/impressive and hilariously bad. It is genuinely really cool and impressive that it's done as well as it has. Someone on YouTube took a more pure RL approach a few years back, and it failed suuuuuper hilariously badly (in beautifully hilarious ways). Claude has definitely done better, and that's pretty legit, given that the core of it was trained to be an LLM, not to play video games. But one of the most true statements about ML still seems to hold true: "It's great when you want to model something where you don't know how to describe the underlying structure... and you're okay with it being hilariously wrong some percentage of the time." Some might think that the percentage of time that it's hilariously wrong is just a little bit too high, and it won't even need to drop that much before it works out pretty decently.

It's not surprising that it needs some scaffolding. The Bitter Lesson Believers will always believe in their hearts that they can eventually drop the scaffolding, and maybe they'll be able to, sometimes. But most of the big advances we've had in ML are because we've exploited some sort of structure in the world. And the most killer applications are where we have very good feedback in a very structured fashion (e.g., tree structures in board games, math/coding engines, etc.).

It definitely puts a damper on any predictions that AI is going to ingeniously conquer the world later this year, but as you project further and further into the future, it's always a matter of, "It's difficult to make predictions, especially about the future."

Someone on YouTube took a more pure RL approach a few years back, and it failed suuuuuper hilariously badly (in beautifully hilarious ways). Claude has definitely done better, and that's pretty legit, given that the core of it was trained to be an LLM, not to play video games.

Funnily enough according to the github repo, that RL approach also gets to Cerulean City, the same point as the LLM.

Contrary to OP's interpretation, I find very significant that LLMs are competing with SOTA Reinforcement Learning on a control task like this and an indicator that we are on the precipice of AGI. Having done RL it's super difficult to do well and for it to be stable on a complex task (or even an ostensibly simple task...). The author of that RL project you mentioned spent years on it and dedicated thousands of hours of training time and hand-tailoring reward functions to complete this one task.

I find it incredible that one day soon it may be possible to just create an RL environment and use an LLM to solve it rather than traditional RL methods. AGI is here when we reach that point IMO. LLMs still seem limited by not correctly calibrating the exploration/exploitation tradeoff which is a pitfall of RL as well.

It makes sense that they get stuck at Cerulean City: the only way to leave the city and progress is to go on a side track to rescue Bill, then you need to go in an NPCs house and then out their back door. Most NPC houses don’t have back doors, and this is the only time you need to go through a house to progress instead of going in a house and then leaving the way you came in.

according to the github repo, that RL approach also gets to Cerulean City

Huh. I hadn't looked at it in a while. What I recall from when I last looked was that they were completely stuck on Mt. Moon. Looks like the repo, itself, doesn't contain the latest. They claim to have actually finished the game as of last month. Of course, there are a lot of caveats. They did significant simplification of the game, are using a bunch of scripts to provide extra guardrails, super detailed reward shaping (sometimes an ad hoc specific reward for a particular level), etc. The one that stood out to me originally is still there; they simply skip the entire first part of the game, because it was just failing, conceptually, to even get its first Pokemon or get the parcel to Prof. Oak or something (I don't remember exactly). It couldn't even get off the ground, so they literally just skip it.

These sorts of exercises are always delicate to comprehend, because it's almost never easy to figure out how to think about the unique ways they're inserting human expertise. Does it matter that mayyyybe it only needs a liiiiittle help on something that is bone stupid easy for a human, but incomprehensibly hard to whatever iteration AI we have? Will that get papered over suddenly in two months by some other method? Or is it just a limited tool that, with plenty of human expertise, guidance, and delicacy, can still do pretty phenomenal things? Who knows?

We are seeing a lot of pieces, and it's easy to imagine that if we just glue the right general-purpose version of those pieces together in the right way, it'll really soar. But there are still a lot of ways to be skeptical, and they happen to be concepts that are kind of hard to think about, given that we lack a really appropriate theory structure.

By far the worst aspect, though, is Claude’s inability to navigate. It gets trapped in loops very easily, and is needlessly distracted by any objects it sees. The worst example of this so far has been its time in Mount Moon, which is a fairly (though not entirely) straightforward level that most kids probably beat in 15-30 minutes. Claude got trapped there for literal days, with its typical loop being going down a ladder, wandering around a bit, finding the ladder again, going back up the ladder, wandering around a bit, finding the ladder, going back down again, repeat. It’s like watching a sitcom of a man with a 7 second memory.

If you think this is bad, wait for Sylph Co, the level that fundamentally broke Twitch Plays Pokemon so bad that the managers had to change the play process. Either the LLM is going to ace it insanely fast, or it is going to be very, very slow.