TheAntipopulist
Formerly Ben___Garrison
No bio...
User ID: 373
Claude AI playing Pokemon shows AGI is still a long ways off
(Read this on Substack for some funny pictures)
Evaluating AI is hard. One of the big goals of AI is to create something that could functionally act like a human -- this is commonly known as “Artificial General Intelligence” (AGI). The problem with testing AI’s is that their intelligence is often “spiky”, i.e. it’s really good in some areas but really bad in others, so any single test is likely to be woefully inadequate. Computers have always been very good at math, and even something as simple as a calculator could easily trounce humans when it comes to doing simple arithmetic. This has been true for decades if not over a century. But calculators obviously aren’t AGI. They can do one thing at a superhuman level, but are useless for practically anything else.
LLMs like chatGPT and Claude are more like calculators than AI hype-meisters would like to let on. When they burst onto the scene in late 2022, they certainly seemed impressively general. You could ask them a question on almost any topic, and they’d usually give a coherent answer so long as you excused the occasional hallucinations. They also performed quite well on human measurements of intelligence, such as college level exams, the SAT, and IQ tests.. If LLMs could do well on the definitive tests of human intelligence, then certainly AGI was only months or even weeks away, right? The problem is that LLMs are still missing quite a lot of things that would make them practically useful for most tasks. In the words of Microsoft’s CEO, they’re “generating basically no value”. There’s some controversy over whether the relative lack of current applications is a short-term problem that will be solved soon, or if it’s indicative of larger issues. Claude’s performance playing Pokemon Red points quite heavily toward the latter explanation.
First, the glass-half-full view: The ability for Claude to play Pokemon at all is highly impressive at baseline. If we were just looking for any computer algorithm to play games, then TAS speedruns have existed for a while, but that would be missing the point. While AI playing a children’s video game isn’t exactly Kasparov vs Deep Blue, the fact it’s built off of something as general as an LLM is remarkable. It has rudimentary vision to see the screen and respond to events that occur as they come into the field of view. It interacts with the game through a bespoke button-entering system built by the developer. It interprets a coordinate system to plan to move to different squares on the screen. It accomplishes basic tasks like battling and rudimentary navigation in ways that are vastly superior to random noise. It’s much better than monkeys randomly plugging away at typewriters. This diagram by the dev shows how it works
I have a few critiques that likely aren’t possible for a single developer, but would still be good to keep in mind when/if capabilities improve. The goal should be to play the game like a player would, so it shouldn’t be able to read directly from the RAM, and instead it should only rely on what it can see on the screen. It also shouldn’t need to have a bespoke button-entering system designed at all and should instead do this using something like ChatGPT’s Operator. There should be absolutely no game-specific hints given, and ideally its training data wouldn’t have Pokemon Red (or even anything Pokemon-related) included. That said, though, this current iteration is still a major step forward.
Oh God it’s so bad
Now the glass-half-empty view: It sucks. It’s decent enough at the battles which have very few degrees of freedom, but it’s enormously buffoonish at nearly everything else. There’s an absurdist comedy element to the uncanny valley AI that’s good enough to seem like it’s almost playing the game as a human would, but bad enough that it seems like it’s severely psychotic and nonsensical in ways similar to early LLMs writing goofy Harry Potter fanfiction. Some of the best moments include it erroneously thinking it was stuck and writing a letter to Anthropic employees demanding they reset the game, to developing an innovative new tactic for faster navigation called the “blackout strategy” where it tries to commit suicide as quickly as possible to reset to the most recently visited Pokemon center… and then repeating this in the same spot over and over again. This insanity also infects its moment-to-moment thinking, from hallucinating that any rock could be a Geodude in disguise (pictured at the top of this article), to thinking it could judge a Jigglypuff’s level solely by its girth.
All these attempts are streamed on Twitch, and they could make for hilarious viewing if it wasn’t so gosh darn slow. There’s a big lag in between its actions as the agent does each round of thinking. Something as simple as running from a random encounter, which would take a human no more than a few seconds, can last up to a full minute as Claude slowly thinks about pressing ‘A’ for the introductory text “A wild Zubat has appeared!”, then thinks again about moving its cursor to the right, then thinks again about moving its cursor down, and then thinks one last time about pressing ‘A’ again to run from the battle. Even in the best of times, everything is covered in molasses. The most likely reaction anyone would have to watching this would likely be boredom after the novelty wears off in a few minutes. As such, the best way to “watch” this insanity is on a second monitor, or to just hear the good parts second-hand from people who watched it themselves.
Is there an AI that can watch dozens of hours of boring footage and only pick out the funny parts?
By far the worst aspect, though, is Claude’s inability to navigate. It gets trapped in loops very easily, and is needlessly distracted by any objects it sees. The worst example of this so far has been its time in Mount Moon, which is a fairly (though not entirely) straightforward level that most kids probably beat in 15-30 minutes. Claude got trapped there for literal days, with its typical loop being going down a ladder, wandering around a bit, finding the ladder again, going back up the ladder, wandering around a bit, finding the ladder, going back down again, repeat. It’s like watching a sitcom of a man with a 7 second memory.
There’s supposed to be a second AI (Critique Claude) to help evaluate actions from time to time, but it’s mostly useless since LLMs are inherently yes-men, so when he's talking to the very deluded and hyperfixated main Claude he just goes with it. Even when he disagrees, main Claude acts like a belligerent drunk and simply ignores him.
In the latest iteration, the dev created a tool for storing long-term memories. I’m guessing the hope was that Claude would write down that certain ladders were dead-ends and thus should be ignored, which would have gone a long way towards fixing the navigation issues. However, it appears to have backfired: while Claude does indeed record some information about dead-ends, he has a tendency to delete those entries fairly quickly which renders them pointless. Worse, it seems to have made Claude remember that his “blackout strategy” “succeeded” in getting out of Mount Moon, prompting it to double, triple, and quadruple down on it. I’m sure there’s some dark metaphor in the development of long-term memory leading to Claude chaining suicides.
What does this mean for AGI predictions?
Watching this trainwreck has been one of the most lucid negative advertisements for LLMs I’ve seen. A lot of the perceptions about when AGI might arrive are based on the vibes people get by watching what AI can do. LLMs can seem genuinely godlike when they spin up a full stack web app in <15 seconds, but the vibes come crashing back down to Earth when people see Claude bumbling around in circles for days in a simplistic video game made for children.
The “strawberry” test had been a frequent concern for early LLMs that often claimed the word only contained 2 R’s. The problem has been mostly fixed by now, but there’s questions to be asked in how this was done. Was it resolved by LLMs genuinely becoming smarter, or did the people making LLMs cheat a bit by hardcoding special logic for these types of questions. If it’s the latter, then problems would tend to arise when the AI encounters the issue in a novel format, as Gary Marcus recently showed. But of course, the obvious followup question is “does this matter”? So what if LLMs can’t do the extremely specific task of counting letters if they can do almost everything else? It might be indicative of some greater issue… or it might not.
But it’s a lot harder to doubt that game playing is an irrelevant metric. Pokemon Red is a pretty generous test for many reasons: There’s no punishment for long delays between actions. It’s a children’s game, so it’s not very hard. The creator is using a mod for coloring to make it easier to see (this is why Jigglypuff’s eyes look a bit screwy in the picture above). Yet despite all this, Claude still sucks. If it can’t even play a basic game, how could anyone expect LLMs to do regular office work, for, say, $20,000 a month? The long-term memory and planning just isn’t there yet, and that’s not exactly a trivial problem to solve.
It’s possible that Claude will beat pokemon this year, probably through some combination of brute-force and overfitting knowledge to the game at hand. However, I find it fairly unlikely (<50% chance) that by the end of 2025 there will be an AI that exists that can 1) be able to play Pokemon at the level of a human child, i.e. beat the game, able to do basic navigation, not have tons of lag in between trivial actions, and 2) be genuinely general (putting the G in AGI) and not just overfit to Pokemon, with evidence coming from being able to achieve similar results in similar games like Fire Emblem, Dragon Quest, early Final Fantasy titles, or whatever else.
LLMs are pretty good right now at a narrow slice of tasks, but they’re missing a big chunk of the human brain that would allow them to accomplish most tasks. Perhaps this can be remedied through additional “scaffolding”, and I expect “scaffolding” of various types to be a big part of what gives AI more mainstream appeal over the next few years (think stuff like Deep Research). Perhaps scaffolding alone is insufficient and we need a much bigger breakthrough to make AI reasonably agentic. In any case, there will probably be a generic game-playing AI at some point in the next decade… just don’t expect it to be done by the end of the year. This is the type of thing that will take some time to play out.
Agreed. Something like half of the federal budget goes towards elder care, which populists usually recoil at when you talk about touching. But then they turn around and talk about the "bloated budget" and how ostensibly easy it would be to cut it, like its 99% subsidies to transsexuals or something.
This is the best reply in this thread I've read.
People like to wishcast world events as actually being about their pet causes. While I'd like to believe American reluctance is from Europe not taking the conflict seriously even after 3 years and being lapped in artillery shells sent by freaking North Korea, that's not actually the case.
The reality is that very few people care about foreign policy, while plenty of people care about culture and vibes and dunking on their outgroup. This means leaders get to effectively decide foreign policy, and the voters will mostly follow like sheep since they want to support their ingroup. I can practically guarantee that if Trump said we're now going big on booting Russia out of Ukraine by whatever means necessary, the Catturds of the right would flip (like they perennially do on Israel) and say jingoism is actually the best thing ever now -- "AMERICA IS BACK BABY". Really, the only thing you need to do to understand contemporary American politics is learn about negative partisanship. Learn about the frothing, searing hatred the two wings of the country have for each other, and everything else will follow naturally.
I also agree that America is a fundamentally untrustworthy ally. With the Legislative branch effectively defunct, the President has become more and more like an elected, absolute monarch. And you simply can't trust a country that's willing to elect a Mad King every so often.
In realpolitik terms, there was no realistic scenario where better relations with Russia would make much of a difference in a US-China conflict. Such a war would be dominated by sea + air power, which Russia is anemic in. Russia would be helpful in terms of sending raw materials to China, so having them embargo China during a conflict would indeed be useful for the US, but there was never a realistic chance for US-Russia relations to be good enough to where Russia would consider that rather than simply profiting and staying neutral while continuing to trade. Even if Russia joins China relatively explicitly, how much of a difference would that make? It might help China with marginal things like initial missile stockpiles and intelligence gathering. Those aren't nothing, but they'd be highly unlikely to turn the tables. And they'd be well worth the trouble if it meant the US had a stronger European contingent of allies to call on, even if they're mostly limited to just economic sanctions against China.
Certainly NATO was stronger before Trump's election in 2024 than it was in 2020. That's really not a very high bar since Trump was trashing NATO in his first term too. The fact you can't even begin to see how this could be possible is indicative that you're either using some weird scorecard in terms of "stronger", or something else similarly strange is going on. I don't think I've seen any serious piece of analysis claim NATO got weaker from Trump --> Biden.
Further, if you don't think negative partisanship is the absolute most critical factor driving basically every voter in the US for the past decade, you're quite wrong. This applies to both sides for what it's worth. There are a few principled ideologues out there, but the id of both sides' voterbase looks a lot closer to Catturd's twitter feed than it does to a coherent list of policy positions.
You're right that it seems we're probably too far apart to have a productive discussion.
An alliance with Russia would be basically impossible if they were gobbling up democratic European states, and even if the US ignored what they were doing I don't see why they wouldn't just become hostile to the US again once they reassembled the borders of the USSR. Putin's Russia is stilly highly ideologically opposed to the US just like the USSR was, but instead of Communism it has negativity towards democracy and hallucinating that the CIA has a 100% effective anti-Russian brainwashing technique in the form of "color revolutions".
Even just having Poland on the US's side is a great deal because they're a fantastic foil for tinpot dictators. It's not inaccurate to think that Ukrainians looked at how Poland was doing, and how Belarus was doing, and said "I think I'll take some of the former, thanks".
NATO was stronger because of the Ukraine war, but now its weaker because Trump is trashing both the organization and US allies. Simple.
A larger NATO spreads the cost of defense over more countries. It also gives the US the diplomatic leverage to do stuff like enact the chips ban on China, for which critical machine tools were manufactured only in Europe.
Sending weapons to Ukraine has give the US some ability to rebuild its shattered defense-industrial base, trading out old stock leftover from the Cold War for more modern kit. The notion that the US has "emptied its armory" is egregiously wrong. The US apparently never had the political will to part with enough stuff for Ukraine to get a decisive advantage. The notion that the US doesn't have any tanks or planes or ships because they were all sent to Ukraine is just goofy.
On the money aspect, the US has sent about $110 billion to Ukraine over 3 years, although even that number is probably too high since much of the value "lost" was due for disposal anyways and is being replaced by more modern kit as I said above. Even taking the $110 billion number at face value, it's still tiny in comparison to America's other priorities. It's like a week's worth of spending on SS + Medicare, the two largest welfare programs for old people. The Afghan war wasted $2,300 billion on a war that was genuinely unwinnable (and that Trump was more than happy to can-kick on for the 4 years of his first term) since we were never going to be up for the ethnic cleansings required to bring long-term stability.
american conservatives don't "hate" Ukraine and NATO because of liberals, they want US wealth to be focused on the US
This makes me wonder if you genuinely interact with American conservatives. Maybe some small fraction are genuinely principled, hardcore isolationists, but I highly doubt that's the genuine plurality position. As always, Catturd serves as a good barometer of the modern US conservative movement. He uses the monetary cost as an argument, sure, but he goes much further in seeming to genuinely hate Zelensky. There's also this weird quirk where the monetary cost only matters in relation to Ukraine, but it mattered a lot less when it came to getting out of Afghanistan early, or for aid to Israel, etc.
To the common layperson, LLMs haven't really advanced that much since 2022 or 2023. Sure, each new model might have fancy graphs that show it's better than ever before, but it always feels disappointingly iterative when normal people get their hands on it. The only few big leaps have come from infrastructure surrounding it that lets us use it in novel ways, e.g. Deep Research is pretty good from what I've heard. DR isn't revolutionary or anything, it just takes what we already had, gives it more processor cycles, and has it produce something with lots of citations which is genuinely useful for some things. I expect further developments will be like that. It's like how electricity was sort of a flop in industry until we figured out things like the assembly line.
"AGI" is basically a meme at this point. Nobody can agree on a definition, so we might have had it back in 2022... or we might never have it, based on whatever definition you use. It's a silly point of reference.
This war was great for NATO no matter what. Whether Trump destroys NATO himself is a different matter that's more related to domestic negative partisanship. The war has:
- Added Finland and Sweden to the alliance.
- Shown the world Russia's true colors, that it was always interested in dominating Eastern Europe.
- Driven a likely permanent cultural wedge between Russia and Ukraine, regardless of the ultimate outcome.
- Given the West a chance to rebuild their shattered defense-industrial base for likely future conflicts.
- Gave the opportunity for NATO to be rallied around the US (at least when Biden was president), and direct more ire towards China.
etc.
Did you read Kulak's post? His general idea is that allowing for discussion just legitimizes evil people who think things like that it's OK for people to rape white girls.
Kulak is a particularly blatant example but plenty of people here are working off the same template.
NATO's decline is almost entirely unrelated to Ukraine, and if anything Ukraine helped to rally + expand NATO. It got Sweden and Finland to join, remember?
NATO's decline, or really America's waning interest, is mostly caused by a combination of China's rise and negative partisanship where modern US conservatives hate Ukraine mostly just because US liberals like it.
Adjust those spending figures to PPP and they become quite a bit closer. Europe is spending lots of nominal dollars (or Euros), but those dollars don't go nearly as far in Europe as they would in Russia.
The problem is that European spending is being allocated wastefully and that European strategy is muddled. Raising defence spending won't fix anything, what's needed is a plan to achieve specific capabilities and integrate them into a broader political strategy.
This is certainly another piece of the puzzle. If we could wave a magic wand and make Europe a single country like the USA, then a lot of these issues would be fixed. Reducing duplication and having a clear strategy would be great force-multipliers, but in absence of someone having that magic wand, increasing spending is a much more plausible solution in the short and medium term.
There were attempts to build better relations between Russia and the West during the 90s, then briefly during Obama's first term. They never came to much because Russia never gave up the dream of dominating Eastern Europe.
Refusing to grant NATO entry to countries east of Berlin would have just made them easy targets when Russia regained its strength. The Baltics would almost certainly have been either invaded, or pressured into becoming defacto Russian client states by this point.
It's a good primer on why Russia is so obsessed with pushing as far west as possible, and therefore why friendly relations are unlikely with any nations holding power in Eastern Europe.
There was little chance reproachment would have ever worked. Russia has always really, really wanted to dominate Eastern Europe.
This assumes Russia wouldn't have just invaded those countries anyways, which was almost guaranteed to happen. Russia right now is like Germany after WW1: a revanchist power that's seething in resentment. It hasn't had its face smashed against the concrete like WW2 Germany or Japan did in a way that would convince the populace that war wasn't the answer. The only options were to actually do the smashing, which would be very problematic given its nuclear stockpiles, or to contain it. For the containment strategy, abandoning Eastern Europe would have just drawn the line in a less advantageous position.
NATO expansion to the east was a great move in hindsight.
Russia was always going to be hostile to any nation that tried to project power east of Berlin, so the only options were to either kick Russia while it was down or stand by and let it reassemble the borders of the USSR, then fight it on much more equal terms.
Why is everyone so obsessed with military spending, especially as a % of GDP?
This is a joke, right?
Dollars spent isn't the only determinant of a nation's fighting power, but it's the ground-truth for a lot of important factors. How do you think the Allies won WW2? It was by having more tanks + planes + ships (and also oil).
The problem with Europe's defending against Russia is that the countries don't really want to raise defense spending at all, which limits their political appetite for defending their neighbors. Russia wouldn't need to invade the entirety of Europe all at once, they'd just salami-slice e.g. the Baltics and hope other European countries don't get their act together to oppose them. Each European country basically treats all the other countries to their east as buffer states.
Nah, they have the fourth largest economy in PPP terms, which is the only comparison that really matters for most international comparisons.
America sabotaging its alliances is a horrendous own-goal. China has a larger economy than the US in PPP terms (the only terms that matter), and its manufacturing base is like 4-5x ahead. The only way the US can compete at this point is through alliances. It's how we managed to get the chip embargo together, which IIRC was mostly enabled through a Dutch company that manufactures some important machine tools.
Most people in the US hate Putin. He has like a 10% approval rating here. Some on the DR like him, but they're goofballs that fall for the "based and trad Russia" meme/psyop. Most of the Ukraine-bashing is done out of reflexive negative partisanship, i.e. people on the left like Ukraine, and it would be great to see the left suffer, so let's hate Ukraine by proxy. Trump dislikes Ukraine partially out of them not carrying water for him in 2020, and partially because he's desperate for some sort of "deal", and since the Russians aren't budging he knows he has more leverage over Ukraine, so he's going after them instead.
Yes, they're fully in the tank for conflict theory. Look at a post like this and try to disagree.
Indeed, it's quite disappointing what this place has become. Good posters like TracingWoodgrains have been banned or moved on. Shitposters from CultureWarRoundup have moved back in, telling us constantly how we have to hate the outgroup with every fiber of our being, and any notion that we should try understanding them is akin to betrayal. The mods are apparently asleep at the wheel. Zorba, the original creator of the site, hasn't posted in 3 months, and hasn't really participated that much in nearly a year.
Musk is a pretty good example here. He claimed to be a "free speech absolutist", but then he started censoring a bunch of things he didn't like once he took over Twitter.
Looking for old articles is pretty hard on blogs, but I found several from Noah going back to 2021, e.g. him arguing against woke, arguing against motivated leftist science, and this one arguing against decolonization narratives.
- Prev
- Next
The failure-state most people expected from DOGE was that it would mostly be limited to symbolic cuts and focus on PR "wins". If it ever tried to go after important things like the massive elder care apparatus (which comprises something like 2/3rds of federal spending) then people would scream and Trump wouldn't back Musk. I had thought the egos of Trump and Musk destined them to a cataclysmic falling-out at some point, but Hanania has persuaded me that's definitely not guaranteed, and we could Musk with the resiliency of Jared Kushner from Trump's first term.
Still it's looking more and more likely that Musk will have fairly limited power to actually do anything beyond symbolic trivialities, and once the becomes clear I think there's a good chance he'll move on of his own accord.
Probably for the best, given that some of the stuff he did cut like cancer research probably should have been kept.
More options
Context Copy link