This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
I don't understand people who can see the current state of AI and the trendline and not at least see where things are headed unless we break trend. You guys know a few years ago our state of the art could hardly complete coherent paragraphs right? chain of thought models are literally a couple months old. How could you possibly be this confident we've hit a stumbling block because one developer's somewhat janky implementation has hick ups? And one of the criticism is speed, which is something you can just throw more compute at and scale linearly?
I'm suspicious of these kinds of extrapolation arguments. Advances aren't magic, people have to find and implement them. Sometimes you just hit a wall. So far most of what we've been doing is milking transformers. Which is a great discovery, but I think this playthrough is strong evidence that transformers alone is not enough to make a real general intelligence.
One of the reasons hype is so strong is that these models are optimized to produce plausible, intelligent-sounding bullshit. (That's not to say they aren't useful. Often the best way to bullshit intelligence is to say true things.) If you're used to seeing LLMs perform at small, one-shot tasks and riddles, you might overestimate their intelligence.
You have to interact with a model on a long-form task to see its limitations. Right now, the people who are doing that are largely programmers and /g/ gooners, and those are the most likely people to have realistic appraisals of where we are. But this Pokemon thing is a entertaining way to show the layman how dumb these models can be. It's even better at this, because LLMs tend to stealthily "absorb" intelligence from humans by getting gently steered by hints they leave in their prompts. But this game forces the model to rely on its own output, leading to hilarious ideas like the blackout strategy.
To put the obvious counterpoint out there, Claude was never actually designed to play video games at all, and has gotten decent at doing so in a couple of months. The drawbacks are still there: navigation sucks, it’s kinda so, it likes to suicide, etc., but even then, the system is no designed to play games at all.
To me, this is a success, as it’s demonstrating using information it has in its memory to make an informed decision about outcomes. It can meet a monster, read its name, knows its stats, and can think about whether or not its own stats are good enough to take it on. This is applied knowledge. Applied knowledge is one of the hallmarks of general understanding. If I can only apply a procedure if told to do so, I don’t understand it. If I can use that procedure in the context of solving a problem, I do understand it. Clause at minimum understands the meaning of the stats it sees: level, HP, stamina, strength, etc. and can understand that the ratio between the monster’s stats and its own are import, and understand that if the monster has better stats than the player, that the player will lose. That’s thinking strategically based on information at hand.
Claude didn't "get decent at playing" games in a couple of months. A human wrote a scaffold to let a very expensive text prediction model, along with a vision model, attempt to play a video game. A human constructed a memory system and knowledge transfer system, and wired up ways for the model to influence the emulator, read relevant RAM states, wedge all that stuff into its prompt, etc. So far this is mostly a construct of human engineering, which still collapses the moment it gets left to its own devices.
When you say it's "understanding" and "thinking strategically", what you really mean it that it's generating plausible-looking text that, in the small, resembles human reasoning. That's what these models are designed to do. But if you hide the text window and judge it by how it's behaving, how intelligent does it look, really? This is what makes it so funny, the model is slowly blundering around in dumb loops while producing volumes of eloquent optimistic narrative about its plans and how much progress it's making.
I'm not saying there isn't something there, but we live in a world where it's claimed that programmers will be obsolete in 2 years, people are fretting about superintelligent AI killing us all, openAI is planning to rent "phd level" AI agent "employees" to companies for large sums, etc. Maybe this is a sign that we should back up a bit.
This is something I don't understand. The LLM generates text that goes in the 'thinking' box, which purports to explain its 'thought' process. Why does anybody take that as actually granting insight into anything? Isn't that just the LLM doing the same thing the LLM does all the time by default, i.e. make up text to fill a prompt? Surely it's just as much meaningless gobbledygook as all text an LLM produces? I would expect that box to faithfully explain what's actually going on in the model just as much as an LLM is able to faithfully describe the outside world, i.e., not at all.
More options
Context Copy link
What I mean by thinking strategically is exactly what makes the thing interesting. It’s not just creating plausible texts, but it understands how the game works. It understands that losing HP means losing a life, and thus if the HP of the enemy and its STR are too high for it to handle at a given level. In other words, it can contextualize that information and use it not only to understand, but to work toward a goal.
I’m not saying this is the highest standard. It’s about what a 3-4 year old can understand about a game of that complexity. And as a proof of concept, I think it shows that AI can reason a bit. Give this thing 10 years, a decent research budget, I think it could probably take on something like Morrowind. It’s slow, but I think given what it can do now, im pretty optimistic that an AI can make data driven decisions in a fairly short timeframe.
What makes things interesting is that the line between "creating plausible texts" and "understanding" is so fuzzy. For example, the sentence
will be much more plausible if the continuation is a number smaller than 125. "138" would be unlikely to be found in its training set. So in that sense, yes, it understands that attacks cause it to lose HP, that a Pokemon losing HP causes it to faint, etc. However, "work towards a goal" is where this seems to break down. These bits of disconnected knowledge have difficulty coming together into coherent behavior or goal-chasing. Instead you get something distinctly alien, which I've heard called "token pachinko". A model sampling from a distribution that encodes intelligence, but without the underlying mind and agency behind it. I honestly don't know if I'd call it reasoning or not.
It is very interesting, and I suspect that with no constraints on model size or data, you could get indistinguishable-from-intelligent behavior out of these models. But in practice, this is probably going to be seen as horrendously and impractically inefficient, once we figure out how actual reasoning works. Personally, I doubt ten years with this approach is going to get to AGI, and in fact, it looks like these models have been hitting a wall for a while now.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
And the silicon to do so has to be there. The silicon that notably hasn’t been able to increase exponentially for two decades now without a corresponding exponential increase in the investment.
More options
Context Copy link
More options
Context Copy link
Almost all improvements since late 2022 have been marginal. ChatGPT 4.5 was a disappointment outside of perhaps having a better personality. There's no robust solution to hallucinations, or any of the other myriad of problems LLMs have. I imagine there will be some decent products in the same vein as Deep Research, but AGI is highly unlikely in the short term.
I made my prediction 2 years ago and am increasingly sure it will come true.
Would you care to make a prediction as to when (or if!) LLMs will be able to reliably clear games on the level of Pokemon (i.e., 2D & turn-based)? If a new model could do it, would you believe that represents an improvement that isn't just "marginal"? Assume it has to generalize beyond specific games, so it should be able to clear early Final Fantasy titles (and similar) too, to cover the case where "Pokemon ability" becomes a benchmark and gets Goodharted.
Personally I don't think it's necessarily impossible for current models with the proper tooling and prompting but it might be prohibitively difficult; the problem is a little underspecified, for it to be "fair" I'd want to limit the LLMs to screenshot input and tools/extensions it writes itself (for navigation, parsing data, long-term planning, memory, etc). I don't think 2 years is too optimistic though.
(I asked 3.7 Sonnet, which suggested "3-5 years", o3-mini which said "a decade or more", DeepSeek said "5-7 years", and Grok "5-10 years".)
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link