This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
China has a new AI system out taking the world by storm, Manus. It's an autonomous AI agent that, according to Forbes changes everything.
I've seen a LOT of hype so far about AI agents, but the claims from this one actually seem pretty impressive, if they are true. Forbes says:
Manus uses the by now common "stack" of AI models, where there's a master-slave relationship between a head model that looks at the problem, and then sub models which are more specialized to go and do specific tasks. I can't quite tell from a quick search what the key breakthrough Manus made is, as my understanding is other agent AIs already have a similar set up. There is talk about asynch cloud work, but again, I didn't think that was an entirely new thing.
Either way, similar to the DeepSeek R1 reveal, there are a lot of breathless articles coming out about China "taking the lead" i the AI race. I agree that this is a concerning development for the U.S., given that we now have two Chinese labs that have seemingly joined the leading edge out of nowhere. Of course, it remains to be seen if this press blitz actually reflects seriously impressive new ground, or is just a good hype campaign.
Anyone who has used this or looked more into the details - what are your thoughts about Manus so far?
Manus is a generic thin wrapper over a strong Western model (Sonnet 3.7), if a bit better executed than most, and I am quite unhappy about this squandering of DeepSeek's cultural victory. The developers are not deeply technical and have instead invested a lot into hype, with invites to influencers and creating a secondary invite market, cherrypicked demos aimed at low value add SEO-style content creation (eg “write a course on how to grow your audience on X”) and pretty UX. Its performance on GAIA is already almost replicated by this opensource repo. This is the China we know and don't very much love, the non-DeepSeek baseline: tacky self-promotion, jumping on trends, rent-seeking, mystification. In my tests is hallucinates a lot – even in tasks where naked Sonnet can catch those same hallucinations.
The real analogy to DeepSeek is that, like R1 was the first time laymen used to 4o-mini or 3.5-turbo level slop got a glimpse of a SoTA reasoner in a free app, this is the first time laymen have been exposed to a strong-ish agent system, integrating all features that make sense at this stage – web browsing, pdf parsing, code sandbox, multi-document editing… but ultimately it's just a wrapper bringing out some lower bound of the underlying LLM's latent capability. Accordingly it has no moat, and does not benefit China particularly.
Ah well. R2 will wash all that away.
It is interesting, however, that people seemed to have no clue just how good and useful LLMs already are, probably due to lack of imagination. They are not really chatbot machines, they can execute sophisticated operations on any token sequences, if you just give them the chance to do so.
100% agree. I think even most commenters here seem fairly oblivious to all you can get out of LLMs. A lot of people try some use case X, it doesn't work, and they conclude that LLMs can't do X, when in fact it's a skill issue. There is a surprisingly steep learning curve with LLMs and unless you're putting in at least a couple of hours a week tinkering then you're going to miss their full capabilities.
Can you expand on this? Which hidden capabilities are you referring to? I've been a daily user of LLMs since early ChatGPT came out, and I'm not so sure what do you mean here.
Specific examples may sound underwhelming, because I'm mainly talking about metis in the sense of James C Scott, i.e. habitual patterns of understanding and behaviour that are acquired through repeated experience. For example, I almost never run into hallucinations these days, but that's because I've internalised what kinds of queries are most likely to generate hallucinated answers, so I don't ask them in the first place (just like we all know you don't do a Google search for "when is my Aunt Linda's birthday"). But I realise that sounds like a cop-out, so here are some examples -
To give a real-world example of the latter two ideas in action, I recently had a very complicated situation at work involving a dozen different colleagues, lots of institutional rules and politics, and a long history. My goal output was a 4 page strategy document spelling out how we were going to handle the situation. However, typing out the full background would be a big hassle. So instead I had a 60 minute voice conversation with ChatGPT while on a long walk, in which I sketched the situation and told it to keep asking follow up questions until it really had a good handle on the history and key players and relevant dynamics. So we did that, and then I asked it to produce the strategy document. However, I didn't completely love the style, and I thought Claude might do a better job. So instead, I asked ChatGPT to produce a 10 page detailed summary of our entire conversation. I then copypasted that into Claude and told it to turn it into a strategy doc. It did a perfect job.
So, a relatively simple example, but illustrates how voice mode and model switching can work well.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
“Manos…the hands of fate!”
More options
Context Copy link
I was rather suspicious about Manus when I saw a bunch of Twitter accounts who were this close to being unfollowed or muted on account of being breathless hypemen lauding it as revolutionary. A relatively unknown company? No previous releases? Little to no information about the system?
Then it turned out that Manus is a thin-wrapper over Claude 3.7 Sonnet.
Anything good about is almost entirely down to Sonnet. Which is a great model!
Well, at least it isn't a ChatGPT wrapper. This demonstrates slightly better taste, albeit execution that's a joke.
Claude 3.7 reasoning has been very disappointing for me. It tends to be a paragraph that boils down to
{{user}} has requested a reply, I should give him one
.Huh. I haven't used it, because Anthropic doesn't remotely have the capacity to cater to what paying users except. Does it do this for tasks that actually require reasoning?
I'm quite happy with every other reasoning model I've tried.
I’m not doing maths or anything. I use it for schedule management, task list prioritisation etc. I would expect them to benefit from reasoning - indeed, that’s exactly what I want the reasoning for. Being able to do maths is nice but nothing to do with the majority of my use cases.
The contrast with, say, R1 which will produce a full set of reasoning for whatever you ask (and then go off on the wrong direction) is notable. Perhaps the truncated reasoning is meant to produce more tractable behaviour; fair enough if so but not an upgrade from 3.5.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Speaking of taste, lately 4o has very much been passing the vibe check, and 3.7 Sonnet very much hasn't been.
I'm now using Claude almost exclusively as a workhorse and ChatGPT as more of a conversation partner, when it used to be the other way around. 4.5 is even better
Over the past year, especially towards the tail end, I've found myself becoming increasingly agnostic or apathetic between most leading models.
Grok 3? DeepSeek R1? Gemini 2.0 Pro? GPT-4o? o3 (mini)? Claude 3.5/3.7?
Almost any of them meet my needs, I occasionally want a reasoning model, or perhaps the Deep Research option, but in all honesty I just end up using whatever is handy. Grok 3 is probably the smartest model I can use for free, including reasoning and deep research. Beyond that, I wouldn't particularly care these days and just use whatever is easy. Which is often ChatGPT, or Gemini 2.0 Pro.
I've been pissed off enough by Anthropic's ridiculously low usage limits (and the negative experience of paid users to boot), that I don't usually bother.
You are correct, at least in my opinion, about GPT-4o recently getting a far more pleasant personality, to the point that Claude 3.7 doesn't strike me as glaringly superior. I don't really need to use them to write code unless I'm doing it for the hell of it, so your mileage might vary.
(This is a great situation! LLMs are being commodified, and tokens being given away for practically free. I'm used to being showered with a level of competency that would have had you once paying hundreds of dollars on API fees a year or more back.)
Very similar feeling for me.
I stick with ChatGPT for any research-type tasks, I use Grok for image generation and current events, and Claude covers anything else if I happen to be in the mood.
But I'm almost entirely agnostic as to which model is in front of me at any given moment, I feel like I know what I can get from them and when I should be careful about double-checking their outputs. They all hallucinate about the same amount, I'd say.
Which is crazy to think. Part of the promise of AI was that any company that obtained a significant lead in the space could in theory run away with the contest by leveraging their AI for higher efficiency.
Instead we've got all the major tech companies and even some second-tier players putting out approximately peer-level models and releasing improvements at approximately the same pace.
I am still prepared for one company to achieve a game-stealing breakthrough all at once, mind. But I am not going to pretend to guess which one. If you asked me ~2 years ago I'd have bet on OpenAI all the way.
More options
Context Copy link
Yeah, I think that's accurate, and why I find the default personality to be more important now than ever. I can have any of them do the thing, but which one is going to format it nicely, explain at the verbosity level I prefer, match my informal tone without being cringe, etc. etc?
This is one place where ChatGPT makes things very easy with their memory feature. Claude has an equivalent, at least a style guide you can use, but it's more finicky in my limited experience.
If Grok and Gemini have an equivalent, I haven't found it, though I'm quite happy with their default responses.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Yeah that's fair. I'm also suspicious because there are strong incentives for US AI firms to make China seem like they're ahead.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link