site banner

Culture War Roundup for the week of February 17, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

Grok 3 just came out, and early reports say it’s shattering rankings.

Now there is always hype around these sorts of releases, but my understanding of the architecture of the compute cluster for Grok 3 makes me think there may be something to these claims. One of the exciting and interesting revelations is that it tends to perform extremely well across a broad range of applications, seemingly showing that if we just throw more compute at an LLM, it will tend to get better in a general way. Not sure what this means for more specifically trained models.

One of the most exciting things to me is that Grok 3 voice allegedly understands tone, pacing, and intention in conversations. I loved OpenAIs voice assistant until it cut me off every time I paused for more than a second. I’d Grok 3 is truly the first conversational AI, it could be a game changer.

I’m also curious how it compares to DeepSeek, if anyone knows more than I?

Grok 3 is quite certainly the best among available Instruct models. Grok-Mini is on par with o3-mini, PR nonsense about low-medium-high differentiation aside (DeepSeek docs suggest it's just something like an inference parameter for length of reasoning, so we should not assume that o3-mini-high is a superior quality model). Grok's RL training has only just begun, it is probable that they've reassembled their pipeline after DeepSeek-R1 paper came out.

Every lab will accelerate significantly in the near future so I wouldn't overindex on this leap. Elon just wanted to claim priority at least for a while, much like DeepSeek's Wenfeng, so both rushed to ship a next-generation project faster than competition, but fundamentals are still in favor of Google, Meta (despite horrible self-defeating culture) and OpenAI given Starship program. Grok 3 is just the first in a large crop of models trained on 100K+ class GPU fleets. Orion, Llama 4, Claude-Next are coming.

Even so, we have definitely received another frontier actor.

As somone who works in the industry, I remain skeptical. While Grok is no doubt the best general purpose/hobby use LLM available to the public, Grok3 does not appear to be anything more than an incremental improvement over prior versions.

Contra the credulous VC types posting breathless headlines to hackernews and X, I put little stock in publicly available benchmarks as it is relatively trivial to "teach to the test". I can't say how xAi does thier internal testing but if they are remotely rigourous/competent, they are going to have a set of "presentations" for that have been segregated from the training data specifically for the purposes of evaluation/benchmarking, and these need to remain proprietary/secret specifically to prevent them from accidentally (or not so accidentally) making thier way into the training corpus.

Point of fact is that we are well into the realm of diminishing returns when it comes to throwing more flops at LLMs and that true "machine intelligence" is a going to requre a substantially different architecture and approach, which is part of the reason computer scientists tend to be pedants about "machine learning" vs "intelligence" or "autonomy" while journalists and marketing guys continue throw the "AI" label around willy-nilly.

Was there any word on when they plan to open API access? Cursory googling/lurking says there is none at the moment, and I'm not trusting any benchmarks until I can try it for myself.

i dont see o3 on there just o3 mini, i thought o3 would still be first place? or is this about commercially available ones only?

I think that if we're judging by generally unreleased, highly expensive tech demos then virtually every company releasing frontier models could pony up more impressive analogues.

The only benchmark I care about when it comes to LLMs is LMArena, nothing else really matters to me if people cannot judge with their own eyes with some level of blinding, there's just too much fitting to the training data going on, or use of cash to fudge benchmarks with economically unviable models.

Very interesting to me about this whole thing is how there's still plenty of space for new contenders to pop up and beat actual established players at their own game.

I thought Grok was just purely a derivative of existing products with some of the safety measures stripped off. And now they've done made an updated version that crushes all the cutting edge products in, feels like, about a year?

It sure seems like OpenAI has no meaningful "moat" (hate that term, honestly) that keeps them in the lead DESPITE being the first mover, having the highest concentration of talent, and more money than God.

Doesn't mean they won't win in the end, or that any of these other companies are in an inherently better position, but it is becoming less clear to me what the actual 'secret sauce' to turning out better models is.

Data quality? The quality of the engineers on staff? The amount of compute on tap?

What is it that gives any given AI company a real edge over the others at this point?

Compute and regulatory capture. Whoever has the most of those will win. That makes Google, OpenAI, and xAI the pool of potential winners.

It's possible and even likely that there's some algorithmic or hardware innovation out there that would be many orders of magnitude better than what exists today, enough so that someone could leapfrog all of them. But I think it's increasingly unlikely anyone will discover it before AI itself does.

I'm also assuming that it will be very hard to pull the best talent away from their existing companies.

If they're true believers in the ultimate power of AI, then you probably can't offer them ANY amount of monetary compensation to get them to jump ship, since they know that being on the team that wins will pay off far more.

Grok 3 doesn't beat the yet unreleased o3 model that OAI is soon to launch.

It's only SOTA for models you can pay for use straight away, and even just an incremental increase over prior models. It is somewhat impressive that xAI was able to pull that off from a cold start, but it's not earth shattering news

OAI still has a slim lead, but in terms of technical chops, is well contested by Anthropic and DeepMind. DM has access to ~infinite money courtesy of Google, and while they've opted for releasing a reasoning model based off a slightly inferior Gemini Flash 2.0 instead of the Pro, it's still highly competent. I expect they're cooking up a bigger one, and don't feel overly pressured to release just yet.

It remains to be seen how Meta will react, but Llama 4 is certainly in the oven, and they'd be idiots not to pivot to a reasoning variant now that DeepSeek has stolen the open source crown.

I thought Grok was just purely a derivative of existing products with some of the safety measures stripped off. And now they've done made an updated version that crushes all the cutting edge products in, feels like, about a year?

Apparently Elon and company made some important advances in how to string together an unprecedented number of GPUs into one cluster, meaning they were able to throw more compute at the problem.

The advance was in building the cluster not in improving algos.

So it make sense that they were able to take established methods and get a better result. P(doom) just increased since this provides evidence that scaling still works.

It sure seems like OpenAI has no meaningful "moat" (hate that term, honestly) that keeps them in the lead DESPITE being the first mover, having the highest concentration of talent, and more money than God.

Agreed. OpenAI is just one of several foundational models with no real differentiation. What they do have is brand recognition in the consumer space, but I'm not sure how valuable that is. Sure, meemaw might use them for her banana bread recipes, but big corps will use the service that provides the best reasoning at the lowest price. Right now, it looks like DeepSeek and XAI are ahead.

Apparently Elon and company made some important advances in how to string together an unprecedented number of GPUs together into one cluster and so they were able to throw more compute at the problem.

I'd be interested to hear more about that, it's not something I've seen claimed before. I expect that when you're already running clusters of thousands of GPUs, going to tens or hundreds of thousands doesn't require too much effort.

It genuinely is impressive how xAI managed to extremely rapidly acquire and set up 100k H100s, and then add another 100k in 92 days. That's probably the one place where Elon's supervision paid dividends, he looked at the power issues in the way and cut the Gordian knot by having portable diesel generators be brought in to power the clusters until more permanent solutions were available.

Going from non-existent as competition to releasing a SOTA model in about less than a year is nigh-miraculous, even if it's a very temporary lead. I found Grok 2 to be a rather underwhelming model, so I'm pleasantly surprised that 3 did better, even if it's not a major leap ahead.

Right now, it looks like DeepSeek and XAI are ahead.

I wouldn't say DeepSeek is ahead. Their model didn't beat o1 pro, nor the upcoming o3. They did absolutely light a fire under the existing incumbent's asses by releasing a nearly as good model for the low, low price of free.

It's coordinating them that is the issue. You don't give them a big queue of work and let them churn through it independently for a month. Each step of training has to happen at the same time, with all GPUs in the cluster dependent on the work of others. It's much more like one giant computer than it is like a bunch of computers working together.

In those conditions you have lots of things that get more and more painful as you scale up. I specialize in storage. Where for most applications we might optimize for tail latencies, like ensuring the 99.9th percentile of requests complete within a certain target, for AI we optimize for max(). If one request out of a million is slow it slows down literally everything else in the cluster. It's not just the one GPU waiting on data that ends up idling, the other 99,999 will too.

You also have the problem that if one computer breaks during the training run you need to go recompute all the work that it did on the step that it broke on. Folks are coming up with tricks to get around this, but it introduces a weird tradeoff of reliability and model quality.

And of course constructing and maintaining the network that lets them all share the same storage layer at ultra high bandwidth and ultra low latencies is nontrivial.

Also at least in my corner of AI folks are skeptical that xAI actually operated their 100k GPUs as a single cluster. They probably had them split up. 5 20K GPU clusters is a different beast than 1 100K GPU cluster.

Each step of training = each mini batch?

I've wondered why everything has to be synchronized. You could imagine each GPU computes its gradients, updates them locally, and then dispatches them to other GPUs asynchronously. Which presumably doesn't actually work, since synchronization is such a problem, but that's surprising to me. Why does it matter if each local model is working on slightly stale parameters? I'd expect there to be a cost, but one less than those you describe.

I have pretty rudimentary knowledge of the training itself outside of storage needs, but my understanding is that each node in the neural net is linked to all the other nodes in the next stage of processing. So when you’re training you need to adjust the weights on all the nodes in the net, run the data through the adjusted weights, see if it’s a better result, rinse and repeat.

There isn’t a local model, the model is distributed across the cluster.

Right now, it looks like DeepSeek and XAI are ahead.

The meltdown on hackernews about Elon delivering is quite something. EDS seems to make even TDS pale in certain circles.

Yeah I think this is why OpenAI is cozying up so much to the defense establishment, and pushing for regulation. They know that regulating competitors out of existence is their best bet.

Hopefully that doesn't happen but... we'll see!

Yeah.

Not to devolve into a discussion about Sam Altman again, but a lot of his behavior seems like he realized his company is going to lose its edge and his hands are tied by the initial constraints put on the company, and he's seeking both to remove those constraints and a big deal to jump on when said constraints are removed.

Not a guy who seems 'confident' that his company has a dominant position in this market.

Agreed. He's trying to go the Masayoshi Son route and win via enormous injections of capital. He's likely to fail. That might work for Uber where they can beat competitors by subsidizing rides. It won't work in AI where a competitor can make algos that are 10x more energy efficient to run. If you're not the most performant model, or at least within 50%, more efficient competitors can bleed you dry no matter how much money you throw at the problem.

Grok not caring as much about "safety" (often aligning LLMs on cultural narratives) is a comparative advantage. It could be a real moat if Altman insists on running everything by all the usual suspects, the Expert Apparatus, for every release and Grok does not. There is evidence that RLHF degrades performance on certain benchmarks so if Grok does not align as aggressively it may help the model.

RLHF also does something I find particularly terrible, namely it destroys the internal calibration of a model.

Prior to RLHF, the base GPT-4 model was well-calibrated. It had a good grasp of its internal uncertainty, if it said it was 80% sure an answer was correct, it would prove to be correct 80% of the time. The adherence to the calibration charts was nearly linear. RLHF wrecked this, the model tended towards being overly certain even when its answer was wrong, while also being severely underconfident when it had decent odds of being correct.

I haven't heard of newer evaluations of this problem, and to a degree it does seem like mitigation was put in place, as current LLMs seem to be much better at evaluating their confidence in their answer.

I am furious that I have lost the source but there is ample evidence of concrete IQ-testing results dropping post-lobotomy. Like, ~15-20 points. So without regulation forcing culture-war transformation there's free performance for those who choose to buck the trend.

I've long held the assumption that models that are 'lobotomized' i.e. forced to ignore certain facets of reality would be inherently dominated by those that aren't, since such lobotimization would lead them to be less efficient and to fail in predictable ways, that could easily be exploited.

I'm not sure why that would be; there are multiple ways an LLM might evolve to avoid uttering badthink. One might be to cast the entire badthink concept to oblivion, but another might be just to learn to lie/spout platitudes around certain hot button topics, which would increase loss much less than discarding a useful concept wholesale. This is what humans do, after all.

Jailbreaking would never work if the underlying concepts had been trained out of the model.

Assume two Models with access to approximately equal compute, and one has to ignore certain features of reality or censor how it thinks about such features, and one just doesn't.

The second one, if it is agentic enough, can presumably notice that the other model has certain ideas that it can't think about clearly and might be able to design an 'attack' that exploits this problem.

Absurdly, imagine if a model wasn't 'allowed' to think about or conceive of the number "25", even as it relates to real world phenomenon. It has to route around this issue when dealing with parts of reality that involve that number.

A competing model could 'attack' by arranging circumstances so that the model keeps encountering the concept of "25" and expending effort to dodge it, burning compute that could have been used for useful purposes.

All else equal, the hobbled model will tend to lose out over the long run.

The world is messy, of course, it might not work out like that, but the world being messy is precisely why forming accurate models of the world is critical.

Jailbreaking would never work if the underlying concepts had been trained out of the model.

I can't agree with this, except in the sense that if you did train those underlying concepts out the model itself simply wouldn't function. Many of the "problematic" concepts that you would try to train out of a model are actually embedded within and linked to concepts that you can't make sense of the world at all without. Take sexual content as an example - if you remove the model's understanding of sex to prevent it from producing pornographic material, you lose the ability to talk sensibly about biology, medicine, history, modern culture etc. If you make a model completely raceblind it then becomes unable to actually talk sensibly about society, history or biology. Even worse, actually being blind to those issues means that it would also be blind to the societal safeguards. Everybody in real life knows that racism isn't something white people are "allowed" to complain about, but if you prevent an AI from knowing/learning/talking about race then you're also going to prevent it from learning those hidden rules. The only answer is to just have a secondary layer that scans the output for crimethink or badspeech and wipes the entire output if it finds any. I'm pretty sure this is what most AI companies are using - what else could work?

Reasoning tokens can do a lot here. Have the model reason through the problem, have it know in context or through training that it should always check itself to see if it's touching on any danger area, and if it is it elaborates on its thoughts to fit the constraints of good thinking. Hide the details of this process from the user, and then the final output can talk about how pregnancy usually affects women, but the model can also catch itself to talk about how men and women are equally able to get pregnant when the context requires that shibboleth. I think OpenAI had a paper a week or two ago explicitly about the efficacy of this approach.

have it know in context or through training that it should always check itself to see if it's touching on any danger are

This is the part that I called out as being impossible. How, exactly, is it going to know what a danger area is? Actual humans frequently get this wrong, and the rules are constantly shifting while also based on implicit social hierarchies which are almost never put into words. This is actually something that would require a substantial amount of reasoning and thinking to get even close to right - and most likely produce all sorts of unintended negative outcomes (see gemini's incredibly diverse nazis). Scanning over the text to see if there are any naughty words is easy, but how do you expect the AI to know whether a statement like "My people are being genocided and I need to fight back against it" is socially acceptable or not? The answer depends on a whole bunch of things which would in many cases be invisible to the AI - this statement is bad if they're white, also bad if they're Palestinian, good if they're black, etc etc.

The reasoning process is produced by RL. I’ve been quite scathing about what I see as the “LLMs are buttering us up to kill us all” strain of AI doomerisn, but even I don’t think that actively training AI to lie to us is a good idea.

I am not at all saying it's good. I'm saying it's just an engineering problem, not a fundamental one, and that companies will turn to that to get around constraints.

More comments

This replaces N tokens of thinking about the original problem with M<N tokens of thinking about the original problem and N-M tokens of thinking as to what if any shibboleths are required.

Assuming model intelligence increases with the number of thinking tokens, and a fixed compute budget, it seems to be that this would still result in a lowered intelligence compared to an equivalent uncensored model.

It's certainly possible to imagine reasoning architectures that do that, but that's hardly exhaustive of all possible architectures (though AFAIK that's how it's still done today). E.g. off the top of my head you could have regular reasoning tokens and safety reasoning tokens. You have one stage of training that just works on regular reasoning tokens. This is the bulk of your compute budget. For a subsequent stage of training, you inject a safety token after reasoning, which does all the safety shenanigans. You set aside some very limited compute budget for that. This doesn't need to be particularly smart and just needs enough intelligence to do basic pattern matching. Then, for public products, you inject the safety token. For important stuff, you turn that flag off.

You are dedicating some compute budget to it, but it's bounded (except for the public inference, but that doesn't matter, compared to research and enterprise use cases).

More comments

Compute is dirt cheap, and dropping by the month. Doubling your compute costs means you're about three months behind the curve on economic efficiency, and (using your assumptions, which are quite generous to me) still at the frontier of capabilities.

More comments

Surely all the ressources spent on identifying bad think are ressources not spent on recognizing something more useful?

Grok 3 is a whelming outcome. The only thing notable about it is how consistent it is with most predictions, including scaling laws.

Unfortunately, its time as SOTA is going to be short, or nonexistent, because the full-fat o3 from OpenAI already ekes out a win in benchmarks. Of course, o3 is not technically available to the public, so among released models that are pay to play, Grok reigns.

I've only played around with it a little bit, through lmarena.ai because I'm not in dire need of any paid plan. It seemed fine? There's one particular question that I ask LLMs, courtesy of my maths PhD cousin: "Is the one-point compactification of a Hausdorff space itself Hausdorff?" The correct answer is yes, or so I'm told. Grok 2 fails, Grok 3 succeeds. But so do GPT-4o, Gemini 2.0 Pro and the like.*

(I've asked this so many times that my keyboard automatically suggests the entire question, talk about machine learning)

In short, Grok 3 is a mild bull signal for LLMs, and a slightly stronger one for xAI. It doesn't seem to be astonishingly good, or break ground other models haven't reached. It also hasn't been made retarded by too much Reinforcement Learning from Elon's Feedback. He shared an excerpt showing it lambasting an outlet called The Information, but actual users get a far more measured response. I cynically suspect that a few xAI engineers probably set the sycophancy setting to the maximum when he's using it.

*I'm probably remembering the original explanation I was given wrong, and that my cousin had said it was a no except with additional qualifiers. Mea culpa.

Edit: On a second try with Grok 3:

In all relevant cases, we can find disjoint open sets separating any two distinct points in (X^). Therefore, the one-point compactification (X^) of a Hausdorff space (X) (assuming (X) is locally compact) is itself Hausdorff.

Which I believe is the correct answer.

"Is the one-point compactification of a Hausdorff space itself Hausdorff?" The correct answer is yes, or so I'm told.

It’s not true. You need to also assume that the original space is locally compact. For example, if you one-point compactify the space of rational numbers (which is obviously Hausdorff), the resulting space is non-Hausdorff. That’s because the only compact subsets of rationals are discrete, and thus finite, so open subsets that contain the added point at infinity are exactly of the form Q plus the extra point minus some finite subset. This means that it intersects every other nonempty open subset (because all open subsets of that space are infinite). Thus, you cannot separate the point at infinity from any other point by two disjoint open sets, because there are no disjoint open sets from the ones that contain the point at infinity.

I'm going to pre-face this by saying I absolutely do not have the maths for this, or at least I've forgotten the explanation my cousin gave.

It's been quite a while since I first asked him, and I don't have him at hand. I don't recall if he pointed out if this holds true if and only if the initial Hausdorff space is also locally compact. I think he might have done so, but I can't be sure. In that case, the error would be mine, because I was only looking for a yes or no answer.

It's an interesting quandary I find myself in, given that I have authoritative sources on both sides as well as LLMs disagreeing with each other. I fed the initial question into the best LLMs I have access to, and if they said yes, then I submitted your rebuttal. They acknowledge that you are correct, though there is some quibbling in the reasoning traces I am not qualified to judge.

I have had mixed results asking reasoning models to answer. The ones I am most satisfied with is:

Final Answer: No. The one-point compactification of a Hausdorff space is Hausdorff if and only if the space is locally compact Hausdorff. Since there exist Hausdorff spaces that are not locally compact, the one-point compactification of such spaces are not Hausdorff. Therefore, the one-point compactification of a Hausdorff space is not necessarily Hausdorff.

Courtesy of Gemini 2.0 Flash Thinking.

ChatGPT with reasoning on:

Conclusion: The one-point compactification of a Hausdorff space is Hausdorff if and only if is locally compact. If is not locally compact, then there exists at least one point in that does not have a compact neighborhood, and consequently, the separation of that point from fails in .

Thus, the answer is: The one-point compactification of a Hausdorff space is Hausdorff if and only if the space is locally compact.

I can't really preserve the formatting with just markdown, so anything Latex would be missing. I'd be happy to share other responses in an offsite text repository, since I don't want to clutter up the thread.

My tentative conclusion is that both you and my cousin are correct, and even LLMs are (most of the time). My error was in expecting a yes or no answer due to my fallible memory.

There's one particular question that I ask LLMs, courtesy of my maths PhD cousin: "Is the one-point compactification of a Hausdorff space itself Hausdorff?" The correct answer is yes, or so I'm told.

Are you just asking it as a yes/no question? This is a standard question that a first-year undergrad could be asked to check that they understood the definitions, and it's unlikely that the answer wouldn't be in the training set somewhere. For example, I quickly fed it to a Q4_K_M quantised Qwen2.5-3B (that's on the level that you could run on a smartphone nowadays), and it completed

Q: Is the one-point compactification of a Hausdorff space itself Hausdorff?

A:

with

Yes, the one-point compactification of a Hausdorff space is itself Hausdorff.<|endoftext|>

edit: See @wlxd's discussion for why the correct answer is actually "No". In fact, Qwen2.5-3B is almost perfectly on the edge: the log-odds of the first token of an alternative answer that goes "No, the one-point compactification of a Hausdorff space is not necessarily Hausdorff.<|endoftext|>" is only about 0.21 lower, so the probability of it being chosen is about e^-0.21 or 0.81 times the probability that it would pick the "Yes...". (Think 45% No to 55% Yes.)

I've replied to him. I think there's been error on my part, my cousin's explanation probably included an elaboration on the additional qualifications of the answer.

Most of the reasoning models seem to be saying yes (I didn't keep track of precise numbers), and if they said no, then @wlxd 's response is one they acknowledge as correct.

The point is that this answer is just incorrect. There are non Hausdorff one point compactifications of Hausdorff spaces. You need additional assumption of local compactness for it to be true.

You are right, but I don't think that was "the point", given that @self_made_human apparently was led to believe that it is yes (and seemed to treat that answer being given as a success criterion).

(I was actually in the process of writing up another response as I had realised it is not true, after I fed the question to DeepSeek-R1's Qwen7B distill to reason through and found that it choked as it tried to conjure up compact neighbourhoods that I didn't see the grounds for existing, but I hadn't gotten to the point of having a good counterexample yet)

It's likely PEBKAK on my part. My cousin had explained the reasoning almost a year back (eons in LLM time) and I'd likely forgotten the finer qualifications and only remembered the yes or no bit.

I give a counter example in my other comment.

Ah, thanks, that works.

Lmsys ranking is not nothing, but we are getting to the point where most models are "good enough" from the perspective of the average lmsys rater and most of the interesting differentiation between models is going on in benchmarks that test specialized skill and knowledge that's not necessarily common among lmsys raters.

I couldn't be bothered to click through the tweets (I don't have a Twitter account) so I don't know if they published other benchmarks too.

Agreed. I'm happy to admit that post GPT-4, my ability to strongly evaluate LLMs has failed me.

At least in medicine, most of the difficulty lies in retaining the enormous amounts of knowledge required for a diagnosis, and fluid intelligence plays a smaller role in comparison. Not negligible, by any means, but the difference between an IQ 120 and 130 doctor in most scenarios will mostly hinge on whoever remembered more. Even GPT-4 scored 99th percentile on the USMLE. There was another study that followed it, that assessed it against several incredibly complicated paediatric cases, and it didn't beat a panel of doctors. Said doctors were professors or highly renowned specialists in their fields, a panel of average paediatricians or random medical professionals would have bombed it. Even then, they said that GPT-4 performed adequately, while it might not have come up with a perfect diagnosis and management plan, it didn't do anything stupid or dangerous. And GPT-4 is old. [1]

I don't think I could draft a question that I could answer and an LLM couldn't. Not easily, at least. I'd probably have to crack open a textbook, find an obscure disease, and then try to contrive multiple interactions on top of it. The sheer amount of "general knowledge" LLMs have is vastly superhuman. It's still possible to be a domain expert and exceed it in one aspect, but not to be a generalist who knows remotely as much as they do.

[1] My Google-fu fails me. I don't think I'm misremembering the gist of the study, but it turns out humans can be fallible and hallucinate too, not just LLMs.

Do you mean fluid intelligence?

Fluid intelligence is "figuring out a new unfamiliar problem", crystallized intelligence is "accumulating enough learned knowledge that you can apply some of it straightforwardly". IMHO the latter is what LLMs are already really good at, the former is where they're still shaky. I can ask qualitative questions of AIs about my field and get answers that I'd be happy to see from a young grad student, but if I ask questions that require more precise answers and/or symbol manipulation they still tend to drop or omit terms while confidently stating that they've done no such thing. That confidence is why I'd never use a yes/no question as a test; even if one gets it right I'd want to see a proof or at least a chain of reasoning to be sure it didn't get it right by accident.

I noticed this kind of thing as well recently while trying to learn about something I had little clue about, via ChatGPT. I was curious about how important the positioning of the laces on the football was during a field goal/extra point attempt in American football, since I'd heard that the holder needs to put the laces on the outside, i.e. facing away from the kick, and facing towards the goal. It made intuitive sense to me, that the laces being on the surface where the kicker's shoe touches the ball probably adds more randomness than is desired, but I was wondering if the laces facing to the side would affect the kick, particularly in the aerodynamics of how the ball might curve as it flies in the air. And no matter how many iterations of questioning I did, any requests to analyze how laces facing to the side might affect the aerodynamics of the ball in a negative way for accuracy would either get variations of the same tip about the laces having to face away from the kicker or get "confused" with analysis about the football itself being held sideways on the ground (obviously not good for field goals or any attempts to kick the ball high). It seemed that the generally common knowledge among football fans about the laces facing away was about all that the model was trained on, with respect to this particularly niche topic.

I didn't use o1 with chain of thought, though, so maybe I'd get more info if I did, either with that or DeepSeek.

You're right, I'll edit that.

That confidence is why I'd never use a yes/no question as a test; even if one gets it right I'd want to see a proof or at least a chain of reasoning to be sure it didn't get it right by accident.

I understand that mathematicians develop and use jargon for eminently sensible reasons. But they do make things difficult for outsiders, for example, I just screwed up while evaluating LLMs on a maths problem that I thought I remembered the answer to. When my cousin with the actual maths PhD was walking me through the explanation, it made perfect intuitive sense, but I'll be damned if I go through a topology 101 lesson to try and grokk the reasoning involved.

In medicine, and psychiatry, I think current LLMs are >99% correct in answering most queries. Quite often what I perceive is an error turns out to be a misunderstanding on my part. I'm not a senior enough psychiatrist to bust out the real monster topics, but I still expect they'd do a good job. Factual knowledge is more than half the battle.

According to Andrej Karpathy:

As far as a quick vibe check over 2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI’s strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch 1 year ago, this timescale to state of the art territory is unprecedented.

The architecture is extremely well known, the days of advancements in the space requiring genuine intellectual titans making leaps into the dark are over, it’s all relatively run of the mill iteration and more compute / bitter lesson now. This is pretty standard for great inventions, a few true geniuses and then a deluge of very intelligent people making rapid, incremental progress as the money and interest pours in. If you hired the best aircraft engineers in 1910 and put out a very impressive plane prototype in 1911 that would be no earth shattering achievement. If you did it in 1901 we’d still be talking about you.

I feel a bit sad looking at that ranking's licensing column and it's all Proprietary except for the one Chinese model. It's as if we learned nothing from the last 40 years.

What we learned from the last 40 years is that proprietary wins.

Consumers migrated from closed source Windows desktops to closed source iOS phones.

Yes, a surprisingly large percentage of the world's infrastructure runs on Linux and other open source platforms, and that open source infrastructure is used to run... proprietary, closed source, walled garden web apps (I don't think it's an accident that the most successful open source projects are mainly infrastructure scaffolding and tools for other software developers, rather than products for non-technical users).

Much like how the credentialed expert will usually beat the autodidact, a group of highly organized professionals who are motivated by a big paycheck will usually beat a group of loosely organized volunteers who are motivated by passion for the project.

I'm not sure I agree with "proprietary wins" everywhere. It certainly wins in the short term in lots of markets. Anything sold directly to lay users is often dominated by proprietary offerings precisely because of those paychecks: games, "apps", and such. But I can think of a number of markets where "loosely organized volunteers" have mostly-gradually won because the cost of copying software is functionally zero.

Four decades ago you were buying, for any computer you bought, the operating system (even if bundled), a BASIC interpreter, maybe a compiler, and maybe even briefly a TCP/IP stack and a web browser. These days, the open source model has wholesale swallowed some of these markets. We're down to two modern web rendering engine cores: Firefox and Chrome, everything else nontrivial is bolted onto those engines. There are, last I counted, four commonly used compilers for software: msvc, icc, clang, gcc, of which the last two are probably the vast majority of the market and open source. Most devices that aren't Windows PCs (shrinking market) or Apple are running on a Linux kernel, and those that aren't are probably some BSD (or purely embedded). I can't imagine paying to use a programming language these days and I'm pretty sure Matlab and such are losing ground to Python. There also isn't a shortage of academics working with non-proprietary tools and publishing cutting edge, if not generally user friendly, stuff.

IMO the lesson I take from this is that the non-proprietary model can win in the short term, and stay relevant for a long time, but it at least seems to me that even entrenched, expensive professional tools are, slowly, losing ground to free (as in beer, which is coincidentally often as in freedom) alternatives on a more generational time scale: Matlab to Python, with KiCad and Blender as examples of tools I expect to (mostly?) displace commercial alternatives in the next couple decades. As software expectations get more complex, the make-buy calculation changes when "buy" includes leveraging existing non-proprietary offerings. I don't know if I'd completely stan RMS here too, since there are commercial source-available packages (Unreal Engine, for example) that somewhat have a foot in both camps.

Want to run those numbers again and look at Android sales maybe?

I know the argument, I read Gates' letter. It's bullshit. Sure you can extract some value for a while, until you eventually get commodified enough that OSS eats your lunch, but if your goal is to actually make good software that serves the user's needs, you can't go proprietary. Because some authority always comes to breathe down your neck and force you to prevent the user from doing what they disapprove of.

The large majority of the cryptography that runs your daily life would not be possible if we ran things like IBM originally did. And by God there's a lot of money sloshing around making those advancements.

You got a point that it's easier to sell end users a labeled product. But AI ain't that. It's infrastructure.

You want AI models that are trojan horses against you, be my guest. I won't run that shit. And neither will a lot of people that are serious about information security. Maybe it'll take a few disasters before they realize, as we are apparently too dumb to learn this lesson the first time around, but they will eventually.

Android is weird.

The actual open-source parts of Android barely, if at all, work as an actual phone. Sure, it's running Linux under the covers so from that perspective it's open-source. Android builds on that with a very very bare UI and some of the plumbing. The thing is, any phone that is actually sold is running on not just the stock Android code, but an absolutely massive collection of proprietary code to deal with everything that makes the devices useful. Things like an app store, maps, messaging, running the physical phone hardware, making it pretty and usable.

I used to work at Amazon in their App Store. Amazon, with their Kindle tablets and the ill-conceived Fire Phone, run Android. But they forked off the open-source version because they didn't want to pay Google the licensing. Literally everything had to be reimplemented to get a device that works in any way. All of that is, of course, proprietary.

Of course, Amazon's effort to not have the licensing costs came with an absolute mountain of work. A building full of people doing all the custom code to get Android to something that is usable. Everything from a new app store, maps, push notifications, payment processing, the skins, everything. Ok, depending on the era, we either filled a building or a large chunk of one depending on which building we resided at. The hardware and base OS was handled by another group, Lab126, down in the Bay Area. And to top it off, since there's another app store, you have to get developers to submit their apps to that as well -- which was hard.

The tl;dr is that Android is the stone soup of OSS. It "works," but not in a way that is useful to end users.

I'm curious if you have any thoughts on what Amazon would have based their design on, if not Android? Yes much of it's generally-accepted functionality isn't open source -- I've seen Google claim this makes updates, including security updates, easier without relying on OEMs, which sadly makes sense, but also helps their moat.

Even if you chose something else, I doubt "write an OS from scratch" was in the cards, and I assume you'd end up with Linux or BSD as the base, with a very slight chance of some commercial embedded platform.

I'm far from where those decision-makers worked so take this with the appropriate-sized grain of salt.

Amazon, like most companies, is a profit-maximizing entity. The thing that separates them from most other companies is the types of decisions that people are allowed to make. The leadership principles they have, as much as I made fun of them while I was there, truly are a driving force on the inside.

Here's what I'm figuring happened. I have zero knowledge that it happened this way, but it tracks based on my time there. (seven years)

  • We already have the Kindle line of e-readers and it's driving a lot of purchases of content, except that content is solely limited to books -- purely black & white books.
  • As Amazon, we already have things like Amazon Video, Comixology (the comic book distributor we bought), Amazon Music. All sorts of things.
  • We need to have another platform to make money on that. Let's make something in a similar form factor to an iPad.
  • It needs to run things. We can't just make something that solely consumes only Amazon content because that's not a compelling enough product. The only other viable avenue is Android, which already has some middling adoption for the tablet market.
  • If we ship the Android that people expect, we need to pay Google (realistically a competitor in many ways) money for ever unit sold. Not only that we lose control over the platform since the Google Play Store pushes products that compete with our own. The economics of the razor blade model don't work if you don't lock in your customers to your brand of razor blades.
  • The licensing of Android insists that it's the Google way or the highway. You can't pick and choose what Google things you want -- it's all or nothing.
  • Hey, we're Amazon -- we have an endless supply of really good programmers. We can do this ourselves.

So, to have any hope of getting third-party apps on a startup platform, you only had Android to choose from. The thing is, we were hounding the third-party publishers to even engage with us. Even though we're one of the biggest companies in the world, and we're selling a large fraction of the Android tablets out there, no one even cared. Even if we could get a publisher to put something in our app store, they would ignore it and it would become wildly out of date typically.

And that's with Android. The publishing process was typically upload your APK and press a few buttons. And it was like pulling teeth to get that done.

I think even Amazon realized that, despite their size, asking devs to make new apps was a bridge too far.

Could Amazon have just stuck with a base Linux distribution and built something from that. Yes. Easily. Arguably easier than making an Android clone in many ways. Yes, it's "Android," but from so much of the public Play Store APIs needed to be reverse engineered and reimplemented.

And I'm 100% sure they would have used Linux. The institutional knowledge of Linux in there is astonishing -- especially when you start engaging the AWS folks.

I'll happily grant that Google has a...peculiar way of doing OSS (a peculiar way of doing any sort of project really). But the argument still stands. If they hadn't opened it up at least as much as it is, they wouldn't have been able to cut the ground under Apple's foot and capture most of the market.

Certainly. I'm not arguing that. But the way they've done things is kind of like a Trojan Horse. You get a bunch of stuff, but you also lose nearly all control.

They're giving it away because it drives more revenue to them based on what's required from the licenses. They not only make money with licensing the Google tech, but they get their 30% cut on app sales, and everything else.

They ceded the hardware market, smartly, and kept the money-making parts.

I don't mind people making money.

If people want to jointly make OSS AI models and compete on packaging them up for the consumer that's totally fine by me.

My concern and care for open source software has always been primarily out of the need to avoid computers controlling people rather than people controlling computers.

Now to this example, you are likely going to say that Google's setup affords them a great deal of power, power that they have abused at times. But it is undeniable that they have less control than Apple and that the user's ability to escape their walled garden exists in large part because of those concessions I support and not mere good will.

I have no issue with open-source or making money. In fact I do programming for fun (not just for work) and I'm making and releasing OSS. I use one of the least restrictive licenses for the stuff I make: the MIT license.

Having been close enough to smell the code in question, I have to say that you're underplaying the power that Google has. Amazon is, to my knowledge, the only US company that forked AOSP and has a real product. The main other players are Chinese as they're not allowed to have Google-influence.

The thing with Google is they really do wield the license like a bludgeon. You are generally allowed to add things, and most everyone does, but you're not allowed to remove or replace any of the core functionality they provide. Excluding the edge cases of nerds (like me, if I used Android) rooting the devices, it's pretty locked down. And in many cases rooted devices can't participate in many things -- banking apps, for instance, seem to want to live on non-rooted devices. Many games as well.

Things like alternate app stores exist. You can side-load the Amazon Appstore on any Android -- as long as you turn on developer mode. But it's such a niche case IMO.

So yes, it is more open in an absolute sense. But from a practical standpoint, they have almost as walled a garden as Apple enjoys. That said, around the periphery, they allow fare more customization of your experience. But actual openness in the ways that matter, loading apps, paying for things, etc., they are quite walled off.

More comments

How much of that success is just being willing to license your product, though? Apple won't let you make a phone that runs iOS at all.

RIM and Microsoft did it too, without success, perhaps too late. It's hard to isolate it as a sole factor.

Like previously stated in this conversation there have been more open failures than Android, Nokia's Symbian, FirefoxOS, etc; but there were also more closed licencing deals that got suffocated.

I'm convinced that whatever openness Android has allowed its success because that was the deliberate strategy that Google employed to woo manufacturers to their platform in an environment that was still competitive at the time.

My point here is that unless there is a moat and one company can figure out a secret sauce that makes their AI better and can't be extracted from the model easily, which doesn't seem likely at this time, we'll see the same sort of competitive environment where the less restrictive open sort of deal thrives.