site banner

Culture War Roundup for the week of March 31, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

I've also noticed this when plugging grad-level QM questions into Gemini/ChatGPT. No matter how many times I tell it that it's wrong, it will repeatedly apologize and make the same mistake, usually copied from some online textbook or solution set without being able to adapt the previous solution to the new context.

Could you confirm the exact models used? Both Gemini and ChatGPT, through the standard consumer interface, offer a rather confusing list of options that's even broader if you're paying for them.

I just used the free public facing ones (Gemini 2.0 flash, GPT 4-o). You can try asking it for the decay time for the 3p-1s transition in hydrogen. It can do the 2p-1s transition since this question is answered in lots of places but struggles to extrapolate.

I will note that Gemini 2.0 Flash and GPT-4o are significantly behind the SOTA! The latter got a very recent update that made it the second best model on LM Arena, but they're both decidedly inferior in reasoning tasks compared to o1, o3 or Gemini 2.5 Pro Thinking. (Many caveats apply, since o1 and o3 have different sub-models and reasoning levels)

I asked two instances of Gemini 2.5 Pro:

Number 1:

What is the decay time for the 3p-1s transition in hydrogen? Make sure you are certain about your answer, after doing the relevant calculations.

Final answer: 5.27 ns

Second iteration:

What is the decay time for the 3p-1s transition in hydrogen? Make sure you are certain about your answer, after doing the relevant calculations. I have enabled code execution, if you think that would help with the maths.

Final answer: 5.28 ns

I wasn't lying to it, I'd enabled its ability to generate and execute code. Neither instance had access to Google Search, which is an option I could toggle. I made sure it was off. If you read the traces closely, you see mention of "searching the NIST values", but on being challenged, the model says that it wasn't looking it up, but trying to jog its own memory. This is almost certainly true.

I've linked to dumps of the entire reasoning trace and "final" answer:

First instance- https://rentry.org/cqty47r2

Second instance- https://rentry.org/2oyx24sa

I certainly don't know the answer myself, so I used GPT-4o with search enabled to evaluate the correctness of the answer. It claimed that both were excellent, and the correct value is around 5.4 ns according to experimental results (the decay time for the hydrogen 3p state).

I also used plain old Google, but didn't find a clear answer. There might be one in: https://link.springer.com/article/10.1007/s12043-018-1648-4?

But it's pay walled. I don't know if ChatGPT GPT-4o was able to access it despite this impediment.

Edit:

DeepSeek R1 without search claimed 1.2e-10 seconds. o3-mini without search claims 21 ns.

The correct answer is about 5.98 ns when applying the spontaneous emission formula, so Gemini pro 2.5 got it correct, although it had to reference NIST when its original formula didn't work. It looks like it copies the correct formula so I'm not sure where the erroneous factor of 4 comes from.

Thank you.

What do you mean by "reference NIST"? I think I've already mentioned that despite its internal chain of thought claiming to reference NIST or "look up" sources, it's not actually doing that. It had no access to the internet. I bet that's an artifact of the way it was trained, and regardless, the COT, while useful, isn't a perfect rendition of inner cognition. When challenged, it apologizes for misleading the user, and says that it was a loose way of saying that it was wracking its brains and trying to find the answer in the enormous amount of latent knowledge it possesses.

I also find it very interesting that the model that couldn't use code to run its calculations got a very similar answer. It did an enormous amount of algebra and arithmetic, and there was every opportunity for hallucinations or errors to kick in.

For the first calculation dump at least, it comes up with a value 6.63 × 10⁸ s^-1, then compares it to the expected value from the NIST Atomic Spectra Database 1.6725 × 10⁸ s⁻¹, then spends half the page trying to reconcile the difference, before giving up and proceeding with the ASD value.

Hmm. I think that's likely because my prompt heavily encouraged it to reason and calculate from first principles. It's a good thing that it noted that those attempts didn't align with pre-existing knowledge, and accurately recalled the relevant values, which must be a nigh-negligible amount of the training data.

At the end of the day, what matters is whether the model outputs the correct answer. It doesn't particularly matter to the end user if it came up with everything de-novo, remembered the correct answer, or looked it up. I'm not saying this can't matter at all, but if you asked me or 99.999% of the population to start off trying to answer this problem from memory, we'd be rather screwed.

Thanks for the suggestion and looking through the answer, I've personally run up to the limits of my own competence, and there are few things I can ask an LLM to do that I can't, while still verifying the answer myself.

At the end of the day, that's not really what matters, because nobody is going to need to solve a problem in physics with a known solution. A good portion of tests that I had as an undergraduate and in graduate school were open book, because simply knowing a formula or being able to look up a particular value wasn't sufficient to be able to answer the problem. If I want a value from NIST, I can look it up. The important part is being able to correctly engage in the type of problem solving needed to answer questions that haven’t ever been answered before.

I've had some thoughts about what it actually means to be able to do "research level" physics, which I'm still convinced no LLM can actually do. I've thought about posing a question as a top level post, but I'm not really an active enough user of this forum to do that amd don't want to become one.

Finally, I want to say that for the past 18 months, I've continually been getting solicitations on LinkedIn to solve physics problems to teain LLM's. The rate they offer isn't close to enough to make it worth it for me, even if I had any interest, but it would probably seem great to a grad student. I wouldn't be surprised if these models have been trained on more specific problems than we realize.

The quote the full model names in appendix A.1, but it's really such a short paper that it's worth at least scrolling through before discussing.

  • O3-MINI (HIGH)
  • O1-PRO (HIGH)
  • DEEPSEEK R1
  • QWQ-32B
  • GEMINI-2.0-FLASH-THINKING-EXP
  • CLAUDE-3.7-SONNET-THINKING

While surprisingly poor performing, it's not entirely out of line with my own experience experimenting with this class of models. They do seem to hallucinate at a very high rate for problems requiring subtle but extremely tight reasoning.

Thank you for listing out the models in the paper, but I was more concerned with the ones you've personally used. If you say they're in the same tier, then I would assume that you mean o3-high, o1 pro but not Claude 3.7 Sonnet Thinking (since you didn't mention Anthropic). I will note that R1, QWQ and Flash 2.0 Thinking are worse than those two, even if they're still competent models.

The best that Gemini has to offer is Gemini 2.5 Pro Thinking, which is the state of the art at present (in most domains). Is that the one you've tried? If you're not paying, youre not getting it on the app. I use it through AI Studio, where it's free. For ChatGPT, what was the best model you tried?

If you don't want to go to the trouble of signing up to AI Studio yourself (note that it's very easy), feel free to share a prompt and I'll try it myself and report back. I obviously can't judge the quality of the answer on its own merits, so I'll have to ask you.

Ah, I'm not OP. I've tried O3 High, O1 Pro, and QwQ. For the paper they have the prompts and grading scheme on the corresponding github. USAMO questions are hard enough you definitely need some expertise to grade them accurately. I'm far from being capable of judging them accurately.

Very qualitatively, the current crop of LLMs impresses me with the huge breadth of topics they can talk about. But "talking" to them does not give the impression they are better at reasoning than anyone I know who has scored >50% on USMAO, IMO, or the Putnam.

But "talking" to them does not give the impression they are better at reasoning than anyone I know who has scored >50% on USMAO, IMO, or the Putnam.

They are still improving very quickly, and I don't see the rate of improvement leveling off. Gemini 2.5 recently answered with ease a test question of mine that Gemini 2.0 (and, honestly, everything prior to Claude 3.5) had been utterly confused by. But I admit that they're definitely lacking in reasoning skills still; they're much better at retrieval and basic synthesis of knowledge than they are at extrapolating it to anything too greatly removed from standard problems that I'd expect were in their training data sets.

Still, can we take a step back and look at the big picture here? The USAMO is an invitation-only math competition where they pick the top few hundred students from a bigger invitation-only competition winnowed from an open math competition, and the median score on it is still sub-50%. The Putnam has thousands of competitors, but they're typically the most dedicated undergrad math majors and yet the median score on it is often zero! How far have we moved the goal posts, to get to this point? It's the "Can a robot write a symphony?" "Can you?" movie scene made manifest.

I don't think I know anyone who:

has scored >50% on USMAO, IMO, or the Putnam.

I think my younger cousin was an IMO competitor, but he didn't win AFAIK, even if he's now in a very reputable maths program.

I'm personally quite restricted myself in my ability to evaluate pure reasoning capabilitiy, since I'm not a programmer or mathematician. I know they're great at medicine, even tricky problems, but what makes medicine challenging is far more the ability to retain an enormous amount of information in your head rather than an unusually onerous demand on fluid intelligence. You can probably be a good doctor with an IQ of 120, if you have a very broad understanding of relevant medicine, but you're unlikely to be a good mathematician producing novel insights.

I did for all three, but it was many years ago, and I think I'd struggle with most IMO problems nowadays. Pretty sure I'm still better at proofs than the frontier CoT models, but for more mechanical applied computations (say, computing an annoying function's derivative) they're a lot better than me at churning through the work without making a dumb mistake. Which isn't that impressive, TBH, because Wolfram Alpha could do that too, a decade ago. But you have to consciously phrase things correctly for WA, whereas LLMs always correctly understand what you're asking (even if they get the answer wrong).