site banner

Culture War Roundup for the week of March 24, 2025

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

4
Jump in the discussion.

No email address required.

Confusion sets in when you spend most of your life not doing anything real. Metrics and statistics were supposed to be a tool that would aid in the interpretation of reality, not supercede it. Just because a salesman with some metrics claims that these models are better than butter does not make it true. Even if they manage to convince every single human alive.

I just tried out GPT 4.5, asking some questions about the game Old School Runescape (because every metric like math has been gamed to hell and back). This game has the best wiki every created, effectively documenting everything there is to know about the game in unnecessary detail. Spoiler: The answer is completely incoherent. It makes up item names, locations, misunderstand basic concepts like what type of gear is useful where. Asking it for a gear setup for a specific boss results in horrible results, despite the fact that it could just have copied the literally wiki (which has some faults like overdoing min-maxing, but it's generally coherent). The net utility of this answer was negative given the incorrect answer, the time it took for me to read it, and the cost of generating it (which is quite high, I wonder what happens when these companies want to make money).

Same thing happens when asking questions about programming that are not student-level (student-level question just returns an answer copied from a text-book. Did you know you can solve a leetcode question in 10 seconds by just copying the answer someone else wrote down? Holy shit!). The idea that these models will soon (especially given the plateau the seem to be hitting) replace real work is absurd. They will make programming faster, which means we'll build more shit (that's probably not needed, but that's another argument). They currently make me about 50% faster and make programming more fun since it's a fantastic search tool as long as it's used with care. But it's also destroying the knowledge of students. Test scores are going up, but understanding is dropping like a stone.

I'm sure it will keep getting better, eventually a large enough model with a gigantic dataset will get better at fooling people, but it's still just a poor image of reality. Of course this is how humans also function to some degree, copying other people more competant than us. However most of us combine that with real knowledge (creating a coherent model of something that manages to predict something new accurately). Without that part it's just a race to the bottom.

But a lot of people are like you, so these models will start to get used everywhere, destroying quality like never before. For example, I tried contacting a company regarding a missing order a few weeks ago. Their first line support had been replaced by an AI. Completely useless. It kept responding to the question it thought I made, instead of the one I made. Then asking me to double check things I told it I had checked. The funny thing is that a better automated support could have been created 20 years ago with some basic scripting (looking for order number and responding with details if it was included.). Or having an intern spend 30 second copy-pasting data into an email. But here we are, at the AI revolution, doing thing we have always been able to do, now in a shittier and more costly way. With some added pictures to make it seem useful. Fits right in in the finance world I guess?

I can however imagine a future workflow where these models do basic tasks (answer emails, business operations, programming tickets) overseen by someone that can intervene if it messes up. But this won't end capitalism. If you stopped LARPing on this forum/twitter you would barely even notice it. Though it is a shame that graphic design and similar things will be hurt more than it should.

If you're really an SWE, I must presume that you're not speaking in good faith here.

Asking it for a gear setup for a specific boss results in horrible results, despite the fact that it could just have copied the literally wiki (which has some faults like overdoing min-maxing, but it's generally coherent). The net utility of this answer was negative given the incorrect answer, the time it took for me to read it, and the cost of generating it (which is quite high, I wonder what happens when these companies want to make money).

You must know that GPT 4.5 is pretty mid as far as instruction models of this generation go. DeepSeek's latest is close in performance and literally 100-200x cheaper. More importantly, what do you think would be a random college-educated human's score on Runescape questions? It is so trivial to grant these systems access to tools for web browsing as to not be worth talking about.

The rest of your comment is the same style. What is amazing and terrifying about LLMs is not their knowledge retrieval but generality and in-context learning. At sufficient context length and trained to appropriately leverage existing tools, there is nothing in the realm of pure cognitive work they cannot do on human level. This is not hard to understand. So tell me: what are you going for? Just trying to assuage your own worries?

If you're really an SWE, I must presume that you're not speaking in good faith here.

I am also a SWE and have the same experience. The smartest models essentially work as good search engines, an interface between me and the api or language I am working with. No matter the prompt engineering or context window, they are utterly incapable of either reliable or good solutions to any moderately complex problem.

Please understand that I (and @Coolguy1337) have every incentive to leverage AI tools as much as possible. I use them daily for help with coding. If they could actually do my job I'd gladly sit back and let them do it--I already let them do as much of my job as they can.

At sufficient context length and trained to appropriately leverage existing tools, there is nothing in the realm of pure cognitive work they cannot do on human level

You must know this isn't true or we'd have already lost our jobs.

If you're not convinced yet, let me outline my general coding process.

  1. Get specifications
  2. Adapt them to something that makes sense--I am usually the owner of my projects, so what people ask me to do is usually not really what they want.
  3. Research appropriate frameworks/solutions to problems on a general level, such as by looking at similar existing products, if any
  4. Design basic architecture of solution
  5. Begin to implement
  6. Research appropriate solutions to specific problems, implement them
  7. Adjust and iterate according to feedback from client

I'll tell you with confidence that AI can't do a single one of these steps. I know this because I use AI at every step along the way, and while it works ok as a search engine (for example it's great at finding similar existing implementations) it simply does not work at all as an actual problem solver. Not even for any individual step, let alone all the steps together, no matter how much prompt engineering is used.

Seriously, I mean, if you were actually right, I could just retire and give an AI agent my job. At least for the year or so it will take for my industry to catch on. That time could be used to relax or find an AI-proof job. But I'm not worried at all about my job (at least not from AI agents) because I have extensive direct firsthand experience with them and they are still extremely limited.

I wish I had a dollar for every time people use the current state of AI as their primary justification for claiming it won't get noticeably better, I wouldn't need UBI.

I just tried out GPT 4.5, asking some questions about the game Old School Runescape (because every metric like math has been gamed to hell and back). This game has the best wiki every created, effectively documenting everything there is to know about the game in unnecessary detail. Spoiler: The answer is completely incoherent. It makes up item names, locations, misunderstand basic concepts like what type of gear is useful where. Asking it for a gear setup for a specific boss results in horrible results, despite the fact that it could just have copied the literally wiki (which has some faults like overdoing min-maxing, but it's generally coherent). The net utility of this answer was negative given the incorrect answer, the time it took for me to read it, and the cost of generating it (which is quite high, I wonder what happens when these companies want to make money).

I just used Gemini 2.5 to reproduce, from memory, the NICE CKS guidance for the diagnosis and management of dementia. I explicitly told it to use its own knowledge, and made sure it didn't have grounding with Google search enabled. I then spot-checked it with reference to the official website.

It was bang-on. I'd call it a 9.5/10 reproduction, only falling short of perfection through minor sins of omission (it didn't mention all the validated screening tests by name, skipped a few alternative drugs that I wasn't even aware of before). It wasn't a word for word reproduction, but it covered all the essentials and even most of the fine detail.

The net utility of this answer is rather high to say the least, and I don't expect even senior clinicians who haven't explicitly tried to memorize the entire page to be able to do better from memory. If you want to argue that I could have just googled this, well, you could have just googled the Runescape build too.

I think it's fair to say that this makes your Runescape example seem like an inconsequential failing. It's about the same magnitude of error as saying that a world-class surgeon is incompetent because he sometimes forgets how to lace his shoes.

You didn't even use the best model for the job, for a query like that you'd want a reasoning model. 4.5 is a relic of a different regime, too weird to live, too rare to die. OAI pushed it out because people were clamoring for it. I expect that with the same prompt, o3 or o1, which I presume you have access to as a paying user, would fare much better.

The idea that these models will soon (especially given the plateau the seem to be hitting) replace real work is absurd

Man, there's plateaus, and there's plateaus. Anyone who thinks this is an AI winter probably packs a fur coat to the Bahamas.

The rate of iteration in AI development has ramped up massively, which contributes to the impression that there aren't massive gaps between successive models. Which is true, jumps of the same magnitude as say GPT 3.5 to 4 are rare, but that's mostly because the race is so hot that companies release new versions the moment they have even the slightest justification in performance. It's not like back when OAI could leisurely dole out releases, their competitors have caught up or even beaten them in some aspects.

In the last year, we had a paradigm shift with reasoning models like o1 or R1. We just got public access to native image gen.

Even as the old scaling paradigms leveled off, we've already found new ones. Brand new steep slopes of the sigmoidal curve to ascend.

METR finds that the duration of tasks (based on how long humans take to do it) that AIs can reliably perform doubles every 7 months.

On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

We think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people’s day-to-day work: the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long

At any rate, what does it matter? I expect reality to smack you in the face, and that's always more convincing than random people on the internet asking why you can't even look ahead while considering even modest and iterative improvement.

My main gripe with current day models is their lack of consistency. On the one hand, they can do very impressive things that save me hours of work, on the other, they can fuck things up in simple ways and it costs me hours of work to fix it. I was using Claude to program a scene in Godot and the file was showing a parsing error at line 1. I let him try to fix it multiple times, started new chats, etc. Then I just looked at the file, noticed a comment in line 1 starting with # and thought "maybe that's not allowed". I took the comment away and the file was fixed. It's insanely frustrating when the AI fucks up such a simple thing. The main benefit of AI is that you can just let it rip and create something without knowing what you are doing. If I have to check the code all the time for fuckups, it really drags on productivity.

I'm not a programmer, the best I can say about myself is that I once did a Leetcode medium successfully, in Python, with an abysmal score because it wasn't remotely optimized. At that level, everything from GPT-4 onwards is clearly superior to what I can do unaided.

I think the utility varies in different ways based off the domain-skill of the user. A beginner programmer? Even if they get frustrating issues I find it hard to imagine they aren't immensely better off. The other end of the spectrum? You have people like Karpathy and Carmac singing their praises, while Linus says they're not nearly good enough. There are a dozen different programmers here saying different things.

There's also skill when it comes to using them, and that's an acquired ability. In your situation, it would likely have been better to give up on that conversation and try again, or to copy and paste the code into a different instance or a different model and ask it to find the issue. I expect this would have worked well. With too much gunking up the context, LLMs can still fall into ruts or miss obvious problems. When in doubt, retry.

I'm a mid-level software dev who mostly spent the age of the LLMs doing Java Spring Boot backend development at two different companies. I've tried using the various chatbots provided to me, and so far they've been useless in 100% of all cases. It's entirely possible that I'm doing it wrong.

I've already mentioned Karpathy and Co. Even in this subreddit, you've got people like @DaseindustriesLtd or @faul_sname (are you a programmer? Well, you know your ML, so close enough for government work) who get clear utility out of them.

You recognize you might using them wrong (and what are the specifics of how you attempted to use them? Which model? What kind of prompt? Which interface?), but I'm certainly not the best person to tell you how to go about it better. I could still try, if you want me to.

I've tried the reasoning models. They fail just as much (just tried Gemini 2.5 too and it did even worse). The purpose was to illustrate an example of how they fail. To showcase their poor reliability. I did not say they won't get better. They will, just not as much as you think. You can't just take 2 datapoints and extrapolate forever.

And I don't get your example, wouldn't the NICE CKS be in the dataset many times over? Maybe my point wasn't clear. These tools are amazing as search engines as long as the user using them is responsible and able to validate the responses. It does not mean they are thinking very well. Which means they will have a hard time doing things not in the dataset. These models are not a pathway to AGI. They might be a part of it, but it's gonna need something else. And that/those parts might be discovered tomorrow, or in 50 years.

And I don't see why reality will smack me in the face. I'm already using these as much as possible since they are great tools. But I don't expect my work to look very different in 2030 compared to now. Since programming does not feel very different today compared to 2015. The main problem has always been to make the program not collapse under its own weight, by simplifying it as much as possible. Typing the code has never been relevant. Thanks for the comment btw, it made me try out programming with gemini 2.5 and it's pretty good.

I mean, I assume both of us are operating on far more than 2 data points. I just think that if you open with an example of a model failing at a rather inconsequential task, I'm eligible to respond with an example of it succeeding at a task that could be more important.

My impression of LLMs is that in the domains I personally care about:

  1. Medicine.
  2. Creative fiction
  3. Getting them to explain random things I have no business knowing. Why do I want to understand lambda calculus or the Church Turing hypothesis? I don't know. I finally know why Y Combinator has that name.

They've been great at 1 and 3 for a while, since GPT-4. 2? It's only circa Claude 3.5 Sonnet that I've been reasonably happy with their creative output, occasionally very impressed.

Number 3 encompasses a whole heap of topics. Back in the day, I'd spot check far more frequently, these days, if something looks iffy, these days I'll shop around with different SOTA models and see if they've got a consensus or critique that makes sense to me. This almost never fails me.

And I don't get your example, wouldn't the NICE CKS be in the dataset many times over?

Almost certainly. But does that really matter to the end user? I don't know if the RS wiki has anti-scraping measures, but there's tons of random nuggets of RS build and items guide all over the internet. Memorization isn't the only reason that models are good, they think, or do something so indistinguishable from the output of human thought that it doesn't matter.

If you met a person who was secretly GPT-4.5 in disguise, you would be rather unlikely to be able to tell at all that they weren't a normal human, not unless you went about suspicious from the start. (Don't ask me how this thought experiment would work, assume a human who just reads lines off AR lenses I guess).

These tools are amazing as search engines as long as the user using them is responsible and able to validate the responses. It does not mean they are thinking very well. Which means they will have a hard time doing things not in the dataset. These models are not a pathway to AGI. They might be a part of it, but it's gonna need something else. And that/those parts might be discovered tomorrow, or in 50 years.

This is a far more reasonable take in my opinion, if you'd said this at the start I'd have been far more agreeable.

I have minor disagreements nonetheless:

  1. 99% of the time or more, what current models say in my field of expertise (medicine) is correct when I check it. Some people claim to experience severe Gell-Mann amnesia when using AI models, and that has not really been my experience.
  2. This means that unless it's mission critical, the average user can usually get by with taking answers at face value. If it's something important, then checking is still worthwhile.
  3. Are current models AGI? Who even knows what AGI means these days. By most definitions before 2015, they count. It's valid to argue that that reveals a weakness of those previous definitions, but I think that at the absolute bare minimum these are proto-agi. I expect an LLM to be smarter and more knowledgeable and generally flexible than the average human. I can't ask a random person on the street what beta reduction is and expect an answer unless I'm on the campus of a uni with a CS course. That the same entity can also give good medical advice? Holy shit.
  4. Are the current building blocks necessary or sufficient for ASI? Something so smart than even skeptics have to admit defeat (Gary Marcus is retarded, so he doesn't count)? Maybe. Existing ML models can theoretically approximate any computable function, but something like the Transformer architecture has real world limitations.

And I don't see why reality will smack me in the face. I'm already using these as much as possible since they are great tools. But I don't expect my work to look very different in 2030 compared to now. Since programming does not feel very different today compared to 2015.

Well, if you're using the tools regularly and paying for them, you'll note improvements if and when they come. I expect reality to smack me in the face too, in the sense that even if I expect all kinds of AI related shenanigans, seeing a brick wall coming at my car doesn't matter all that much when I don't control the brakes.

For a short span of time, I was seriously considering switching careers from medicine to ML. I did MIT OCW programs, managed to solve one Leetcode medium, and then realized that AI was getting better at coding faster than I would. (And that there are a million Indian coders already, that was a factor). I'm not saying I'm a programmer, but I have at least a superficial understanding.

I distinctly remember what a difference GPT-4 made. GPT-3.5 was tripped up by even simple problems and hallucinated all the time. 4 was usually reliable, and I would wonder how I'd ever learned to code before it.

I have little reason to write code these days, but I can see myself vibe-coding. Despite your claims that you don't feel that programming had changed since 2015, there are no end of talented programmers like Karpathy or Carmac who would disagree.

Thanks for the comment btw, it made me try out programming with gemini 2.5 and it's pretty good.

You're welcome. It's probably the best LLM for code at the moment. That title changes hands every other week, but it's true for now.

99% of the time or more, what current models say in my field of expertise (medicine) is correct when I check it. Some people claim to experience severe Gell-Mann amnesia when using AI models, and that has not really been my experience.

  1. Okay can we get people to start using delusions or confabulations instead of hallucinations. This always irks me.

  2. I know we've bickered about this in the past but I think you have to be very cautious about what decision support tools and LLMs are doing in practical medicine at this time - fact recall is not most of the problem or difficulty.

The average person here could use UpToDate to answer many types of clinical questions, even without the clinical context that you, I, and ChatGPT have.

That's not the hard part of medicine. The hard part is managing volume (which AI tools can do better than people) and vagary (which they are shit at). Patients reporting symptoms incorrectly, complex comorbidity, a Physical Exam, these sorts of things are HARD.

Furthermore the research base in medicine is ass, and deciding if you want a decision support tool to use the research base or not is not a simple question.

On the topic of hallucinations/confabulations from LLMs in medicine:

https://x.com/emollick/status/1899562684405670394

This should scare you. It certainly scares me. The paper in question has no end of big names in it. Sigh, what happened to loyalty to your professional brethren? I might praise LLMs, but I'm not conducting the studies that put us out of work.

The average person here could use UpToDate to answer many types of clinical questions, even without the clinical context that you, I, and ChatGPT have.

I expect that without medical education, and only googling things, the average person might get by fine for the majority of complaints, but the moment it gets complex (as in the medical presentation isn't textbook), they have a rate of error that mostly justifies deferring to a medical professional.

I don't think this is true when LLMs are involved. When presented with the same data as a human clinician, they're good enough to be the kind of doctor who wouldn't lose their license. The primary obstacles, as I see them, lie in legality, collecting the data, and the fact that the system is not set up for a user that has no arms and legs.

I expect that when compared to a telemedicine setup, an LLM would do just as well, or too close to call.

That's not the hard part of medicine. The hard part is managing volume (which AI tools can do better than people) and vagary (which they are shit at). Patients reporting symptoms incorrectly, complex comorbidity, a Physical Exam, these sorts of things are HARD.

I disagree that they can't handle vagary. They seem epistemically well calibrated, consider horses before zebras, and are perfectly capable of asking clarifying questions. If a user lies, human doctors are often shit out of luck. In a psych setting, I'd be forced to go off previous records and seek collateral histories.

Complex comorbidities? I haven't run into a scenario where an LLM gave me a grossly incorrect answer. It's been a while since I was an ICU doc, that was GPT-3 days, but I don't think they'd have bungled the management of any case that comes to mind.

Physical exams? Big issue, but if existing medical systems often use non-doctor AHPs to triage, then LLMs can often slot into the position of the senior clinician. I wouldn't trust the average psych consultant to find anything but the rather obvious physical abnormalities. They spend blissful decades avoiding PRs or palpating livers. In other specialities, such as for internists, that's certainly different.

I don't think an LLM could replace me out of the box. I think a system that included an LLM, with additional human support, could, and for significant cost-savings.

Where I currently work, we're more bed-constrained than anything, and that's true for a lot of in-patient psych work. My workload is 90% paperwork versus interacting with patients. My boss, probably 50%. He's actually doing more real work, at least in terms of care provided.

Current setup:

3-4 resident or intern doctors. 1 in-patient cons. 1 outpatient cons. 4 nurses a ward. 4-5 HCAs per ward. Two wards total, and about 16-20 patients.

?number of AHPs like mental health nurses and social workers triaging out in the community. 2 ward clerks. A secretary or two, and a bunch of people whose roles are still inscrutable to me.

Today, if you gave me the money and computers that weren't locked down, I could probably get rid of half the doctors, and one of the clerks. I could probably knock off a consultant, but at significant risk of degrading service to unacceptable levels.

We're rather underemployed as-is, and this is a sleepy district hospital, so I'm considering the case where it's not.

You would need at least one trainee or intern doctor who remembered clinical medicine. A trainee 2 years ahead of me would be effectively autonomous, and could replace a cons barring the legal authority the latter holds. If you need token human oversight for prescribing and authorizing detention, then keep a cons and have him see the truly difficult cases.

I don't think even the ridiculous amount of electronic paperwork we have would rack up more than $20 a day for LLM queries.

I estimate this would represent about £292,910 in savings from not needing to employ those people, without degrading service. I think I'm grossly over-estimating LLM query costs, asking one (how kind of it) suggests a more realistic $5 a day.

This is far from a hyperoptimized setup. A lot of the social workers spend a good fraction of their time doing paperwork and admin. Easy savings there, have the rest go out and glad-hand.

I re-iterate that this is something I'm quite sure could be done today. At a certain point, it would stop making sense to train new psychiatrists at all, and that day might be now (not a 100% confidence claim). In 2 years? 5?

Do keep in mind how terrible most medical research is, and that includes research into our replacements. This isn't from lack of effort but from the various systems, pressures, and ethics at play.

How do you simulate a real patient encounter when testing an LLM? Well maybe you write a vignette (okay that's artificial and not a good example. Maybe you sanitize the data inputs and have a physician translate into the LLM. Well shit, that's not good either.

Do you have the patient directly talk to the LLM and have someone else feed in lab results? Okay maybe getting closer but let's see evidence they are actually doing that.

All in the setting of people very motivated to show the the tool works well and therefore are biased in research publication (not to mention all the people who run similar experiments and find that it doesn't work but can't get published!).

You see this all the time in microdosing, weed, and psychedelic research. The quality is ass.

Also keep in mind that a good physician is a manager also - you are picking up the slack on everyone else's job, calling family, coordinating communication for a variety of people, and doing things like actually convincing the patient to follow recommendations.

I haven't seen any papers on an LLMs attempts to get someone to take their 'beetus medication vs a living breathing person.

Also Psych will be up there with the procedurealists in the last to be replaced.

Also also other white collar jobs will go first.

Do you have the patient directly talk to the LLM and have someone else feed in lab results? Okay maybe getting closer but let's see evidence they are actually doing that.

I expect this would work. You could have the AI be something like GPT-4o Advanced Voice for the audio communication. You could record video and feed it into the LLM. This is something you can do now with Gemini, I'm not sure about ChatGPT.

You could, alternatively, have a human (cheaper than the doctor) handle the fussy bits. Ask the questions the AI wants asked, while there's a continuous processing loop in the background.

No promises, but I could try recording a video of myself pretending to be a patient and see how it fares.

All in the setting of people very motivated to show the the tool works well and therefore are biased in research publication (not to mention all the people who run similar experiments and find that it doesn't work but can't get published!).

I mean, quite a few of the authors are doctors, and I presume they'd also have a stake in us being gainfully employed.

Also keep in mind that a good physician is a manager also - you are picking up the slack on everyone else's job, calling family, coordinating communication for a variety of people, and doing things like actually convincing the patient to follow recommendations.

I'd take orders from an LLM, if I was being paid to. This doesn't represent the bulk of a doctor's work, so if you keep a fraction of them around.. People are already being conditioned to take what LLMs take seriously. They can be convinced to take them more seriously, especially if vouched for.

I haven't seen any papers on an LLMs attempts to get someone to take their 'beetus medication vs a living breathing person.

That specific topic? Me neither. But there are plenty of studies of the ability of LLMs to persuade humans, and the very short answer is that they're not bad.

I mean, quite a few of the authors are doctors, and I presume they'd also have a stake in us being gainfully employed.

Nah most of us Get Too Excited About Making A Difference.

Sidebar- I was watching "In Good Company" at lunch today (podcast in which the manager of Norway's sovereign wealth fund interview the most successful people in the world) and the CEO of Goldman asked Nicolai about the best features in leaders - empathy was one of them! And this was noted in the context of LLMs taking over other parts of the job for many things!

Empathy and leadership are core to being a physician (at least in the U.S.) and if two of the world's most successful people are going to emphasize the importance of that I'm going to imagine we will be well positioned lol.

More comments

I wish I had a dollar for every time people use the current state of AI as their primary justification for claiming it won't get noticeably better, I wouldn't need UBI.

He didn't say that. He said that the state today is not very good, not that being unimpressive today means it will be unimpressive in the future.

Besides which, your logic cuts both ways. Rates of change are not constant. Moore's Law was a damn good guarantee of processors getting faster year over year... right until it wasn't, and it very likely never will be again. Maybe AI will keep improving fast enough, for long enough, that it really will become all it's hyped up to be within 5-10 years. But neither of us actually knows whether that's true, and your boundless optimism is every bit as misplaced as if I were to say it definitely won't happen.

But a lot of people are like you, so these models will start to get used everywhere, destroying quality like never before.

I can however imagine a future workflow where these models do basic tasks (answer emails, business operations, programming tickets) overseen by someone that can intervene if it messes up. But this won't end capitalism.

This conveys to me the strong implication that in the near term, models will make minimal improvements.

At the very beginning, he said that benchmarks are Goodharted and given too much weight. That's not a very controversial statement, I'm happy to say it has merit, but I can also say that these improvements are noticeable:

Metrics and statistics were supposed to be a tool that would aid in the interpretation of reality, not supercede it. Just because a salesman with some metrics claims that these models are better than butter does not make it true. Even if they manage to convince every single human alive.

You say:

Besides which, your logic cuts both ways. Rates of change are not constant. Moore's Law was a damn good guarantee of processors getting faster year over year... right until it wasn't, and it very likely never will be again. Maybe AI will keep improving fast enough, for long enough, that it really will become all it's hyped up to be within 5-10 years. But neither of us actually knows whether that's true, and your boundless optimism is every bit as misplaced as if I were to say it definitely won't happen.

I think that blindly extrapolating lines on the graph to infinity is as bad an error as thinking they must stop now. Both are mistakes, reversed stupidity isn't intelligence.

You can see me noting that the previous scaling laws no longer hold as strongly. The diminishing returns make scaling models to the size of GPT 4.5 using compute for just model parameters and training time on larger datasets not worth the investment.

Yet we've found a new scaling laws, test-time compute using reasoning and search which has started afresh and hasn't shown any sign of leveling out.

Moore's law was an observation of both increasing transistor/$ and also increasing transistor density.

The former metric hasn't budged, and newer nodes might be more expensive per transistors. Yet the density, and hence available compute, continues to improve. Newer computers are faster than older ones, and we occasionally get a sudden bump, for example, Apple and their M1

Note that the doubling time for Moore's law was revised multiple times. Right now, the transistor/unit area seems to double every 3-4 years. It's not fair to say the law is dead, but it's clearly struggling.

Am I certain that AI will continue to improve to superhuman levels? No. I don't think anybody is justified in saying that. I just think it's more likely than not.

  1. Diminishing returns!= negative returns.
  2. We've found new scaling regimes.
  3. The models that are out today were trained using data centers that are now outdated. Grok 3 used a mere fraction of the number of GPUs that xAI has, because they were still building out.
  4. Capex and research shows no signs of stopping. We went from a million dollar training run being considered ludicrously expensive to companies spending hundreds of millions. They've demonstrated every inclination to spend billions, and then tens of billions. The economy as a whole can support trillion dollar investments, assuming the incentive was there, and it seems to be. They're busy reopening nuclear plants just to meet power demands.
  5. All the AI skeptics were pointing out that we're running out of data. Alas, it turned out that synthetic data works fine, and models are bootstrapping.
  6. Model capabilities are often discontinuous. A self-driving car that is safe 99% of the time has few customers. GPT 3.5 was too unreliable for many use cases. You can't really predict with much certainty what new tasks a model is capable of based on extrapolating the reducing loss, which we can predict very well. Not that we're entirely helpless, look at the METR link I shared. The value proposition of a PhD level model is far greater than that of one as smart as a high school student.
  7. One of the tasks most focused upon is the ability to code and perform maths. Guess how AI models are made? Frontier labs like Anthropic have publicly said that a large fraction of the code they write is generated by their own models. That's a self-spinning fly-wheel. It's also one of the fields that has actually seen the most improvement, people should see how well GPT-4 compares to the current SOTA, it's not even close.

Standing where I am, seeing the straight line, I see no indication of it flattening out in the immediate future. Hundreds of billions of dollars and thousands of the world's brightest and best paid scientists and engineers are working on keeping it going. We are far from hitting the true constraints of cost, power, compute and data. Some of those constraints once thought critical don't even apply.

Let's go like 2 years without noticeable improvement before people start writing things off.