self_made_human
amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi
I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.
At any rate, I intend to live forever or die trying. See you at Heat Death!
Friends:
A friend to everyone is a friend to no one.
User ID: 454
We don't really want a "showcase" in the sense "look at X impressive thing that Y model can do". There are a gazillion demos out there.
We want specific tasks that someone doubts a model can do, but which they'd be impressed by if they succeeded and which the two of us a priori think will work. If it would be super impressive (if it worked) but we don't think it would work, it's not what we want right now.
Gemini's sample is impressive! Color me impressed, especially that a straight-up prompt produced that (though I suppose if any technique would get it with current models, it'd be "one shotting through a prompt" rather than "iterative refinement towards a target").
My impression is that Gemini's output was unusually good and Claude’s was unusually bad. But both 3.1 Pro and 4.6 Sonnet are new enough that my intuition based on extensive interaction with previous models might no longer be applicable. For what it's shirt, both were n=1 samplings with zero cherrypicking.
since you don't tend to drop spurious technical details into your walls of text unless they serve a purpose (and also because I half suspect you're not a fan of the amyloid theory of alzheimers)
Looks around shiftily why, I'd never throw in spurious technical details into an essay. Couldn't be me!
(I probably wouldn't use the specific Tau and amyloid phrasing, since you are correct that I have very mixed feelings about the amyloid hypothesis)
Interestingly, your results look much, much better to me than the ones I get myself. I ran the same test as you did against Gemini, and got these not-very-good attempts: 1 2 3. Gemini took distinctive phrases (e.g. "85% agree") and ideas (e.g. "claude code as supply chain risk") I have used once in the corpus, fixated on them, and stitched them together into a skinsuit which superficially resembles my writing but doesn't hold up under scrutiny. Interestingly, that's a very base model flavored failure mode. I have grown unused to seeing base-model-flavored failure modes, and as such Gemini is much more interesting to me now.
The examples seem to channel your "LessWrong" blogging voice. I am unable to critique the technical details or identify (what I expect are many) confabulations, but if I saw this posted there in your name I wouldn't bat an eye.
I haven't really futzed around with base models since GPT-3, though I might have tried one of the Llama 3s at some point. They're non-trivial to access, and have limited utility for me. Mainly because of the added difficulty of prompting base models, and the fact that the publicly accessible ones are nowhere near as intelligent as proprietary dedicated assistants. If you think I'm wrong about this, I'd be curious to hear about it.
In general, I get the strong impression that while the author of the corpus might be able to pinpoint specific issues in terms of style or stance, it's much harder for others to spot those tells.
The biggest pitfalls are the tendency to adopt em-dashes (models are more than capable of not doing that if you specifically prompt them not to), and other stock "AI" phrases like:
There is a very specific failure mode in modern LLMs
Which can show up if you're using models to merely edit/format a draft, and not just write an essay from scratch.
I must also continue stressing the point that this isn't quite representative of my usual informal benchmark:
- I'd also ask the model to first output a list of essay topics that it thinks I would write, of which I'd choose a specific one that sounded interesting, perhaps asking it to propose an outline first.
- I would definitely run multiple iterations of the prompt or suggest specific corrections and check their adherence.
- I would also index heavily on their ability to mimic authors I know very well. Can they pass as Gwern, or Scott, or Richard Watts? Can they take an existing essay I've written and rewrite it an arbitrary style and produce something interesting, if not superior as a whole?
It's enough for me to spot a better way to say a specific thing I'm already saying. A single vivid metaphor or interesting analogy that is worth co-opting can make the practical purpose of the exercise worth it.
Yeah, but they're usually suffering from psychiatric illness, and the usual treatment is to tell them to go to the doctor less. Indulging them and constantly ordering investigations and treatment is pretty much malpractice.
Either way, there aren't enough of them to keep doctors employed full-time.
We'll take it into consideration, thanks.
Demand for healthcare is comparatively inelastic, but it is not unbounded. If going to the doctor was cheap, you wouldn't spend all your time going to the doctor.
The specific outcome depends heavily on a variety of factors, including the degree of boosted productivity and whether having a fully trained medical professional in the room is necessary at all. If AI could do 90% of a doctor's work and save 90% of their time, but the demand for medical care only doubled, then I can see it easily being the case that hospitals would slash headcounts and pocket the change.
If the AI was >=100% as good as a human doctor (or got away with using less skilled alternatives like nurses, NPs etc for the physical stuff), then that might lead to mass unemployment or paycuts. 90% of doctors ending up unemployed, from my perspective, is almost as bad as all of us getting the sack.
That's already in my post. I would have liked people to give an estimate of how long they're willing to wait for the AI to try solving the problem, but nobody has bothered, so it's clear to me that they care more about the fact that it can be done at all than how long it takes. On our end, we're not going to keep trying indefinitely, we've got bigger fish to fry.
I presume, when we share logs, it'll include time stamps and reasoning times as well as tokens used. Shouldn't be too hard, I recall that all of that is there by default in Claude Code.
70% of medicine is minimizing unknown unknowns by knowing as much as you can, and knowing the boundaries of what is unknown to you. I believe a more concise way of expressing that is "knowledge". Regretfully, the books are fat and intimidating for good reason, there's are a lot of things to know.
30% of the rest is reasoning from knowledge, clinical experience (yet another form of knowledge, just the stuff the textbooks don't tell you) and pattern recognition.* This is more dependent on your wits, or your fluid intelligence, if I'm being precise.
The best doctors both know a lot, and are bright enough to apply that information well. The former is indispensable, you simply cannot figure out medicine by sitting in a cave and thinking very hard. I don't know if some superintelligence can look at a single human without the aid of tools, ponder very hard, and figure out everything work knowing. All I can say is that it's beyond any actual human.
(IQ/g also correlates strongly with memory, so the relative importance of both is very hard to tease out. Especially when there's a high-pass filter with all most of the idiots and amnesiacs strained out by the end of med school)
How much of the raw cognitive labor doctors do could be done by a bright undergrad with access to uptodate and a bunch of case histories, both with semantic search?
Let me put it this way: I was a bright kid, and felt like I knew a lot of medicine before entering med school, both due to cultural osmosis and because I took an interest in it. You would not have wanted me as your actual doctor. I did not know nearly as much as I thought I did.
Later, I was a med student, a year or two in and confident that I knew the gist of it. I felt ready to make my own medical decisions, at least about myself. I thought I was smart and that I did my due diligence (reading things online, including research papers). It was insufficient, I did potentially permanent damage to my own health (I'm not going to go into details). I would not want that me as my doctor either.
Now, I am a lot older and a little more knowledgeable, if not necessarily wiser. You could do worse as your doctor, at least if we're sticking to psychiatry. You could probably do better too, but I have a place on the free market. I'm cheap, I give away my advice for free on the internet to anyone who asks nicely, and many who don't.
Along the way, I almost killed people through ignorance. Thankfully, nobody died, my colleagues caught it, or the pharmacist did, or I had a sudden sinking feeling in my gut and ran back to double check. Medicine recognizes that any human is fallible, and there are plenty of safeguards in place. Every junior doctor has their story of close calls, and hopefully nothing more than close calls. All senior doctors start as junior doctors, I hope.
Consider something else: most doctors will seek out a different doctor when they suffer a condition that isn't covered by their own specialty. Sometimes even then.
If a cardiologist feels funny in the head, he'll seek a neurologist. If a neurologist feels heart palpitations, he'll go talk to a cardiologist.
Why is that? Could they both not just open the relevant textbooks and figure out what the issue is? Can a cardiologist not take his med school knowledge of neurology and then skim something Elsevier put out?
These are people with complete medical training, genuine intelligence, and full access to literature, and they still defer to each other. That's not false modesty or liability management, it's that they've learned, through experience, exactly where their pattern recognition breaks down. They know the limits of their own competence.
Maybe. It might work out fine 90% of the time. But most doctors can handle ~90% of conditions, because most conditions are common and usually simple to manage. I apologize for the tautology, I can't see my way around it.
The other 10% are where the specialists come in. You cannot take a psychiatrist (even a smart one) and give him access to UpToDate and expect him to be as good a cardiologist as an actual trained cardiologist. He might do okay, but he's going to kill people along the way.
And that is a fully qualified doctor dabbling in another branch of medicine. A "bright undergrad with access to uptodate and a bunch of case histories, both with semantic search" will crash and burn. I'd bet good money on it, it'll happen sooner rather than later.
If they set up shop and started seeing patients, bumbling their way through things and furiously looking things up as soon as they could, they might successfully treat the colds, stomach upsets, sore throats and so on. That's the bulk of undifferentiated medicine, as you'd expect. They might catch some of the rarer stuff. They will also be very poorly calibrated and commit significant iatrogenic harm. But rest assured they will kill people eventually (at a rate massively higher than a doctor normally does).
That's not even getting into time pressure, or physical findings and techniques that are impossible to adequately convey over just video and text.
LLMs? They narrow the gap significantly, but do not have thumbs. The bright undergrad would benefit immensely from ChatGPT, but rest assured that most of the performance would come from ChatGPT itself, and they would add little. Handcuffing a child to a man does not make their combination superior.
The combination of factors that make a good human clinician are rare. And when you do find them, you're investing a great deal in training to get them up to scratch. Most of this is the bottleneck of information transfer/learning, which LLMs neatly sidestep. GPT-4 did well, and it was dumb as bricks compared to current models. Turns out an encyclopedic knowledge of medicine will get you very far, even if you're not very bright. But it was also able to access and process this information faster than your thought experiment of a human with a computer.
But if you want a final answer: 60-70%. Best estimate I have.
*Sufficiently advanced pattern recognition is indistinguishable from intelligence. It might well be intelligence. You know LLMs, you know this.
https://www.calebleak.com/posts/dog-game/
Show's over. Someone's found a way to make even the most unsophisticated user into a competent game developer through judicious use of AI. I'll pack my bags.
(No, it's not actually over, I just thought this was too funny to ignore)
GPT 5.2 Thinking in Extended Reasoning mode:
https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40
5.2T + Agent mode, where it actually used a dedicated browser with a visual output:
https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5
I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.
Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about?
I am really the wrong person to ask this. I don't regularly use LLMs for programming purposes, when I do, it's usually for didactical purposes, or small bespoke utilities.
The most ambitious project I tried was a mod for Rimworld, which didn't work. To be fair to the models, I was asking for something very niche, and I wasn't using an IDE instead of the chat interface. I ended up borrowing open-source code and editing it, and just using AI image generation for art assets (which worked very well, to the point it pissed off the more puritan modders in the Discord). I can mention that the issues I ran into were the models being unfamiliar with the code for the mod I intended to support (Combat Extended, a massive overhaul of core systems), and that what knowledge they had innately was outdated. I was too unfamiliar with Rimworld modding to be confident that editing their efforts was worth my time. Other people have succeeded in writing bigger mods that work well (as far as I can tell) using AI, so there's definitely an element of skill-issue on my part.
SF might have actually useful observations, but he's a lurker to the core, and I'm the forward-facing entity for the moment. He says he's generally busy with work right now, so I wouldn't wait on him to respond, though I'd be happy if he did.
If you insist:
- I think there are very significant gains from providing models clear direction from the start, including sharing your own intuition/professional taste. That includes instructions on how to manage state or update design documents and maintain records. Experienced managers or principal architects find that much of their skills directly transfer to directing and managing agents.
- I have little idea how well the models would do by default. Depends on the task, depends on the model. I haven't used any version of Opus, ever. The last time I used them seriously for writing code was in the GPT-4 days, and they were already better than me (I was doing programming homework and working through MIT's OCW, relying on them for educational purposes when I got stuck - I was disillusioned with medicine and exploring alternatives)
Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)
I don't know if it can do this kind of knowledge work, but I do expect that it will be able to short-order. I make no firm commitments on whether this will be the direct consequence of RSI (since labs are opaque about methodology), or if it'll be a simple consequence of further scaling and increasingly intensive RLVR.
(¿Por que no los dos?)
Either way, I think it's more likely than not the kind of problem you describe will be trivial within a year or two. My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part. That is already a very strong starting point, can you imagine asking GPT-4 to even attempt any of this and get working results?
Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.
A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.
As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.
So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated.
I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.
If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.
Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.
I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.
Photoshop/GIMP tool
I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.
Hmm. I think that would be acceptable. Stand by for results, though it might take a while for us to hash it all out on our end.
I think we're on the same page here, I'll talk to SF about this. I'm willing to put in the effort on my end, which, as I see it, is to write a 1000 word essay as I normally would. Not particularly onerous.
Let me give you an idea of how I normally approach this. I simply copy-paste pages of my profile after sorting by top, usually at least two or three pages (45k tokens). I might also share a few "normal" pages in chronological order, for the sake of diversity if nothing else.
I did just this, using Gemini 3.1 Pro on AI Studio (GPT 5.2 Thinking, which I pay for, can't write in arbitrary styles nearly as well no matter how hard you try, and I've tried a lot, I don't pay for Claude so I'm stuck with Sonnet):
I copied and pasted the first two profile pages, sorting by top of all time. Instructions were:
Your task is to write a 1000 word essay in the exact style and voice of self_made_human, on a topic of your choice (heavily informed by what you think he'd choose).
https://rentry.co/23dc63vs by Gemini https://rentry.co/p5yh68zu by Claude 4.6 Sonnet (same setup)
Results? I'd grade Gemini a 7/10, Claude a 5/10.
Looking at Gemini:
- It captures the way I'd write in an "academic register", namely when I'm trying very hard to be polished, and that includes heavy LLM use. It's not "raw self_made_human", because I increasingly do not post raw, minimally edited posts.
- It uses em-dashes. I do not, as a general rule, mostly because people are on a hair-string trigger. Shame, I think they're neat.
- The exact circumstances are obviously fictional. Can't expect otherwise, can we?
- Otherwise very good! I would write a story like that. I've seen patients just like that. It captures my transhumanist outlook and my love/hate relationship with medicine.
- I can see it overindexing on random biographical tidbits. My grandpa? Relevant.
Looking closer:
which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.
I don't live or work near Bromley. That's where an uncle of mine resides. It's clear from the context I shared that I'm up in Scotland.
I will happily roll the dice on a 30% chance of AGI-induced extinction if it buys me a 70% chance of reaching escape velocity. Give me the ASI. Let it fold our proteins and solve cellular senescence. If it kills us, at least it will likely be fast, clean, and computationally elegant—which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.
I could see myself saying this. Maybe not those exact figures, perhaps 10%:90%, but directionally correct.
We have, as a civilization, achieved a horrific kind of half-victory. Modern medicine—my profession, which I love and despise in equal measure—has become incredibly adept at preventing you from dying. We can stent your coronaries, dialyze your kidneys, and pump you full of broad-spectrum antibiotics. We have defeated the acute killers that historically pruned the human herd. But we have utterly failed to extend healthspan in tandem with lifespan. We have built a remarkably efficient pipeline that funnels the elderly past the quick, clean deaths of yesteryear and deposits them directly into a decades-long purgatory of cognitive and physical decay.
And the NHS, Moloch bless its sclerotic, crumbling heart, is entirely unprepared for the demographic tsunami that is already making landfall. We are warehousing hollowed-out shells of human beings in care homes at exorbitant expense, draining the wealth of the middle class to fund the agonizingly slow dissolution of their parents.
Very good. I would use that verbatim in a real essay.
People look at my bio—amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi (attain immortality, or die trying)—and assume I am driven by a narcissistic fear of death. They wheel out the tired, poetic cope that "death gives life meaning," that finitude is the necessary canvas upon which human beauty is painted.
I wouldn't say that at all dawg. Why would I randomly reference my user flair in an essay?
Claude's version is shit. It's staggeringly content free, and while it's closer to "raw" me, it also uses em-dashes and uses many words to say few things. Maybe it's bad luck, I've had better results in the past, especially since I usually share a specific topic instead of letting it decide on its own.
Here is the whole prompt, profile dump included, if you want to try with a different model. I'll see about using Opus, I know 5.2 Thinking will shit the bed in a stylistic sense.
Rentry won't let me paste the whole thing. But I think I've been clear enough to reproduce independently. I'll happily take a look.
Sounds interesting enough. I will note that using LLMs to write 500 words using my own work as a style reference and then just using that verbatim as a comment/post is not how I actually use them.
But as a general experiment? Sure, I'd be interested to see the results.
Any means and methods including agents are permitted as long as all output tokens came from the AI model.
Does this preclude all human intervention after hitting go? Am I forbidden from telling the model that it has failed to capture my style or my opinions correctly, then either suggesting specific corrections or more broad advice?
I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?
I've shared my thoughts on LLMs consistently here, for years. It wasn't central to this particular demo.
But if you want to know:
- I do not hold very high confidence claims on their capabilities in coding, because I'm not a professional programmer. I get the impression that they're very useful, on the basis of statements made by people like Karpathy, and by observing specific advances.
- I think they are already capable of replacing a large chunk of existing knowledge work. The market hasn't caught up to this, if LLM progress was arrested right here, we'd see seismic shocks as industries adjusted years into the future.
- Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor. I could automate 90% of my job today, leaving aside the physical tasks. The primary thing holding me back is archaic NHS IT. LLMs give solid medical advice.
- I have a median timeline for AGI that's ~2030. 70% CI by 2035. I put a very non-negligible chance on it arriving by start of 2028 or even 2027.
- I do not make strong claims on if the current Transformer architecture/LLMs is capable of scaling into AGI, or if we need new paradigms. Even if we do, I think the ludicrous amounts of monetary investment and the attention of thousands of the smartest humans alive will likely find it.
I think this would probably make me an LLM bull, even if I'm not maximally bullish. Definitely "displacement imminent".
I would call you a moderate under my schema, and probably an "instrumental optimist".
Either way, I don't think you're our target audience for this demo, since you personally and professionally use SOTA LLMs with regularity and are familiar with their pitfalls.
Sounds reasonable to me (SMH). We'll get back to you on that.
Subtly (or maybe not so subtly) the discussion has changed from "AI will achieve AGI and then ASI and then run the world to give us fully automated luxury gay space communism" to "AI is for coding, it's all about the coding, AI will replace software engineers, coding is the be-all and end-all, ignore that it still fakes answers to questions where people know enough to know it's lying/hallucinating".
Don't waste my time with a strawman, please.
I expect AGI and ASI. Even before LLMs, when it was Yudkowsky and Friends worrying about hypothetical future AI in a shed in the ancient times of the early 2000s, the concern was recursive self-improvement. What does that mean? A smart-enough AI writing the code for a smarter version of itself, which writes the code for an even smarter version of itself, and so on till humans are left in the dust.
Notice the common thread? Coding, writing code. Even leaving aside that there's enormous consumer and business demand for LLM-written code, their coding capabilities have been central to this whole debate since day -1.
The big labs are betting their future on being the first to get to this point, and already claim significant boosts to the productivity of their human researchers via the models writing code for training new models, or even conducting experiments.
I don't care about coding because coding has nothing to do with my job. Can it replace accountants, lawyers, clerical staff? Without inventing fake precedents or fake citations from dead authors?
Why don't you buy a $20 plan and test? I can tell you that as a doctor who isn't expected to write code ever, it could do most of my work for me, and well. The only reason I haven't automated myself into an early retirement are the obvious physical bottlenecks and NHS IT.
Thank you for the offer! We might be able to take you up on it.
After a night to dwell on your suggestion, we might even be able to implement a version of your original proposal:
- Original spec, that the agent works towards mostly autonomously till a finished product
- Your pre-registered desired modifications, which SF can then ask the model to attempt to implement.
- We (including you) evaluate the final result.
That way, he won't need to keep active tabs on it, he can just tell the model to do things as per his convenience, while not losing much in terms of demonstrative power.
I'm not sure if this is what you had originally proposed, or if you edited in before I replied, but no big deal. We'd need you to give us a more specific idea of the task at hand, if possible.
Well, what are the specific ways you think the experiment can be improved, including the minimization of cherrypicking (without adding an unreasonable amount of additional effort on our parts)? Keep in mind we're two dudes in a shed, not Anthropic itself.
This is very unlikely to be accepted:
-
Too subjective to be useful, and far too ambiguous. Who's doing the grading here? How are they assessing "coherence"? How are we blinding things, if not, how do we account for bias?
-
We strongly prefer actual programming tasks, not creative writing. We could easily ask Claude to write a novel, and it would do it, but then we're back at the issue of grading it properly.
If you want to propose something like this, you need to be as rigorous as @faul_sname up in the thread. At the very least, propose evaluators that aren't you or the two of us, and we can see if it's possible to make this work.
I said "strongly inadvisable" and not "automatically disqualifying".
SF would need to babysit the process, waiting for the person making the request to raise their request, instead of hitting go and checking in periodically or after being alerted. He may or may not be able to do this, he does have a full-time job.
It also injects some degree of ambiguity into things, as well as significantly increasing the time and token investment. Max plans are not infinite.
I stress that this isn't necessarily a deal breaker, it just makes things harder and reduces the likelihood of acceptance. You're at liberty to try asking, and we're at liberty to turn it down, especially should you ask for something outside the original spec (as mutually agreed on in advance).
What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.
Duh? What on earth could you expect us to do differently? If the skeptic already believes the model to be capable of the task, why ask for a test?
There is non-zero value in discovering a task that both the two of us and the skeptic expect a model to achieve, and then witnessing it failing at it (unexpected, at least), but that is clearly not the primary purpose here. Someone else is welcome to try, after they're no longer swamped with a quadrillion entries. The set of tasks that the skeptics and I both expect models to accomplish is much larger than the one where we disagree.
Hence why I think your claim:
This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.
Is clearly nonsensical.
By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.
Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.
I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.
Also, I see several issues with this proposal:
- As I've happily admitted in the past, I use AI quite often in my writing. That encompasses using them for a) research and ideation (not very contentious, assuming I've done my due diligence and didn't let actual hallucinations through, and I don't recall being accused of that, ever), b) Formatting and rearranging essays I've already written (surprisingly contentious) and c) minor additions to what I drafted in the first place which I saw fit to incorporate (I'd call this contentious if people could actually pinpoint what they were, they can't). I've never shared an essay where I didn't write at least 80% of the text (prior to the editorial step I mentioned).
- This means that a scrape of my writing corpus is hopelessly "contaminated". I'd have to go back many months before I'm comfortable staying that not even a single word came from an LLM.
- In order for this to have any hope of blinding, I'd have to write a 1000 word essay myself. Ideally on the same topic, and before I ever saw LLM output This is complicated by the fact that I throw almost everything substantial I ever write these days into an LLM, for critique and fact checking if nothing else. I could do that, God knows it takes very little for me to rattle off a thousand words.
Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.
(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)
Access to all of your public past writing
This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.
Access to base models Access to fine-tune base or instruct models Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants
Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.
Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?
Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.
This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.
I do not see how you can interpret us in that manner.
We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points.
If the problem is deemed too hard by everyone (the person proposing it clearly believes the model can't do it), then what exactly does failure demonstrate? Nobody ever expected it to succeed within the given constraints. You can't evaluate automobiles in terms of their ability to reach Alpha Centauri. You can't adjudicate a debate between a Ferrari fanatic and a Lambo lover based on which car is more effective at deep sea exploration.
It takes disagreement on model capabilities and (expected) outcomes for all of this to be surprising or useful.
As we've clearly stated later, if we agree to the challenge, then we expect that the model can do something (that our counterparty thinks it can't), so the failure of the model goes against us, and will force us to update.
I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread. They seem reasonable enough to me, but I am clearly not the real expert here, and I'll be deferring to his judgment. That might take a little while to organize, I'll edit this into the main post for the sake of clarity.
- Prev
- Next

Does anyone here have any personal experience with the management of migraines?
As I've mentioned before, mine have recently become significantly more frequent (annual to maybe twice or thrice a month). I think, but am not entirely sure, that they're much more debilitating. The visual aura was usually standalone, but these days it's followed by a headache that, if not awful, is still bad enough to be debilitating. I also feel queasy and loopy, which means I have a hard time getting anything done for several hours afterwards. All I seem to want to do is lie in bed for most of the rest of the day.
I've tried sumatriptan, 50mg x 2, taken as soon as I notice the visual aura. Augmented by the odd paracetamol or two. I think it helps a little, but I wouldn't call myself fully functional afterwards.
If they become even more frequent, then I'm open to starting preventative medication like beta blockers.
I have no experience with treating migraines professionally, and I am also incredibly lazy about seeing other doctors unless in imminent fear of death. Yes, yes, laugh at me if you want. I know my flaws.
More options
Context Copy link