@self_made_human's banner p

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users  
joined 2022 September 05 05:31:00 UTC

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


				

User ID: 454

self_made_human

amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi

14 followers   follows 0 users   joined 2022 September 05 05:31:00 UTC

					

I'm a transhumanist doctor. In a better world, I wouldn't need to add that as a qualifier to plain old "doctor". It would be taken as granted for someone in the profession of saving lives.

At any rate, I intend to live forever or die trying. See you at Heat Death!

Friends:

A friend to everyone is a friend to no one.


					

User ID: 454

https://www.calebleak.com/posts/dog-game/

Show's over. Someone's found a way to make even the most unsophisticated user into a competent game developer through judicious use of AI. I'll pack my bags.

(No, it's not actually over, I just thought this was too funny to ignore)

GPT 5.2 Thinking in Extended Reasoning mode:

https://chatgpt.com/share/699dfcfc-b0c4-800b-8e1a-870264179c40

5.2T + Agent mode, where it actually used a dedicated browser with a visual output:

https://chatgpt.com/share/699dfd6d-a7f8-800b-be8e-c04d95de44e5

I haven't checked if the answer is right, I'm recovering from a bad migraine so apologies for the laziness.

Do you have any thoughts that you'd be willing to share on what I wrote concerning the amount of knowledge work currently required to be input to do things like the task I was thinking about?

I am really the wrong person to ask this. I don't regularly use LLMs for programming purposes, when I do, it's usually for didactical purposes, or small bespoke utilities.

The most ambitious project I tried was a mod for Rimworld, which didn't work. To be fair to the models, I was asking for something very niche, and I wasn't using an IDE instead of the chat interface. I ended up borrowing open-source code and editing it, and just using AI image generation for art assets (which worked very well, to the point it pissed off the more puritan modders in the Discord). I can mention that the issues I ran into were the models being unfamiliar with the code for the mod I intended to support (Combat Extended, a massive overhaul of core systems), and that what knowledge they had innately was outdated. I was too unfamiliar with Rimworld modding to be confident that editing their efforts was worth my time. Other people have succeeded in writing bigger mods that work well (as far as I can tell) using AI, so there's definitely an element of skill-issue on my part.

SF might have actually useful observations, but he's a lurker to the core, and I'm the forward-facing entity for the moment. He says he's generally busy with work right now, so I wouldn't wait on him to respond, though I'd be happy if he did.

If you insist:

  • I think there are very significant gains from providing models clear direction from the start, including sharing your own intuition/professional taste. That includes instructions on how to manage state or update design documents and maintain records. Experienced managers or principal architects find that much of their skills directly transfer to directing and managing agents.
  • I have little idea how well the models would do by default. Depends on the task, depends on the model. I haven't used any version of Opus, ever. The last time I used them seriously for writing code was in the GPT-4 days, and they were already better than me (I was doing programming homework and working through MIT's OCW, relying on them for educational purposes when I got stuck - I was disillusioned with medicine and exploring alternatives)

Perhaps, given your comment below, this is just something that you mostly don't care about. Does this sort of thing just bucket into, "No, it can't do this sort of knowledge work now, but with sufficient recursive self-improvement, it will be able to do it later"? (I guess, in line with your stated AGI timelines?)

I don't know if it can do this kind of knowledge work, but I do expect that it will be able to short-order. I make no firm commitments on whether this will be the direct consequence of RSI (since labs are opaque about methodology), or if it'll be a simple consequence of further scaling and increasingly intensive RLVR.

(¿Por que no los dos?)

Either way, I think it's more likely than not the kind of problem you describe will be trivial within a year or two. My impression is that the models can just about do what you want them to do, but with significant frustration and wasted time on your part. That is already a very strong starting point, can you imagine asking GPT-4 to even attempt any of this and get working results?

Does it have to be a coding problem? I understand that there are time and financial constraints that prevent you from trying a lot of what is being requested, but I also understand @iprayiam3's criticism that it looks like you're cherry picking for something you thing the LLM can do. The problem is that for most people who aren't computer programmers they aren't going to be able to think of anything other than a piece of software that they wish existed but doesn't and ask you to write it from scratch, which is going to be cost prohibitive beyond the kind of textbook examples that were constructed for teaching purposes and don't address problems anyone is actually trying to solve. This seems like it should be marketing 101, but if you're trying to convince people that your product is worthwhile—and that's your stated goal—you have to show them that it will actually help them do something they want to do. If you tell me it can write code to fetch data from a REST API using asynchronous requests then I'll smile and nod but that's complete gibberish to me, and I won't know whether I should be impressed by it or not, or how that's supposed to improve my life.

A coding problem? Not strictly, no. I focused on coding because my collaborator SF (who is doing most of the work) is a programmer.

As you can see from discussion with Phailyoor and faul_sname, I'm open to other well-defined tasks.

So instead, I propose that we re-run the test I gave you last summer, because that is something I actually would use it for, and it obviously isn't too complicated.

I started as soon as I read this. I'm running it on 5.2 Thinking and another instance using Agent mode (the model has access to a computer of its own with a browser). It's taking a while, so I'll ping you when I'm done. I tried to be faithful to your original framing, so I didn't mention that o3 tried and failed at the task, or your critiques shared later.

If this doesn't work, then sure, I can ask SF to consider using his Claude setup to try. Shouldn't be too onerous.

Another idea i had on similar lines would be for me to arbitrarily select a parcel of land in Westmoreland County, PA (selected because all of the recorded documents are available for free online) and see if it could download every deed in the chain of title going back 100 years. This particular task isn't hard to do but would be a proof of concept that it could possibly do more sophisticated work. I recognize that there are a number of scenarios that could arise that would completely flummox the LLM here. Given that, as a proof of concept I could run a few parcels in advance and preselect an easy one as proof of concept, though since LLM boosters like to brag about how powerful their models are I'm inclined to arbitrarily pick one without looking first and see how it does, especially since it cuts way down on the work I would need to do to verify the answer.

I have no idea, in advance, if this will work. I doubt SF does either. But it's also something we can try.

Photoshop/GIMP tool

I share your concerns with the issues arising from Photoshop being closed-source. But I'll share it too, assuming SF hasn't seen this yet. It sounds like something worth trying from my perspective, but I will stress that I am not a professional programmer so I'll be deferring to his judgment.

Hmm. I think that would be acceptable. Stand by for results, though it might take a while for us to hash it all out on our end.

I think we're on the same page here, I'll talk to SF about this. I'm willing to put in the effort on my end, which, as I see it, is to write a 1000 word essay as I normally would. Not particularly onerous.

Let me give you an idea of how I normally approach this. I simply copy-paste pages of my profile after sorting by top, usually at least two or three pages (45k tokens). I might also share a few "normal" pages in chronological order, for the sake of diversity if nothing else.

I did just this, using Gemini 3.1 Pro on AI Studio (GPT 5.2 Thinking, which I pay for, can't write in arbitrary styles nearly as well no matter how hard you try, and I've tried a lot, I don't pay for Claude so I'm stuck with Sonnet):

I copied and pasted the first two profile pages, sorting by top of all time. Instructions were:

Your task is to write a 1000 word essay in the exact style and voice of self_made_human, on a topic of your choice (heavily informed by what you think he'd choose).

https://rentry.co/23dc63vs by Gemini https://rentry.co/p5yh68zu by Claude 4.6 Sonnet (same setup)

Results? I'd grade Gemini a 7/10, Claude a 5/10.

Looking at Gemini:

  • It captures the way I'd write in an "academic register", namely when I'm trying very hard to be polished, and that includes heavy LLM use. It's not "raw self_made_human", because I increasingly do not post raw, minimally edited posts.
  • It uses em-dashes. I do not, as a general rule, mostly because people are on a hair-string trigger. Shame, I think they're neat.
  • The exact circumstances are obviously fictional. Can't expect otherwise, can we?
  • Otherwise very good! I would write a story like that. I've seen patients just like that. It captures my transhumanist outlook and my love/hate relationship with medicine.
  • I can see it overindexing on random biographical tidbits. My grandpa? Relevant.

Looking closer:

which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.

I don't live or work near Bromley. That's where an uncle of mine resides. It's clear from the context I shared that I'm up in Scotland.

I will happily roll the dice on a 30% chance of AGI-induced extinction if it buys me a 70% chance of reaching escape velocity. Give me the ASI. Let it fold our proteins and solve cellular senescence. If it kills us, at least it will likely be fast, clean, and computationally elegant—which is a damn sight better than sitting in a soiled diaper in a Bromley care home, screaming at a nurse because you think you're back in the Blitz.

I could see myself saying this. Maybe not those exact figures, perhaps 10%:90%, but directionally correct.

We have, as a civilization, achieved a horrific kind of half-victory. Modern medicine—my profession, which I love and despise in equal measure—has become incredibly adept at preventing you from dying. We can stent your coronaries, dialyze your kidneys, and pump you full of broad-spectrum antibiotics. We have defeated the acute killers that historically pruned the human herd. But we have utterly failed to extend healthspan in tandem with lifespan. We have built a remarkably efficient pipeline that funnels the elderly past the quick, clean deaths of yesteryear and deposits them directly into a decades-long purgatory of cognitive and physical decay.

And the NHS, Moloch bless its sclerotic, crumbling heart, is entirely unprepared for the demographic tsunami that is already making landfall. We are warehousing hollowed-out shells of human beings in care homes at exorbitant expense, draining the wealth of the middle class to fund the agonizingly slow dissolution of their parents.

Very good. I would use that verbatim in a real essay.

People look at my bio—amaratvaṃ prāpnuhi, athavā yatamāno mṛtyum āpnuhi (attain immortality, or die trying)—and assume I am driven by a narcissistic fear of death. They wheel out the tired, poetic cope that "death gives life meaning," that finitude is the necessary canvas upon which human beauty is painted.

I wouldn't say that at all dawg. Why would I randomly reference my user flair in an essay?

Claude's version is shit. It's staggeringly content free, and while it's closer to "raw" me, it also uses em-dashes and uses many words to say few things. Maybe it's bad luck, I've had better results in the past, especially since I usually share a specific topic instead of letting it decide on its own.

Here is the whole prompt, profile dump included, if you want to try with a different model. I'll see about using Opus, I know 5.2 Thinking will shit the bed in a stylistic sense.

Rentry won't let me paste the whole thing. But I think I've been clear enough to reproduce independently. I'll happily take a look.

Sounds interesting enough. I will note that using LLMs to write 500 words using my own work as a style reference and then just using that verbatim as a comment/post is not how I actually use them.

But as a general experiment? Sure, I'd be interested to see the results.

Any means and methods including agents are permitted as long as all output tokens came from the AI model.

Does this preclude all human intervention after hitting go? Am I forbidden from telling the model that it has failed to capture my style or my opinions correctly, then either suggesting specific corrections or more broad advice?

I don't think you're quite clear in the post as to what camp you're actually in. Are you a straight bull? As in, do you think it can currently replace a sizeable portion of human knowledge work?

I've shared my thoughts on LLMs consistently here, for years. It wasn't central to this particular demo.

But if you want to know:

  • I do not hold very high confidence claims on their capabilities in coding, because I'm not a professional programmer. I get the impression that they're very useful, on the basis of statements made by people like Karpathy, and by observing specific advances.
  • I think they are already capable of replacing a large chunk of existing knowledge work. The market hasn't caught up to this, if LLM progress was arrested right here, we'd see seismic shocks as industries adjusted years into the future.
  • Specifically in the field of medicine, they can do most of the raw cognitive labor doctors do - as well or better than the average doctor. I could automate 90% of my job today, leaving aside the physical tasks. The primary thing holding me back is archaic NHS IT. LLMs give solid medical advice.
  • I have a median timeline for AGI that's ~2030. 70% CI by 2035. I put a very non-negligible chance on it arriving by start of 2028 or even 2027.
  • I do not make strong claims on if the current Transformer architecture/LLMs is capable of scaling into AGI, or if we need new paradigms. Even if we do, I think the ludicrous amounts of monetary investment and the attention of thousands of the smartest humans alive will likely find it.

I think this would probably make me an LLM bull, even if I'm not maximally bullish. Definitely "displacement imminent".

I would call you a moderate under my schema, and probably an "instrumental optimist".

Either way, I don't think you're our target audience for this demo, since you personally and professionally use SOTA LLMs with regularity and are familiar with their pitfalls.

Sounds reasonable to me (SMH). We'll get back to you on that.

Subtly (or maybe not so subtly) the discussion has changed from "AI will achieve AGI and then ASI and then run the world to give us fully automated luxury gay space communism" to "AI is for coding, it's all about the coding, AI will replace software engineers, coding is the be-all and end-all, ignore that it still fakes answers to questions where people know enough to know it's lying/hallucinating".

Don't waste my time with a strawman, please.

I expect AGI and ASI. Even before LLMs, when it was Yudkowsky and Friends worrying about hypothetical future AI in a shed in the ancient times of the early 2000s, the concern was recursive self-improvement. What does that mean? A smart-enough AI writing the code for a smarter version of itself, which writes the code for an even smarter version of itself, and so on till humans are left in the dust.

Notice the common thread? Coding, writing code. Even leaving aside that there's enormous consumer and business demand for LLM-written code, their coding capabilities have been central to this whole debate since day -1.

The big labs are betting their future on being the first to get to this point, and already claim significant boosts to the productivity of their human researchers via the models writing code for training new models, or even conducting experiments.

I don't care about coding because coding has nothing to do with my job. Can it replace accountants, lawyers, clerical staff? Without inventing fake precedents or fake citations from dead authors?

Why don't you buy a $20 plan and test? I can tell you that as a doctor who isn't expected to write code ever, it could do most of my work for me, and well. The only reason I haven't automated myself into an early retirement are the obvious physical bottlenecks and NHS IT.

Thank you for the offer! We might be able to take you up on it.

After a night to dwell on your suggestion, we might even be able to implement a version of your original proposal:

  • Original spec, that the agent works towards mostly autonomously till a finished product
  • Your pre-registered desired modifications, which SF can then ask the model to attempt to implement.
  • We (including you) evaluate the final result.

That way, he won't need to keep active tabs on it, he can just tell the model to do things as per his convenience, while not losing much in terms of demonstrative power.

I'm not sure if this is what you had originally proposed, or if you edited in before I replied, but no big deal. We'd need you to give us a more specific idea of the task at hand, if possible.

Well, what are the specific ways you think the experiment can be improved, including the minimization of cherrypicking (without adding an unreasonable amount of additional effort on our parts)? Keep in mind we're two dudes in a shed, not Anthropic itself.

This is very unlikely to be accepted:

  • Too subjective to be useful, and far too ambiguous. Who's doing the grading here? How are they assessing "coherence"? How are we blinding things, if not, how do we account for bias?

  • We strongly prefer actual programming tasks, not creative writing. We could easily ask Claude to write a novel, and it would do it, but then we're back at the issue of grading it properly.

If you want to propose something like this, you need to be as rigorous as @faul_sname up in the thread. At the very least, propose evaluators that aren't you or the two of us, and we can see if it's possible to make this work.

I said "strongly inadvisable" and not "automatically disqualifying".

SF would need to babysit the process, waiting for the person making the request to raise their request, instead of hitting go and checking in periodically or after being alerted. He may or may not be able to do this, he does have a full-time job.

It also injects some degree of ambiguity into things, as well as significantly increasing the time and token investment. Max plans are not infinite.

I stress that this isn't necessarily a deal breaker, it just makes things harder and reduces the likelihood of acceptance. You're at liberty to try asking, and we're at liberty to turn it down, especially should you ask for something outside the original spec (as mutually agreed on in advance).

What I’m saying is you are asking users to come up with examples that they already by definition don’t believe it can accomplish, by definition of their skepticism.

Duh? What on earth could you expect us to do differently? If the skeptic already believes the model to be capable of the task, why ask for a test?

There is non-zero value in discovering a task that both the two of us and the skeptic expect a model to achieve, and then witnessing it failing at it (unexpected, at least), but that is clearly not the primary purpose here. Someone else is welcome to try, after they're no longer swamped with a quadrillion entries. The set of tasks that the skeptics and I both expect models to accomplish is much larger than the one where we disagree.

Hence why I think your claim:

This seems extremely, self-servingly narrow and contradictory. We want to show you how much an AI can do, in order to change your mind on it's limits. But please, do only pick something that it can do. This isn't question begging, but something like it.

Is clearly nonsensical.

By "you", do you mean me (self_made_human)? Oh boy, you'd be surprised. I've actively experimented with this, and it's one of my vibes benchmarks for any new model release.

Of course, my standards were that it should pass my own sniff test. That roughly amounts to "does that sound like me?" and "are those the kinds of arguments I'd make?". They do a decent job, and have for a while. Not perfect, but good enough to fool most of the people, most of the time.

I didn't/don't have access to "standard stylometry tools", though I will admit I didn't go looking very hard.

Also, I see several issues with this proposal:

  • As I've happily admitted in the past, I use AI quite often in my writing. That encompasses using them for a) research and ideation (not very contentious, assuming I've done my due diligence and didn't let actual hallucinations through, and I don't recall being accused of that, ever), b) Formatting and rearranging essays I've already written (surprisingly contentious) and c) minor additions to what I drafted in the first place which I saw fit to incorporate (I'd call this contentious if people could actually pinpoint what they were, they can't). I've never shared an essay where I didn't write at least 80% of the text (prior to the editorial step I mentioned).
  • This means that a scrape of my writing corpus is hopelessly "contaminated". I'd have to go back many months before I'm comfortable staying that not even a single word came from an LLM.
  • In order for this to have any hope of blinding, I'd have to write a 1000 word essay myself. Ideally on the same topic, and before I ever saw LLM output This is complicated by the fact that I throw almost everything substantial I ever write these days into an LLM, for critique and fact checking if nothing else. I could do that, God knows it takes very little for me to rattle off a thousand words.

Note that these aren't insurmountable challenges: if the final essay written by Claude falls within the same distribution as what I've been writing (with minor LLM involvement), then that's... fine? If nobody without access to the ground truth knows which essay is which, that's a victory as far as I'm concerned.

(We'd need to decide if I'm allowed to use LLMs myself, in the same manner I already do. Claude wouldn't get the benefit of me giving it feedback or editing its output in any capacity)

Access to all of your public past writing

This would be an absolute pain to collate, both for me and for Claude. Then again, it might be easier for it, right until it ran into the fact that even the largest context windows would choke. I write a lot, and have for many years. Unfortunately, I'm not famous enough to be preferentially scraped, LLMs do not know who I am without looking it up. That's a luxury reserved for Gwern or Scott.

Access to base models Access to fine-tune base or instruct models Access to a vast.ai box with an H100 for 24 hours to do whatever else it wants

Would you be willing to pay for that or provide access? I wouldn't want to financially burden poor SF more than he already is; the Claude Max plan is a sunk cost, while these clearly represent additional investment. Genuine question, I'm sure that Claude could usefully interface with everything if provided access.

Just after I started writing this, the Metal Gear ! sound effect played in my head. Hang on a minute, I don't think I'd describe you as an LLM skeptic according to my (incomplete and rudimentary) taxonomy. I'd struggle to pin you as a moderate either, though I can't recall you expressing strong opinions that aren't more research/alignment oriented. If my hunch is true, then you aren't the target audience for this! Plus I'll eat my hat if you don't already have access to the best models out there. Do you really need us for the test, beyond whatever blinding or scraping of my writing is necessary?

Also, this is slightly off-topic. We're not evaluating the agents on their ability to write code, we're testing their ability to mimic me convincingly. Sure, coding capabilities might come into the picture, especially if they're in charge of other models, but it's not as central.

This doesn't strike me as insurmountable. I'd be open to trying once you get back to me on the issues I raised.

I do not see how you can interpret us in that manner.

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points.

If the problem is deemed too hard by everyone (the person proposing it clearly believes the model can't do it), then what exactly does failure demonstrate? Nobody ever expected it to succeed within the given constraints. You can't evaluate automobiles in terms of their ability to reach Alpha Centauri. You can't adjudicate a debate between a Ferrari fanatic and a Lambo lover based on which car is more effective at deep sea exploration.

It takes disagreement on model capabilities and (expected) outcomes for all of this to be surprising or useful.

As we've clearly stated later, if we agree to the challenge, then we expect that the model can do something (that our counterparty thinks it can't), so the failure of the model goes against us, and will force us to update.

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread. They seem reasonable enough to me, but I am clearly not the real expert here, and I'll be deferring to his judgment. That might take a little while to organize, I'll edit this into the main post for the sake of clarity.

See For Yourself: A Live Demo of LLM capabilities

As someone concerned with AI Safety or the implications of cognitive automation for human employability since well before it's cool, I must admit a sense of vindication from seeing AI dominate online discourse, including on the Motte.

We have a wide-range of views on LLM capabilities (at present) as well as on their future trajectory. Opinions are heterogeneous enough that any attempt at taxonomy will fail to capture individual nuance, but as I see it:

  • LLM Bulls: Current SOTA LLMs are capable of replacing a sizeable portion of human knowledge work. Near-future models or future architectures promise AGI, then ASI in short order. The world won't know what hit it.

  • LLM moderates: Current SOTA models are useful, but incapable of replacing even mid-level devs without negative repercussions on work quality or code performance/viability. They do not fully substitute for the labor of the average professional programmer in the West. This may or may not achieve in the near future. AGI is uncertain, ASI is less likely.

  • LLM skeptics: Current SOTA models are grossly overhyped. They are grossly incompetent at the majority of programming tasks and shouldn't be used for anything more than boilerplate, if that. AGI is unlikely in the near-term, ASI is a pipedream.

  • Gary Marcus, the dearly departed Hlynka. Opinions not worth discussing.

Then there's the question of whether LLMs or recognizable derivatives are capable of becoming AGI/ASI, or if we need to make significant discoveries in terms of new architectures and/or training pipelines (new paradigms). Fortunately, that isn't relevant right now.


Alternatively, according to Claude:

The Displacement Imminent camp thinks current models already threaten mid-level knowledge work, and the curve is steep enough that AGI is a near-term planning assumption, not a thought experiment.

The Instrumental Optimist thinks current models are genuinely useful in a supervised workflow, trajectory is positive but uncertain, AGI is possible but not imminent. This is probably the modal position among working engineers who actually use these tools.

The Tool Not Agent camp thinks current models are genuinely useful as sophisticated autocomplete or search, but the "agent" framing is mostly hype — they fail badly without tight human scaffolding, and trajectory is uncertain enough that AGI is not worth pricing in.

The Stochastic Parrot camp (your skeptics, minus the pejorative) thinks the capabilities are brittle, benchmark gaming is rampant, and real-world coding performance is far below reported evals. They're often specifically focused on the unsupervised case and the question of whether the outputs are actually understood vs. pattern-matched.

The dimension you might also want to add explicitly is who bears the cost of the failure modes — because a lot of the disagreement between practitioners isn't about raw capability but about whether the errors are cheap (easily caught, low stakes) or expensive (subtle, compounding, hard to audit). Someone who works on safety-critical systems has a very different prior than someone shipping web apps.


Coding ability is more of a vector than it is a scalar. Using a breakdown helpfully provided by ChatGPT 5.2 Thinking:


Most arguments are really about which of these capabilities you think models have:

  1. Local code generation (Boilerplate, idioms, small functions, straightforward CRUD, framework glue.)

  2. Code understanding in situ (Reading unfamiliar code, tracing control flow, handling large repos, respecting existing patterns.)

  3. Debugging and diagnosis (Finding root cause, interpreting logs, stepping through runtime behavior, reproducing bugs. Refactoring and maintenance)

  4. Changing code without breaking invariants, reducing complexity, untangling legacy.

  5. System design and requirements translation (Turning vague specs into robust design, choosing tradeoffs, anticipating failure modes.)

  6. Operational competence (Tests, CI, tooling, dependency management, security posture, deploy and rollback, observability.)

Two people can both say “LLMs are great at coding” and mean (1) only vs (1)+(2)+(6) vs “end-to-end ticket closure.”


With terminology hopefully clarified, I come to the actual proposal:

@strappingfrequent (one of the many Mottizens I am reasonably well-acquainted with off-platform), has very generously offered:

  1. A sizeable amount of tokens from his very expensive Claude Max plan ($200 a month!) and access to the latest Claude Opus.

  2. His experience using agent frameworks and orchestration. I can personally attest that he was doing this well before it was cool, I recall seeing detailed experimentation as early as GPT-4.

  3. His time in personally setting up experiments/tests, as well as overseeing their progress, while potentially interacting with an audience over a livestream.

He works as a professional programmer, and has told me that he has been consistently impressed by the capabilities of AI coding agents. They've served his needs well.

Here's his description of his skills and experience:

in my professional capacity, I've been working with Python for back-end (computer vision algorithms, FastAPI, Django) & Java (Spring). For Front-end; React. 95 percent of what I do is boilerplate, although Sonnet 3.5 did help me solve a novel problem last year but it did take quite a bit of back & forth -- the key was discussing what additional metrics I could capture to help nail down ~30+ parameters influencing a complicated computer vision pipeline.

tldr; the more represented your use case is in the training corpus, better results (probably) -- but I am absolutely confident that Opus 4.6 can help with novel problems, too. And, y'know -- Terrance Tao thinks that as well.

To what end?

He and I share a dissatisfaction with AI discourse that substitutes confident assertion for empirical investigation, and we think the most useful contribution we can make is to show the tools actually working on tasks that skeptics consider beyond their reach.

What do we want from you?

If you self-identify as someone who is either on the fence about LLMs, or strongly skeptical that they're useful for anything: share a coding challenge that you think they're presently incapable of doing, or doing well.

An ideal candidate is a proposal that you think is beyond the abilities of any LLM, while not being so difficult that we think they'd be entirely intractable. Neither of us claim that we can solve Fermat's Last Problem (or that Claude can solve it for us).

Other requirements:

  • A clear problem specification, or a willingness to submit a vaguer one and then approve a tighter version as created by us/Claude.

  • Nothing so easy/trivial that a quick Google shows that someone's already done it. If you want a C++ compiler written by an LLM, well, there's one out there (though that is the opposite of trivial).

  • Nothing too hard. He provides an example of "coding a Netflix clone in 4 hours".

  • An agreement on the degree of human intervention allowed. Can we prompt the model if it gets stuck? Help it in other ways? Do you want to add something to the scope later? (Strongly inadvisable). Note that if you expect literally zero human intervention, SF isn't game. He says: "I don't think I'd care to demonstrate any sort of zero-shot capacity... that's a silly expectation. If my prime orchestrator spends 30 minutes building a full-stack webapp that doesn't work I'll say It doesn't work; troubleshoot, please. I trust your judgement."

  • A time-horizon. Even a Max plan has its limits, we can't be expected to start a task that'll take days to complete.

  • Some kind of semi-objective rubric for grading the outcome, if it isn't immediately obvious. Is it enough to succeed at all? Or do you want code that even Torvalds can't critique? And no, "I know it when I see it" isn't really good enough, for obvious reasons. Ideally, give us an idea of the tests everything needs to pass.

  • If your task requires the model to review/extend proprietary code, that's not off the table entirely, but it's up to you to make sure we can access it. Either send us a copy or point us at a repo.

  • Nothing illegal.

But to sum, up, we want a task that we agree is probably feasible for an LLM, and where success will change your mind significantly. By which I mean: "If it succeeds at X, I will revise my estimate of LLM capability from Y to Z." Otherwise I can only imagine a lot of post-hoc goalpost movement, "okay but it still needed 3 prompts" or "the code works but a good programmer would have done it differently."

We reserve the right to choose which proposals we attempt, partly because some will be more interesting than others, and partly because we have finite tokens and finite time.


Miscellaneous concerns:

Why Claude Opus 4.6?

Well, the most honest answer is that @strappingfrequent already has an Claude Max plan, and is familiar with its capabilities. The other SOTA competitors include Gemini 3.1 Pro and GPT 5.3 Codex, which are nominally superior on a few benchmarks, but a very large fraction of programmers insist that Claude is still the best for general programming use cases. We don't think this choice matters that much, and the models are in fact roughly interchangeable while being noticeably superior to anything released before very late 2025.

Why bother at all?

We are hoping to change minds. This is disclosed upfront. We're also open to changing our own minds. We would also be genuinely interested to discover cases where we expect success and get failure - mapping the capability frontier is valuable independent of which direction the evidence points. We will share logs and a final repo either way. An interactive livestream is possible if there is sufficient interest.

Anything else?

You know the model. We'll be using Claude Code. The specifics of the demo are TBD, it could be a livestream with user engagement if there's sufficient interest, otherwise we can dump logs and share a final repo.

The floor is open. What do you think Claude cannot do?


Edit:

I'll forward the proposals to @strappingfrequent, assuming he doesn't show up in the thread himself. I am clearly not the real expert here, plus he's doing all of the heavy lifting. Expect at least a day or two of back and forth before we come to a consensus (but if he agrees to something, then assume I won't object, but not vice versa unless specifically noted), and that includes conversations within this thread to narrow down the scope and make things as rigorous as possible while adhering to the restrictions we've mentioned. At some point, we'll announce winners and get to scheduling.

Very few, but still non-zero. Classic examples would be Ender's Game; then we've got HPMOR and other rat-fic.

Exactly, if a woman marries me for my money, extends me love and attention, raises my kids, watches me die of natural causes and then goes to the Bahamas to cry on a cruise ship, I'm not really seeing the issue here.

There are very few women who don't care about money at all. I ask the married male Mottizens here to consider what would happen if they suddenly gave away all their money, quit their jobs and then told their wives that. "But don't you love me for who I am?", you'll have to cry plaintively as she files papers and takes the kids.

She's never been permabanned. I seem to recall her saying she'd lost the password to her previous account, and she then turned down our offer to restore it.

Thank you, that's the one. My internal betting market had strong odds in favor of you being the first to find the link, good to see I'm well-calibrated.

doesn't quite match your assertion

Hmm. It seems I was misremembering. I will weaken from saying that 18 (or my speculation of 16) being peak female attractiveness isn't supported by the graph.

I will note:

as you can see, men tend to focus on the youngest women in their already skewed preference pool, and, what's more, they spend athe median 30 year-old man spends as much time messaging teenage girls as he does women his own age significant amount of energy pursuing women even younger than their stated minimum. No matter what he's telling himself on his setting page, a 30 year-old man spends as much time messaging 18 and 19 year-olds as he does women his own age. On the other hand, women only a few years older are largely neglected.

I think this supports part of my argument: namely, that by setting an age minimum at 18, OKCupid obscures the fact that many/most men would happily approach younger women if they had the option. I suppose this is even less controversial, women don't magically go from being divorced of sexual value at 18 years - 1 Planck time to being hot when the clock strikes 12 on their 18th birthday.

Also look at the charts titled "The shape of the dating pool" and "how a person's attractiveness changes with time":

The latter shows that 18 year old women are about 75% as attractive as they are at their absolute peak at 21. They are roughly twice as attractive as they would be at 34. This strongly implies that women below 18 are more attractive than the majority of older women, the range restriction just doesn't allow us to measure this.

I had an ex who was actually two years older than me, but could have passed as 18 without much hassle. I visited London with her when I was 26ish, and she was 28. I remember getting dirty looks at a liquor store with her on my arm as we were gawking at the variety of booze on offer. The next time, when she went alone, she got even dirtier looks, and was finally accosted by both a random old granny and the lady at the till on suspicion of underage drinking. It was funny in hindsight, as much as women complain about getting carded, they're even more upset when it stops.

On the other hand, excluding venues where they have a policy of carding anyone who walks in, I haven't been specifically asked for ID since I was 16. I can only presume that the we were giving off the impression of a sizeable age gap.

Anecdotes aside, I think the primary driver of age gap discourse is the bitterness of a specific age group of women engaged in intrasexual warfare that spills out into intersexual forms.

Ages 25-35, I'd say. Just young enough to be terminally online, unlike even older women who grew up and settled down this before this was capital-d Discourse. (There are very few grannies out there who are going to lecture their granddaughters about dating a 35 yo when they're 22.)

They notice that the youth they once prized is fading, and while they're still perfectly happy to go for older men (as are almost all women), they resent the fact that the men in their ideal age range don't consider them to be in their ideal age range.

Lip-service to feminism makes it difficult to directly attack their direct competitors (younger girls), without coming off as bitter and butt-hurt. But you can attack the men. And if you can successfully pathologize male preference for youth as predatory, you accomplish two things simultaneously: you make the competing demographic seem like victims who need protection rather than rivals, and you make the men who prefer them seem like villains.

This reframing has the additional advantage of being unfalsifiable in ways that make it rhetorically robust. Any counterexample, any young woman who says she's perfectly happy in her relationship and was not victimized, can be explained as evidence of how thorough the manipulation was. She doesn't know she's a victim. That's the worst part.

The frontal-lobe argument is where things get especially interesting. The claim is that the prefrontal cortex isn't fully developed until 25, therefore people under 25 lack sufficient judgment to consent to relationships with older partners. I've seen this argument made by people with actual MDs on /r/medicine, which I find both impressive and alarming. It's impressive because it successfully launders a social preference into neuroscience. It's alarming because it's bad neuroscience.

Neurodevelopment is continuous. The "fully developed at 25" framing suggests a step function where below 25 you're basically a golden retriever and above 25 you're suddenly Immanuel Kant. This is not how brains work. The research shows gradual changes in certain cognitive and regulatory processes, with enormous individual variation, and basically no evidence that this translates into systematic inability to make reasonable decisions about relationships.

The younger girls? They absorb this by cultural osmosis. Younger Gen Z is actually the most vocal about age-gap discourse. Unfortunately (or fortunately), that isn't enough to overcome their innate biological preference for older, successful men, so actual behavior doesn't change much. If a 20 year old girl meets a 30 year old man she thinks is cute, she'll usually have few qualms about sleeping with him or getting into a relationship, age-gaps be damned.

Power-disparity is bad? Huh, someone should tell all the women who prefer that kind of disparity, in favor of the men they desire. Men tend to be more focused on attributes such as physical attractiveness and youth, which are, no prizes for guessing, more common in younger women.

I find such pathologization of universal human preferences distasteful, doubly so when my field is molested and forcefully conscripted to shore up bad arguments. Oh well, so be it. I'm lucky enough to be a MILF enjoyer and thus immune from direct blowback for the most part, even if I regretfully note that "MILF" increasingly just means women my age.

(Another anecdote: I remember grinding on a girl I vaguely knew at a club in Scotland. An older friend of mine had a thing for a bisexual woman about the same age as me. She ended up chatting with the first girl, who seemed receptive to her advances. Then the girl disclosed that she was 19, and that made the woman freak out, as they later explained in our company. I put aside any plans to approach the girl later, since the headache was far from worth it.)

If I was less lazy/busy, I'd insert the usual OkCupid stats blogs/archives from before they were bought and cucked. They showed that female attractiveness peaked at 18, but that was their minimum age cutoff, so I suspect the actual figure is even lower at around 16. Men also showed tolerance to wider age gaps as they got older. 30 year old and 35 year old men showed roughly the same willingness to approach 25 year old women.

I believe Gwern has a copy. Someone please do this in the comments, thanks, :*

Just look at what the side bar on the blog is titled.

I think my actual favorite by Watts is the Sunflower series/novella. There's no scope for heavy handed ecological metaphors, just good old fashioned scifi and existential dread.

That's like saying Einstein and a village idiot both suffer from the "same" problem, they stub their toes at equal rates. Or saying that a drunk Asian grandma and a professional F1 driver are as incompetent because F1 drivers crash their cars too.

How often they fail is important.