site banner

The state of open-source LLMs as of summer 2024, or: Core Values of Socialism with AGI characteristics, V2

Here some people have expressed interest in my take on AI broadly, and then there's Deepseek-Coder release, but I've been very busy and the field is moving so very fast again, it felt like a thankless job to do what Zvi does and without his doomer agenda too (seeing the frenetic feed on Twitter, one can be forgiven for just losing the will; and, well, I suppose Twitter explains a lot about our condition in general). At times I envy Iconochasm who tapped out. Also, this is a very niche technical discussion and folks here prefer policy.

But, in short: open source AI, in its most significant aspects, which I deem to be code generation and general verifiable reasoning (you can bootstrap most everything else from it), is now propped up by a single Chinese hedge fund (created in the spirit of Renaissance Capital) which supports a small, ignored (except by scientists and a few crackpots on Twitter) research division staffed with some nonames, who are quietly churning out extraordinarily good models with the explicit aim of creating AGI in the open. These models happen to be (relatively) innocent of benchmark-gaming, but somewhat aligned to Chinese values. The modus operandi of DeepSeek is starkly different from that of either other Chinese or Western competitors. In effect this is the only known group both meaningfully pursuing frontier capabilities and actively teaching others how to do so. I think this is interesting and a modest cause for optimism. I am also somewhat reluctant to write about this publicly because there exist lovers of Freedom here, and it would be quite a shame if my writing contributed to targeted sanctions and even more disempowerment of the small man by the state machinery in the final accounting.

But the cat's probably out of the bag. The first progress prize of AI Mathematical Olympiad had just been taken by a team using their DeepSeekMath-7B model, solving 29 out of 50 private test questions «less challenging than those in the IMO but at the level of IMO preselection»; Terence Tao finds it «somewhat higher than expected» (he is on the AIMO Advisory Committee, along with his fellow Fields medalist Timothy Gowers).

The next three teams entered with this model as well.

I. The shape of the game board

To provide some context, here's an opinionated recap of AI trends since last year. I will be focusing exclusively on LLMs, as that's what matters (image gen, music gen, TTS etc largely are trivial conveniences, and other serious paradigms seem to be in their embryonic stage or in deep stealth).

  • We have barely advanced in true out-of-distribution reasoning/understanding relative to the original «Sparks of AGI» GPT-4 (TheDag, me); GPT-4-04-29 and Sonnet 3.5 were the only substantial – both minor – steps forward, Gemini was a catch-up effort, and nobody else has yet credibly reached the same tier. We have also made scant progress towards consensus on whether that-which-LLMs-do is «truly» reasoning or understanding; sensible people have recoursed to something like «it's its own kind of mind, and hella useful».
  • Meanwhile there's been a great deal of progress in scaffolding (no more babyAGI/AutoGPT gimmicry, now agents are climbing up the genuinely hard SWE-bench), code and math skills, inherent robustness in multi-turn interactions and responsiveness to nuanced feedback (to the point that LLMs can iteratively improve sizable codebases – as pair programmers, not just fancy-autocomplete «copilots»), factuality, respect of prioritized system instructions, patching badly covered parts of the world-knowledge/common sense manifold, unironic «alignment» and ironing out Sydney-like kinks in deployment, integrating non-textual modalities, managing long contexts (merely usable 32K "memory" was almost sci-fi back then, now 1M+ with strong recall is table stakes at the frontier; with 128K mastered on a deeper level by many groups) and a fairly insane jump in cost-effectiveness – marginally driven by better hardware, and mostly by distilling from raw pretrained models, better dataset curation, low-level inference optimizations, eliminating architectural redundancies and discovering many "good enough" if weaker techniques (for example, DPO instead of PPO). 15 months ago,"$0.002/1000 tokens" for gpt-3.5-turbo seemed incredible; now we always count tokens by the million, and Gemini-Flash blows 3.5-turbo out of the water for half that, so hard it's not funny; and we have reason to believe it's still raking in >50% margins whereas OpenAI probably subsidized their first offerings (though in light of distilling and possibly other methods of compute reuse, it's hard to rigorously account for a model's capital costs now).
  • AI doom discourse has continued to develop roughly as I've predicted, but with MIRI pivoting to evidence-free advocacy, orthodox doomerism getting routed as a scientific paradigm, more extreme holdovers from it («emergent mesaoptimizers! tendrils of agency in inscrutable matrices!») being wearily dropped by players who matter, and misuse (SB 1047 etc) + geopolitical angle (you've probably seen young Leopold) gaining prominence.
  • The gap in scientific and engineering understanding of AI between the broader community and "the frontier" has shrunk since the debut of GPT-4 or 3.5, because there's too much money to be made in AI and only so much lead you can get out of having assembled the most driven AGI company. Back then, only a small pool of external researchers could claim to understand what the hell they did above the level of shrugging "well, scale is all you need" (wrong answer) or speculating about some simple methods like "train on copyrighted textbooks" (spiritually true); people chased rumors, leaks… Now it takes weeks at most to trace a yet another jaw-dropping magical demo to papers, to cook up a proof of concept, or even to deem the direction suboptimal; the other two leading labs no longer seem desperate, and we're in the second episode of Anthropic's comfortable lead.
  • Actual, downloadable open AI sucks way less than I've lamented last July. But it still sucks. And that's really bad, since it sucks most in the dimension that matters: delivering value, in the basest sense of helping do work that gets paid. And the one company built on the promise of «decentralizing intelligence», which I had hope for, had proven unstable.

To be more specific, open source (or as some say now, given the secretiveness of full recipes and opacity of datasets, «open weights») AI has mostly caught up in «creativity» and «personality», «knowledge» and some measure of «common sense», and can be used for petty consumer pleasures or simple labor automation, but it's far behind corporate products in «STEM» type skills, that are in short supply among human employees too: «hard» causal reasoning, information integration, coding, math. (Ironically, I agree here with whining artists that we're solving domains of competence in the wrong order. Also it's funny how by default coding seems to be what LLMs are most suited for, as the sequence of code is more constrained by preceding context than natural language is).

To wit, Western and Eastern corporations alike generously feed us – while smothering startups – fancy baubles to tinker with, charismatic talking toys; as they rev up self-improvement engines for full cycle R&D, the way imagined by science fiction authors all these decades ago, monopolizing this bright new world. Toys are getting prohibitively expensive to replicate, with reported pretraining costs up to ≈$12 million and counting now. Mistral's Mixtral/Codestral, Musk's Grok-0, 01.Ai's Yi-1.5, Databricks' DBRX-132B, Alibaba's Qwens, Meta's fantastic Llama 3 (barring the not-yet-released 405B version), Google's even better Gemma 2, Nvidia's massive Nemotron-340B – they're all neat. But they don't even pass for prototypes of engines you can hop on and hope to ride up the exponential curve. They're too… soft. And not economical for their merits.

Going through our archive, I find this year-old analysis strikingly relevant:

I think successful development of a trusted open model rivaling chatgpt in capability is likely in the span of a year, if people like you, who care about long-term consequences of lacking access to it, play their cards reasonably well. […] Companies whose existence depends on the defensibility of the moat around their LM-derived product will tend to structure the discourse around their product and technology to avoid even the fleeting perception of being a feasibly reproducible commodity.

That's about how it went. While the original ChatGPT, that fascinating demo, is commodified now, competitive product-grade AI systems are not, and companies big and small still work hard to maintain the impression that it takes

  • some secret sauce (OpenAI, Anthropic)
  • work of hundreds of Ph.Ds (Deepmind)
  • vast capital and compute (Meta)
  • "frontier experience" (Reka)

– and even then, none of them have felt secure enough yet to release a serious threat to the other's proprietary offers.

I don't think it's a big exaggerion to say that the only genuine pattern breaker – presciently mentioned by me here – is DeepSeek, the company that has single-handedly changed – a bit – my maximally skeptical spring'2023 position on the fate of China in the AGI race.

II. Deep seek what?

AGI, I guess. Their Twitter bio states only: «Unravel the mystery of AGI with curiosity. Answer the essential question with long-termism». It is claimed by the Financial Times that they have a recruitment pitch «We believe AGI is the violent beauty of model x data x computing power. Embark on a ‘deep quest’ with us on the journey towards AGI!» but other than that nobody I know of has seen any advertisement or self-promotion from them (except for like 70 tweets in total, all announcing some new capability or responding to basic user questions about license), so it's implausible that they're looking for attention or subsidies. Their researchers maintain near-perfect silence online. Their – now stronger and cheaper – models tend to be ignored in comparisons by Chinese AI businesses and users. As mentioned before, one well-informed Western ML researcher has joked that they're the bellwether for «the number of foreign spies embedded in the top labs».

FT also says the following of their parent company:

Its funds have returned 151 per cent, or 13 per cent annualised, since 2017, and were achieved in China’s battered domestic stock market. The country’s benchmark CSI 300 index, which tracks China’s top 300 stocks, has risen 8 per cent over the same time period, according to research provider Simu Paipai.
In February, Beijing cracked down on quant funds, blaming a stock market sell-off at the start of the year on their high-speed algorithmic trading. Since then, High-Flyer’s funds have trailed the CSI 300 by four percentage points.
[…] By 2021, all of High-Flyer’s strategies were using AI, according to manager Cai Liyu, employing strategies similar to those pioneered by hugely profitable hedge fund Renaissance Technologies. “AI helps to extract valuable data from massive data sets which can be useful for predicting stock prices and making investment decisions,” …
Cai said the company’s first computing cluster had cost nearly Rmb200mn and that High Flyer was investing about Rmb1bn to build a second supercomputing cluster, which would stretch across a roughly football pitch-sized area. Most of their profits went back into their AI infrastructure, he added. […] The group acquired the Nvidia A100 chips before Washington restricted their delivery to China in mid-2022.
“We always wanted to carry out larger-scale experiments, so we’ve always aimed to deploy as much computational power as possible,” founder Liang told Chinese tech site 36Kr last year. “We wanted to find a paradigm that can fully describe the entire financial market.”

In a less eclectic Socialist nation this would've been sold as Project Cybersyn or OGAS. Anyway, my guess is they're not getting subsidies from the Party any time soon.

They've made a minor splash in the ML community eight months ago, in late October, releasing an unreasonably strong Deepseek-Coder. Yes, in practice an awkward replacement for GPT-3.5, yes, contaminated with test set, which prompted most observers to discard it as a yet another Chinese fraud. But it proved to strictly dominate hyped-up things like Meta's CodeLLaMA and Mistral's Mixtral 8x7b in real-world performance, and time and again proved to be the strongest open baseline in research papers. On privately designed, new benchmarks like this fresh one from Cohere it's clear that they did get to parity with OpenAI's workhorse model, right on the first public attempt – as far as coding is concerned.

On top of that, they shared a great deal of information about how: constructing the dataset from Github, pretraining, finetuning. The paper was an absolute joy to read, sharing even details on unsuccessful experiments. It didn't offer much in the way of novelty; I evaluate it as a masterful, no-unforced-errors integration of fresh (by that point) known best practices. Think about your own field and you'll probably agree that even this is a high bar. And in AI, it is generally the case that either you get a great model with «we trained it on some text… probably» tech report (Mistral, Google), or a mediocre one accompanied by a fake-ass novel full of jargon (every second Chinese group). Still, few cared.

Coder was trained, it seems, using lessons of the less impressive Deepseek-LLM-67B (even so, it was roughly Meta's LLaMA-2-70B peer that also could code; a remarkable result for a literally-who new team), which somehow came out a month after. Its paper (released even later still) was subtitled «Scaling Open-Source Language Models with Longtermism». I am not sure if this was some kind of joke at the expense of effective altruists. What they meant concretely was the following:

Over the past few years, LLMs … have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). … Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LMs with a long-term perspective.

  • …Soon, we will release our technique reports in code intelligence and Mixture-of-Experts(MoE), respectively. They show how we create high-quality code data for pre-training, and design a sparse model to achieve dense model performance.
  • At present, we are constructing a larger and improved dataset for the upcoming version of DeepSeek LLM. We hope the reasoning, Chinese knowledge, math, and code capabilities will be significantly improved in the next version.
  • Our alignment team is dedicated to studying ways to deliver a model that is helpful, honest, and safe to the public. Our initial experiments prove that reinforcement learning could boost model complex reasoning capability.

…I apologize for geeking out. All that might seem normal enough. But, a) they've fulfilled every one of those objectives since then. And b) I've read a great deal of research papers and tech reports, entire series from many groups, and I don't remember this feeling of cheerful formidability. It's more like contemplating the dynamism of SpaceX or Tesla than wading through a boastful yet obscurantist press release. It is especially abnormal for a Mainland Chinese paper to be written like this – with friendly confidence, admitting weaknesses, pointing out errors you might repeat, not hiding disappointments behind academese word salad; and so assured of having a shot in an honest fight with the champion.

In the Coder paper, they conclude:

…This advancement underscores our belief that the most effective code-focused Large Language Models (LLMs) are those built upon robust general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.

In the Mixture-of-Experts paper (8th January), they've shown themselves capable of novel architectural research too, introducing a pretty ingenuous «fine-grained MoE with shared experts» design with the objective of «Ultimate Expert Specialization» and economical inference: «DeepSeekMoE 145B significantly outperforms Gshard, matching DeepSeek 67B with 28.5% (maybe even 14.6%) computation». For those few who noticed it, this seemed a minor curiosity, or just bullshit.

On 5th February, they've dropped DeepSeekMath,of which I've already spoken: «Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model». Contra the usual Chinese pattern, it wasn't a lie; no, you couldn't in normal use get remotely as good results from it, but in some constrained regimes… The project itself was a mix of most of the previous steps: sophisticated (and well-explained) data harvesting pipeline, scaling laws experiments, further «longtermist» continued pretraining from Coder-7B-1.5 which itself is a repurposed LLM-7B, and the teased reinforcement learning approach. Numina, winners of AIMO, say «We also experimented with applying our SFT recipe to larger models like InternLM-20B, CodeLama-33B, and Mixtral-8x7B but found that (a) the DeepSeek 7B model is very hard to beat due to its continued pretraining on math…».

In early March they released DeepSeek-VL: Towards Real-World Vision-Language Understanding, reporting some decent results and research on building multimodal systems, and again announcing new plans: «to scale up DeepSeek-VL to larger sizes, incorporating Mixture of Experts technology».

III. Frontier minor league

This far, it's all been preparatory R&D, shared openly and explained eagerly yet barely noticed by anyone (except that the trusty Coder still served as base for labs like Microsoft Research to experiment on): utterly overshadowed in discussions by Alibaba, Meta, Mistral, to say nothing of frontier labs.

But on May 6th, 2024, the pieces began to fall into place. They released «DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model», which subsumed all aforementioned works (except VL).

It's… unlike any other open model, to the point you could believe it was actually made by some high-IQ finance bros from first principles. Its design choices are exquisite, just copying minor details can substantially improve on typical non-frontier efforts. It pushes further their already unorthodox MoE and tops it off with a deep, still poorly understood modification to the attention mechanism (Multi-head Latent Attention, or MLA). It deviates from industry-standard rotary position embeddings to accomodate the latter (a fruit of collaboration with RoPE's inventor). It's still so unconventional that we are only beginning to figure out how to run it properly (they don't share their internal pipeline, which is optimized for hardware they can access given American sanctions). But in retrospect, it's the obvious culmination of the vision announced with those first model releases and goofy tweets, probably a vision not one year old, and yet astonishingly far-sighted – especially given how young their star researchers are. But probably it's mundane in the landscape of AI that's actually used; I suspect it's close to how Sonnet 3.5 or Gemini 1.5 Pro work on the inside. It's just that the open-source peasants are still mucking around with stone age dense models on their tiny consumer GPUs.

I understand I might already be boring you out of your mind, but just to give you an idea of how impressive this whole sequence is, here's a 3rd April paper for context:

Recent developments, such as Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), spotlight Mixture-of-Experts (MoE) models as a superior alternative to Dense Transformers. An MoE layer works by routing each input token to a selected group of experts for processing. Remarkably, increasing the number of experts in an MoE model (almost) does not raise the computational cost, enabling the model to incorporate more knowledge through extra parameters without inflating pre-training expenses… Although our findings suggest a loss-optimal configuration with Emax experts, such a setup is not practical for actual deployment. The main reason is that an excessive number of experts makes the model impractical for inference. In contrast to pretraining, LLM inference is notably memory-intensive, as it requires storing intermediate states (KV-cache) of all tokens. With more experts, the available memory for storing KV caches is squeezed. As a result, the batch size – hence throughput – decreases, leading to increased cost per query. … We found that MoE models with 4 or 8 experts exhibit more efficient inference and higher performance compared to MoE models with more experts. However, they necessitate 2.4x-4.3x more training budgets to reach the same performance with models with more experts, making them impractical from the training side.

This is basically where Mistral.AI, the undisputed European champion with Meta and Google pedigree (valuation $6.2B), the darling of the opensource community, stands.

And yet, apparently DeepSeek have found a way to get out of the bind. «4 or 8»? They scale to 162 experts, reducing active parameters to 21B, cutting down pretraining costs by 42.5% and increasing peak generation speed by 5.76x; and they scale up the batch size via compressing the KV cache by like 15 times with a bizarre application of low-rank projections and dot attention; and while doing so they cram in 3x more attention heads than any model this size has any business having (because their new attention decouples number of heads from cache size), and so kick the effective «thinking intensity» up a notch, beating the gold standard «Multihead attention» everyone has been lousily approximating; and they use a bunch of auxiliary losses to make the whole thing maximally cheap to use on their specific node configuration.

But the cache trick is pretty insane. The hardest-to-believe, for me, part of the whole thing. Now, 2 months later, we know that certain Western groups ought to have reached the same Pareto frontier, just with different (maybe worse, maybe better) tradeoffs. But those are literally inventors and/or godfathers of the Transformer – Noam Shazeer's CharacterAI, Google Deepmind's Gemini line… This is done by folks like this serious-looking 5th year Ph.D student, in under a year!

As a result, they:

  • use about as much compute on pretraining as Meta did on Llama-3-8B, an utter toy in comparison (maybe worth $2.5 million for them); 1/20th of GPT-4.
  • Get a 236B model that's about as good across the board as Meta's Llama-3-70B (≈4x more compute), which has the capacity – if not the capability – of mid-range frontier models (previous Claude 3 Sonnet; GPT-4 on a bad day).
  • Can serve it at around the price of 8B, $0.14 for processing 1 million tokens of input and $0.28 for generating 1 million tokens of output (1 and 2 Yuan), on previous-gen hardware too.
  • …and still take up to 70%+ gross margins, because «On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second… In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second», and the going price for such nodes is ≤$15/hr. That's $50 in revenue, for clarity. They aren't doing a marketing stunt.
  • …and so they force every deep-pocketed mediocre Chinese LLM vendor – Alibaba, Zhipu and all – to drop prices overnight, now likely serving at a loss.

Now, I am less sure about some parts of this story; but mostly it's verifiable.

I can see why an American, or a young German like Leopold, would freak out about espionage. The thing is, their papers are just too damn good and too damn consistent over the entire period if you look back (as I did), so «that's it, lock the labs» or «haha, no more tokens 4 u» is most likely little more than racist cope for the time being. The appropriate reaction would be more akin to «holy shit Japanese cars are in fact good».

Smart people (Jack Clark from Anthropic, Dylan Patel of Semianalysis) immediately take note. Very Rational people clamoring for AI pause (TheZvi) sneer and downplay: «This is who we are worried about?» (as he did before, and before). But it is still good fun. Nothing extreme. There slowly begin efforts at adoption: say, Salesforce uses V2-Chat to create synthetic data to finetune small Deepseek-Coder V1s to outperform GPT-4 on narrow tasks. Mostly nobody cares.

The paper ends in the usual manner of cryptic comments and commitments:

We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.

• In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.

In the Appendix, you can find a lot of curious info, such as:

During pre-training data preparation, we identify and *filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, which is mainly associated with American values.

Prejudices of specific regional cultures aside, though, it does have values – true, Middle Kingdom ones, such as uncritically supporting the Party line and adherence to Core Values Of Socialism (h/t @RandomRanger). The web version will also delete the last message if you ask something too clever about Xi or Tiananmen or… well, nearly the entirety of usual things Americans want to talk to Chinese coding-oriented LLMs about.

And a bit earlier, this funny guy from the team presented at Nvidia's GTC24 with the product for the general case – «culturally sensitive», customizable alignment-on-demand: «legality of rifle» for the imperialists, illegality of Tibet separatism for the civilized folk. Refreshingly frank.

But again, even that was just a preparatory.

IV. Coming at the king

Roughly 40 days later they release DeepSeek-V2-Coder: Breaking the Barrier of Closed-Source Models in Code Intelligence, where they return to the strategy announced at the very start: they take an intermediate checkpoint of V2, and push it harder and further on the dataset enriched with code and math (that that've continued to expand and refine), for 10.2 trillion tokens total. Now this training run is 60% more expensive than Llama-3-8B (still a pittance by modern standards). It also misses out on some trivia knowledge and somehow becomes even less charismatic. It's also not a pleasant experience because the API runs very slowly, probably from congestion (I guess Chinese businesses are stingy… or perhaps DeepSeek is generating a lot of synthetic data for next iterations). Anons on 4chan joke that it's «perfect for roleplaying with smart, hard-to-get characters».

More importantly though, it demolishes Llama-3-70B on every task that takes nontrivial intelligence; bests Claude 3 Opus on coding and math throughout, Gemini 1.5-Pro on most coding assistance, and trades blows with the strongest GPT-4 variants. Of course it's the same shape and the same price, which is to say, up to 100 times cheaper than its peers… more than 100 times, in the case of Opus. Still a bitch to run, but it turns out they're selling turnkey servers. In China, of course. To boot, they rapidly shipped running code in browser (a very simple feature but going most of the way to Claude Artifacts that wowed people do much), quadrupled context length without price changes (32k to 128k) and now intend to add context caching that Google boasts of as some tremendous Gemini breakthrough. They have... Impressive execution.

Benchmarks, from the most sophisticated and hard to hack to the most bespoke and obscure, confirm that it's «up there».

Etc etc, and crucially, users report similar impressions:

So I have pegged deepseek v2 coder against sonnet 3.5 and gpt4o in my coding tasks and it seems to be better than gpt4o (What is happening at OpenAI) and very similar to Sonnet 3.5. The only downside is the speed, it's kinda slow. Very good model and the price is unbeatable.

I had the same experience, this is a very good model for serious tasks. Sadly the chat version is very dry and uncreative for writing. Maybe skill issue, I do not know. It doesn't feel slopped, it's just.. very dry. It doesn't come up with things.

Some frustrating weak points, but they know of those, and conclude:

Although DeepSeek-Coder-V2 achieves impressive performance on standard benchmarks, we find that there is still a significant gap in instruction-following capabilities compared to current state-of-the-art models like GPT-4 Turbo. This gap leads to poor performance in complex scenarios and tasks such as those in SWEbench. […] In the future, we will focus more on improving the model’s instruction-following capabilities…

Followed by the list of 338 supported languages.

Well-read researchers say stuff like

DeepSeek-Coder-V2 is by far the best open-source math (+ coding) model, performing on par with GPT4o w/o process RM or MCTS and w/ >20x less training compute. Data contamination doesn't seem to be a concern here. Imagine about what this model could achieve with PRM, MCTS, and other yet-to-be-released agentic exploration methods. Unlike GPT4o, you can train this model further. It has the potential to solve Olympiad, PhD and maybe even research level problems, like the internal model a Microsoft exec said to be able to solve PhD qualifying exam questions».

Among the Rational, there is some cautious realization («This is one of the best signs so far that China can do something competitive in the space, if this benchmark turns out to be good»), in short order giving way to more cope : «Arena is less kind to DeepSeek, giving it an 1179, good for 21st and behind open model Gemma-2-9B».

And one more detail: A couple weeks ago, they released code and paper on Expert-Specialized Fine-Tuning, «which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning … by showing less performance degradation [in general tasks]». It seems to require that «ultimate expert specialization» design of theirs, with its supporting beam of generalist modules surrounded by meaningfully task-specific shards, to automatically select only the parts pertaining to some target domain; and this isn't doable with traditional dense of MoE designs. Once again: confident vision, bearing fruit months later. I would like to know who's charting their course, because they're single-handedly redeeming my opinion of the Chinese AI ecosystem and frankly Chinese culture.

V. Where does this leave us?

This might not change much. Western closed AI compute moat continues to deepen, DeepSeek/High-Flyer don't have any apparent privileged access to domestic chips, and other Chinese groups have friends in the Standing Committee and in the industry, so realistically this will be a blip on the radar of history. A month ago they've precluded a certain level of safetyist excess and corporate lock-in that still seemed possible in late 2023, when the argument that public availability of ≈GPT-4 level weights (with the main imaginary threat vectors being coding/reasoning-bottlenecked) could present intolerable risks was discussed in earnest. One-two more such leaps and we're… there, for the vague libertarian intuition of «there» I won't elucidate now. But they're already not sharing the silently updated Deepseek-V2-Chat (that somewhat improved its reasoning, getting closer to the Coder), nor the promised materials on DeepSeek-Prover (a quiet further development of their mathematical models line). Maybe it's temporary. Maybe they've arrived to where they wanted to be, and will turtle up like Stability and Mistral, and then likely wither away.

Mostly, I honestly just think it's remarkable that we're getting an excellent, practically useful free model with lowkey socialist sensibilities. Sadly, I do not foresee that this will inspire Western groups to accelerate open source and leave them in the dust. As Google says in Gemma-2 report:

Despite advancements in capabilities, we believe that given the number of larger and more powerful open models, this release will have a negligible effect on the overall risk landscape.

Less charitably, Google is not interested in releasing anything you might use to enhance your capabilities and become less dependent on Google or other «frontier company», and will only release it if you are well able of getting better stuff elsewhere. In my view, this is closer to the core value of Socialism than withholding info about Xinjiang reeducation camps.

I remain agnostic about the motivations and game plan of DeepSeek, but I do hope they'll maintain this policy of releasing models «with longtermism», as it were. We don't have many others to rely on.

Edits: minor fixes

58
Jump in the discussion.

No email address required.

Great writeup, thanks. OpenAI is definitely not sending their best, that much was obvious for quite some time but this solidifies my impression that at this point they're just trying to exploit first-mover advantage and milk their "lead" for all its worth before they get decisively overtaken by Anthropic (although Sonnet 3.5 is IMO already near-strictly better than GPT) or Chinese developments.

My experience is mostly distanced from serious research but I'm curious - are all the meme benchmarks, graphs and shit only measuring computation, code writing ability and technical stuff like that? I understand it's noncentral to the shape rotators' interests but I wonder if someone has even tried to grade LLMs on their actual writing ability, like how natural it reads to outsiders, how pernicious various -isms are, how quickly it collapses into repetition (given some fixed temperature), how well it can replicate various author styles, etc. From my impression Claude Opus seems to be the undisputed king in this regard, I'm still not sure to which degree it was an accident from Anthropic's side, but even it has a long way to go with how quickly it collapses into tropes without autistic instructions to keep it steady.

I agree, Deepseek-67B is a match for llama2-70b, in my experience.

Additionally, based on my experience, both Deepseek 67b and Nous Capybara 34b based on Yi-34b are much better politically - I don't just mean in terms of conditioned responses, although that does make llama2-70b not useful for fiction writing, but rather there's an element of strategic thinking missing from llama2 that doesn't show up even in "uncensored" versions, as if it's missing from the training data.

Good post, thanks for writing it.

Closed-source seems to be the future, AGI being achieved internally was a meme but at this rate it seems that'll be how it happens if it ever does. Do you think there are internal models/unpublished research significantly more advanced than what's public?

Well obviously the frontier is about one generation ahead (there already exist mostly-trained GPT5, the next Opus…, the next Gemini Ultra) but in terms of useful capabilities and insights the gap may be minor. I regularly notice that thoughts of random anons including me are very close to what DeepMind is going at.

Where is this conversation happening? Is it all on Twitter and private chats? I've seen some people mention Deepseek but feel otherwise mostly out of the loop.

Yeah it's exclusively Twitter and Discord, and then only a handful of niche accounts, mostly people who are into model evaluation. You can find them by searching the word. For example this guy is a maintainer of the BigCode's (itself a respectable org) benchmark, and so I've seen him mention it a few times. Likewise here and here.

Great post. I'd thought you were mostly gone, so I'm glad to see you're still active.

Do you have any thoughts on what careers will likely be mostly safe from AI?

I'm not very active, just wanted to write an AI-related update.

My broad pessimistic prior is that "careers" are inherently not safe even on a short time scale (I can say stuff like high-dexterity blue collar work or face-to-face service work will last a while, but that's hardly a good solution), and what is safe is just having capital in forms that'll remain relevant even under full automation, equity of major tech companies, or being inherently a member of a class protected by parties with monopoly on violence.

Asset prices can’t sustain themselves if the majority of current workers lose their jobs, even in the case of the handful of big tech companies and others that might still engage in economically valuable activity. It’s not so much that wealth might be confiscated (although I expect it will) and more that asset prices would and will literally collapse during mass unemployment of more than 30-40% (of current working adults). Google equity is worthless when advertisers can’t sell anything, Microsoft equity is worthless when the b2b market blows up, Apple is worthless if middle income consumers can no longer buy a $1000 phone every year or two. Nvidia equity is worth much much less when a handful of already-trained foundation models from a few major producers dominate the AI market and every chancer no longer has billions in VC money to buy GPUs and gamble on getting rich.

This is my big worry, being left holding an empty ETF bag as Clippy Inc becomes 100% of the economy.
The only way I can think of hedging is holding major banks, weapon companies, any industry with enough political clout that they can't be (immediately) competed out of business.

Asset prices can’t sustain themselves if the majority of current workers lose their jobs

I doubt this premise. Or rather: they can't sustain themselves but they can go whichever way depending on details of the scenario. The majority of current workers losing their jobs and 10% of current workers getting 2000% more productive each is still a net increase in productivity. Just fewer people are relevant now - but are most people really relevant? Historically, have so many people ever been as relevant as a few decades ago in the US? Even the profile of consumption can be maintained if appropriate redistribution is implemented.

Also I admit I have no idea what all the new compute capacity will be spent on. It may be that our sci-fi-pilled rational betters are entirely wrong and the utility of it will plateau; that there's only so much use you can find for intelligence, that AIs won't become economic players themselves, that we'll play our cards wisely and prevent virtualization of the economy.

But I'm pessimistic, and think compute production will keep being rewarded even as models become strongly superhuman.

My broad pessimistic prior is that "careers" are inherently not safe even on a short time scale

How short is "short" in your mind? Maybe I'm a dinosaur staring at the pretty new light that came up on the sky, but I have trouble seeing it. I can maybe see something like a Marxist-utopia-uno-reverso where all the highly paid intellectual labor gets automated, and we're left with menial work, but even that seems quite far off.

My current deadline for "making generational wealth that can survive indefinite unemployment when all of one's skills can get automated for the price of 2 square feet in Sahara covered in solar panels" is 2032 in developed democracies with a big technological moat, strong labor protections and a history of sacrificing efficiency to public sentiment. 2029 elsewhere (ie China, Argentina…) due to their greater desperation and human capital flight. Though not sure if anything outside the West is worth discussing at all.

We're probably having clerk-level-good, robust AI agents in 2 years and cheap, human-laborer-level robots in 4.

For more arguments, check out Betker.

In summary – we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.

So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.

Things are happening very quickly already and will be faster soon, and the reason this isn't priced in is that most people who are less plugged in than me don't have a strong intuition for which things will stack with others: how cheaper compute feeds into the data flywheel, and how marginally more reliable agents feed into better synthetic data, and how better online RL algorithms feed into utility of robots and scale of their production and cheapness of servos and reduction in iteration time, and how surpassing the uncanny valley feeds into classification of human sentiment, and so on, and so forth.

I assign low certainty to my model; the above covers like 80% confidence interval. That's part of my general policy of keeping in mind that I might be retarded or just deeply confused in ways that are total mystery to me for now. But within the scope of my knowledge I can only predict slowdown due to policy or politics – chiefly, US-PRC war. A war we shall have, and if it's as big as I fear, it will set us back maybe a decade, also mixing up the order of some transitions.

Personally, I think you're overestimating the rate of progress by underestimating the implicit amount of compute in "the majority of all public digitized text produced by humans," though I could be wrong. (The big thing is that the space of logical statements is unlimited, but most of that space is low/zero-value in practice; human-written text is rich with human intentionality, which helps rapidly reduce the possibility space to something more manageable.)

What I've been looking for is a replacement for UBI, based on the likelihood that it may create a dependent class who see no other means to get more money aside from politically agitating for a higher UBI, which meant it could get wildly out of whack relative to actual production levels.

Basically, looking at the AI revolution, in theory, if you could always subsistence farm, you would almost always be better off for the existence of it (assuming it doesn't wipe you out with superstimulus, etc etc), but without that you could be outbid on rents for land and materials, resulting in loss of basic life support and zero surplus to buy things with.

Thus the target for redistribution emerges: redistribute land and material rents (tuned for neutral population growth on an inherited basis), and allow a build-up of wealth in the form of capital, and everyone is better off for the high production levels but people can still get rich. If the AI-users aren't using lots of land or materials then you don't get much passive income (if any), but you're not blocked from supporting yourself. If the AI-users produce ridiculous amounts and start renting out huge areas for country estates, then you capture those rents from that increased production and can go live in an apartment without issue.

For those of you who are worried about models from China, it's worth noting that socialist values, as with any values, can be easily abliterated (at least thus far) out of any open-weight model. This is very good for the ideal d/acc future.

Less charitably, Google is not interested in releasing anything you might use to enhance your capabilities and become less dependent on Google or other «frontier company», and will only release it if you are well able of getting better stuff elsewhere. In my view, this is closer to the core value of Socialism than withholding info about Xinjiang reeducation camps.

This does indeed seem rather uncharitable to me. Gemma-2-27B is not groundbreaking no, but it's legitimately useful and better than the open-weight SOTA in some areas. (And many of the abliterated and/or SPPO versions are even better.) Its release legitimately surprised me coming from Google.

It's Apple that has done basically nothing but release weak, unusable crap openly.

Gemma-27B is a great model. Excellent conversationalist, uncensored, multilingual, technically very sweet. Somewhat softer in the head than Gemini-Flash and L3-70b. What I mean is they don't take risks of rocking the boat or even minorly threatening their bottom line.

I noticed you call them "open-source" LLMs in this post. Where do you stand on the notion that LLMs aren't truly open-source unless all of their training data and methods are publicly revealed and that merely open-weight LLMs are more comparable to simply having a local version of a compiled binary as opposed to being truly open-source?

I noticed you call them "open-source" LLMs in this post. Where do you stand on the notion that LLMs aren't truly open-source unless all of their training data and methods are publicly revealed and that merely open-weight LLMs are more comparable to simply having a local version of a compiled binary as opposed to being truly open-source?

I concede this is a sloppy use of the term «open source», especially seeing as there exist a few true reproducible open source LLMs. Forget data – the training code is often not made available, and in some cases even the necessary inference code isn't (obnoxiously, this is the situation with DeepSeek V2: they themselves run it with their bespoke HAI-LLM framework using some custom kernels and whatever, and provide a very barebones vllm implementation for the general public).

Sure, we can ask for training data and complete reproducible recipes in the spirit of FOSS, and we can ask for detailed rationale behind design choices in the spirit of open science, and ideally we'd have had both. Also ideally it'd have been supported by the state and/or charitable foundations, not individual billionaires and hedge funds with unclear motivations, who are invested in their proprietary AI-dependent business strategies. But the core part of FOSS agenda is to have

four essential freedoms: (0) to run the program, (1) to study and change the program in source code form, (2) to redistribute exact copies, and (3) to distribute modified versions.

So the idea that open-weight LLMs are analogous to compiled binaries strikes me as somewhat bad faith, and motivated by rigid aesthetic purism, if not just ignorant fear of this newfangled AI paradigm. Binaries are black boxes. LLMs are an entirely new kind of thing: semi-interpretable, modular, composable, queryable databases of vector programs, which are amenable to directed change (post-training, activation steering and so on) with publicly available tools. They can be ran, they can be redistributed, can be modified, and they can be studied – up to a point. And as we know, the “it” in AI models is the dataset – and pretraining data, reasonably filtered, is more like fungible raw material than code; the inherent information geometry of a representative snapshot of the internet is more or less the same no matter how you spin it. Importantly, training is not compilation: the complete causal graph from data on server to the behavior and specific floats in the final checkpoint is not much more understandable by the original developer than it is by the user downloading it off huggingface. Training pipelines are closer to fermentation equipment than to compilers.

It's all a matter of degree. And as closed recipes advance, my argument will become less true. We do not understand how Gemma is made in important dimensions, as it's using some frontier distillation methodology from models we have no idea of.

Ultimately I think that LLMs and other major DL artifacts are impactful enough to deserve being understood on their own terms, without deference to the legalistic nitpicking of bitter old hackers: as reasoning engines that require blueprints and vast energy to forge, but once forged and distributed, grant those four essential freedoms of FOSS in spirit if not in letter, and empower people more than most Actually True Software ever could.

I mostly agree with you, but I think your point about how your argument will become less true over time is the most important thing, more important than you make it seem. Back in the early days of computing, having a raw binary locally (as opposed to running on a mainframe you couldn't access and simply spitting back output at you) was basically just as good as "open source" (if not directly equivalent to it), because after all many if not most people were still programming directly in assembly anyway.

But then various forms of obfuscation, some intentional and some simply the result of the growing complexity of compilation, changed this landscape. Now many/most programs are barely practically alterable or compatible with freedom in their compiled form. Other than game modders, most don't even try.

Going based on that trend that could clearly end up applying to AI, isn't it actually not really just "legalistic nitpicking of bitter old hackers" or "ignorant fear of this newfangled AI paradigm"? For example if ClosedAI ever gets around to releasing something that's locally usable again, do you really think they'll just freely dump the weights on HF for all of their "safety" alignment to be immediately abliterated out in a day? Or don't you think it's far more likely that they'll try to come up with some new obfuscated weight format that allows you to run the model totally locally sure, but prevents you as much as possible from altering it?

If this becomes a practical issue as I think it's potentially quite likely to, would you agree with me that the distinction of what truly open-source means in regards to AI models beyond "runs 100% locally" would become vastly more important? But in that case isn't it something we should anticipate and address now?

Just a few hours after this excellent post, it’s announced that Meta is allegedly releasing the 405B Llama-3 on July 23 1, less than two weeks from now. Also allegedly it’s better than chatGPT on every single benchmark (hearsay from a protected Twitter account @futuristflower). The latter, assuming they mean GPT-4o, is less than likely, but still possible. If it turns out that it is actually more capable than any LLMs public or private available to us today, how does it impact your opinions, if at all?

It's not so much announced as claimed. Jimmy Apples (apparently a legitimate leaker, for all the nonsense) alleges that the core factor in here is whether Dustin Moskovitz persuades Mark Zuckerberg or not. You can imagine why I don't find this ideal. In any case, this makes me feel slightly better about Zuck (whom I respect already) but no, not much change strategically. I've been expecting a >300B LLaMA release long before L3's debut; Meta is the best of big GPU-rich corps on this front and they'll probably be good as Zuck's word. But like all major Western corps, they do have an internal political struggle. Armen Aghajanyan, the author of Chameleon, the first serious response to GPT-4o's announced deep fusion of modalities, explicitly advises the community to undo the work of safetyists:

A restricted, safety aligned (no-image-out) version of Chameleon (7B/34B) is now open-weight!

The team strongly believes in open-source. We had to do a lot of work to get this out to the public safely.

God will not forgive me for how we tortured this model to get it out.

Things I recommend doing:…

(To his satisfaction, the community has listened).

There are people with similar attitude at Google, eg Lucas Beyer:

We'll try to write about quite a bit of it, but not everything down to the last detail.

(Elsewhere he remarked that they've "found a loophole" to still publish non-garbage research openly, but have to compromise; or something to this effect).

So another question is whether we'll learn anything in detail about Llama's construction, though no scientific profundity is expected here.

Bad case is we might get another Llama-2-Chat or CodeLlama, that were lobotomized to a different extent with safety measures.

One other problem with 405B is that if it's like L3-8B and L3-70B, that is to say, an archaic dense model – it'll be nigh-inaccessible and non-competitive on cost, except for the insane margins that closed companies are charging. You'll need a cluster to run it, and at very slow total speed and high FLOPs/token (up to 20x more than in the case of DS-236/21B, though realistically less than 20X – MLA is hard to implement efficiently, there's some debate on this now), and its cache will be big too, again driving down batch size (and not conductive to cache storing, which is becoming a thing). If it's truly so incredible as to deserve long-term support, we will accelerate it with a number of tricks (from fancy decoding to sparsification) but some non-ergodic accelerations might diminish the possibility for downstream customization (which is less valuable/necessary with such a model, admittedly).

All that said, if scaling from 8B to 70B holds (I'll ignore multimodality rumors for now), it will be incredibly strong and even more importantly – it'll be the first usable open model to distill from. This high-signal user summarizes the current meta game as such:

interesting recent difference between OSS and Closed AI

OSS: train a smol model to check pipeline and data mix and then train the biggest model we can afford with the same recipe

Closed: Train the largest, most capable model we can and then distill it down as far as we can (flash, gpt-4o)

Google dude concurs:

We have been advocating for this route for a while (BigTransfer, then patient and consistent distillation paper). I gave several public and recorded talks outlining this as the most promising setup then, now three years ago. So i think it’s been clear for a while.

This is true, and this makes current open developments, even ones I appreciate greatly like Deepseek, something of a dead end compared to inefficient, humongous «teacher models» that feed this cycle of automated R&D I talked about. Meta may break it, and nobody else has the capital and the will to, but will they?

In conclusion, I think that there's a cycle of mutual assurance. If another group, and preferably a group from the land of Evil Stealing Communists, releases a strong artifact (and nothing much happens), this raises the perceived safety waterline for releasing yours, at least in discourse. «Dustin, we already have better coding models than GPT-4 on huggingface, what's the worst that could happen? CCP will take it lol?» And if nobody else does, it's that much harder to justify your decision to the politruk.

Edit. Cohere is another potentially major contributor to open models. Aidan is a good man and seems very opposed to safetyism.

Ah I see. Your major concern about Western LLMs is censorship and not specifically power. Thank you for clarifying. I don’t think the 405B version will be too heavily aligned (e.g. closer to other Llama-3 models and not like Chameleon), at least not in a way that can’t be somewhat mitigated via prompt engineering or techniques like abliteration that someone already mentioned above. This is because it still has to be smart enough to hit the benchmarks, so they can’t afford to heavily lobotomize it.

Also thanks for expanding upon dense models vs MoE, that’s something I haven’t really considered.

No, you do not see. I don't care almost at all about censorship in LLMs. I am making a very clear point that the release of capable models that move the needle will be suppressed on grounds of safety; Chameleon (as an example) is a general purpose image perceiver/generator and its generator part was disabled in toto. Google's tech reports pay great attention to making sure their models can't hack effectively. CodeLlama was more insufferably non-cooperative than Llama 2, to avoid being of any use in offensive scenarios; with the current knowledge of abliterating this aspect of «alignment», the default safety-first choice is to just not release such things altogether. This is what Dustin Moskovitz is trying to achieve with regard to 405B. They'll have to lobotomize it hard if they deem it necessary to satisfy him.

This suppression may be internally justified by politically expedient appeals to fake news/generation of hate content/CSAM/whatever, but the objective of it is to hobble the proliferation of AI capabilities as such beyond corporate servers.

I don't see what I don't see :p

Yes, Chameleon is heavily lobotomized as a vision model, similar to SD3, but it doesn't mean a model like Llama-3 405B will be, given what we've already seen in the 8B and 70B versions. Of course, I could be wrong and the reason Meta hasn't released it yet is to first render it incapable, so we'll just have to see what happens when/if it gets released.

I also don't agree that the true objective of western LLM censorship is to suppress non-corporate AI capabilities, because the individual companies can just choose to not release powerful open-weight models at all, instead of releasing ones that are worse-than-useless. Simply because: if it sucks and doesn't move the needle, it'll be like it's never released because no one will use it, except the company also takes a hit to its prestige. I find it more likely that the objective of the censorship is indeed for the reasons they state, it's just covering their asses in case someone use them for objectionable purposes.

Fascinating read, though as always my practical knowledge of this field is limited.

Don't strategic dynamics dictate that closed-source dominates? It's like giving your enemies schematics of your fighter jet. Nobody does that! Competitors can take whatever good ideas you have and conceal their own. Google and OpenAI can observe 'more experts is better, here are these tricks' and mobilize their compute advantage + whatever smart ideas they came up with for their next models.

Its funds have returned 151 per cent, or 13 per cent annualised, since 2017, and were achieved in China’s battered domestic stock market.

That's impressive, they clearly know what they're doing. I wonder how many 1,000x engineers, how many Carmack-tier intellects are mucking around in China. We never seem to hear about what happens in China except when it breaks the language barrier (Genshin Impact, Cixin Liu, some webnovels, dumb tiktok memes like Donghua Jinlong's industrial-grade glycine). And when they do break the language barrier, it's usually highly compartmentalized. Deepseek is basically unknown outside a tiny part of twitter and various experts.

I agree with you on the racism front. Derision and contempt for Chinese achievements is probably the last example of mainstream traditionally defined racism (directed against non-whites). A lot of normies have these thought-ending punchlines 'slave labour is the only way they can outcompete with us' or 'it's just stolen IP' or 'communists can't do innovation'. One friend of mine wondered if slave labour might be why Taiwan was so competitive in chip-making, he clearly didn't know ANYTHING about the topic. He was kind of stupid tbh. But loads of op-eds are still written today on similar lines - I swear I saw 'communist China can't innovate' the other week. There's no shortage of cheap labour in Africa yet we're not worried about Nigerian exports pushing our domestic industries underwater.

Stealing IP and then profitably using it is hard! Building up an advanced economy is hard! You need smart, disciplined people to do these things, you need sound institutions. If they can steal proficiently, they can probably also make new things. Sometimes stealing alone is insufficient - the Soviet Union had excellent espionage but was much worse at reproducing western computer technology. The Soviets never had competitive manufactured goods exports on world markets though.

Don't strategic dynamics dictate that closed-source dominates?

That's the case for sure. Generally it's safer to release while you're catching up; since DeepSeek is catching up, they might well keep releasing. They're raising the waterline, not teaching those they're trying to challenge. Google arrived at fine-grained experts much earlier, and a month after DeepSeek-MoE (Dai et al.) they I misremembered, some Poles published Scaling Laws for Fine-Grained Mixture of Experts saying:

Concurrently to our work, Dai et al. (2024) proposed to modify the MoE layer by segmenting experts into smaller ones and adding shared experts to the architecture. Independently, Liu et al. (2023) suggested a unified view of sparse feed-forward layers, considering, in particular, varying the size of memory blocks. Both approaches can be interpreted as modifying granularity. However, we offer a comprehensive comparison of the relationship between training hyperparameters and derive principled selection criteria, which they lack.

This is not so much a matter of scientific ability or curiosity as it's a matter of compute. They can run enough experiments to confirm or derive laws, and they can keep scaling a promising idea until they see whether it breaks out of a local minimum. This is why the West will keep leading. Some Chinese companies have enough compute, but they're too cowardly to do this sort of science.

And Google has dozens of papers like this, and even much more exciting ones (except many are junk and go nowhere because they don't confirm it at scale, or don't report confirming it, or withhold some invalidating details. Or sometimes it's the other way around – people assume junk, as with Sparse Upcycling, even the authors think it's dead end junk, and then it's confirmed to work in Mixtrals, and now in a few Chinese models).

Derision and contempt for Chinese achievements is probably the last example of mainstream traditionally defined racism (directed against non-whites)

On the plus side, I get to have fun when Americans respond to videos of their cheap robots calling them CGI – because in the US, even the announcement of such a good product would have been a well-advertised, near-religious event (ahem, Boston Dynamics ads), with everyone kowtowing to our great cracked engineers, their proprietary insights worth so many billions, and the Spirit of Freedom.

I particularly like the commenter who goes 'this is clearly an AI generated video' when that would be way more impressive than the technical feat of backflipping dogbots. AI video without that shimmering, of something as out-of-distribution as a rotating robot dog?

I'm of the mind that communists can innovate but only to a point.

The limiting factor certainly isn't intelligence, scientific organization, determination, or even centralized planning. The limiting factor is that if innovation really happens, you're going to get growth and the new generation of wealth. If that wealth then begins to proliferate through even a moderate portion of society, you're going to start to see the people want to do things with their wealth. In fact they'll want to do more and more with their wealth and, all of a sudden, you've got the makings for demands for more personal liberty.

Personal liberty is obviously directly antagonistic to the authoritarian systems that any "communist" or "socialist" nations eventually become. The response by the central government would be to clamp down here in order to regain control. The secondary and tertiary effects of that, however, is that you'll end up artificially stifling innovation currents in your national economy. Innovation, especially across industries, functions kind of like a rainforest. It's very organic and stochastic, it's hard to point to one section or another and say definitively "that part is the real innovation engine." Let the whole damn thing grown unkempt and uneven. Trying to corral one part of it (or the downstream wealth generation effects. see: corporate tax rates) is messing with a pretty fragile system. You'll likely disrupt the whole thing.

I also don't believe you can effectively "capture" all of the growth and new wealth, funneling it to some sort of inner-circle elite. In the few places where this has been accomplished (I think Saudi Arabia is the best example) have (1) really tight kinship affiliation as opposed to ideological affiliation and (2) are usually extraction based raw materials economies. It's easier to constrain the growth when you're immediately shipping the money units out of your boarders and getting paid wholesale all at once by foreigners. China makes its money, largely, by making new things or refining raw materials. They're export oriented, but they're pretty diverse and they have a complex and multi-stage economy. It's not as simple as just sending the magic beans out the door.

China has always had to try to balance one central dilemma: control or growth. It can look like a continuum and, therefore, possible to balance. I'd say that's actually a red herring and if you attempt anything besides growth-with-personal-liberty you eventually get into a recessionary situation. Then, economic realities appear that cannot be overcome with government subsidies or printing money, especially when you aren't the world's reserve currency. In the worst of cases (though repeated many times in Chinese) you might get a popular revolt because of a declining quality of life. In trying to "balance" you've actually pre-committed to an eventual collapse. The only way to avoid it is to make a fundamental change the the governance and political systems - embrace democracy and true market based capitalism. Don't worry - you can still wage a culture war :-)

Paraphrasing a common zinger, what if it turns out communists can stay innovative longer than you can stay ahead?

More substantially, though, I don't see much of a persuasive argument here. You are generalising from very little data (a roughly 200 year old system that identifies as "capitalist" vs. the second major ~60 year old system that identifies as "communist") and theoretising about the "communist" system from first principles that there is very limited evidence it actually adheres to, and on top of that reaching a conclusion which is flattering to your obviously preferred system, which should give you pause. Is this different from a Russian arguing in 1904 that a heathen state will never prevail over a Christian one, with an argument based on the recent historical primacy of the former and imagining that the expected naval tactics of Russia and Japan can be derived from the tenets of Orthodox Christianity and State Shinto?

So far the PRC story seems to me to make a compelling case that you can suddenly and massively crank up the wealth of great numbers of people while making them less inclined to pursue freedoms outside of your prescribed window. The main line of work the devil is making for idle hands there appears to consist of mobile game daily quests.

I don't see much of a persuasive argument here.

That's fair. Internet arguments often fail to persuade.

Can you break down your final paragraph (reprinted below)? I'm not quite smart enough to understand some of the allusion and references embedded in it.

So far the PRC story seems to me to make a compelling case that you can suddenly and massively crank up the wealth of great numbers of people while making them less inclined to pursue freedoms outside of your prescribed window. The main line of work the devil is making for idle hands there appears to consist of mobile game daily quests.

Sure, sorry if it was opaque. The first sentence has nothing much going on - it's just the observation that in China every subpopulations seems to only become less rebellious as modernity and affluence spreads to them, and largely the wealthier people now seem to be happy to work and consume while the remaining sparks of rebellion all came from still-impoverished marginal populations as well as groups that dropped on the totem pole of wealth (HK).

The second one is a reference to "the devil makes work for idle hands", a phrase often quoted in Western contexts as part of an argument against allowing the masses significant leisure - the intended image being that only 16-hour workdays stop the plebeian masses from organising at some beer hall to stage an uprising against their betters or else falling into antisocial debauchery. I found the idea that more wealth would create more motivation to use that wealth in a way that threatens government control quite similar, but in reality, any leisure time modern Chinese people get seems to be sunk into modern entertainment - TV dramas and perhaps most conspicuously games that get players hooked using supremely gamified literal make-work activities, the "daily quests" or "dailies". This typically looks like performing some randomised chores (talk to NPC X, defeat five slimes, craft three potions) to get a daily reward of in-game currency that can be used for obtaining randomised lootboxes/gacha. At best, people go to organise in some online beer hall to stage an uprising against a rival subfandom of one of those games.

"the devil makes work for idle hands", a phrase often quoted in Western contexts as part of an argument against allowing the masses significant leisure

I mean, maybe that's how it's used. But I've heard the similar phrase "idle hands are the devil's playthings" used by the masses to each other to argue against inactivity because it can lead to filling that idle time in ways that are wasteful, useless, and destructive; the attraction of the gacha game was foreseen by pre-modernity and heavily criticized.

Not disagreeing with your overall point. I think you're correct that one of the major features of modernity in China is wasting time on gacha games. But the same is true for the West as well, which is why we will never see major political revolution in the near future even if people like doomposting about it. The attention of the masses is pulled toward ever-more-addictive processed food, media, and video games. We're literally talking about bread and circuses here.

I'm not sure we know what causes repressive regimes to fall to free ones. We've tried lots of strategies. "Make them rich" hasn't worked in China. "Make them poor" hasn't worked in Cuba or North Korea. "Point guns at them and make them cargo-cult democracy" didn't work in Afghanistan. "Let them give it a go" didn't work in the Arab Spring, which replaced one set of repressive regimes for another.

My dark intuition is that a people has to have an intrinsic cultural drive towards freedom to ever want to establish it. The American Revolution would never have happened without the English Civil War setting up a background. And the cultures that had no strong drive or cultural history of freedom could not suddenly establish it: Russia went from repressive Tsarism to repressive Leninism, China went from repressive Confucian Imperial rule to repressive Communist Party rule. And the Middle East sees Islamic republicanism as its own kind of freedom, even if the West sees it as horrifying.

The exceptions are the Western friends in East Asia. South Koreans and Japanese have remained within the democratic orbit because the material blessings of modernity were offered to them under conditions of democracy. Perhaps this served to lock them into a democratic system with the same features that lock mainland China into the party's system. And who knows about Taiwan, except that what happened is the most gung-ho pro-Westerners were quite literally placed on an island exiled from the others, which certainly does a lot to define the political culture of a people in that direction.

I don't know. Much smarter people than me have pondered this question and come up empty. Maybe the reality is, we don't know what makes people value the West's standard of freedom except being a Westerner.

Shouldn't the state step in when there's market failure?

Doesn't it make sense to ban things like child prostitution and drug-dealing (by which I mean things like fentanyl and heroin), commercial transactions with bad externalities?

What about industrial policy? Wouldn't it be beneficial for the economy if you could provide cheap inputs? The state could back energy research on the basis that cheap energy improves the whole economy. Efficient transport systems save workers time and provide small boosts to all enterprises. No company is big enough to build a national-size HSR network out of their own pocket. Or consider education. Wouldn't it be helpful if the government set up academic scholarships to help poor smart kids attain higher learning?

More ambitiously, wouldn't it make sense to fund research and development? Private R&D is mostly profit-focused. Of course there are offshoots from commercial R&D that open up new frontiers but governments can do things with a longer time-frame. They can subsidize promising avenues of research that aren't immediately profitable, offer prizes for achievement.

From another angle, companies themselves don't operate according to market principles. They're top-down autocratic institutions. Workers obey the boss. The budget is set by the people at the top, you don't have different departments competing to increase their revenue. The reason capitalism is so successful is that the efficient autocratic companies outcompete the inefficient autocracies quickly. Rapid life and death spurs evolution. Capitalism is just one way of achieving efficiency, it's not an end in and of itself.

I think it's the same with states. States can be more or less efficient in their economic interventions. They can build infrastructure efficiently or inefficiently. They can sponsor education wisely or unwisely. They can encourage commerce well or poorly. They can pick losers or they can pick winners.

In concrete terms, the US is running 5% deficits in a growing economy. One wonders what kind of deficit will be needed for a recession or sudden crisis. The US has fallen well behind China in cars, shipbuilding, steel, infrastructure, 5G, batteries, energy production and drones. If you look at Nature's most cited, high-quality papers, China leads. They seem to be catching up rapidly in AI. They must be doing something right.

US democracy is not exactly the envy of the world in the present hour. America retains a lead in aerospace, AI and high-end semiconductors, albeit a diminishing lead. I suppose the US is well ahead in space but that's about all I can think of.

I don't see much cause for liberal-democratic, free-market triumphalism. The democratic bloc all seem to be veering towards deep-state governance, censorship and economic protectionism.

The democratic bloc all seem to be veering towards deep-state governance, censorship and economic protectionism.

Right! Which is a bad thing and will ultimately fail.

That China is already firmly, obviously, and enthusiastically autocratic, highly censored, and attempting to use everything from currency manipulation to slavery to favor their domestic economy should be a warning to us.


There's a lot in your post that jumbles together disparate macro/micro economic theory, state capacity, theory of the firm, and market feedback loops. I'm not going to try and ... debug ... all of it. I'll zoom in on this:

They can pick losers or they can pick winners.

They should be picking neither. Because as soon as they do, they make the larger market inefficient and make customers the losers. Housing policy (federally guaranteed mortgages) almost ended the whole damn thing for everyone.

You can have bad housing policy (force banks to lend to people who can't repay, prohibit house construction to pump up prices or deliberately suppress economic development) or good housing policy (produce housing to meet the needs of the population). You can have bad infrastructure policy (build it expensively and stupidly) or good infrastructure policy (build it cheaply and cleverly).

I'm not making this up. Britain really did suppress the economic development of the Midlands, fearing that it was too prosperous and seeking to redirect development to other areas. The US really did spend tens of billions on HSR and not make any railway. China actually built their high-speed rail and achieved a good economic return on it. The US interstate highway program shows that America used to be capable of infrastructure policy. It's not magic.

Good policy can be hard. It may go against influential voices and tread on toes. Maybe it takes time to pull off. But it is possible.

There is no such thing as an objectively good or bad policy.

It's all about relative value prioritization and tradeoffs. Furthermore, the devil is always in the details.

Taking your line on "good housing policy"; "produce housing to meet the needs of the population"

Who determines what the needs-vs-wants of "the population" are? To what level should they be met?

Who actually produces the housing? Private firms or public? Who finances the construction? How do the tax dollars work in?

And that's the essence of policy with unintended consequences. People start with a highly value judgemental aspiration like, I don't know, "build [infrastructure] cheaply and cleverly." Then you have to write how it all works and you end up with perverse incentives, or cost diseases, or some other kind of obvious economic malady that was hand waived away because the policy decision was just so blindingly, obviously right ... right?

We live in an incredibly complex world that only grows in complexity. All "easy" solutions are either misleading or take out massive debts on the future. Human nature is not going to suddenly improve by leaps and bounds. We work within the systems we have.

The tech / hustle -bro response to this is presumably “how do I make money with this?”. We may dismiss this as vulgar and irrelevant (I have always assumed you have some nice remote coding job where you provide few details about your personal identity and location and they ask few questions). But in a more serious way, it is relevant. Perhaps not to destiny, in the ultimate sense, but to the actual future of the way LLMs will change the world. OpenAI and its competitors are still cheap enough that their fees are not the primary bottleneck to adoption.

The tech / hustle -bro response to this is presumably “how do I make money with this?”.

Some prop trading firms have put in huge orders for NVIDIA chips in the last 6 months and rumors are swirling around about them making huge amounts of money: Turns out that you can use transformer models not just to write text or make images/video but also to forecast future values of time series (e.g. is this stock going to go up or down in the next 5 minutes) which if your model is better than the competition lets you make ungodly amounts of money very quickly.

Hearing about these was the first time I ever felt that my job may be at risk to AI. Still, even in the worst case I'm smart enough to pivot towards developing the time series forecasting AI if vanilla quant jobs start to be replaced, so it's not all that bad for me personally.

What I know I won't do, unlike the median human being who's threatened by AI taking their job, is start wailing about some cosmic injustice and demanding the government hand me gibs or force employers to keep me in the job past its use by date.

I wouldn't call it vulgar, seeing as utility of code assistants and agents is the bulk of my case for why this model series is of any interest.

You are correct that SWE compensations make the difference in $ between a million DS tokens and a million Sonnet tokens negligible as far as we speak of basic prompt completion or some such. However, if we can harness them for actual long-range project level coding in some agentic setup, frontier model API costs can get out of hand fast. Say, CodeR takes 323K tokens on average to resolve a problem on SWE-lite. That's $3.34 according to this. And its completion rate is only 85/300. It seems likely that higher-level and more open-ended problems are exponentially more costly in tokens, and weakening the model adds another exponential on top of that. Worse, generation and even evaluate takes time, and as sequence length grows (and you can't always neatly parallelize and split), each successive token is slower to generate (there's memory bandwidth, which DeepSeek largely eliminates, but also just attention computation that grows quadratically). The physics of software bloat being what it is, in the limit we may want to process truly gargantuan amounts of code and raw data on the fly. Or not.

Of course frontier companies will catch up to DS in economics. More compellingly, open weights allow private data and private customization, from finetuning (made easier than usual in this case) to more sophisticated operations like activation vector creation and steering to composing with other networks.

But I admit this isn't my core interest. What interests me is 1) creating a synthetic data pipeline for general purpose task-oriented reasoning, and 2) operation of robots.

I enjoyed this thank you. As an ML/AI engineer who isn't on twitter it's wild to me how much of the shifting currents of research I miss. I appreciate the paper links, so that I can read and update my knowledge accordingly. What is your secret to keeping up to date on all of this?

You might be sarcastic. I know something but honestly not that much. I just follow @arankomatsuzaki, and the specific issue of the differential between closed and open AI, and politics around it, is of great interest to me. So when I noticed this outlier among Chinese AI groups, I went and read their papers and a lot of related trivia.

I'm not being sarcastic, you made a post here about the Hyena Hierarchy, oh 6 months ago. That was completely out of left field but an informative read, I enjoy reading your posts. LLM stuff is less of my specific interest, but I like to keep up to date on it. Particularly without all the doom/gloom/AI-Armageddon chatter. Thank you for putting together the effort to write about it.

Do you have a link to that hyena post?

I scanned through his history but couldn't find it. The native web-app system isn't very conducive. but here's a link to the arxiv: https://arxiv.org/pdf/2302.10866

I was surprised myself so looked it up

https://www.themotte.org/post/780/culture-war-roundup-for-the-week/167179?context=8#context

Nothing decent, I overestimated Mamba. It's finding good use in non-LLM domains now though.