site banner

The state of open-source LLMs as of summer 2024, or: Core Values of Socialism with AGI characteristics, V2

Here some people have expressed interest in my take on AI broadly, and then there's Deepseek-Coder release, but I've been very busy and the field is moving so very fast again, it felt like a thankless job to do what Zvi does and without his doomer agenda too (seeing the frenetic feed on Twitter, one can be forgiven for just losing the will; and, well, I suppose Twitter explains a lot about our condition in general). At times I envy Iconochasm who tapped out. Also, this is a very niche technical discussion and folks here prefer policy.

But, in short: open source AI, in its most significant aspects, which I deem to be code generation and general verifiable reasoning (you can bootstrap most everything else from it), is now propped up by a single Chinese hedge fund (created in the spirit of Renaissance Capital) which supports a small, ignored (except by scientists and a few crackpots on Twitter) research division staffed with some nonames, who are quietly churning out extraordinarily good models with the explicit aim of creating AGI in the open. These models happen to be (relatively) innocent of benchmark-gaming, but somewhat aligned to Chinese values. The modus operandi of DeepSeek is starkly different from that of either other Chinese or Western competitors. In effect this is the only known group both meaningfully pursuing frontier capabilities and actively teaching others how to do so. I think this is interesting and a modest cause for optimism. I am also somewhat reluctant to write about this publicly because there exist lovers of Freedom here, and it would be quite a shame if my writing contributed to targeted sanctions and even more disempowerment of the small man by the state machinery in the final accounting.

But the cat's probably out of the bag. The first progress prize of AI Mathematical Olympiad had just been taken by a team using their DeepSeekMath-7B model, solving 29 out of 50 private test questions «less challenging than those in the IMO but at the level of IMO preselection»; Terence Tao finds it «somewhat higher than expected» (he is on the AIMO Advisory Committee, along with his fellow Fields medalist Timothy Gowers).

The next three teams entered with this model as well.

I. The shape of the game board

To provide some context, here's an opinionated recap of AI trends since last year. I will be focusing exclusively on LLMs, as that's what matters (image gen, music gen, TTS etc largely are trivial conveniences, and other serious paradigms seem to be in their embryonic stage or in deep stealth).

  • We have barely advanced in true out-of-distribution reasoning/understanding relative to the original «Sparks of AGI» GPT-4 (TheDag, me); GPT-4-04-29 and Sonnet 3.5 were the only substantial – both minor – steps forward, Gemini was a catch-up effort, and nobody else has yet credibly reached the same tier. We have also made scant progress towards consensus on whether that-which-LLMs-do is «truly» reasoning or understanding; sensible people have recoursed to something like «it's its own kind of mind, and hella useful».
  • Meanwhile there's been a great deal of progress in scaffolding (no more babyAGI/AutoGPT gimmicry, now agents are climbing up the genuinely hard SWE-bench), code and math skills, inherent robustness in multi-turn interactions and responsiveness to nuanced feedback (to the point that LLMs can iteratively improve sizable codebases – as pair programmers, not just fancy-autocomplete «copilots»), factuality, respect of prioritized system instructions, patching badly covered parts of the world-knowledge/common sense manifold, unironic «alignment» and ironing out Sydney-like kinks in deployment, integrating non-textual modalities, managing long contexts (merely usable 32K "memory" was almost sci-fi back then, now 1M+ with strong recall is table stakes at the frontier; with 128K mastered on a deeper level by many groups) and a fairly insane jump in cost-effectiveness – marginally driven by better hardware, and mostly by distilling from raw pretrained models, better dataset curation, low-level inference optimizations, eliminating architectural redundancies and discovering many "good enough" if weaker techniques (for example, DPO instead of PPO). 15 months ago,"$0.002/1000 tokens" for gpt-3.5-turbo seemed incredible; now we always count tokens by the million, and Gemini-Flash blows 3.5-turbo out of the water for half that, so hard it's not funny; and we have reason to believe it's still raking in >50% margins whereas OpenAI probably subsidized their first offerings (though in light of distilling and possibly other methods of compute reuse, it's hard to rigorously account for a model's capital costs now).
  • AI doom discourse has continued to develop roughly as I've predicted, but with MIRI pivoting to evidence-free advocacy, orthodox doomerism getting routed as a scientific paradigm, more extreme holdovers from it («emergent mesaoptimizers! tendrils of agency in inscrutable matrices!») being wearily dropped by players who matter, and misuse (SB 1047 etc) + geopolitical angle (you've probably seen young Leopold) gaining prominence.
  • The gap in scientific and engineering understanding of AI between the broader community and "the frontier" has shrunk since the debut of GPT-4 or 3.5, because there's too much money to be made in AI and only so much lead you can get out of having assembled the most driven AGI company. Back then, only a small pool of external researchers could claim to understand what the hell they did above the level of shrugging "well, scale is all you need" (wrong answer) or speculating about some simple methods like "train on copyrighted textbooks" (spiritually true); people chased rumors, leaks… Now it takes weeks at most to trace a yet another jaw-dropping magical demo to papers, to cook up a proof of concept, or even to deem the direction suboptimal; the other two leading labs no longer seem desperate, and we're in the second episode of Anthropic's comfortable lead.
  • Actual, downloadable open AI sucks way less than I've lamented last July. But it still sucks. And that's really bad, since it sucks most in the dimension that matters: delivering value, in the basest sense of helping do work that gets paid. And the one company built on the promise of «decentralizing intelligence», which I had hope for, had proven unstable.

To be more specific, open source (or as some say now, given the secretiveness of full recipes and opacity of datasets, «open weights») AI has mostly caught up in «creativity» and «personality», «knowledge» and some measure of «common sense», and can be used for petty consumer pleasures or simple labor automation, but it's far behind corporate products in «STEM» type skills, that are in short supply among human employees too: «hard» causal reasoning, information integration, coding, math. (Ironically, I agree here with whining artists that we're solving domains of competence in the wrong order. Also it's funny how by default coding seems to be what LLMs are most suited for, as the sequence of code is more constrained by preceding context than natural language is).

To wit, Western and Eastern corporations alike generously feed us – while smothering startups – fancy baubles to tinker with, charismatic talking toys; as they rev up self-improvement engines for full cycle R&D, the way imagined by science fiction authors all these decades ago, monopolizing this bright new world. Toys are getting prohibitively expensive to replicate, with reported pretraining costs up to ≈$12 million and counting now. Mistral's Mixtral/Codestral, Musk's Grok-0, 01.Ai's Yi-1.5, Databricks' DBRX-132B, Alibaba's Qwens, Meta's fantastic Llama 3 (barring the not-yet-released 405B version), Google's even better Gemma 2, Nvidia's massive Nemotron-340B – they're all neat. But they don't even pass for prototypes of engines you can hop on and hope to ride up the exponential curve. They're too… soft. And not economical for their merits.

Going through our archive, I find this year-old analysis strikingly relevant:

I think successful development of a trusted open model rivaling chatgpt in capability is likely in the span of a year, if people like you, who care about long-term consequences of lacking access to it, play their cards reasonably well. […] Companies whose existence depends on the defensibility of the moat around their LM-derived product will tend to structure the discourse around their product and technology to avoid even the fleeting perception of being a feasibly reproducible commodity.

That's about how it went. While the original ChatGPT, that fascinating demo, is commodified now, competitive product-grade AI systems are not, and companies big and small still work hard to maintain the impression that it takes

  • some secret sauce (OpenAI, Anthropic)
  • work of hundreds of Ph.Ds (Deepmind)
  • vast capital and compute (Meta)
  • "frontier experience" (Reka)

– and even then, none of them have felt secure enough yet to release a serious threat to the other's proprietary offers.

I don't think it's a big exaggerion to say that the only genuine pattern breaker – presciently mentioned by me here – is DeepSeek, the company that has single-handedly changed – a bit – my maximally skeptical spring'2023 position on the fate of China in the AGI race.

II. Deep seek what?

AGI, I guess. Their Twitter bio states only: «Unravel the mystery of AGI with curiosity. Answer the essential question with long-termism». It is claimed by the Financial Times that they have a recruitment pitch «We believe AGI is the violent beauty of model x data x computing power. Embark on a ‘deep quest’ with us on the journey towards AGI!» but other than that nobody I know of has seen any advertisement or self-promotion from them (except for like 70 tweets in total, all announcing some new capability or responding to basic user questions about license), so it's implausible that they're looking for attention or subsidies. Their researchers maintain near-perfect silence online. Their – now stronger and cheaper – models tend to be ignored in comparisons by Chinese AI businesses and users. As mentioned before, one well-informed Western ML researcher has joked that they're the bellwether for «the number of foreign spies embedded in the top labs».

FT also says the following of their parent company:

Its funds have returned 151 per cent, or 13 per cent annualised, since 2017, and were achieved in China’s battered domestic stock market. The country’s benchmark CSI 300 index, which tracks China’s top 300 stocks, has risen 8 per cent over the same time period, according to research provider Simu Paipai.
In February, Beijing cracked down on quant funds, blaming a stock market sell-off at the start of the year on their high-speed algorithmic trading. Since then, High-Flyer’s funds have trailed the CSI 300 by four percentage points.
[…] By 2021, all of High-Flyer’s strategies were using AI, according to manager Cai Liyu, employing strategies similar to those pioneered by hugely profitable hedge fund Renaissance Technologies. “AI helps to extract valuable data from massive data sets which can be useful for predicting stock prices and making investment decisions,” …
Cai said the company’s first computing cluster had cost nearly Rmb200mn and that High Flyer was investing about Rmb1bn to build a second supercomputing cluster, which would stretch across a roughly football pitch-sized area. Most of their profits went back into their AI infrastructure, he added. […] The group acquired the Nvidia A100 chips before Washington restricted their delivery to China in mid-2022.
“We always wanted to carry out larger-scale experiments, so we’ve always aimed to deploy as much computational power as possible,” founder Liang told Chinese tech site 36Kr last year. “We wanted to find a paradigm that can fully describe the entire financial market.”

In a less eclectic Socialist nation this would've been sold as Project Cybersyn or OGAS. Anyway, my guess is they're not getting subsidies from the Party any time soon.

They've made a minor splash in the ML community eight months ago, in late October, releasing an unreasonably strong Deepseek-Coder. Yes, in practice an awkward replacement for GPT-3.5, yes, contaminated with test set, which prompted most observers to discard it as a yet another Chinese fraud. But it proved to strictly dominate hyped-up things like Meta's CodeLLaMA and Mistral's Mixtral 8x7b in real-world performance, and time and again proved to be the strongest open baseline in research papers. On privately designed, new benchmarks like this fresh one from Cohere it's clear that they did get to parity with OpenAI's workhorse model, right on the first public attempt – as far as coding is concerned.

On top of that, they shared a great deal of information about how: constructing the dataset from Github, pretraining, finetuning. The paper was an absolute joy to read, sharing even details on unsuccessful experiments. It didn't offer much in the way of novelty; I evaluate it as a masterful, no-unforced-errors integration of fresh (by that point) known best practices. Think about your own field and you'll probably agree that even this is a high bar. And in AI, it is generally the case that either you get a great model with «we trained it on some text… probably» tech report (Mistral, Google), or a mediocre one accompanied by a fake-ass novel full of jargon (every second Chinese group). Still, few cared.

Coder was trained, it seems, using lessons of the less impressive Deepseek-LLM-67B (even so, it was roughly Meta's LLaMA-2-70B peer that also could code; a remarkable result for a literally-who new team), which somehow came out a month after. Its paper (released even later still) was subtitled «Scaling Open-Source Language Models with Longtermism». I am not sure if this was some kind of joke at the expense of effective altruists. What they meant concretely was the following:

Over the past few years, LLMs … have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). … Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LMs with a long-term perspective.

  • …Soon, we will release our technique reports in code intelligence and Mixture-of-Experts(MoE), respectively. They show how we create high-quality code data for pre-training, and design a sparse model to achieve dense model performance.
  • At present, we are constructing a larger and improved dataset for the upcoming version of DeepSeek LLM. We hope the reasoning, Chinese knowledge, math, and code capabilities will be significantly improved in the next version.
  • Our alignment team is dedicated to studying ways to deliver a model that is helpful, honest, and safe to the public. Our initial experiments prove that reinforcement learning could boost model complex reasoning capability.

…I apologize for geeking out. All that might seem normal enough. But, a) they've fulfilled every one of those objectives since then. And b) I've read a great deal of research papers and tech reports, entire series from many groups, and I don't remember this feeling of cheerful formidability. It's more like contemplating the dynamism of SpaceX or Tesla than wading through a boastful yet obscurantist press release. It is especially abnormal for a Mainland Chinese paper to be written like this – with friendly confidence, admitting weaknesses, pointing out errors you might repeat, not hiding disappointments behind academese word salad; and so assured of having a shot in an honest fight with the champion.

In the Coder paper, they conclude:

…This advancement underscores our belief that the most effective code-focused Large Language Models (LLMs) are those built upon robust general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.

In the Mixture-of-Experts paper (8th January), they've shown themselves capable of novel architectural research too, introducing a pretty ingenuous «fine-grained MoE with shared experts» design with the objective of «Ultimate Expert Specialization» and economical inference: «DeepSeekMoE 145B significantly outperforms Gshard, matching DeepSeek 67B with 28.5% (maybe even 14.6%) computation». For those few who noticed it, this seemed a minor curiosity, or just bullshit.

On 5th February, they've dropped DeepSeekMath,of which I've already spoken: «Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model». Contra the usual Chinese pattern, it wasn't a lie; no, you couldn't in normal use get remotely as good results from it, but in some constrained regimes… The project itself was a mix of most of the previous steps: sophisticated (and well-explained) data harvesting pipeline, scaling laws experiments, further «longtermist» continued pretraining from Coder-7B-1.5 which itself is a repurposed LLM-7B, and the teased reinforcement learning approach. Numina, winners of AIMO, say «We also experimented with applying our SFT recipe to larger models like InternLM-20B, CodeLama-33B, and Mixtral-8x7B but found that (a) the DeepSeek 7B model is very hard to beat due to its continued pretraining on math…».

In early March they released DeepSeek-VL: Towards Real-World Vision-Language Understanding, reporting some decent results and research on building multimodal systems, and again announcing new plans: «to scale up DeepSeek-VL to larger sizes, incorporating Mixture of Experts technology».

III. Frontier minor league

This far, it's all been preparatory R&D, shared openly and explained eagerly yet barely noticed by anyone (except that the trusty Coder still served as base for labs like Microsoft Research to experiment on): utterly overshadowed in discussions by Alibaba, Meta, Mistral, to say nothing of frontier labs.

But on May 6th, 2024, the pieces began to fall into place. They released «DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model», which subsumed all aforementioned works (except VL).

It's… unlike any other open model, to the point you could believe it was actually made by some high-IQ finance bros from first principles. Its design choices are exquisite, just copying minor details can substantially improve on typical non-frontier efforts. It pushes further their already unorthodox MoE and tops it off with a deep, still poorly understood modification to the attention mechanism (Multi-head Latent Attention, or MLA). It deviates from industry-standard rotary position embeddings to accomodate the latter (a fruit of collaboration with RoPE's inventor). It's still so unconventional that we are only beginning to figure out how to run it properly (they don't share their internal pipeline, which is optimized for hardware they can access given American sanctions). But in retrospect, it's the obvious culmination of the vision announced with those first model releases and goofy tweets, probably a vision not one year old, and yet astonishingly far-sighted – especially given how young their star researchers are. But probably it's mundane in the landscape of AI that's actually used; I suspect it's close to how Sonnet 3.5 or Gemini 1.5 Pro work on the inside. It's just that the open-source peasants are still mucking around with stone age dense models on their tiny consumer GPUs.

I understand I might already be boring you out of your mind, but just to give you an idea of how impressive this whole sequence is, here's a 3rd April paper for context:

Recent developments, such as Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), spotlight Mixture-of-Experts (MoE) models as a superior alternative to Dense Transformers. An MoE layer works by routing each input token to a selected group of experts for processing. Remarkably, increasing the number of experts in an MoE model (almost) does not raise the computational cost, enabling the model to incorporate more knowledge through extra parameters without inflating pre-training expenses… Although our findings suggest a loss-optimal configuration with Emax experts, such a setup is not practical for actual deployment. The main reason is that an excessive number of experts makes the model impractical for inference. In contrast to pretraining, LLM inference is notably memory-intensive, as it requires storing intermediate states (KV-cache) of all tokens. With more experts, the available memory for storing KV caches is squeezed. As a result, the batch size – hence throughput – decreases, leading to increased cost per query. … We found that MoE models with 4 or 8 experts exhibit more efficient inference and higher performance compared to MoE models with more experts. However, they necessitate 2.4x-4.3x more training budgets to reach the same performance with models with more experts, making them impractical from the training side.

This is basically where Mistral.AI, the undisputed European champion with Meta and Google pedigree (valuation $6.2B), the darling of the opensource community, stands.

And yet, apparently DeepSeek have found a way to get out of the bind. «4 or 8»? They scale to 162 experts, reducing active parameters to 21B, cutting down pretraining costs by 42.5% and increasing peak generation speed by 5.76x; and they scale up the batch size via compressing the KV cache by like 15 times with a bizarre application of low-rank projections and dot attention; and while doing so they cram in 3x more attention heads than any model this size has any business having (because their new attention decouples number of heads from cache size), and so kick the effective «thinking intensity» up a notch, beating the gold standard «Multihead attention» everyone has been lousily approximating; and they use a bunch of auxiliary losses to make the whole thing maximally cheap to use on their specific node configuration.

But the cache trick is pretty insane. The hardest-to-believe, for me, part of the whole thing. Now, 2 months later, we know that certain Western groups ought to have reached the same Pareto frontier, just with different (maybe worse, maybe better) tradeoffs. But those are literally inventors and/or godfathers of the Transformer – Noam Shazeer's CharacterAI, Google Deepmind's Gemini line… This is done by folks like this serious-looking 5th year Ph.D student, in under a year!

As a result, they:

  • use about as much compute on pretraining as Meta did on Llama-3-8B, an utter toy in comparison (maybe worth $2.5 million for them); 1/20th of GPT-4.
  • Get a 236B model that's about as good across the board as Meta's Llama-3-70B (≈4x more compute), which has the capacity – if not the capability – of mid-range frontier models (previous Claude 3 Sonnet; GPT-4 on a bad day).
  • Can serve it at around the price of 8B, $0.14 for processing 1 million tokens of input and $0.28 for generating 1 million tokens of output (1 and 2 Yuan), on previous-gen hardware too.
  • …and still take up to 70%+ gross margins, because «On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second… In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second», and the going price for such nodes is ≤$15/hr. That's $50 in revenue, for clarity. They aren't doing a marketing stunt.
  • …and so they force every deep-pocketed mediocre Chinese LLM vendor – Alibaba, Zhipu and all – to drop prices overnight, now likely serving at a loss.

Now, I am less sure about some parts of this story; but mostly it's verifiable.

I can see why an American, or a young German like Leopold, would freak out about espionage. The thing is, their papers are just too damn good and too damn consistent over the entire period if you look back (as I did), so «that's it, lock the labs» or «haha, no more tokens 4 u» is most likely little more than racist cope for the time being. The appropriate reaction would be more akin to «holy shit Japanese cars are in fact good».

Smart people (Jack Clark from Anthropic, Dylan Patel of Semianalysis) immediately take note. Very Rational people clamoring for AI pause (TheZvi) sneer and downplay: «This is who we are worried about?» (as he did before, and before). But it is still good fun. Nothing extreme. There slowly begin efforts at adoption: say, Salesforce uses V2-Chat to create synthetic data to finetune small Deepseek-Coder V1s to outperform GPT-4 on narrow tasks. Mostly nobody cares.

The paper ends in the usual manner of cryptic comments and commitments:

We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.

• In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.

In the Appendix, you can find a lot of curious info, such as:

During pre-training data preparation, we identify and *filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, which is mainly associated with American values.

Prejudices of specific regional cultures aside, though, it does have values – true, Middle Kingdom ones, such as uncritically supporting the Party line and adherence to Core Values Of Socialism (h/t @RandomRanger). The web version will also delete the last message if you ask something too clever about Xi or Tiananmen or… well, nearly the entirety of usual things Americans want to talk to Chinese coding-oriented LLMs about.

And a bit earlier, this funny guy from the team presented at Nvidia's GTC24 with the product for the general case – «culturally sensitive», customizable alignment-on-demand: «legality of rifle» for the imperialists, illegality of Tibet separatism for the civilized folk. Refreshingly frank.

But again, even that was just a preparatory.

IV. Coming at the king

Roughly 40 days later they release DeepSeek-V2-Coder: Breaking the Barrier of Closed-Source Models in Code Intelligence, where they return to the strategy announced at the very start: they take an intermediate checkpoint of V2, and push it harder and further on the dataset enriched with code and math (that that've continued to expand and refine), for 10.2 trillion tokens total. Now this training run is 60% more expensive than Llama-3-8B (still a pittance by modern standards). It also misses out on some trivia knowledge and somehow becomes even less charismatic. It's also not a pleasant experience because the API runs very slowly, probably from congestion (I guess Chinese businesses are stingy… or perhaps DeepSeek is generating a lot of synthetic data for next iterations). Anons on 4chan joke that it's «perfect for roleplaying with smart, hard-to-get characters».

More importantly though, it demolishes Llama-3-70B on every task that takes nontrivial intelligence; bests Claude 3 Opus on coding and math throughout, Gemini 1.5-Pro on most coding assistance, and trades blows with the strongest GPT-4 variants. Of course it's the same shape and the same price, which is to say, up to 100 times cheaper than its peers… more than 100 times, in the case of Opus. Still a bitch to run, but it turns out they're selling turnkey servers. In China, of course. To boot, they rapidly shipped running code in browser (a very simple feature but going most of the way to Claude Artifacts that wowed people do much), quadrupled context length without price changes (32k to 128k) and now intend to add context caching that Google boasts of as some tremendous Gemini breakthrough. They have... Impressive execution.

Benchmarks, from the most sophisticated and hard to hack to the most bespoke and obscure, confirm that it's «up there».

Etc etc, and crucially, users report similar impressions:

So I have pegged deepseek v2 coder against sonnet 3.5 and gpt4o in my coding tasks and it seems to be better than gpt4o (What is happening at OpenAI) and very similar to Sonnet 3.5. The only downside is the speed, it's kinda slow. Very good model and the price is unbeatable.

I had the same experience, this is a very good model for serious tasks. Sadly the chat version is very dry and uncreative for writing. Maybe skill issue, I do not know. It doesn't feel slopped, it's just.. very dry. It doesn't come up with things.

Some frustrating weak points, but they know of those, and conclude:

Although DeepSeek-Coder-V2 achieves impressive performance on standard benchmarks, we find that there is still a significant gap in instruction-following capabilities compared to current state-of-the-art models like GPT-4 Turbo. This gap leads to poor performance in complex scenarios and tasks such as those in SWEbench. […] In the future, we will focus more on improving the model’s instruction-following capabilities…

Followed by the list of 338 supported languages.

Well-read researchers say stuff like

DeepSeek-Coder-V2 is by far the best open-source math (+ coding) model, performing on par with GPT4o w/o process RM or MCTS and w/ >20x less training compute. Data contamination doesn't seem to be a concern here. Imagine about what this model could achieve with PRM, MCTS, and other yet-to-be-released agentic exploration methods. Unlike GPT4o, you can train this model further. It has the potential to solve Olympiad, PhD and maybe even research level problems, like the internal model a Microsoft exec said to be able to solve PhD qualifying exam questions».

Among the Rational, there is some cautious realization («This is one of the best signs so far that China can do something competitive in the space, if this benchmark turns out to be good»), in short order giving way to more cope : «Arena is less kind to DeepSeek, giving it an 1179, good for 21st and behind open model Gemma-2-9B».

And one more detail: A couple weeks ago, they released code and paper on Expert-Specialized Fine-Tuning, «which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning … by showing less performance degradation [in general tasks]». It seems to require that «ultimate expert specialization» design of theirs, with its supporting beam of generalist modules surrounded by meaningfully task-specific shards, to automatically select only the parts pertaining to some target domain; and this isn't doable with traditional dense of MoE designs. Once again: confident vision, bearing fruit months later. I would like to know who's charting their course, because they're single-handedly redeeming my opinion of the Chinese AI ecosystem and frankly Chinese culture.

V. Where does this leave us?

This might not change much. Western closed AI compute moat continues to deepen, DeepSeek/High-Flyer don't have any apparent privileged access to domestic chips, and other Chinese groups have friends in the Standing Committee and in the industry, so realistically this will be a blip on the radar of history. A month ago they've precluded a certain level of safetyist excess and corporate lock-in that still seemed possible in late 2023, when the argument that public availability of ≈GPT-4 level weights (with the main imaginary threat vectors being coding/reasoning-bottlenecked) could present intolerable risks was discussed in earnest. One-two more such leaps and we're… there, for the vague libertarian intuition of «there» I won't elucidate now. But they're already not sharing the silently updated Deepseek-V2-Chat (that somewhat improved its reasoning, getting closer to the Coder), nor the promised materials on DeepSeek-Prover (a quiet further development of their mathematical models line). Maybe it's temporary. Maybe they've arrived to where they wanted to be, and will turtle up like Stability and Mistral, and then likely wither away.

Mostly, I honestly just think it's remarkable that we're getting an excellent, practically useful free model with lowkey socialist sensibilities. Sadly, I do not foresee that this will inspire Western groups to accelerate open source and leave them in the dust. As Google says in Gemma-2 report:

Despite advancements in capabilities, we believe that given the number of larger and more powerful open models, this release will have a negligible effect on the overall risk landscape.

Less charitably, Google is not interested in releasing anything you might use to enhance your capabilities and become less dependent on Google or other «frontier company», and will only release it if you are well able of getting better stuff elsewhere. In my view, this is closer to the core value of Socialism than withholding info about Xinjiang reeducation camps.

I remain agnostic about the motivations and game plan of DeepSeek, but I do hope they'll maintain this policy of releasing models «with longtermism», as it were. We don't have many others to rely on.

Edits: minor fixes

58
Jump in the discussion.

No email address required.

Fascinating read, though as always my practical knowledge of this field is limited.

Don't strategic dynamics dictate that closed-source dominates? It's like giving your enemies schematics of your fighter jet. Nobody does that! Competitors can take whatever good ideas you have and conceal their own. Google and OpenAI can observe 'more experts is better, here are these tricks' and mobilize their compute advantage + whatever smart ideas they came up with for their next models.

Its funds have returned 151 per cent, or 13 per cent annualised, since 2017, and were achieved in China’s battered domestic stock market.

That's impressive, they clearly know what they're doing. I wonder how many 1,000x engineers, how many Carmack-tier intellects are mucking around in China. We never seem to hear about what happens in China except when it breaks the language barrier (Genshin Impact, Cixin Liu, some webnovels, dumb tiktok memes like Donghua Jinlong's industrial-grade glycine). And when they do break the language barrier, it's usually highly compartmentalized. Deepseek is basically unknown outside a tiny part of twitter and various experts.

I agree with you on the racism front. Derision and contempt for Chinese achievements is probably the last example of mainstream traditionally defined racism (directed against non-whites). A lot of normies have these thought-ending punchlines 'slave labour is the only way they can outcompete with us' or 'it's just stolen IP' or 'communists can't do innovation'. One friend of mine wondered if slave labour might be why Taiwan was so competitive in chip-making, he clearly didn't know ANYTHING about the topic. He was kind of stupid tbh. But loads of op-eds are still written today on similar lines - I swear I saw 'communist China can't innovate' the other week. There's no shortage of cheap labour in Africa yet we're not worried about Nigerian exports pushing our domestic industries underwater.

Stealing IP and then profitably using it is hard! Building up an advanced economy is hard! You need smart, disciplined people to do these things, you need sound institutions. If they can steal proficiently, they can probably also make new things. Sometimes stealing alone is insufficient - the Soviet Union had excellent espionage but was much worse at reproducing western computer technology. The Soviets never had competitive manufactured goods exports on world markets though.

Don't strategic dynamics dictate that closed-source dominates?

That's the case for sure. Generally it's safer to release while you're catching up; since DeepSeek is catching up, they might well keep releasing. They're raising the waterline, not teaching those they're trying to challenge. Google arrived at fine-grained experts much earlier, and a month after DeepSeek-MoE (Dai et al.) they I misremembered, some Poles published Scaling Laws for Fine-Grained Mixture of Experts saying:

Concurrently to our work, Dai et al. (2024) proposed to modify the MoE layer by segmenting experts into smaller ones and adding shared experts to the architecture. Independently, Liu et al. (2023) suggested a unified view of sparse feed-forward layers, considering, in particular, varying the size of memory blocks. Both approaches can be interpreted as modifying granularity. However, we offer a comprehensive comparison of the relationship between training hyperparameters and derive principled selection criteria, which they lack.

This is not so much a matter of scientific ability or curiosity as it's a matter of compute. They can run enough experiments to confirm or derive laws, and they can keep scaling a promising idea until they see whether it breaks out of a local minimum. This is why the West will keep leading. Some Chinese companies have enough compute, but they're too cowardly to do this sort of science.

And Google has dozens of papers like this, and even much more exciting ones (except many are junk and go nowhere because they don't confirm it at scale, or don't report confirming it, or withhold some invalidating details. Or sometimes it's the other way around – people assume junk, as with Sparse Upcycling, even the authors think it's dead end junk, and then it's confirmed to work in Mixtrals, and now in a few Chinese models).

Derision and contempt for Chinese achievements is probably the last example of mainstream traditionally defined racism (directed against non-whites)

On the plus side, I get to have fun when Americans respond to videos of their cheap robots calling them CGI – because in the US, even the announcement of such a good product would have been a well-advertised, near-religious event (ahem, Boston Dynamics ads), with everyone kowtowing to our great cracked engineers, their proprietary insights worth so many billions, and the Spirit of Freedom.

I particularly like the commenter who goes 'this is clearly an AI generated video' when that would be way more impressive than the technical feat of backflipping dogbots. AI video without that shimmering, of something as out-of-distribution as a rotating robot dog?

I'm of the mind that communists can innovate but only to a point.

The limiting factor certainly isn't intelligence, scientific organization, determination, or even centralized planning. The limiting factor is that if innovation really happens, you're going to get growth and the new generation of wealth. If that wealth then begins to proliferate through even a moderate portion of society, you're going to start to see the people want to do things with their wealth. In fact they'll want to do more and more with their wealth and, all of a sudden, you've got the makings for demands for more personal liberty.

Personal liberty is obviously directly antagonistic to the authoritarian systems that any "communist" or "socialist" nations eventually become. The response by the central government would be to clamp down here in order to regain control. The secondary and tertiary effects of that, however, is that you'll end up artificially stifling innovation currents in your national economy. Innovation, especially across industries, functions kind of like a rainforest. It's very organic and stochastic, it's hard to point to one section or another and say definitively "that part is the real innovation engine." Let the whole damn thing grown unkempt and uneven. Trying to corral one part of it (or the downstream wealth generation effects. see: corporate tax rates) is messing with a pretty fragile system. You'll likely disrupt the whole thing.

I also don't believe you can effectively "capture" all of the growth and new wealth, funneling it to some sort of inner-circle elite. In the few places where this has been accomplished (I think Saudi Arabia is the best example) have (1) really tight kinship affiliation as opposed to ideological affiliation and (2) are usually extraction based raw materials economies. It's easier to constrain the growth when you're immediately shipping the money units out of your boarders and getting paid wholesale all at once by foreigners. China makes its money, largely, by making new things or refining raw materials. They're export oriented, but they're pretty diverse and they have a complex and multi-stage economy. It's not as simple as just sending the magic beans out the door.

China has always had to try to balance one central dilemma: control or growth. It can look like a continuum and, therefore, possible to balance. I'd say that's actually a red herring and if you attempt anything besides growth-with-personal-liberty you eventually get into a recessionary situation. Then, economic realities appear that cannot be overcome with government subsidies or printing money, especially when you aren't the world's reserve currency. In the worst of cases (though repeated many times in Chinese) you might get a popular revolt because of a declining quality of life. In trying to "balance" you've actually pre-committed to an eventual collapse. The only way to avoid it is to make a fundamental change the the governance and political systems - embrace democracy and true market based capitalism. Don't worry - you can still wage a culture war :-)

Paraphrasing a common zinger, what if it turns out communists can stay innovative longer than you can stay ahead?

More substantially, though, I don't see much of a persuasive argument here. You are generalising from very little data (a roughly 200 year old system that identifies as "capitalist" vs. the second major ~60 year old system that identifies as "communist") and theoretising about the "communist" system from first principles that there is very limited evidence it actually adheres to, and on top of that reaching a conclusion which is flattering to your obviously preferred system, which should give you pause. Is this different from a Russian arguing in 1904 that a heathen state will never prevail over a Christian one, with an argument based on the recent historical primacy of the former and imagining that the expected naval tactics of Russia and Japan can be derived from the tenets of Orthodox Christianity and State Shinto?

So far the PRC story seems to me to make a compelling case that you can suddenly and massively crank up the wealth of great numbers of people while making them less inclined to pursue freedoms outside of your prescribed window. The main line of work the devil is making for idle hands there appears to consist of mobile game daily quests.

I don't see much of a persuasive argument here.

That's fair. Internet arguments often fail to persuade.

Can you break down your final paragraph (reprinted below)? I'm not quite smart enough to understand some of the allusion and references embedded in it.

So far the PRC story seems to me to make a compelling case that you can suddenly and massively crank up the wealth of great numbers of people while making them less inclined to pursue freedoms outside of your prescribed window. The main line of work the devil is making for idle hands there appears to consist of mobile game daily quests.

Sure, sorry if it was opaque. The first sentence has nothing much going on - it's just the observation that in China every subpopulations seems to only become less rebellious as modernity and affluence spreads to them, and largely the wealthier people now seem to be happy to work and consume while the remaining sparks of rebellion all came from still-impoverished marginal populations as well as groups that dropped on the totem pole of wealth (HK).

The second one is a reference to "the devil makes work for idle hands", a phrase often quoted in Western contexts as part of an argument against allowing the masses significant leisure - the intended image being that only 16-hour workdays stop the plebeian masses from organising at some beer hall to stage an uprising against their betters or else falling into antisocial debauchery. I found the idea that more wealth would create more motivation to use that wealth in a way that threatens government control quite similar, but in reality, any leisure time modern Chinese people get seems to be sunk into modern entertainment - TV dramas and perhaps most conspicuously games that get players hooked using supremely gamified literal make-work activities, the "daily quests" or "dailies". This typically looks like performing some randomised chores (talk to NPC X, defeat five slimes, craft three potions) to get a daily reward of in-game currency that can be used for obtaining randomised lootboxes/gacha. At best, people go to organise in some online beer hall to stage an uprising against a rival subfandom of one of those games.

"the devil makes work for idle hands", a phrase often quoted in Western contexts as part of an argument against allowing the masses significant leisure

I mean, maybe that's how it's used. But I've heard the similar phrase "idle hands are the devil's playthings" used by the masses to each other to argue against inactivity because it can lead to filling that idle time in ways that are wasteful, useless, and destructive; the attraction of the gacha game was foreseen by pre-modernity and heavily criticized.

Not disagreeing with your overall point. I think you're correct that one of the major features of modernity in China is wasting time on gacha games. But the same is true for the West as well, which is why we will never see major political revolution in the near future even if people like doomposting about it. The attention of the masses is pulled toward ever-more-addictive processed food, media, and video games. We're literally talking about bread and circuses here.

I'm not sure we know what causes repressive regimes to fall to free ones. We've tried lots of strategies. "Make them rich" hasn't worked in China. "Make them poor" hasn't worked in Cuba or North Korea. "Point guns at them and make them cargo-cult democracy" didn't work in Afghanistan. "Let them give it a go" didn't work in the Arab Spring, which replaced one set of repressive regimes for another.

My dark intuition is that a people has to have an intrinsic cultural drive towards freedom to ever want to establish it. The American Revolution would never have happened without the English Civil War setting up a background. And the cultures that had no strong drive or cultural history of freedom could not suddenly establish it: Russia went from repressive Tsarism to repressive Leninism, China went from repressive Confucian Imperial rule to repressive Communist Party rule. And the Middle East sees Islamic republicanism as its own kind of freedom, even if the West sees it as horrifying.

The exceptions are the Western friends in East Asia. South Koreans and Japanese have remained within the democratic orbit because the material blessings of modernity were offered to them under conditions of democracy. Perhaps this served to lock them into a democratic system with the same features that lock mainland China into the party's system. And who knows about Taiwan, except that what happened is the most gung-ho pro-Westerners were quite literally placed on an island exiled from the others, which certainly does a lot to define the political culture of a people in that direction.

I don't know. Much smarter people than me have pondered this question and come up empty. Maybe the reality is, we don't know what makes people value the West's standard of freedom except being a Westerner.

Shouldn't the state step in when there's market failure?

Doesn't it make sense to ban things like child prostitution and drug-dealing (by which I mean things like fentanyl and heroin), commercial transactions with bad externalities?

What about industrial policy? Wouldn't it be beneficial for the economy if you could provide cheap inputs? The state could back energy research on the basis that cheap energy improves the whole economy. Efficient transport systems save workers time and provide small boosts to all enterprises. No company is big enough to build a national-size HSR network out of their own pocket. Or consider education. Wouldn't it be helpful if the government set up academic scholarships to help poor smart kids attain higher learning?

More ambitiously, wouldn't it make sense to fund research and development? Private R&D is mostly profit-focused. Of course there are offshoots from commercial R&D that open up new frontiers but governments can do things with a longer time-frame. They can subsidize promising avenues of research that aren't immediately profitable, offer prizes for achievement.

From another angle, companies themselves don't operate according to market principles. They're top-down autocratic institutions. Workers obey the boss. The budget is set by the people at the top, you don't have different departments competing to increase their revenue. The reason capitalism is so successful is that the efficient autocratic companies outcompete the inefficient autocracies quickly. Rapid life and death spurs evolution. Capitalism is just one way of achieving efficiency, it's not an end in and of itself.

I think it's the same with states. States can be more or less efficient in their economic interventions. They can build infrastructure efficiently or inefficiently. They can sponsor education wisely or unwisely. They can encourage commerce well or poorly. They can pick losers or they can pick winners.

In concrete terms, the US is running 5% deficits in a growing economy. One wonders what kind of deficit will be needed for a recession or sudden crisis. The US has fallen well behind China in cars, shipbuilding, steel, infrastructure, 5G, batteries, energy production and drones. If you look at Nature's most cited, high-quality papers, China leads. They seem to be catching up rapidly in AI. They must be doing something right.

US democracy is not exactly the envy of the world in the present hour. America retains a lead in aerospace, AI and high-end semiconductors, albeit a diminishing lead. I suppose the US is well ahead in space but that's about all I can think of.

I don't see much cause for liberal-democratic, free-market triumphalism. The democratic bloc all seem to be veering towards deep-state governance, censorship and economic protectionism.

The democratic bloc all seem to be veering towards deep-state governance, censorship and economic protectionism.

Right! Which is a bad thing and will ultimately fail.

That China is already firmly, obviously, and enthusiastically autocratic, highly censored, and attempting to use everything from currency manipulation to slavery to favor their domestic economy should be a warning to us.


There's a lot in your post that jumbles together disparate macro/micro economic theory, state capacity, theory of the firm, and market feedback loops. I'm not going to try and ... debug ... all of it. I'll zoom in on this:

They can pick losers or they can pick winners.

They should be picking neither. Because as soon as they do, they make the larger market inefficient and make customers the losers. Housing policy (federally guaranteed mortgages) almost ended the whole damn thing for everyone.

You can have bad housing policy (force banks to lend to people who can't repay, prohibit house construction to pump up prices or deliberately suppress economic development) or good housing policy (produce housing to meet the needs of the population). You can have bad infrastructure policy (build it expensively and stupidly) or good infrastructure policy (build it cheaply and cleverly).

I'm not making this up. Britain really did suppress the economic development of the Midlands, fearing that it was too prosperous and seeking to redirect development to other areas. The US really did spend tens of billions on HSR and not make any railway. China actually built their high-speed rail and achieved a good economic return on it. The US interstate highway program shows that America used to be capable of infrastructure policy. It's not magic.

Good policy can be hard. It may go against influential voices and tread on toes. Maybe it takes time to pull off. But it is possible.

There is no such thing as an objectively good or bad policy.

It's all about relative value prioritization and tradeoffs. Furthermore, the devil is always in the details.

Taking your line on "good housing policy"; "produce housing to meet the needs of the population"

Who determines what the needs-vs-wants of "the population" are? To what level should they be met?

Who actually produces the housing? Private firms or public? Who finances the construction? How do the tax dollars work in?

And that's the essence of policy with unintended consequences. People start with a highly value judgemental aspiration like, I don't know, "build [infrastructure] cheaply and cleverly." Then you have to write how it all works and you end up with perverse incentives, or cost diseases, or some other kind of obvious economic malady that was hand waived away because the policy decision was just so blindingly, obviously right ... right?

We live in an incredibly complex world that only grows in complexity. All "easy" solutions are either misleading or take out massive debts on the future. Human nature is not going to suddenly improve by leaps and bounds. We work within the systems we have.