@DaseindustriesLtd's banner p

DaseindustriesLtd

late version of a small language model

72 followers   follows 27 users  
joined 2022 September 05 23:03:02 UTC

Tell me about it.


				

User ID: 745

DaseindustriesLtd

late version of a small language model

72 followers   follows 27 users   joined 2022 September 05 23:03:02 UTC

					

Tell me about it.


					

User ID: 745

You are noticing that none of these companies want to race. The whole competition to build Sand God is largely kayfabe. Western AI scene is not really a market, it's a highly inefficient cartel (with massive state connections too), which builds up enormous capacity but drags its feet on products because none of them ultimately believe their business models are sustainable in the case of rapid commoditization. This is why DeepSeek was such a disruption: not only was it absurdly cheap (current estimates put their annual operations cost at like $200M), not only were they Chinese, but they dared to actively work to bring the costs of frontier capabilities to zero, make it logistically mundane, in alignment with Liang Wenfeng's personal aesthetic and nationalist preferences.

I think R1's release has sped up every Western frontier lab by 20-50% simply by denying them this warm feeling that they can feed the user base some slop about hidden wonder weapons in their basements, release incremental updates bit by bit, and focus on sales. Now we are beginning to see a bit more of their actual (still disappointingly low, not a single one of these companies could have plausibly made R1 on that cluster I think) power level.

Again I have to quote Boaz Barak (currently OpenAI): AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Consider the task of predicting the consequences of a particular action in the future. In any sufficiently complex real-life scenario, the further away we attempt to predict, the more there is inherent uncertainty. For example, we can use advanced methods to predict the weather over a short time frame, but the further away the prediction, the more the system “regresses to the mean”, and the less advantage that highly complex models have over simpler ones (see Figure 4). As in meteorology, this story seems to play out similarly in macroeconomic forecasting. In general, we expect prediction success to behave like Figure 1 below—the error increases with the horizon until it plateaus to a baseline level of some simple heuristic(s). Hence while initially highly sophisticated models can beat simpler ones by a wide margin, this advantage eventually diminishes with the time horizon.

Tetlock’s first commandment to potential superforecasters is to triage: “Don’t waste time either on “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp). Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most.” Another way to say it is that outside of the Goldilocks zone, more effort or cognitive power does not give much returns.

Rather, based on what we know, it is likely that AI systems will have a “sweet spot” of a not-too-long horizon in which they can provide significant benefits. For strategic and long-term decisions that are far beyond this sweet spot, the superior information processing skills of AIs will give diminishing returns. (Although AIs will likely supply valuable input and analysis to the decision makers.). An AI engineer may well dominate a human engineer (or at least one that is not aided by AI tools), but an AI CEO’s advantage will be much more muted, if any, over its human counterpart. Like our world, such a world will still involve much conflict and competition, with all sides aided by advanced technology, but without one system that dominates all others.

In essence, irreducible error and chaotic events blunt the edge of any superintelligent predictor in a sufficiently high-dimensional environment.

What remains to be answered for me:

  1. Can AI planners interfere in the events with enough frequency and precision to proactively suppress chaos and reduce the world to a game of chess they can model to the draw?
  2. Is a decently superhuman prediction and execution not enough to eliminate warm, simply because humans are already close to this level and only initiate wars they won't win (instead of pragmatically retreating to some defensible compromise) in feats of retardation (see: Russia)?

I think a case can be made that whenever the Chinese state is stable, it is very stable and very good at suppressing peasant rebellions. This is perhaps grounds for an unflattering stereotype about cruelty and power distance, but not so much about obedience of the masses.

that I argue that you do not understand how well connected people like they are or are not to the dominant American (or western) political cultures.

So, how well connected are they? Enlighten me. Being a clueless Imperialist (or however you see me), I have developed the impression that Peter Thiel and his creatures, and Palantir specifically, are fairly well connected in the current American establishment.

Functional American hegemony, whether as a means or as an end, has clearly lasted for decades. Do you simply concede that it's not going to survive and the US accepts that since its continuation is not worth or not feasible fighting for?

Do you argue that people like Palmer Luckey, Alex Karp, Alex Wang, Dario Amodei, Sam Altman are, similarly to me, clueless and in disconnect with your political culture? Because they definitely argue for the maintenance and indeed revitalization of hegemony, not some strategic retreat to domestic affairs. Says Amodei:

This means that in 2026-2027 we could end up in one of two starkly different worlds. In the US, multiple companies will definitely have the required millions of chips (at the cost of tens of billions of dollars). The question is whether China will also be able to get millions of chips9.

If they can, we'll live in a bipolar world, where both the US and China have powerful AI models that will cause extremely rapid advances in science and technology — what I've called "countries of geniuses in a datacenter". A bipolar world would not necessarily be balanced indefinitely. Even if the US and China were at parity in AI systems, it seems likely that China could direct more talent, capital, and focus to military applications of the technology. Combined with its large industrial base and military-strategic advantages, this could help China take a commanding lead on the global stage, not just for AI but for everything.

If China can't get millions of chips, we'll (at least temporarily) live in a unipolar world, where only the US and its allies have these models. It's unclear whether the unipolar world will last, but there's at least the possibility that, because AI systems can eventually help make even smarter AI systems, a temporary lead could be parlayed into a durable advantage10. Thus, in this world, the US and its allies might take a commanding and long-lasting lead on the global stage.

Well-enforced export controls11 are the only thing that can prevent China from getting millions of chips, and are therefore the most important determinant of whether we end up in a unipolar or bipolar world.

As you can see he deems bipolar outcome unacceptable, since it's merely a prelude to American (and all Western/liberal) defeat: either the US wins a “durable strategic advantage” by capitalizing on its compute edge, or China does by capitalizing on its industrial capacity. For my part I think he's wrong and dumb, the US is highly defensible and not at risk of Chinese unipolar dominance. But that is his argument, and others are making near-identical ones.

From CSIS, I don't know, maybe you hold them too in contempt, but they use the same terminology:

China’s success to date suggests that, at least for Huawei Ascend chips, the answer is that they will have millions of chips within the next year or two. Thankfully, these chips are, at present, dramatically lower performing than Nvidia ones for training advanced AI models; they are also supported by a much weaker software ecosystem with many complex issues that will likely take years to sort out. This is the time that the export controls have bought for the United States to win the race to AGI and then use that victory to try and build more durable strategic advantages. At this point, all the margin for sloppy implementation of export controls or tolerance of large-scale chip smuggling has already been consumed. There is no more time to waste.

All of this does not look to me like acceptance of coming multipolarity.

Do you write it off as inconsequential self-interest of individual players, because the vote of salt-of-the-earth rednecks is more influenced by price of eggs?

I think this is still too self-serving and frankly racist a spin: “Chinese are robots, sure, but they can train robots on Western genius and follow their lead, still copying the West by proxy”.

I try to approach this technically. Technically you say that Asians are incapable of thought in full generality, that – speaking broadly – they can only “execute” but not come up with effective out-of-the-box plans; that their very robust IQ edge (Zhejiang, where DeepSeek is headquartered and primarily recruits, and where Wenfeng comes from, has average around 110 - that's on par with Ashkenazim!) – is achieved with some kind of interpolation and memorization, but not universal reasoning faculties. To me this looks like a very shaky proposition. From my/r1's deleted post:

The West’s mythos of creative genius – from Archimedes to Musk – emerged from unnaturally prolonged frontiers. When Europe lost 30-60% of its population during the Black Death, it reset Malthusian traps and created vacant niches for exploration. The American frontier, declared "closed" by the 1890 Census, institutionalized risk-taking as cultural capital. By contrast, China’s Yangtze Delta approached carrying capacity by the Song Dynasty (960-1279 CE). Innovation became incremental: water mills optimized, tax registers refined, but no steam engines emerged.

This wasn’t a failure of intelligence, but a rational allocation of cognitive resources. High population density selects for "intensive" IQ – pattern-matching within constraints – rather than "extensive" creativity. The same rice paddies that demanded obsessive irrigation schedules cultivated the hyper-adaptive minds now dominating international math Olympiads. China’s historical lack of Nobel laureates in science (prior to 1950) reflects not a missing "genius gene," but a Nash equilibrium where radical exploration offered negative expected value.

R1 might understate the case for deep roots of Western exploratory mindset, but where we agree is that its expression is contingent. Consider: how innovative is Europe today? It sure innovates in ways of shooting itself in the foot with bureaucracy, I suppose. Very stereotypically Chinese, I might say.

What I argue is that whereas IQ is a fundamental trait we can trace to neural architecture, and so are risk-avoidance or conformism, which we can observe even in murine models, “innovativeness” is not. It's an application of IQ to problem-solving in new domains. There's not enough space in the genome to specify problem-solving skills only for domains known in the bearer's lifetime, because domains change; Asians are as good in CTF challenges as their grandfathers were in carving on wood. What can be specified is lower tolerance to deviating from the groupthink, for example as cortisol release once you notice that your idea has not been validated by a higher-status peer; or higher expectation of being backstabbed in a vulnerable situation if you expend resources on exploration; or greater subjective sense of reward for minimizing predictive error, incentivizing optimization at the expense of learning the edges of the domain, thinking how it extends, testing hypotheses and hoping to benefit from finding a new path. Modulo well-applied and tempered IQ, this eagerness to explore OOD is just a result of different hyperparameter values that can also produce maladaptive forms like useless hobbies, the plethora of Western sexual kinks (furries?) and the – no, no, it's not just Jewish influence, own up to it – self-destructive leftist ideologies.

One anecdote is illustrative of the conundrum, I think. Some time ago, a ByteDance intern came up with a very interesting image generation technique, VAR. It eventually won the NeurIPS best paper award! Yandex trained a model based on it already, by the way, and Yandex has good taste (I may be biased of course). But what else did that intern do? Driven by ambition to scale his invention and make an even bigger name for himself, he sabotaged training runs of his colleagues to appropriate the idle compute, fully applying his creative intelligence to derail an entire giant corporation's R&D program! Look at this cyberpunk shit:

  • Modifying PyTorch Source Code: Keyu Tian modified the PyTorch source code in the cluster environment used by his colleagues, including changes to random seeds, optimizer's direction, and data loading procedures. These modifications were made within Docker containers, which is not tracked by Git.
  • Disrupting Training Processes: Keyu Tian deliberately hacked the clusters to terminate multi-machine experiment processes, causing large-scale experiments (e.g., experiments on over thousands of GPUs) to stall or fail.
  • Security Attack: Tian gained unauthorized access to the system by creating login backdoors through checkpoints, allowing him to launch automated attacks that interrupted processes of colleagues' training jobs.
  • Interference with Debugging: Tian participated in the cluster debugging meeting and continuously refined the attack code based on colleagues' diagnostic approaches, exacerbating the issue.
  • Corrupting the Experiments: Tian modified colleagues' well-trained model weights, making their experimental results impossible to reproduce.

Upon uncovering clear evidence, ByteDance terminated Tian's internship. Instead of taking responsibility, he retaliated by publicly accusing other employees of framing him and manipulating public opinion in a malicious manner.

This, I think, is peak of non-conformist genius, the stuff of the Romance of Three Kingdoms and warlord era. This is the essence of what the Confucian paradigm is trying to suppress, crushing benign self-expression at the same time.

But what if your peers cannot backstab you? What if resources are abundant? What if all your peers are rewarded for exploration and it clearly has positive ROI for them? It might not transmogrify the Chinese into archetypal Hajnalis, who engage in these behaviors without stilts, but the result will be much the same.

Only on greater scale.

R1:

Liang’s meta-derisking – making exploration legible, replicable, and prestigious – could trigger a phase shift. But true transformation requires more than outlier firms. It demands ecosystems that reward speculative genius as reliably as rice farmers once rewarded meticulousness. The question isn’t whether Chinese minds can innovate, but whether China’s institutional lattice will let a thousand DeepSeeks bloom – or if this lone swallow merely heralds a cultural spring that never comes.

China's fragile treasure

Tl;DR: after months of observation, I am convinced that DeepSeek has been an inflection point in Chinese AI development and probably beyond that, to the level of reforming national psyche and long-term cultural trajectory, actualizing the absurd potential they have built up in the last two decades and putting them on a straight path to global economic preeminence or even comprehensive hegemony. It is not clear to me what can stop this, except the idiocy of the CCP, which cannot be ruled out.

Last time I wrote on this topic I got downvoted to hell for using DeepSeek R1 to generate the bulk of text (mostly to make a point about the state of progress with LLMs, as I warned). So – only artisanal tokens now, believe it or not. No guarantees of doing any better though.

The direct piece of news inspiring this post is The Information's claim that DeepSeek, a private Chinese AGI company owned by Liang Wenfeng, is implementing some very heavy-handed measures: «employees told not to travel, handing in passports; investors must be screened by provincial government; gov telling headhunters not to approach employees». This follows OpenAI's new Global Policy chief Chris Lehane accusing them of being state-subsidized and state-controlled and framing as the main threat to the West, popular calls on Twitter (eg from OpenAI staff) to halt Chinese AI progress by issuing O1 visas or better offers to all key DeepSeek staff, and the sudden – very intense – attention of Beijing towards this unexpected national champion (they weren't among the «six AI tigers» pegged for that role, nor did they have the backing of incumbent tech giants; what they did have was grassroots attention of researchers and users in the West, which China trusts far more than easily gamed domestic indicators).

I am not sure if this is true, possibly it's more FUD, like the claims about them having 50K H100s and lying about costs, claims of them serving at a loss to undercut competition, about compensations over $1M, and other typical pieces of «everything in China is fake» doctrine that have been debunked. But China does have a practice of restricting travel for people deemed crucial for national security (or involved in financial institutions). And DeepSeek fits this role now: they have breathed new life into Chinese stock market, integrating their model is a must for every business in China that wants to look relevant and even for government offices, and their breakthrough is the bright spot of the National People’s Congress. They are, in short, a big deal. Bigger than I predicted 8 months ago:

This might not change much. Western closed AI compute moat continues to deepen, DeepSeek/High-Flyer don't have any apparent privileged access to domestic chips, and other Chinese groups have friends in the Standing Committee and in the industry, so realistically this will be a blip on the radar of history.

Seems like this is no longer in the cards.

Recently, @ActuallyATleilaxuGhola has presented the two opposite narratives on China which dominate the discourse: a Paper Tiger that merely steals, copies and employs smoke and mirrors to feign surpassing the fruit of American genius born of free exchange of ideas etc. etc.; and the Neo-China coming from the future, this gleaming juggernaut of technical excellence and industrial prowess. The ironic thing is that the Chinese themselves are caught between these two narratives, undecided on what they are, or how far they've come. Are they merely «industrious» and «good at math», myopic, cheap, autistic narrow optimizers, natural nerdy sidekicks to the White Man with his Main Character Energy and craaazy fits of big picture inspiration, thus doomed to be a second-tier player as a nation; with all cultural explanations of their derivative track record being «stereotype threat» level cope – as argued by @SecureSignals? Or are they just held back by old habits, path-dependent incentives and lack of confidence but in essence every bit as capable, nay, more capable of this whole business of pushing civilization forward, and indeed uplifting the whole planet, as argued by Chinese Industrial Party authors – doing the «one thing that Westerners have been unwilling or powerless to accomplish»?

In the now-deleted post, me and R1 argued that they are in a superposition. There are inherent racial differences in cognition, sure, and stereotypes have truth to them. But those differences only express themselves as concrete phenotypes and stereotypes contextually. In the first place, the evo psych story for higher IQ of more northern ancestral populations makes some sense, but there is no plausible selection story for Whites being unmatched innovators in STEM or anything esle. What is plausible is that East Asians are primed (by genetics and, on top of that, by Confucian culture and path dependence) towards applying their high (especially in visually and quantitatively loaded tasks) IQ to exploitation instead of exploration, grinding in low-tail-risk, mapped-out domains. Conformism is just another aspect of it; and so you end up with a civilization that will hungrily optimize a derisked idea towards razor-thin margins, but won't create an idea worth optimizing in a million years. Now, what if the calculus of returns changes? What if risk-taking itself gets derisked?

And I see DeepSeek as a vibe shift moment nudging them in this direction.

The Guoyun narrative around DeepSeek began when Feng Ji 冯骥, creator of the globally successful game “Black Myth: Wukong,” declared it a “national destiny-level technological achievement.” The discourse gained momentum when Zhou Hongyi 周鸿祎, Chairperson of Qihoo 360, positioned DeepSeek as a key player in China’s “AI Avengers Team” against U.S. dominance. This sentiment echoed across media, with headlines like “Is DeepSeek a breakthrough of national destiny? The picture could be bigger” The discourse around 国运论 (guóyùn lùn, or “national destiny theory”) reveals parallels to America’s historical myth-making. Perhaps the most striking similarity between China and the US is their unwavering belief in their own exceptionalism and their destined special place in the world order. While America has Manifest Destiny and the Frontier Thesis, China’s “national rejuvenation” serves as its own foundational myth from which people can derive self-confidence.

And to be clear, DeepSeek is not alone. Moonshot is on a very similar level (at least internally – their unreleased model dominates LiveCodeBench), so are StepFun, Minimax and Alibaba Qwen. Strikingly, you see a sudden formation of an ecosystem. Chinese chip and software designers are optimizing their offerings towards efficient serving of DeepSeek-shaped models, Moonshot adopts and builds on DeepSeek's designs in new ways, Minimax's CEO says he was inspired by Wenfeng to open source their LLMs, there are hundreds of papers internationally that push beyond R1's recipe… the citation graph is increasingly painted red. This, like many other things, looks like a direct realization of Wenfeng's long-started objectives:

Innovation is undoubtedly costly, and our past tendency to adopt existing technologies was tied to China’s earlier developmental stage. But today, China’s economic scale and the profits of giants like ByteDance and Tencent are globally significant. What we lack isn’t capital but confidence and the ability to organize high-caliber talent for effective innovation … I believe innovation is, first and foremost, a matter of belief. Why is Silicon Valley so innovative? Because they dare to try. When ChatGPT debuted, China lacked confidence in frontier research. From investors to major tech firms, many felt the gap was too wide and focused instead on applications.

NVIDIA’s dominance isn’t just its effort—it’s the result of Western tech ecosystems collaborating on roadmaps for next-gen tech. China needs similar ecosystems. Many domestic chips fail because they lack supportive tech communities and rely on secondhand insights. Someone must step onto the frontier.

We won’t go closed-source. We believe that establishing a robust technology ecosystem matters more.

No “inscrutable wizards” here—just fresh graduates from top universities, PhD candidates (even fourth- or fifth-year interns), and young talents with a few years of experience. … V2 was built entirely by domestic talent. The global top 50 might not be in China today, but we aim to cultivate our own.

BTW: I know @SecureSignals disagrees on the actual innovativeness of all this innovation. Well suffice to say the opinion in the industry is different. Their paper on Native Sparse Attention, pushed to arxiv (by Wenfeng personally – he is an active researcher and is known to have contributed to their core tech) just the day before Wenfeng went to meet Xi, looks more impressive than what we see coming from the likes of Google Deepmind, and it has a… unique cognitive style. They have their very distinct manner, as does R1. They had nowhere to copy that from.

Maybe all of it is not so sudden; the hockey-stick-like acceleration of Chinese progress is a matter of boring logistics, not some spiritual rebirth, much like the hokey stick of their EV or battery sales. For decades, they've been mainly a supplier of skilled labor to America, which masked systemic progress. All the while they have been building domestic schools to retain good educators, training new researchers and engineers without entrusting this to Microsoft Asia and Nvidia and top American schools, growing the economy and improving living conditions to increase retention and have businesses to employ top talent and give them interesting enough tasks… so at some point it was bound to happen that they begin graduating about as much talent as the rest of world combined, a giant chunk goes to their companies, and that's all she wrote for American incumbents in a largely fake, sluggish market. DeepSeek, or Wenfeng personally, is not so much a crown jewel of Chinese economy as a seed of crystallization of the new state of things, after all pieces have been set.

The boost of confidence is visible outside the AI sphere too. I find it remarkable that He Jankui is shitposting on Twitter all the time and threatening to liberate the humanity from the straitjacket of «Darwin's evolution». A decade earlier, one would expect his type to flee to the West and give lectures about the menace of authoritarianism. But after three years in Chinese prison, he's been made inaugural director of the Institute of Genetic Medicine at Wuchang University and conspicuously sports a hammer-and-sickle flag on his desk. The martyr of free market, Jack Ma, also has been rehabilitated, with Xi giving him a very public handshake (alongside Wenfeng, Unitree's Wang Xingxing, Xiaomi's Lei Jun and other entrepreneurs).

…but this is all fragile, because China remains a nation led by the CCP, which remains led by one boomer of unclear sentience and a very clear obsession with maximizing his control and reducing risk to himself. In that, Wenfeng is similar – he's bafflingly refusing all investment, from both private and state entities, because it always has strings attached, I suppose.

“We pulled top-level government connections and only got to sit down with someone from their finance department, who said ‘sorry we are not raising’,” said one investor at a multibillion-dollar Chinese tech fund. “They clearly are not interested in scaling up right now. It’s a rare situation where the founder is wealthy and committed enough to keep it lean in a Navy Seal-style for his pursuit of AGI.”

But you can't just refuse the CCP forever. Reports that he's been told not to interact with the press seem credible; perhaps the story about passports will come true too, as DeepSeek's perceived value grows. In that moment, China will largely abandon its claim to ascendancy, vindicating American theory that Freedom always wins hearts and minds. People, even in China, do not acquire world-class skills to be treated like serfs.

…If not, though? If China does not just shoot itself in the foot, with heavy-handed securitization, with premature military aggression (see them flexing their blue water navy they supposedly don't have in Australian waters, see their bizarre landing ships designed for Taiwan Operation, see their 6th generation aircraft…), with some hare-brained economic scheme – where does this leave us?

I've been thinking lately: what exactly is the American theory of victory? And by victory I mean retaining hegemony, as the biggest strongest etc. etc. nation on the planet, and ideally removing all pesky wannabe alternative poles like Russia, China and Iran. Russia and Iran are not much to write home about, but what to do with China?

The main narrative I see is something something AGI Race: the US builds a God-level AI first, then… uh, maybe grows its economy 100% a year, maybe disables China with cyberattacks or nanobots. I used to buy it when the lead time was about 2 years. It's measured in months now: research-wise, they have fully caught up, releases after V3 and R1 show that the West has no fundamental moat at all, and it's all just compute.

In terms of compute, it's very significant to my eyes that TSMC has been caught supplying Huawei with over 2 millions of Ascend chip dies. This could not have been obfuscated with any amount of shell companies – TSMC, and accordingly Taipei, knew they are violating American decree. Seeing Trump's predatory attitude towards TSMC (them being forced to invest into manufacturing on American soil and now to fix Intel's mess with a de facto technology transfer… as an aside, Intel's new CEO is a former director of SMIC, so literally all American chip companies are now headed by Chinese or Taiwanese people), I interpret this as hedging rather than mere corruption – they suspect they will not be able to deter an invasion or convince the US to do so, and are currying favor with Beijing. By the way, Ascend 910c is close to the performance of Nvidia H800. R1 was trained on 2048 H800s; So just from this one transaction, China will have around 500 times more compute, and by the end of the year they will be able to produce another couple million dies domestically. So, it is baked in that China will have AGI and ASI shortly after the US at worst, assuming no first strike from the latter.

In terms of cyberattacks for first strike, AIs are already good enough to meaningfully accelerate vulnerability search; coupled with the vast advantage in computer-literate labor force (and to be honest, actual state-backed hackers), China will be able to harden their infrastructure in short order, and there's no amount of cleverness that gets past provably hardened code. So this is a very uncertain bet.

In terms of economic growth, this is usually tied to automation. China seems to be on par in robotics research (at least), controls almost the entire supply chain, and has an incomparably bigger installed automated manufacturing base (see their EV factories, which are now also producing robots). They will have OOMs more humanoids and probably faster compounding growth. This more than covers for their workforce aging, too.

Then I hear something about Malacca strait blockade. Suffice to say this seemed more convincing when they really didn't have a «blue water navy», which they now clearly have, contra Peter Zeihan. They're also making great progress in weaning their civilian economy off oil (high speed rail instead of planes, normal rail for freight, EVs again, nuclear and renewable buildouts…) and have stockpiled giant reserves so oil cutoff won't really deter them. They are not quite food-secure but likely won't starve without imports. So blockade is no solution.

Lastly, I've seen this theory that Starship (once it's ready for prime time) provides the US with insurmountable advantage in mass to orbit, thus all the old Star Wars plans are back in action and Chinese nuclear deterrence is neutralized. This doesn't seem feasible because they're working on their own economical reusable rockets – across multiple companies as usual – and are very close to success, and there are signs that this project has very favorable scalability, to the point the US will lose its mass to orbit lead in under three years, or at least it will be diminished. (Personally I think Zhuque-3 is a more sensible design than Musk's monstrosity, though it's just a tasteful interpolation between Falcon and Starship. Learning from mistakes of others is a common late mover advantage).

Sector by sector and attack vector by attack vector, it's all like that.

So… what is left?

As far as I can tell, at this trajectory only China can defeat China – the hidebound, unironic Communists in control, fulfilling the mawkish Western prophecy they try to avoid, bear-hugging to death the young civilization that grew around their mandate and is now realizing its destiny. Confiscating passports, banning open source that widens the talent funnel, cracking down on «speculative investments», dragging them back into the 20th century at the brink of the long-coveted «national rejuvenation».

…Parallels to the US are probably clear enough.

I'd say it's another bit of evidence for Google upgrading their product strategy, but nothing unexpected capabilities-wise. Shame they did not release the weights, instead shipping only Gemma 3 with image-in text-out. «Safety» reasoning is obvious enough.

Contra @SkoomaDentist I think it's not fair to describe this as «The LLM is still talking to the image generator», ie that the main LLM is basically just the encoder for some diffusion model or another separate module. The semantic fidelity and surgical precision of successive edits suggest nothing like that, and point instead to a unified architecture with a single context where each token, be it textual or visual, is embedded in its network of relationships with all others (well, that's what these models are – literally, hypotheses about the shape of the training data manifold). Back when OpenAI announced their image-out capabilities with 4o, the teaser generation said «suppose we directly model P(text, image, sounds) with one big autoregressive transformer». Shortly after, Meta (or really Armen Aghajanyan, who has since departed largely in protest over Chameleon's safety-informed nerfing, and his team) published their Chameleon, a parallel work in identical spirit:

This early-fusion approach, where all modalities are projected into a shared representational space from the start, allows for seamless reasoning and generation across modalities. … Chameleon represents images, in addition to text, as a series of discrete tokens and takes advantage of the scaling properties of auto-regressive Transformers … We train a new BPE tokenizer (Sennrich et al., 2016) over a subset of the training data outlined below with a vocabulary size of 65,536, which includes the 8192 image codebook tokens …

Later, DeepSeek, who are probably the best team in the business (if not for resource limits), have been working on Janus, which is also a unified model of a potentially superior design:

Specifically, we introduce two independent visual encoding pathways: one for multimodal understanding and one for multimodal generation, unified by the same transformer architecture … Autoregressive models, influenced by the success in language processing, leverage transformers to predict sequences of discrete visual tokens (codebook IDs) [24, 65, 75]. These models tokenize visual data and employ a prediction approach similar to GPT-style [64] techniques. … Chameleon [77] adopts a VQ Tokenizer to encode images for both multimodal understanding and generation. However, this practice may lead to suboptimal outcomes, as the vision encoder might face a trade-off between the demands of understanding and generation. In contrast, our Janus can explicitly decouple the visual representations for understanding and generation, recognizing that different tasks may require varying levels of information. … for text understanding, we use the built-in tokenizer of the LLM to convert the text into discrete IDs and obtain the feature representations corresponding to each ID. For multimodal understanding, we use the SigLIP [92] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [73] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. The built-in prediction head of the LLM is utilized for text predictions in both the pure text understanding and multimodal understanding tasks, while a randomly initialized prediction head is used for image predictions in the visual generation task.

I expect DeepSeek's next generation large model to be based on some mature form of Janus.

I think Gemini is similar. This may be the first time we get to evaluate the power of modality transfer in a well-trained model – usually you run into the bottleneck of the projection layer, as @self_made_human describes. But here, it can clearly copy an image (up to the effective "resolution" of its codebook and tokenizer) and make isolated transformations, precisely the way transformers can do to a text string. Hopefully this means its pure verbalized understanding of the visual modality (eg spatial relations, say… anatomy…) is upgraded. Gooners from 4chan ought to be reaching the conclusion as I type this.

In the next iteration video and probably 3d meshes are getting similar treatment.

P.S. SkoomaDentist being bizarrely aggressive and insistent that this is whatsoever like inpainting is being very funny. Inpaint this. No, no, these are not vulgar tricks, and I don't see why one could be invested in bitterly arguing against that.

Manus is a generic thin wrapper over a strong Western model (Sonnet 3.7), if a bit better executed than most, and I am quite unhappy about this squandering of DeepSeek's cultural victory. The developers are not deeply technical and have instead invested a lot into hype, with invites to influencers and creating a secondary invite market, cherrypicked demos aimed at low value add SEO-style content creation (eg “write a course on how to grow your audience on X”) and pretty UX. Its performance on GAIA is already almost replicated by this opensource repo. This is the China we know and don't very much love, the non-DeepSeek baseline: tacky self-promotion, jumping on trends, rent-seeking, mystification. In my tests is hallucinates a lot – even in tasks where naked Sonnet can catch those same hallucinations.

The real analogy to DeepSeek is that, like R1 was the first time laymen used to 4o-mini or 3.5-turbo level slop got a glimpse of a SoTA reasoner in a free app, this is the first time laymen have been exposed to a strong-ish agent system, integrating all features that make sense at this stage – web browsing, pdf parsing, code sandbox, multi-document editing… but ultimately it's just a wrapper bringing out some lower bound of the underlying LLM's latent capability. Accordingly it has no moat, and does not benefit China particularly.

Ah well. R2 will wash all that away.

It is interesting, however, that people seemed to have no clue just how good and useful LLMs already are, probably due to lack of imagination. They are not really chatbot machines, they can execute sophisticated operations on any token sequences, if you just give them the chance to do so.

My radical thesis is that both are shitholes but not really militarily inept ones in the way people might imagine.

Experience in modern warfare, army size, operational logistics. Western aviation is unlikely to be a game-changer.

I do not account for nukes.

Putin would have caved because his nation is barely hanging on while fighting against a 3rd rate local power

As a matter of fact I think this is not how we should be perceiving Ukraine, and in the present condition it would likely have been able to overwhelm any European military except perhaps France and Poland one on one. Consider that Europeans are not actually Aryan superhumans, their pretty exercises would amount to meme material in a week of fighting a real large scale war, and they have very little in the way of materiel too. They are concerned about Russia for a good reason: they are in fact weak.

That's fine, I don't feel entitled to your time at all. I also can't predict what might trigger you, just like you cannot predict what would trigger me, nor does it seems like you would care.

Fetishizing algorithmic design is, I think, a sign of mediocre understanding of LLMs, being enthralled by cleverness. Data engineering carves more interesting structure into weighs.

The discussion was originally about labs overwhelmingly focused on LLMs and competing for top talent in all of ML industry so partially that was just me speaking loosely.

I do in fact agree with heads of those labs and most star researchers they've got that LLMs strikingly similar to what was found in 201 7 will suffice for the shortest, even if not the globally optimal route to “AGI” (it's an economic concept now anyway, apparently). But it is fair that in terms of basic research there are bigger, greener pastures of intellectual inquiry, and who knows - maybe we will even find something more general and scalable than a well-designed Transformer there. Then again, my view is that taste is to be calibrated to the best current estimate of the promise of available directions, and in conjunction with the above this leads me to a strong opinion on people who dismiss work around Transformers, chiefly work on training signal sources that I've covered above, as “not new science”. Fuck it, it is science, even if a bit of a different discipline. You don't own the concept, what is this infuriatingly infantile dick-measuring?

It's not so much that I hold non-LLM, non-Transformer-centric algo design work in contempt as I am irritated by their own smug, egocentric condescension towards what I see as the royal road. Contrarianism, especially defensive contrarianism, is often obnoxious.

We don't know how Gemini is made. At this point I assume it's something incredibly dumb like Noam Shazeer's reduced attention schemes and not, say, DeepSeek's NSA. In short though, attention inherently allows for sparsity.

Once again I notice that I am usually right in being rude to people, as their responses demonstrate precisely the character flaws inferred. This is a low-content post in defense of wounded ego, with snappy Marvel-esque one-liners («Won a Nobel prize») and dunks optimized for the audience but next to no interest in engagement on the object level. Yes, ML != LLMs, so what? Are you not talking about Altman and Elon who both clearly put their chips on LLMs? «That was a joke», yeah I get your jokes. Here's one you missed:

data engineering is needed and important! But it's not revolutionary. Data engineering, is the same as its always been.

It's not the same, though, that's the thing. Returning back to my point that has upset you –

Fetishizing algorithmic design is, I think, a sign of mediocre understanding of ML, being enthralled by cleverness. Data engineering carves more interesting structure into weighs.

– I meant concretely that this is why leading companies now prioritize creation of training signal sources, that is: datasets themselves (filtered web corpora, enriched and paraphrased data, purely synthetic data, even entirely non-lingual data with properties that induce interesting behaviors), curricula of datasets, model merging and distillation methods, training environments and reward shaping – over basic architecture research, in terms of non-compute spend and researcher hours; under the (rational, I believe) assumption that this has higher ROI for the ultimate goal of reaching "AGI", and that its fruit will be readily applicable to whatever future algorithmic progress may yield. This goes far beyond ScaleAI's efforts in harnessing Pinoy Intelligence to annotate samples for RLHF and you have not even bothered to address any of this. If you think names of Old Titans are a valid argument, I present Hutter as someone who Gets It, gets that what you have a sufficiently general architecture to approximate is at our stage more interesting in terms of eventual structure than how you achieve this potential generality.

This older paper is a neat illustration too. Sohl-Dickstein and Metz have done a little bit of work in non-LLM algo design if you recall, maybe you'll recognize at least them as half-decent scientists.

Now, as regards poor taste in intellectual disagreements, let's revisit this:

Regardless of whether transformers are a dead-end or not, the current approach isn't doing new science or algo design. Its throwing more and more compute at the problem and then doing the Deepseek approach of finetuning the assembly level gpu instructions to exploit the compute even better so you can throw more compute at it. I doubt, Hinton, Goodfellow, LeCunn, Schimdhubber et al. have any desire to do that. Maybe if xAI did something revolutionary like leave the LLM space or introduce a non-MoE-Transformer model for AGI, then talent of that caliber might want to work there. Currently they exist so Elon can piss all over Altman.

My rudeness was not unprovoked; it was informed by the bolded parts. I saw it as a hubristic, elitist, oblivious, tone-deaf insult towards people – scientists – actually moving the field forward today, rather than 8 or 28 years ago, and I do not care that it's slightly obfuscated or that you lack self-awareness to recognize the douche in the mirror but are eager to chimp out at it as you currently do.

Likewise my entire point, before you jumped into insult me, is that the Big Names in ML/AI are "fetishy algo freaks" They shockingly don't want to do non "mediocre algo butt sniffing" work. And Data Engineering isn't new, it isn't revolutionary, it's great, it works well, but it doesn't require some 1% ML researcher to pull it off. It requires a solid engineering team, some technical know-how, and a willingness to get your hands dirty. But no one is going to get famous doing it. It's an engineering task not a research task. And since research tasks are what people pay the ludicrously big bucks for at tech companies the engineers at xAI aren't being paid some massive king-sized salary...

yes thanks for clarification, that's exactly as I understood you.

I claim that to the extent that «talent of that caliber» shares your conceit that design of clever new algorithmic primitives for ANNs is «exciting new science» whereas data work remains and will remain ScaleAI tier «mere data engineering, same as always», this talent is behind the times, too set in their ways, and is resting on its laurels; indeed this is the same high-level philosophical error or prizing manual structure design over simplicity, generality and scalability that keeps repeating on every revolution in AI, and that Sutton has famously exposed. They are free to work on whatever excites them, publish cute papers for fellow affocionados where they beat untuned mainstream baselines, or just leave the frontlines altogether, and even loudly assert that they have superior taste if they so choose, which in my view is just irrational fetishism plus inflamed ego; I think taste is to be calibrated to actual promise of directions. But indeed, what do I know. You are free to share their presumptions. New scientific talent will figure it out.

Seems like you are asking us to praise ignorance over discovery?

To me it seems like the opposite, we just disagree on what qualifies as discovery or science at all, due to differences in taste.

As an exercise, can you tell me THE engineer at Deepseek who proposed or wrote their Parallel Thread Execution(PTX) code with a citation?

Egoists gonna be egoists.

Zhean Xu probably. But I think everyone on (Chenggang Zhao and Shangyan Zhou and Liyue Zhang and Chengqi Deng and Zhean Xu and Yuxuan Liu and Kuai Yu and Jiashi Li and Liang Zhao) list could ask for a megabuck total comp in a frontier lab now, and expect affirmative response.

I see you took this pretty personally.

All I have to say is that top AI research companies (not ScaleAI) are already doing data engineering (expansively understood to include training signal source) and this is the most well-guarded part of the stack, everything else they share more willingly. Data curation, curricula, and yes, human annotation are a giant chunk of what they do. I've seen Anthropic RLHF data, it's very labor intensive and it instantly becomes clear why Sonnet is so much better than its competitors.

They clearly enjoy designing "algos", and the world clearly respects them greatly for that expertise.

Really glad for them and the world.

Past glory is no evidence of current correctness, however. LeCun with his «AR-LLMs suck» has made himself a lolcow, so has Schimidhuber. Hochreiter has spent the last few years trying to one-up the Transformer and fell to the usual «untuned baseline» issue, miserably. Meta keeps churning out papers on architectures; they got spooked by DeepSeek V3 which architecture section opens with «The basic architecture of DeepSeek-V3 is still within the Transformer (Vaswani et al., 2017) framework» and decided to rework the whole Llama 4 stack. Tri Dao did incredibly hard work with Mamba 1/2 and where is Mamba? In models that fall apart on any long context eval more rigorous than NIAH. Google published Griffin/Hawk because it's not valuable enough to hide. What has Hinton done recently, Forward-Forward? Friston tried his hand at this with EBMs and seems to have degraded into pure grift. Shazeer's last works are just «transformers but less attention» and it works fine. What's Goodfellow up to? More fundamental architecture search is becoming the domain of mentally ill 17yo twitter anons.

The most significant real advances in it are driven by what you also condescendingly dismiss – «low-level Cuda compiler writing and server orchestration», or rather hardware-aware Transformer redesigns for greater scalability and unit economics, see DeepSeek's NSA paper.

This Transformer is just a paltry, fetish, "algo".

Transformer training is easy to parallelize and it's expressive enough. Incentives to find anything substantially better increase by OOM year on year, so does the compute and labor spent on it, to no discernible result. I think it's time to let go of faulty analogies and accept the most likely reality.

Regardless of whether transformers are a dead-end or not, the current approach isn't doing new science or algo design. Its throwing more and more compute at the problem

Fetishizing algorithmic design is, I think, a sign of mediocre understanding of ML, being enthralled by cleverness. Data engineering carves more interesting structure into weighs.

Mistral was an upstart lab formed by Meta and GDM alumni. We can look up bios of DeepSeek, they are youngsters with no experience in pretraining large language models and no connection to frontier labs, most have never worked nor studied outside China (according to CEO, V2 model is all indigenous). They strongly emphasize their disinterest in your prior experience in hiring. You can consider this all clever spectacle but I think it's time to admit they're just that good.

By the way, they're starting to opensource their infrastructure on Monday.

Conspiracy theories around DeepSeek are pretty funny, people twist themselves into pretzels to not acknowledge the most parsimonious hypothesis. Because it feels too wild, I guess. Maybe too scary as well, because it suggests that China can birth like a hundred more such companies if it finds a hundred such CEOs. I collect these stories. They've used all of Singapore's compute! They pay $1.3M to top researchers! It's a «choreographed emergence» to deceive the oh-so-important Dean (but he knows better than to trust ChiComs)! The scale is much bigger, there are hidden disciples in cloistered cultivation! It's all so very creative.

In any case I have the direct opposite impression about papers. They are kids, overwhelmingly under 30 and often 20, and they write very naturally and not academically. It's just raw intelligence and curiosity, not experience. It is known that roughly everyone at DeepSeek speaks fluent English – not normal for Chinese labs; they pay extreme attention to culture fit and aptitude in recruiting, and are severely ageist. Many core innovations come from undergrad interns; the first author on the NSA paper is an intern with anime pfp too. We have reports from competitors' employees who had rejected their offer because they perceived the company as too small and weak for the declared ambition. I don't know how to break it to you, but there's no Iron Curtain, things are fairly transparent.

These are the best ML PhDs China has.

Maybe. I'm not sure if DeepSeek has even 50 Ph.Ds though, and ByteDance has thousands.

If your hypothesis is correct, they will not significantly accelerate, now that they're acknowledged as national champions and are offered OOMs more resources than they had before. I think they will.

Grok 3 is quite certainly the best among available Instruct models. Grok-Mini is on par with o3-mini, PR nonsense about low-medium-high differentiation aside (DeepSeek docs suggest it's just something like an inference parameter for length of reasoning, so we should not assume that o3-mini-high is a superior quality model). Grok's RL training has only just begun, it is probable that they've reassembled their pipeline after DeepSeek-R1 paper came out.

Every lab will accelerate significantly in the near future so I wouldn't overindex on this leap. Elon just wanted to claim priority at least for a while, much like DeepSeek's Wenfeng, so both rushed to ship a next-generation project faster than competition, but fundamentals are still in favor of Google, Meta (despite horrible self-defeating culture) and OpenAI given Starship program. Grok 3 is just the first in a large crop of models trained on 100K+ class GPU fleets. Orion, Llama 4, Claude-Next are coming.

Even so, we have definitely received another frontier actor.

With a few weeks of space between the initial marketing hype and observation, and Deepseek seems to be most notable for (a) claiming to have taken less money to develop (which is unclear given the nature of China subsidies), (b) being built off of other tech (which helps explain (a), and (c) being relatively cheap (which is partially explained by (a).

Man, you're really committed to the bit of an old spook who disdains inspecting the object level because you can't trust nuthin' out of Red Choyna. I could just go watch this guy or his pal instead.

It wouln't be an exaggeration to say that the technical ML community has acknowledged DeepSeek as the most impressive lab currently active, pound for pound. What subsidies, there is nothing to subsidize, except perhaps sharashkas with hidden geniuses.

The US of A was founded by Emma Lazarus in 1965, or so I'm told.

This is definitely true, but are you really expecting good results from collapsing the whole house of polite fictions that keep us in the Rules-Based International Liberal Order regime?

It's just Netanuahu making an edgy present

This degree of tastelessness is befitting a bloodthirsty teenager rather than head of state. I doubt SS' exact interpretation of the symbolism, but to deny that it's symbolic is just not serious. Indeed, you concede it is symbolic since you recognize it as a "joke" about a particular Mossad operation.

P.S.

Why mock and threaten your best ally?

(Assuming for the sake of argument that this is what was happening) because the nature of relationship isn't an alliance. Israel offers the US and Trump personally nothing except perhaps tolerance. They do not fight your wars, they do not contribute to your war chest. For all intents and purposes the US is simply doing Israel's bidding, and administrations that oppose this get nowhere and are replaced. Not just Biden, but the whole mighty edifice of the Left, wokeness, progressivism, neo-marxism, all those forces the advances of which had caused the creation of this community, crumbled from the aftershocks of Oct 7. If you don't notice this seismic shift I don't know what to tell you.

The same explanation applies if we assume it's a mere edgy joke, however. Simply put, Israel does not act like a small country dependent on its great ally.

Sounds like they need LLM writing assistance more than anyone, then.