The Motte

Post

Small-Scale Question Sunday for August 4, 2024

68 comments - 2301 thread views 2mo ago by PaperclipPerfector (text post)

Friday Fun Thread for August 2, 2024

136 comments - 2512 thread views 2mo ago by PaperclipPerfector (text post)

Quality Contributions Report for July 2024

38 comments - 5142 thread views 3mo ago by naraburns nihil supernum (text post)

Transnational Thursday for August 1, 2024

7 comments - 894 thread views 3mo ago by PaperclipPerfector (text post)

Wellness Wednesday for July 31, 2024

73 comments - 1556 thread views 3mo ago by PaperclipPerfector (text post)

Matt Yglesias Considered As The Nietzschean Superman

79 comments - 2799 thread views 3mo ago by No_one (astralcodexten.com)

Common Wisdom and Conspiracies

36 comments - 1397 thread views 3mo ago by jake (text post)

When we talk the serious conspiracies, those that pass the schizoabsurd, lizardman-constant filter, so skepticism toward the official accounts of certain pivotal events, a common wisdom is quickly invoked: It would take too many people, and someone would talk.

Would they?

On March 8, 1971, Smokin' Joe Frazier fought Muhammad Ali in the Fight of the Century. Both were undefeated -- 26-0 with 23 knockouts for Frazier, 31-0 with 25 knockouts for Ali. Past the biggest fight, it was considered the biggest sporting event ever up to that point. Madison Square Garden made a million at the gate; 2.5 million tickets were sold for closed-circuit pay-per-view venues; in London where it was broadcast at midnight, 90,000 tickets were sold. They went 15 rounds and Smokin' Joe won by unanimous decision, though Ali would go on to the win their next two bouts. While everyone was watching that fight, less than 100 miles away 8 members of the Citizens' Commission to Investigate the FBI broke into a Bureau office in a Philadelphia suburb called Media. The documents they found revealed the existence of the FBI's Counter Intelligence Program: COINTELPRO.

COINTELPRO started in 1956, its stated goal was undermining communist activity in the United States and much can be said on that, but I think most relevant is socialism and communism already had a popularity in the States at the turn of the century and after World War 1 and the Russian Revolution they had a real presence in American academia. I doubt a man so circumspect as J. Edgar Hoover was unaware of decades of fomenting communist thought and the subsequent infiltration into power of white communists. I imagine his black book had quite a few names Joseph McCarthy would have been very interested in seeing. Nevertheless, it went on, COINTELPRO worked against the Communist Party of the US, the Socialist Workers Party, the Black Panthers, and also the KKK. "Tactics included anonymous phone calls, IRS audits, and the creation of documents that would divide the American communist organization internally." MLK arrived and Hoover quickly identified him and singled him out, bugging his home and hotel rooms, and then using the audio from the bugs to threaten King, saying he should kill himself. I have a singular hatred for communism and MLK was a socialist but he had committed no crime, there was no legal basis for the FBI's considerable efforts against him. RFK signed off on a month of watching MLK, Hoover just kept it running.

On the militant side, COINTELPRO efforts, if not wholly responsible for the schism in the Nation of Islam that saw Malcolm X break away, sharply accelerated the deterioration of the relationship between Malcolm and Elijah Muhammad that culminated in NOI members killing Malcolm.

As an aside, the FBI was apparently concerned with and dedicated to preventing the rise of a "Messiah-like figure" who would unify black militants. I find this curious. At the time the demographics of the US were 88% white, 10% black, 4% hispanics of any race, those are stark lines. Had there been black militancy and an actual armed conflict, they would have been put down hard and America of the 60s, certainly the South as popularly portrayed, surely had the racial animus to back mass expulsion or if not that, death squads. Right? Hoover was no integrationist. I notice I'm confused. Alas.

For these, for the buggings, for the creation of inflammatory documents, we have an FBI that had no problem serially and severely breaking the law, at stoking hostilities, overlooking murders they effectually encouraged, and with MLK, just outright telling the guy "kill yourself or else." They didn't bother with that for Fred Hampton, they just had him killed. Maybe this seems tame now, my how we've fallen.

The CIA had something of their own version of COINTELPRO established under LBJ and expanded by Nixon, Operation MHCHAOS. They also had something older and in the same window as COINTELPRO: Project MKUltra. "MK" from the internal staff rating, the CIA's version of military MOS, involved in the project, and "Ultra" likely from the extremely high secrecy around the project. I expect most here have the gist: starting in the 50s, the CIA dosed the unknowing with various psychoactive substances alongside research into brainwashing, psychological torture and general manipulation of thought. MKUltra was the successor to the CIA's Project Artichoke, which was itself likely a successor to Nazi research from scientists procured through Operation Paperclip. What we know is horrifying, and we don't know a lot, because amidst Watergate, CIA director Richard Helms ordered the destruction of all MKUltra files. A small number survived. What brought it to light wasn't even anything out of the project itself, it was Seymour Hersh reporting on MHCHAOS in the New York Times. His piece resulted in the Rockefeller Commission and the Church Committee, and it was under those the existence of MKUltra was revealed.

Ted Kennedy, on the Senate floor in 1977:

The Deputy Director of the CIA revealed that over thirty universities and institutions were involved in an "extensive testing and experimentation" program which included covert drug tests on unwitting citizens "at all social levels, high and low, native Americans and foreign." Several of these tests involved the administration of LSD to "unwitting subjects in social situations."

Hundreds at least, maybe thousands of people were involved in MKUltra, and with universities performing tests on unwitting citizens it seems like it wasn't particularly compartmentalized. What brought it to light? It wasn't people on the inside blowing the whistle in the 50s or 60s or at the start of the 70s, and I doubt Helms was the only one who thought the American people wouldn't like the truth.

There's Iran-Contra. Oliver North & co. selling guns to Iran to fund the Contras in Nicaragua: busted by an Iranian official leaking to Lebanese journalists. There's the CIA's involvement in drug trafficking, something they've covered their tracks on well, "They knew it was happening" is good enough. There's also Operation Fast and Furious, though I wouldn't group it with the rest, it had interesting goals that might have worked, and those cartel guys don't really have problems getting guns so I don't see FF guns being found at shootings as the biggest pie on their face. But it's worth including because there were people within who objected and blew the whistle.

Then there's PRISM. You know it, Edward Snowden saw the NSA's backdoor to all internet communications, got the files to prove it, now he's a Russian citizen. PRISM is still around, it hasn't been reduced. They can still surveil whomever FISA says they can. The USA FREEDOM Act, the only attempt to limit its reach, moved data holding to the phone carriers; US citizen data of which the NSA can still access with ex parte FISC warrants, an entirely separate incredibly troublesome practice of the US government. PRISM is COINTELPRO and MHCHAOS in one, supersized, a dossier at a click for just about anyone, anywhere. It ranks among the very worst things done by the American government and nobody involved has said a fucking word except Edward Snowden.

What's the common factor?

In each scandal we have large numbers of people involved in such operations. In each scandal, save the exception that proves the rule, none of them came forward. Discovery happened by a lucky break-in, or investigation into a different debacle, or adversarial geopolitical interest.

A CIA official who knows his history knows they don't get caught because someone on the inside spills the beans. They get caught by leaving breadcrumbs for outside eyes. No breadcrumbs, no scandal.

There are smaller scandals, in terms of scope or gravity, not necessarily the height of office of those involved, where whistleblowers did come forward. It's not unheard of. But this is an evidenced rejection of the common wisdom that there is a limited and small number of people who can be involved in highly illegal and evil projects before someone says something. So: "It'd take too many people, and they'd talk"? No, sometimes they don't. Sometimes hundreds or thousands of people can scrape the abyss and go to their graves saying nothing.

Culture War Roundup for the week of July 29, 2024

1930 comments - 34051 thread views 3mo ago by PaperclipPerfector (text post)

1930

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

Small-Scale Question Sunday for July 28, 2024

99 comments - 2318 thread views 3mo ago by PaperclipPerfector (text post)

Friday Fun Thread for July 26, 2024

87 comments - 2363 thread views 3mo ago by PaperclipPerfector (text post)

Transnational Thursday for July 25, 2024

25 comments - 1165 thread views 3mo ago by PaperclipPerfector (text post)

Wellness Wednesday for July 24, 2024

136 comments - 2193 thread views 3mo ago by PaperclipPerfector (text post)

Culture War Roundup for the week of July 22, 2024

2205 comments - 37100 thread views 3mo ago by PaperclipPerfector (text post)

2205

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

Small-Scale Question Sunday for July 21, 2024

175 comments - 3303 thread views 3mo ago by PaperclipPerfector (text post)

Friday Fun Thread for July 19, 2024

67 comments - 2478 thread views 3mo ago by PaperclipPerfector (text post)

Transnational Thursday for July 18, 2024

25 comments - 1180 thread views 3mo ago by PaperclipPerfector (text post)

Wellness Wednesday for July 17, 2024

9 comments - 984 thread views 3mo ago by PaperclipPerfector (text post)

Culture War Roundup for the week of July 15, 2024

2363 comments - 38779 thread views 3mo ago by PaperclipPerfector (text post)

2363

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

Small-Scale Question Sunday for July 14, 2024

180 comments - 3666 thread views 3mo ago by PaperclipPerfector (text post)

The state of open-source LLMs as of summer 2024, or: Core Values of Socialism with AGI characteristics, V2

45 comments - 4102 thread views 3mo ago by DaseindustriesLtd late version of a small language model (text post) Edited 3mo ago

Here some people have expressed interest in my take on AI broadly, and then there's Deepseek-Coder release, but I've been very busy and the field is moving so very fast again, it felt like a thankless job to do what Zvi does and without his doomer agenda too (seeing the frenetic feed on Twitter, one can be forgiven for just losing the will; and, well, I suppose Twitter explains a lot about our condition in general). At times I envy Iconochasm who tapped out. Also, this is a very niche technical discussion and folks here prefer policy.

But, in short: open source AI, in its most significant aspects, which I deem to be code generation and general verifiable reasoning (you can bootstrap most everything else from it), is now propped up by a single Chinese hedge fund (created in the spirit of Renaissance Capital) which supports a small, ignored (except by scientists and a few crackpots on Twitter) research division staffed with some nonames, who are quietly churning out extraordinarily good models with the explicit aim of creating AGI in the open. These models happen to be (relatively) innocent of benchmark-gaming, but somewhat aligned to Chinese values. The modus operandi of DeepSeek is starkly different from that of either other Chinese or Western competitors. In effect this is the only known group both meaningfully pursuing frontier capabilities and actively teaching others how to do so. I think this is interesting and a modest cause for optimism. I am also somewhat reluctant to write about this publicly because there exist lovers of Freedom here, and it would be quite a shame if my writing contributed to targeted sanctions and even more disempowerment of the small man by the state machinery in the final accounting.

But the cat's probably out of the bag. The first progress prize of AI Mathematical Olympiad had just been taken by a team using their DeepSeekMath-7B model, solving 29 out of 50 private test questions «less challenging than those in the IMO but at the level of IMO preselection»; Terence Tao finds it «somewhat higher than expected» (he is on the AIMO Advisory Committee, along with his fellow Fields medalist Timothy Gowers).

The next three teams entered with this model as well.

I. The shape of the game board

To provide some context, here's an opinionated recap of AI trends since last year. I will be focusing exclusively on LLMs, as that's what matters (image gen, music gen, TTS etc largely are trivial conveniences, and other serious paradigms seem to be in their embryonic stage or in deep stealth).

We have barely advanced in true out-of-distribution reasoning/understanding relative to the original «Sparks of AGI» GPT-4 (TheDag, me); GPT-4-04-29 and Sonnet 3.5 were the only substantial – both minor – steps forward, Gemini was a catch-up effort, and nobody else has yet credibly reached the same tier. We have also made scant progress towards consensus on whether that-which-LLMs-do is «truly» reasoning or understanding; sensible people have recoursed to something like «it's its own kind of mind, and hella useful».
Meanwhile there's been a great deal of progress in scaffolding (no more babyAGI/AutoGPT gimmicry, now agents are climbing up the genuinely hard SWE-bench), code and math skills, inherent robustness in multi-turn interactions and responsiveness to nuanced feedback (to the point that LLMs can iteratively improve sizable codebases – as pair programmers, not just fancy-autocomplete «copilots»), factuality, respect of prioritized system instructions, patching badly covered parts of the world-knowledge/common sense manifold, unironic «alignment» and ironing out Sydney-like kinks in deployment, integrating non-textual modalities, managing long contexts (merely usable 32K "memory" was almost sci-fi back then, now 1M+ with strong recall is table stakes at the frontier; with 128K mastered on a deeper level by many groups) and a fairly insane jump in cost-effectiveness – marginally driven by better hardware, and mostly by distilling from raw pretrained models, better dataset curation, low-level inference optimizations, eliminating architectural redundancies and discovering many "good enough" if weaker techniques (for example, DPO instead of PPO). 15 months ago,"$0.002/1000 tokens" for gpt-3.5-turbo seemed incredible; now we always count tokens by the million, and Gemini-Flash blows 3.5-turbo out of the water for half that, so hard it's not funny; and we have reason to believe it's still raking in >50% margins whereas OpenAI probably subsidized their first offerings (though in light of distilling and possibly other methods of compute reuse, it's hard to rigorously account for a model's capital costs now).
AI doom discourse has continued to develop roughly as I've predicted, but with MIRI pivoting to evidence-free advocacy, orthodox doomerism getting routed as a scientific paradigm, more extreme holdovers from it («emergent mesaoptimizers! tendrils of agency in inscrutable matrices!») being wearily dropped by players who matter, and misuse (SB 1047 etc) + geopolitical angle (you've probably seen young Leopold) gaining prominence.
The gap in scientific and engineering understanding of AI between the broader community and "the frontier" has shrunk since the debut of GPT-4 or 3.5, because there's too much money to be made in AI and only so much lead you can get out of having assembled the most driven AGI company. Back then, only a small pool of external researchers could claim to understand what the hell they did above the level of shrugging "well, scale is all you need" (wrong answer) or speculating about some simple methods like "train on copyrighted textbooks" (spiritually true); people chased rumors, leaks… Now it takes weeks at most to trace a yet another jaw-dropping magical demo to papers, to cook up a proof of concept, or even to deem the direction suboptimal; the other two leading labs no longer seem desperate, and we're in the second episode of Anthropic's comfortable lead.
Actual, downloadable open AI sucks way less than I've lamented last July. But it still sucks. And that's really bad, since it sucks most in the dimension that matters: delivering value, in the basest sense of helping do work that gets paid. And the one company built on the promise of «decentralizing intelligence», which I had hope for, had proven unstable.

To be more specific, open source (or as some say now, given the secretiveness of full recipes and opacity of datasets, «open weights») AI has mostly caught up in «creativity» and «personality», «knowledge» and some measure of «common sense», and can be used for petty consumer pleasures or simple labor automation, but it's far behind corporate products in «STEM» type skills, that are in short supply among human employees too: «hard» causal reasoning, information integration, coding, math. (Ironically, I agree here with whining artists that we're solving domains of competence in the wrong order. Also it's funny how by default coding seems to be what LLMs are most suited for, as the sequence of code is more constrained by preceding context than natural language is).

To wit, Western and Eastern corporations alike generously feed us – while smothering startups – fancy baubles to tinker with, charismatic talking toys; as they rev up self-improvement engines for full cycle R&D, the way imagined by science fiction authors all these decades ago, monopolizing this bright new world. Toys are getting prohibitively expensive to replicate, with reported pretraining costs up to ≈$12 million and counting now. Mistral's Mixtral/Codestral, Musk's Grok-0, 01.Ai's Yi-1.5, Databricks' DBRX-132B, Alibaba's Qwens, Meta's fantastic Llama 3 (barring the not-yet-released 405B version), Google's even better Gemma 2, Nvidia's massive Nemotron-340B – they're all neat. But they don't even pass for prototypes of engines you can hop on and hope to ride up the exponential curve. They're too… soft. And not economical for their merits.

Going through our archive, I find this year-old analysis strikingly relevant:

I think successful development of a trusted open model rivaling chatgpt in capability is likely in the span of a year, if people like you, who care about long-term consequences of lacking access to it, play their cards reasonably well. […] Companies whose existence depends on the defensibility of the moat around their LM-derived product will tend to structure the discourse around their product and technology to avoid even the fleeting perception of being a feasibly reproducible commodity.

That's about how it went. While the original ChatGPT, that fascinating demo, is commodified now, competitive product-grade AI systems are not, and companies big and small still work hard to maintain the impression that it takes

some secret sauce (OpenAI, Anthropic)
work of hundreds of Ph.Ds (Deepmind)
vast capital and compute (Meta)
"frontier experience" (Reka)

– and even then, none of them have felt secure enough yet to release a serious threat to the other's proprietary offers.

I don't think it's a big exaggerion to say that the only genuine pattern breaker – presciently mentioned by me here – is DeepSeek, the company that has single-handedly changed – a bit – my maximally skeptical spring'2023 position on the fate of China in the AGI race.

II. Deep seek what?

AGI, I guess. Their Twitter bio states only: «Unravel the mystery of AGI with curiosity. Answer the essential question with long-termism». It is claimed by the Financial Times that they have a recruitment pitch «We believe AGI is the violent beauty of model x data x computing power. Embark on a ‘deep quest’ with us on the journey towards AGI!» but other than that nobody I know of has seen any advertisement or self-promotion from them (except for like 70 tweets in total, all announcing some new capability or responding to basic user questions about license), so it's implausible that they're looking for attention or subsidies. Their researchers maintain near-perfect silence online. Their – now stronger and cheaper – models tend to be ignored in comparisons by Chinese AI businesses and users. As mentioned before, one well-informed Western ML researcher has joked that they're the bellwether for «the number of foreign spies embedded in the top labs».

FT also says the following of their parent company:

Its funds have returned 151 per cent, or 13 per cent annualised, since 2017, and were achieved in China’s battered domestic stock market. The country’s benchmark CSI 300 index, which tracks China’s top 300 stocks, has risen 8 per cent over the same time period, according to research provider Simu Paipai.
In February, Beijing cracked down on quant funds, blaming a stock market sell-off at the start of the year on their high-speed algorithmic trading. Since then, High-Flyer’s funds have trailed the CSI 300 by four percentage points.
[…] By 2021, all of High-Flyer’s strategies were using AI, according to manager Cai Liyu, employing strategies similar to those pioneered by hugely profitable hedge fund Renaissance Technologies. “AI helps to extract valuable data from massive data sets which can be useful for predicting stock prices and making investment decisions,” …
Cai said the company’s first computing cluster had cost nearly Rmb200mn and that High Flyer was investing about Rmb1bn to build a second supercomputing cluster, which would stretch across a roughly football pitch-sized area. Most of their profits went back into their AI infrastructure, he added. […] The group acquired the Nvidia A100 chips before Washington restricted their delivery to China in mid-2022.
“We always wanted to carry out larger-scale experiments, so we’ve always aimed to deploy as much computational power as possible,” founder Liang told Chinese tech site 36Kr last year. “We wanted to find a paradigm that can fully describe the entire financial market.”

In a less eclectic Socialist nation this would've been sold as Project Cybersyn or OGAS. Anyway, my guess is they're not getting subsidies from the Party any time soon.

They've made a minor splash in the ML community eight months ago, in late October, releasing an unreasonably strong Deepseek-Coder. Yes, in practice an awkward replacement for GPT-3.5, yes, contaminated with test set, which prompted most observers to discard it as a yet another Chinese fraud. But it proved to strictly dominate hyped-up things like Meta's CodeLLaMA and Mistral's Mixtral 8x7b in real-world performance, and time and again proved to be the strongest open baseline in research papers. On privately designed, new benchmarks like this fresh one from Cohere it's clear that they did get to parity with OpenAI's workhorse model, right on the first public attempt – as far as coding is concerned.

On top of that, they shared a great deal of information about how: constructing the dataset from Github, pretraining, finetuning. The paper was an absolute joy to read, sharing even details on unsuccessful experiments. It didn't offer much in the way of novelty; I evaluate it as a masterful, no-unforced-errors integration of fresh (by that point) known best practices. Think about your own field and you'll probably agree that even this is a high bar. And in AI, it is generally the case that either you get a great model with «we trained it on some text… probably» tech report (Mistral, Google), or a mediocre one accompanied by a fake-ass novel full of jargon (every second Chinese group). Still, few cared.

Coder was trained, it seems, using lessons of the less impressive Deepseek-LLM-67B (even so, it was roughly Meta's LLaMA-2-70B peer that also could code; a remarkable result for a literally-who new team), which somehow came out a month after. Its paper (released even later still) was subtitled «Scaling Open-Source Language Models with Longtermism». I am not sure if this was some kind of joke at the expense of effective altruists. What they meant concretely was the following:

Over the past few years, LLMs … have increasingly become the cornerstone and pathway to achieving Artificial General Intelligence (AGI). … Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source LMs with a long-term perspective.

…Soon, we will release our technique reports in code intelligence and Mixture-of-Experts(MoE), respectively. They show how we create high-quality code data for pre-training, and design a sparse model to achieve dense model performance.

At present, we are constructing a larger and improved dataset for the upcoming version of DeepSeek LLM. We hope the reasoning, Chinese knowledge, math, and code capabilities will be significantly improved in the next version.

Our alignment team is dedicated to studying ways to deliver a model that is helpful, honest, and safe to the public. Our initial experiments prove that reinforcement learning could boost model complex reasoning capability.

…I apologize for geeking out. All that might seem normal enough. But, a) they've fulfilled every one of those objectives since then. And b) I've read a great deal of research papers and tech reports, entire series from many groups, and I don't remember this feeling of cheerful formidability. It's more like contemplating the dynamism of SpaceX or Tesla than wading through a boastful yet obscurantist press release. It is especially abnormal for a Mainland Chinese paper to be written like this – with friendly confidence, admitting weaknesses, pointing out errors you might repeat, not hiding disappointments behind academese word salad; and so assured of having a shot in an honest fight with the champion.

In the Coder paper, they conclude:

…This advancement underscores our belief that the most effective code-focused Large Language Models (LLMs) are those built upon robust general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language. Looking ahead, our commitment is to develop and openly share even more powerful code-focused LLMs based on larger-scale general LLMs.

In the Mixture-of-Experts paper (8th January), they've shown themselves capable of novel architectural research too, introducing a pretty ingenuous «fine-grained MoE with shared experts» design with the objective of «Ultimate Expert Specialization» and economical inference: «DeepSeekMoE 145B significantly outperforms Gshard, matching DeepSeek 67B with 28.5% (maybe even 14.6%) computation». For those few who noticed it, this seemed a minor curiosity, or just bullshit.

On 5th February, they've dropped DeepSeekMath,of which I've already spoken: «Approaching Mathematical Reasoning Capability of GPT-4 with a 7B Model». Contra the usual Chinese pattern, it wasn't a lie; no, you couldn't in normal use get remotely as good results from it, but in some constrained regimes… The project itself was a mix of most of the previous steps: sophisticated (and well-explained) data harvesting pipeline, scaling laws experiments, further «longtermist» continued pretraining from Coder-7B-1.5 which itself is a repurposed LLM-7B, and the teased reinforcement learning approach. Numina, winners of AIMO, say «We also experimented with applying our SFT recipe to larger models like InternLM-20B, CodeLama-33B, and Mixtral-8x7B but found that (a) the DeepSeek 7B model is very hard to beat due to its continued pretraining on math…».

In early March they released DeepSeek-VL: Towards Real-World Vision-Language Understanding, reporting some decent results and research on building multimodal systems, and again announcing new plans: «to scale up DeepSeek-VL to larger sizes, incorporating Mixture of Experts technology».

III. Frontier minor league

This far, it's all been preparatory R&D, shared openly and explained eagerly yet barely noticed by anyone (except that the trusty Coder still served as base for labs like Microsoft Research to experiment on): utterly overshadowed in discussions by Alibaba, Meta, Mistral, to say nothing of frontier labs.

But on May 6th, 2024, the pieces began to fall into place. They released «DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model», which subsumed all aforementioned works (except VL).

It's… unlike any other open model, to the point you could believe it was actually made by some high-IQ finance bros from first principles. Its design choices are exquisite, just copying minor details can substantially improve on typical non-frontier efforts. It pushes further their already unorthodox MoE and tops it off with a deep, still poorly understood modification to the attention mechanism (Multi-head Latent Attention, or MLA). It deviates from industry-standard rotary position embeddings to accomodate the latter (a fruit of collaboration with RoPE's inventor). It's still so unconventional that we are only beginning to figure out how to run it properly (they don't share their internal pipeline, which is optimized for hardware they can access given American sanctions). But in retrospect, it's the obvious culmination of the vision announced with those first model releases and goofy tweets, probably a vision not one year old, and yet astonishingly far-sighted – especially given how young their star researchers are. But probably it's mundane in the landscape of AI that's actually used; I suspect it's close to how Sonnet 3.5 or Gemini 1.5 Pro work on the inside. It's just that the open-source peasants are still mucking around with stone age dense models on their tiny consumer GPUs.

I understand I might already be boring you out of your mind, but just to give you an idea of how impressive this whole sequence is, here's a 3rd April paper for context:

Recent developments, such as Mixtral (Jiang et al., 2024), DeepSeek-MoE (Dai et al., 2024), spotlight Mixture-of-Experts (MoE) models as a superior alternative to Dense Transformers. An MoE layer works by routing each input token to a selected group of experts for processing. Remarkably, increasing the number of experts in an MoE model (almost) does not raise the computational cost, enabling the model to incorporate more knowledge through extra parameters without inflating pre-training expenses… Although our findings suggest a loss-optimal configuration with Emax experts, such a setup is not practical for actual deployment. The main reason is that an excessive number of experts makes the model impractical for inference. In contrast to pretraining, LLM inference is notably memory-intensive, as it requires storing intermediate states (KV-cache) of all tokens. With more experts, the available memory for storing KV caches is squeezed. As a result, the batch size – hence throughput – decreases, leading to increased cost per query. … We found that MoE models with 4 or 8 experts exhibit more efficient inference and higher performance compared to MoE models with more experts. However, they necessitate 2.4x-4.3x more training budgets to reach the same performance with models with more experts, making them impractical from the training side.

This is basically where Mistral.AI, the undisputed European champion with Meta and Google pedigree (valuation $6.2B), the darling of the opensource community, stands.

And yet, apparently DeepSeek have found a way to get out of the bind. «4 or 8»? They scale to 162 experts, reducing active parameters to 21B, cutting down pretraining costs by 42.5% and increasing peak generation speed by 5.76x; and they scale up the batch size via compressing the KV cache by like 15 times with a bizarre application of low-rank projections and dot attention; and while doing so they cram in 3x more attention heads than any model this size has any business having (because their new attention decouples number of heads from cache size), and so kick the effective «thinking intensity» up a notch, beating the gold standard «Multihead attention» everyone has been lousily approximating; and they use a bunch of auxiliary losses to make the whole thing maximally cheap to use on their specific node configuration.

But the cache trick is pretty insane. The hardest-to-believe, for me, part of the whole thing. Now, 2 months later, we know that certain Western groups ought to have reached the same Pareto frontier, just with different (maybe worse, maybe better) tradeoffs. But those are literally inventors and/or godfathers of the Transformer – Noam Shazeer's CharacterAI, Google Deepmind's Gemini line… This is done by folks like this serious-looking 5th year Ph.D student, in under a year!

As a result, they:

use about as much compute on pretraining as Meta did on Llama-3-8B, an utter toy in comparison (maybe worth $2.5 million for them); 1/20th of GPT-4.
Get a 236B model that's about as good across the board as Meta's Llama-3-70B (≈4x more compute), which has the capacity – if not the capability – of mid-range frontier models (previous Claude 3 Sonnet; GPT-4 on a bad day).
Can serve it at around the price of 8B, $0.14 for processing 1 million tokens of input and $0.28 for generating 1 million tokens of output (1 and 2 Yuan), on previous-gen hardware too.
…and still take up to 70%+ gross margins, because «On a single node with 8 H800 GPUs, DeepSeek-V2 achieves a generation throughput exceeding 50K tokens per second… In addition, the prompt input throughput of DeepSeek-V2 exceeds 100K tokens per second», and the going price for such nodes is ≤$15/hr. That's $50 in revenue, for clarity. They aren't doing a marketing stunt.
…and so they force every deep-pocketed mediocre Chinese LLM vendor – Alibaba, Zhipu and all – to drop prices overnight, now likely serving at a loss.

Now, I am less sure about some parts of this story; but mostly it's verifiable.

I can see why an American, or a young German like Leopold, would freak out about espionage. The thing is, their papers are just too damn good and too damn consistent over the entire period if you look back (as I did), so «that's it, lock the labs» or «haha, no more tokens 4 u» is most likely little more than racist cope for the time being. The appropriate reaction would be more akin to «holy shit Japanese cars are in fact good».

Smart people (Jack Clark from Anthropic, Dylan Patel of Semianalysis) immediately take note. Very Rational people clamoring for AI pause (TheZvi) sneer and downplay: «This is who we are worried about?» (as he did before, and before). But it is still good fun. Nothing extreme. There slowly begin efforts at adoption: say, Salesforce uses V2-Chat to create synthetic data to finetune small Deepseek-Coder V1s to outperform GPT-4 on narrow tasks. Mostly nobody cares.

The paper ends in the usual manner of cryptic comments and commitments:

We thank all those who have contributed to DeepSeek-V2 but are not mentioned in the paper. DeepSeek believes that innovation, novelty, and curiosity are essential in the path to AGI.

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.

• In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.

In the Appendix, you can find a lot of curious info, such as:

During pre-training data preparation, we identify and *filter out contentious content, such as values influenced by regional cultures, to avoid our model exhibiting unnecessary subjective biases on these controversial topics. Consequently, we observe that DeepSeek-V2 performs slightly worse on the test sets that are closely associated with specific regional cultures. For example, when evaluated on MMLU, although DeepSeek-V2 achieves comparable or superior performance on the majority of testsets compared with its competitors like Mixtral 8x22B, it still lags behind on the Humanity-Moral subset, which is mainly associated with American values.

Prejudices of specific regional cultures aside, though, it does have values – true, Middle Kingdom ones, such as uncritically supporting the Party line and adherence to Core Values Of Socialism (h/t @RandomRanger). The web version will also delete the last message if you ask something too clever about Xi or Tiananmen or… well, nearly the entirety of usual things Americans want to talk to Chinese coding-oriented LLMs about.

And a bit earlier, this funny guy from the team presented at Nvidia's GTC24 with the product for the general case – «culturally sensitive», customizable alignment-on-demand: «legality of rifle» for the imperialists, illegality of Tibet separatism for the civilized folk. Refreshingly frank.

But again, even that was just a preparatory.

IV. Coming at the king

Roughly 40 days later they release DeepSeek-V2-Coder: Breaking the Barrier of Closed-Source Models in Code Intelligence, where they return to the strategy announced at the very start: they take an intermediate checkpoint of V2, and push it harder and further on the dataset enriched with code and math (that that've continued to expand and refine), for 10.2 trillion tokens total. Now this training run is 60% more expensive than Llama-3-8B (still a pittance by modern standards). It also misses out on some trivia knowledge and somehow becomes even less charismatic. It's also not a pleasant experience because the API runs very slowly, probably from congestion (I guess Chinese businesses are stingy… or perhaps DeepSeek is generating a lot of synthetic data for next iterations). Anons on 4chan joke that it's «perfect for roleplaying with smart, hard-to-get characters».

More importantly though, it demolishes Llama-3-70B on every task that takes nontrivial intelligence; bests Claude 3 Opus on coding and math throughout, Gemini 1.5-Pro on most coding assistance, and trades blows with the strongest GPT-4 variants. Of course it's the same shape and the same price, which is to say, up to 100 times cheaper than its peers… more than 100 times, in the case of Opus. Still a bitch to run, but it turns out they're selling turnkey servers. In China, of course. To boot, they rapidly shipped running code in browser (a very simple feature but going most of the way to Claude Artifacts that wowed people do much), quadrupled context length without price changes (32k to 128k) and now intend to add context caching that Google boasts of as some tremendous Gemini breakthrough. They have... Impressive execution.

Benchmarks, from the most sophisticated and hard to hack to the most bespoke and obscure, confirm that it's «up there».

Aider (2nd, 1st at release)
LMSYS Arena (low on Overall, but 5th rank on Coding and 7 ranks above Google and Meta's open source alternatives, respectively 11th and 3+ ranks above on Hard subsample)
Arena-Hard-Auto(7th, surprisingly 2 more Chinese models narrowly get ahead)
%compilable Golang programs (2nd)
Livebench (7th, by virtue of being 5th-6th in coding and reasoning and 2nd in Math; everyone above is OpenAI/Anthropic)
LiveCodeBench (4th, same order)
BigCodeBench(2nd)
Gaokao-Math, released days before its deployment (roughly above GPT-4o)

Etc etc, and crucially, users report similar impressions:

So I have pegged deepseek v2 coder against sonnet 3.5 and gpt4o in my coding tasks and it seems to be better than gpt4o (What is happening at OpenAI) and very similar to Sonnet 3.5. The only downside is the speed, it's kinda slow. Very good model and the price is unbeatable.

I had the same experience, this is a very good model for serious tasks. Sadly the chat version is very dry and uncreative for writing. Maybe skill issue, I do not know. It doesn't feel slopped, it's just.. very dry. It doesn't come up with things.

Some frustrating weak points, but they know of those, and conclude:

Although DeepSeek-Coder-V2 achieves impressive performance on standard benchmarks, we find that there is still a significant gap in instruction-following capabilities compared to current state-of-the-art models like GPT-4 Turbo. This gap leads to poor performance in complex scenarios and tasks such as those in SWEbench. […] In the future, we will focus more on improving the model’s instruction-following capabilities…

Followed by the list of 338 supported languages.

Well-read researchers say stuff like

DeepSeek-Coder-V2 is by far the best open-source math (+ coding) model, performing on par with GPT4o w/o process RM or MCTS and w/ >20x less training compute. Data contamination doesn't seem to be a concern here. Imagine about what this model could achieve with PRM, MCTS, and other yet-to-be-released agentic exploration methods. Unlike GPT4o, you can train this model further. It has the potential to solve Olympiad, PhD and maybe even research level problems, like the internal model a Microsoft exec said to be able to solve PhD qualifying exam questions».

Among the Rational, there is some cautious realization («This is one of the best signs so far that China can do something competitive in the space, if this benchmark turns out to be good»), in short order giving way to more cope : «Arena is less kind to DeepSeek, giving it an 1179, good for 21st and behind open model Gemma-2-9B».

And one more detail: A couple weeks ago, they released code and paper on Expert-Specialized Fine-Tuning, «which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning … by showing less performance degradation [in general tasks]». It seems to require that «ultimate expert specialization» design of theirs, with its supporting beam of generalist modules surrounded by meaningfully task-specific shards, to automatically select only the parts pertaining to some target domain; and this isn't doable with traditional dense of MoE designs. Once again: confident vision, bearing fruit months later. I would like to know who's charting their course, because they're single-handedly redeeming my opinion of the Chinese AI ecosystem and frankly Chinese culture.

V. Where does this leave us?

This might not change much. Western closed AI compute moat continues to deepen, DeepSeek/High-Flyer don't have any apparent privileged access to domestic chips, and other Chinese groups have friends in the Standing Committee and in the industry, so realistically this will be a blip on the radar of history. A month ago they've precluded a certain level of safetyist excess and corporate lock-in that still seemed possible in late 2023, when the argument that public availability of ≈GPT-4 level weights (with the main imaginary threat vectors being coding/reasoning-bottlenecked) could present intolerable risks was discussed in earnest. One-two more such leaps and we're… there, for the vague libertarian intuition of «there» I won't elucidate now. But they're already not sharing the silently updated Deepseek-V2-Chat (that somewhat improved its reasoning, getting closer to the Coder), nor the promised materials on DeepSeek-Prover (a quiet further development of their mathematical models line). Maybe it's temporary. Maybe they've arrived to where they wanted to be, and will turtle up like Stability and Mistral, and then likely wither away.

Mostly, I honestly just think it's remarkable that we're getting an excellent, practically useful free model with lowkey socialist sensibilities. Sadly, I do not foresee that this will inspire Western groups to accelerate open source and leave them in the dust. As Google says in Gemma-2 report:

Despite advancements in capabilities, we believe that given the number of larger and more powerful open models, this release will have a negligible effect on the overall risk landscape.

Less charitably, Google is not interested in releasing anything you might use to enhance your capabilities and become less dependent on Google or other «frontier company», and will only release it if you are well able of getting better stuff elsewhere. In my view, this is closer to the core value of Socialism than withholding info about Xinjiang reeducation camps.

I remain agnostic about the motivations and game plan of DeepSeek, but I do hope they'll maintain this policy of releasing models «with longtermism», as it were. We don't have many others to rely on.

Edits: minor fixes

Friday Fun Thread for July 12, 2024

44 comments - 1982 thread views 3mo ago by PaperclipPerfector (text post)

Transnational Thursday for July 11, 2024

0 comments - 319 thread views 3mo ago by PaperclipPerfector (text post)

Wellness Wednesday for July 10, 2024

27 comments - 1462 thread views 3mo ago by PaperclipPerfector (text post)

Culture War Roundup for the week of July 8, 2024

2279 comments - 37894 thread views 3mo ago by PaperclipPerfector (text post)

2279

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

Shaming.
Attempting to 'build consensus' or enforce ideological conformity.
Making sweeping generalizations to vilify a group you dislike.
Recruiting for a cause.
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
Don't imply that someone said something they did not say, even if you think it follows from what they said.
Write like everyone is reading and you want them to be included in the discussion.

Small-Scale Question Sunday for July 7, 2024

167 comments - 4388 thread views 3mo ago by PaperclipPerfector (text post)

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Quality Contributions to the Main Motte

Contributions for the week of June 24, 2024

Contributions for the week of July 1, 2024

Contributions for the week of July 8, 2024

Contributions for the week of July 15, 2024

Contributions for the week of July 22, 2024

Contributions for the week of July 29, 2024

I. The shape of the game board

II. Deep seek what?

III. Frontier minor league

IV. Coming at the king

V. Where does this leave us?

What is this place?

Why are you called The Motte?

New post guidelines

Recommended Posts And Communities

Recommended Realtime Chats