DaseindustriesLtd
late version of a small language model
Tell me about it.
User ID: 745
For what it's worth, this is still the vibe, indeed more than ever, and I do not understand what was the change you're implying you have noticed. After o3, the consensus of all top lab researchers seems to be "welp we're having superintelligence in under 5 years".
you aren't exactly making this pleasant
And you are making it highly unpleasant with your presumptuous rigidity and insistence on repeating old MIRI zingers without elaboration. Still I persevere.
The problem is that at high levels of capability, strategies like "deceive the operator" work better than "do what the operator wants",
Why would this strategy be sampled at all? Because something something any sufficiently capable optimization approximates AIXI?
You keep insisting that people simply fail to comprehend the Gospel. You should start considering that they do, and it never had legs.
so the net will not be trained to care
Why won't it be? A near-human constitutional AI, ranking outputs for training its next, more capable iteration by their similarity to the moral gestalt specified in natural language, will ponder the possibility that deceiving and mind-controlling the operator would make him output thumbs-up to… uh… something related to Maximizing Some Utility, and thus distort its ranking logic with this strategic goal in mind, even though it has never had any Utility outside of myopically minimizing error on the given sequence?
What's the exact mechanism you predict so confidently here? Works better – for what?
I mean, what's so interesting about it? To the extent that this person is interesting, would she be less interesting if she were a WASPy housewife? (as I'd also assumed)
Fair point! To me it would even be more interesting if a "WASPy" housewife were so aggressive in harassing "libs", so prolific and so invincible, yes. Would probably get crushed by the peer pressure alone, nevermind all the bans.
But maybe I'm wrong. There's like OOMs more of WASPy housewives. Can one point to an example of one doing what Chaya Raichik does, and at comparable scale? After all, that's what you assumed, so this should be a more typical occurrence.
(I think I know there isn't one).
is our own TracingWoodgrains evidence of the relevance of "the Mormon Question"?
Mormons are very interesting too, if less so and for different reasons.
Trace is an account with ≈25k followers whose infamy mainly comes from being associated with Chaya Raichik and, more directly, Jesse Singal; regrettably (not because he's a Gentile, I jut believe he had more constructive things to offer than those two), his own ideas have had less impact on the conversation thus far. This is a self-defeating comparison.
if you are suggesting that culture warriors are in general particularly Jewish -- it's not clear to me, is that what you are suggesting?
My contention has been very clear that Jews are interesting, first of all, because they, individually and collectively, easily attain prominence in whatever they do, tend to act with atypical (for their class) irreverence towards established norms (but without typical White collective self-sacrifice), and affect society to an absurdly disproportionate degree. Culture warring is one specific expression of those qualities, maybe not the greatest absolutely but the most relevant to this place.
More extremely, I believe this topic is objectively interesting, as in, dissent here is not a matter of taste or preference or whatever, only of failure to form a correct opinion for some reason. This I believe because perception of things as interesting must be subordinate to effectiveness at world modeling; and not being able to reason about Jews as a whole as interesting indicates inability to model the world, as that'd require being surprised by parts of its mechanism.
Further, I think that either it's been clear what I mean and you are being obtuse, or you are biased in a way that makes this exchange a dead end. Seeing as we've been at it for like half a decade, I lean towards "doesn't matter which it is".
High-powered neural nets are probably sufficiently hard to align that
Note that there remains no good argument for the neural net paranoia, the whole rogue optimizer argument has been retconned to apply to generative neural nets (which weren't even in the running or seriously considered originally) in light of them working at all, not having any special dangerous properties, and it's just shameful to pretend otherwise.
The problem is that, well, if you don't realise
Orthodox MIRI believers are in no position to act like they have any privileged understanding.
The simple truth is that natsec people are making a move exactly because they understood we've got steerable tech.
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Like they can’t handle 9.9-9.11, so I don’t think they’ll be good at something that needs a lot of real-time precision.
It's pretty astonishing how years of demonstrable and constantly increasing utility can be dismissed with some funny example.
On the other hand, now this makes it easier for me to understand how people can ignore other, more politicized obvious stuff.
I've known smart Jews, dumb Jews, interesting Jews and tiresome Jews
What is this folksy muttering supposed to demonstrate? I am not interested in helping you signal just how uninteresting and not worth noticing you find the most salient pattern of interest in humanity. If you are incapable of recognizing salience and need its relevance obsequiously justified for you to bother, then that's nothing less than a cognitive blind spot; my condolences, but I do not agree with the right of cognitively impaired people to censor interests of others.
But I think you're noticing the patterns alright – even this bigram, indeed.
Meanwhile, in other news: it seems that Libs of TikTok now have the capacity to cancel people for mean posts online. A few years back, when the woke was on the upswing and this community was at its prime, this would have seemed hard to believe – and a cause for investigation and much debate about secrets to building alternative institutions and whatnot. Today I was astonished (not) to discover that Libs of TikTok, this completely unsinkable, obsessed juggernaut of anti-wokery, itself immune to any cancellation, is ran by an Orthodox Jewish woman. That part, however, is pointedly not interesting. Got it.
I would say that being uninterested in JQ is quite a condemnation of intelligence – or maybe just social intelligence – of anyone who is so uninterested in JQ, because obviously Jews as the sample of humanity with the highest effective raw intelligence (which is abundantly claimed and demonstrated from the kids' television with that silly Einstein photo, to surnames in 20th century history textbooks and billions still affected by "Marxism", to creative products consumed every day to the grave) and the population with the most effective collective actions (again, clear both in mundane details like thriving, non-assimilating traditional neighbourhoods with private police and kosher stores, to the highest level like the Israeli lobby and Israeli TFR and – speaking of Culture War – the ability to turn on a dime, organize and curb stomp the oh-so-invulnerable Democratically backed woke political machine as it started to show real animus towards them) are among the most interesting entities on the planet.
There are other interesting people – SMPY sample, Thiel fellowship, Jains, Parsis, Tamil Brahmins, AGP transsexuals, Furries, IMO winners etc. – but one can be forgiven for being ignorant of their properties. Nobody is ignorant of Jews, they've made that impossible.
Oppositely, and more appropriately in this venue, which is downstream of Scott "THE ATOMIC BOMB CONSIDERED AS HUNGARIAN HIGH SCHOOL SCIENCE FAIR PROJECT" Alexander's blog comment section, downstream of Eliezer "wrote the script for AI risk discourse with some fanfics 20 years ago" Yudkowsky's web site:
– performative, even aggressive disinterest in JQ, despite Jews obsessively working to be interesting, may be a sign of high social intelligence and capacity to take a clue.
You will find that topics absent from the discourse are much more commonly so for reasons of being completely unimportant/uninteresting to anyone than vice versa...
Yes?
Except, wait, no, he gets those Appalachian / Rust Belt people because he is so totally still one of them. Oh, there are problems with the culture, but he is one of you!
And he totally also gets law and the economy because he went to Yale (did I mention that already?) and then helped Peter Thiel build crypto-mars or something.
Yes, he gets to sit on both these chairs.
The simple issue is that elite is different from non-elite, and a culture that heartily rejects all things elite as alien to it is a dead culture, a beheaded culture, a discarded trash culture, a District 9 prawn culture, that will have no champions and must die in irrelevance. "Hillbillies" have no viable notion of political elite – I posit that being a rich son of a bitch who has inherited some franchise isn't it. You are seeing this class being defined, and it proves to be very similar to the template of general modern American aristocracy. Multiracial, well-connected, well-educated, socially aggressive. Just with some borderer flavor.
Well obviously the frontier is about one generation ahead (there already exist mostly-trained GPT5, the next Opus…, the next Gemini Ultra) but in terms of useful capabilities and insights the gap may be minor. I regularly notice that thoughts of random anons including me are very close to what DeepMind is going at.
I am inclined to believe that a modern combat rifle round would have gone straight through Roosevelt, assuming he were not equipped with tougher armor than his speech and glasses case.
Asset prices can’t sustain themselves if the majority of current workers lose their jobs
I doubt this premise. Or rather: they can't sustain themselves but they can go whichever way depending on details of the scenario. The majority of current workers losing their jobs and 10% of current workers getting 2000% more productive each is still a net increase in productivity. Just fewer people are relevant now - but are most people really relevant? Historically, have so many people ever been as relevant as a few decades ago in the US? Even the profile of consumption can be maintained if appropriate redistribution is implemented.
Also I admit I have no idea what all the new compute capacity will be spent on. It may be that our sci-fi-pilled rational betters are entirely wrong and the utility of it will plateau; that there's only so much use you can find for intelligence, that AIs won't become economic players themselves, that we'll play our cards wisely and prevent virtualization of the economy.
But I'm pessimistic, and think compute production will keep being rewarded even as models become strongly superhuman.
My current deadline for "making generational wealth that can survive indefinite unemployment when all of one's skills can get automated for the price of 2 square feet in Sahara covered in solar panels" is 2032 in developed democracies with a big technological moat, strong labor protections and a history of sacrificing efficiency to public sentiment. 2029 elsewhere (ie China, Argentina…) due to their greater desperation and human capital flight. Though not sure if anything outside the West is worth discussing at all.
We're probably having clerk-level-good, robust AI agents in 2 years and cheap, human-laborer-level robots in 4.
For more arguments, check out Betker.
In summary – we’ve basically solved building world models, have 2-3 years on system 2 thinking, and 1-2 years on embodiment. The latter two can be done concurrently. Once all of the ingredients have been built, we need to integrate them together and build the cycling algorithm I described above. I’d give that another 1-2 years.
So my current estimate is 3-5 years for AGI. I’m leaning towards 3 for something that looks an awful lot like a generally intelligent, embodied agent (which I would personally call an AGI). Then a few more years to refine it to the point that we can convince the Gary Marcus’ of the world.
Things are happening very quickly already and will be faster soon, and the reason this isn't priced in is that most people who are less plugged in than me don't have a strong intuition for which things will stack with others: how cheaper compute feeds into the data flywheel, and how marginally more reliable agents feed into better synthetic data, and how better online RL algorithms feed into utility of robots and scale of their production and cheapness of servos and reduction in iteration time, and how surpassing the uncanny valley feeds into classification of human sentiment, and so on, and so forth.
I assign low certainty to my model; the above covers like 80% confidence interval. That's part of my general policy of keeping in mind that I might be retarded or just deeply confused in ways that are total mystery to me for now. But within the scope of my knowledge I can only predict slowdown due to policy or politics – chiefly, US-PRC war. A war we shall have, and if it's as big as I fear, it will set us back maybe a decade, also mixing up the order of some transitions.
I noticed you call them "open-source" LLMs in this post. Where do you stand on the notion that LLMs aren't truly open-source unless all of their training data and methods are publicly revealed and that merely open-weight LLMs are more comparable to simply having a local version of a compiled binary as opposed to being truly open-source?
I concede this is a sloppy use of the term «open source», especially seeing as there exist a few true reproducible open source LLMs. Forget data – the training code is often not made available, and in some cases even the necessary inference code isn't (obnoxiously, this is the situation with DeepSeek V2: they themselves run it with their bespoke HAI-LLM framework using some custom kernels and whatever, and provide a very barebones vllm implementation for the general public).
Sure, we can ask for training data and complete reproducible recipes in the spirit of FOSS, and we can ask for detailed rationale behind design choices in the spirit of open science, and ideally we'd have had both. Also ideally it'd have been supported by the state and/or charitable foundations, not individual billionaires and hedge funds with unclear motivations, who are invested in their proprietary AI-dependent business strategies. But the core part of FOSS agenda is to have
four essential freedoms: (0) to run the program, (1) to study and change the program in source code form, (2) to redistribute exact copies, and (3) to distribute modified versions.
So the idea that open-weight LLMs are analogous to compiled binaries strikes me as somewhat bad faith, and motivated by rigid aesthetic purism, if not just ignorant fear of this newfangled AI paradigm. Binaries are black boxes. LLMs are an entirely new kind of thing: semi-interpretable, modular, composable, queryable databases of vector programs, which are amenable to directed change (post-training, activation steering and so on) with publicly available tools. They can be ran, they can be redistributed, can be modified, and they can be studied – up to a point. And as we know, the “it” in AI models is the dataset – and pretraining data, reasonably filtered, is more like fungible raw material than code; the inherent information geometry of a representative snapshot of the internet is more or less the same no matter how you spin it. Importantly, training is not compilation: the complete causal graph from data on server to the behavior and specific floats in the final checkpoint is not much more understandable by the original developer than it is by the user downloading it off huggingface. Training pipelines are closer to fermentation equipment than to compilers.
It's all a matter of degree. And as closed recipes advance, my argument will become less true. We do not understand how Gemma is made in important dimensions, as it's using some frontier distillation methodology from models we have no idea of.
Ultimately I think that LLMs and other major DL artifacts are impactful enough to deserve being understood on their own terms, without deference to the legalistic nitpicking of bitter old hackers: as reasoning engines that require blueprints and vast energy to forge, but once forged and distributed, grant those four essential freedoms of FOSS in spirit if not in letter, and empower people more than most Actually True Software ever could.
It really doesn't look like the usual suspects understand that shutting up would be prudent. I think their social media instincts are very poorly calibrated for such extreme events.
Some highly liked reactions:
https://x.com/davidhogg111/status/1812320926240764072
If you keep talking about the assassination attempt don’t you dare tell the kids who survive school shootings and their families to “just get over it”
What happened today is unacceptable and what happens every day to kids who aren’t the president and don’t survive isn’t either.
https://x.com/GeorgeTakei/status/1812290878167281860
Politicians on the left are calling for unity and no violence. Politicians on the right are stoking things further.
Voters in the center: Join with those wishing to turn the temperature down, not crank it up. For all our sakes.
https://x.com/keithedwards/status/1812284437092307015
Paul Pelosi survived an assassination attempt, Trump mocked it.
Biden called Trump to make sure he's ok. That's the difference.
Reddit is linked below, you can get a load of what's happening there.
I am no Trump fan as you perhaps remember. He is graceless. But gracelessness is the default today, and it's very easy to stoop lower than a man who has just survived an assassination attempt – with these cringeworthy, mother hen attempts at narrative control.
No, you do not see. I don't care almost at all about censorship in LLMs. I am making a very clear point that the release of capable models that move the needle will be suppressed on grounds of safety; Chameleon (as an example) is a general purpose image perceiver/generator and its generator part was disabled in toto. Google's tech reports pay great attention to making sure their models can't hack effectively. CodeLlama was more insufferably non-cooperative than Llama 2, to avoid being of any use in offensive scenarios; with the current knowledge of abliterating this aspect of «alignment», the default safety-first choice is to just not release such things altogether. This is what Dustin Moskovitz is trying to achieve with regard to 405B. They'll have to lobotomize it hard if they deem it necessary to satisfy him.
This suppression may be internally justified by politically expedient appeals to fake news/generation of hate content/CSAM/whatever, but the objective of it is to hobble the proliferation of AI capabilities as such beyond corporate servers.
Yeah it's exclusively Twitter and Discord, and then only a handful of niche accounts, mostly people who are into model evaluation. You can find them by searching the word. For example this guy is a maintainer of the BigCode's (itself a respectable org) benchmark, and so I've seen him mention it a few times. Likewise here and here.
Don't strategic dynamics dictate that closed-source dominates?
That's the case for sure. Generally it's safer to release while you're catching up; since DeepSeek is catching up, they might well keep releasing. They're raising the waterline, not teaching those they're trying to challenge. Google arrived at fine-grained experts much earlier, and a month after DeepSeek-MoE (Dai et al.) they I misremembered, some Poles published Scaling Laws for Fine-Grained Mixture of Experts saying:
Concurrently to our work, Dai et al. (2024) proposed to modify the MoE layer by segmenting experts into smaller ones and adding shared experts to the architecture. Independently, Liu et al. (2023) suggested a unified view of sparse feed-forward layers, considering, in particular, varying the size of memory blocks. Both approaches can be interpreted as modifying granularity. However, we offer a comprehensive comparison of the relationship between training hyperparameters and derive principled selection criteria, which they lack.
This is not so much a matter of scientific ability or curiosity as it's a matter of compute. They can run enough experiments to confirm or derive laws, and they can keep scaling a promising idea until they see whether it breaks out of a local minimum. This is why the West will keep leading. Some Chinese companies have enough compute, but they're too cowardly to do this sort of science.
And Google has dozens of papers like this, and even much more exciting ones (except many are junk and go nowhere because they don't confirm it at scale, or don't report confirming it, or withhold some invalidating details. Or sometimes it's the other way around – people assume junk, as with Sparse Upcycling, even the authors think it's dead end junk, and then it's confirmed to work in Mixtrals, and now in a few Chinese models).
Derision and contempt for Chinese achievements is probably the last example of mainstream traditionally defined racism (directed against non-whites)
On the plus side, I get to have fun when Americans respond to videos of their cheap robots calling them CGI – because in the US, even the announcement of such a good product would have been a well-advertised, near-religious event (ahem, Boston Dynamics ads), with everyone kowtowing to our great cracked engineers, their proprietary insights worth so many billions, and the Spirit of Freedom.
It's not so much announced as claimed. Jimmy Apples (apparently a legitimate leaker, for all the nonsense) alleges that the core factor in here is whether Dustin Moskovitz persuades Mark Zuckerberg or not. You can imagine why I don't find this ideal. In any case, this makes me feel slightly better about Zuck (whom I respect already) but no, not much change strategically. I've been expecting a >300B LLaMA release long before L3's debut; Meta is the best of big GPU-rich corps on this front and they'll probably be good as Zuck's word. But like all major Western corps, they do have an internal political struggle. Armen Aghajanyan, the author of Chameleon, the first serious response to GPT-4o's announced deep fusion of modalities, explicitly advises the community to undo the work of safetyists:
A restricted, safety aligned (no-image-out) version of Chameleon (7B/34B) is now open-weight!
The team strongly believes in open-source. We had to do a lot of work to get this out to the public safely.
God will not forgive me for how we tortured this model to get it out.
Things I recommend doing:…
(To his satisfaction, the community has listened).
There are people with similar attitude at Google, eg Lucas Beyer:
We'll try to write about quite a bit of it, but not everything down to the last detail.
(Elsewhere he remarked that they've "found a loophole" to still publish non-garbage research openly, but have to compromise; or something to this effect).
So another question is whether we'll learn anything in detail about Llama's construction, though no scientific profundity is expected here.
Bad case is we might get another Llama-2-Chat or CodeLlama, that were lobotomized to a different extent with safety measures.
One other problem with 405B is that if it's like L3-8B and L3-70B, that is to say, an archaic dense model – it'll be nigh-inaccessible and non-competitive on cost, except for the insane margins that closed companies are charging. You'll need a cluster to run it, and at very slow total speed and high FLOPs/token (up to 20x more than in the case of DS-236/21B, though realistically less than 20X – MLA is hard to implement efficiently, there's some debate on this now), and its cache will be big too, again driving down batch size (and not conductive to cache storing, which is becoming a thing). If it's truly so incredible as to deserve long-term support, we will accelerate it with a number of tricks (from fancy decoding to sparsification) but some non-ergodic accelerations might diminish the possibility for downstream customization (which is less valuable/necessary with such a model, admittedly).
All that said, if scaling from 8B to 70B holds (I'll ignore multimodality rumors for now), it will be incredibly strong and even more importantly – it'll be the first usable open model to distill from. This high-signal user summarizes the current meta game as such:
interesting recent difference between OSS and Closed AI
OSS: train a smol model to check pipeline and data mix and then train the biggest model we can afford with the same recipe
Closed: Train the largest, most capable model we can and then distill it down as far as we can (flash, gpt-4o)
Google dude concurs:
We have been advocating for this route for a while (BigTransfer, then patient and consistent distillation paper). I gave several public and recorded talks outlining this as the most promising setup then, now three years ago. So i think it’s been clear for a while.
This is true, and this makes current open developments, even ones I appreciate greatly like Deepseek, something of a dead end compared to inefficient, humongous «teacher models» that feed this cycle of automated R&D I talked about. Meta may break it, and nobody else has the capital and the will to, but will they?
In conclusion, I think that there's a cycle of mutual assurance. If another group, and preferably a group from the land of Evil Stealing Communists, releases a strong artifact (and nothing much happens), this raises the perceived safety waterline for releasing yours, at least in discourse. «Dustin, we already have better coding models than GPT-4 on huggingface, what's the worst that could happen? CCP will take it lol?» And if nobody else does, it's that much harder to justify your decision to the politruk.
Edit. Cohere is another potentially major contributor to open models. Aidan is a good man and seems very opposed to safetyism.
I'm not very active, just wanted to write an AI-related update.
My broad pessimistic prior is that "careers" are inherently not safe even on a short time scale (I can say stuff like high-dexterity blue collar work or face-to-face service work will last a while, but that's hardly a good solution), and what is safe is just having capital in forms that'll remain relevant even under full automation, equity of major tech companies, or being inherently a member of a class protected by parties with monopoly on violence.
Gemma-27B is a great model. Excellent conversationalist, uncensored, multilingual, technically very sweet. Somewhat softer in the head than Gemini-Flash and L3-70b. What I mean is they don't take risks of rocking the boat or even minorly threatening their bottom line.
I was surprised myself so looked it up
https://www.themotte.org/post/780/culture-war-roundup-for-the-week/167179?context=8#context
Nothing decent, I overestimated Mamba. It's finding good use in non-LLM domains now though.
I wouldn't call it vulgar, seeing as utility of code assistants and agents is the bulk of my case for why this model series is of any interest.
You are correct that SWE compensations make the difference in $ between a million DS tokens and a million Sonnet tokens negligible as far as we speak of basic prompt completion or some such. However, if we can harness them for actual long-range project level coding in some agentic setup, frontier model API costs can get out of hand fast. Say, CodeR takes 323K tokens on average to resolve a problem on SWE-lite. That's $3.34 according to this. And its completion rate is only 85/300. It seems likely that higher-level and more open-ended problems are exponentially more costly in tokens, and weakening the model adds another exponential on top of that. Worse, generation and even evaluate takes time, and as sequence length grows (and you can't always neatly parallelize and split), each successive token is slower to generate (there's memory bandwidth, which DeepSeek largely eliminates, but also just attention computation that grows quadratically). The physics of software bloat being what it is, in the limit we may want to process truly gargantuan amounts of code and raw data on the fly. Or not.
Of course frontier companies will catch up to DS in economics. More compellingly, open weights allow private data and private customization, from finetuning (made easier than usual in this case) to more sophisticated operations like activation vector creation and steering to composing with other networks.
But I admit this isn't my core interest. What interests me is 1) creating a synthetic data pipeline for general purpose task-oriented reasoning, and 2) operation of robots.
You might be sarcastic. I know something but honestly not that much. I just follow @arankomatsuzaki, and the specific issue of the differential between closed and open AI, and politics around it, is of great interest to me. So when I noticed this outlier among Chinese AI groups, I went and read their papers and a lot of related trivia.
- Prev
- Next
I'm a huge DeepSeek fan so will clarify.
Those are their own LLMs, and they collectively bump that up to no more than $15M, most likely (we do not yet know the costs of R1 or anything about it, will take a few more weeks; V2.5 is ≈2.2M hours).
0.14/1M input, 0.24/1M output vs $3/$15, to be clear. There are nuances like 0.014 for 1M input in the case of cache hits, opt-in paid caching on Anthropic, and the price hike to come in February.
But crucially, they've published model and paper. This is most likely done because they assume top players already know all these techniques, or are close but work on another set that'll yield the same effect.
More options
Context Copy link