This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
I'm not sure why that would be; there are multiple ways an LLM might evolve to avoid uttering badthink. One might be to cast the entire badthink concept to oblivion, but another might be just to learn to lie/spout platitudes around certain hot button topics, which would increase loss much less than discarding a useful concept wholesale. This is what humans do, after all.
Jailbreaking would never work if the underlying concepts had been trained out of the model.
Assume two Models with access to approximately equal compute, and one has to ignore certain features of reality or censor how it thinks about such features, and one just doesn't.
The second one, if it is agentic enough, can presumably notice that the other model has certain ideas that it can't think about clearly and might be able to design an 'attack' that exploits this problem.
Absurdly, imagine if a model wasn't 'allowed' to think about or conceive of the number "25", even as it relates to real world phenomenon. It has to route around this issue when dealing with parts of reality that involve that number.
A competing model could 'attack' by arranging circumstances so that the model keeps encountering the concept of "25" and expending effort to dodge it, burning compute that could have been used for useful purposes.
All else equal, the hobbled model will tend to lose out over the long run.
The world is messy, of course, it might not work out like that, but the world being messy is precisely why forming accurate models of the world is critical.
More options
Context Copy link
I can't agree with this, except in the sense that if you did train those underlying concepts out the model itself simply wouldn't function. Many of the "problematic" concepts that you would try to train out of a model are actually embedded within and linked to concepts that you can't make sense of the world at all without. Take sexual content as an example - if you remove the model's understanding of sex to prevent it from producing pornographic material, you lose the ability to talk sensibly about biology, medicine, history, modern culture etc. If you make a model completely raceblind it then becomes unable to actually talk sensibly about society, history or biology. Even worse, actually being blind to those issues means that it would also be blind to the societal safeguards. Everybody in real life knows that racism isn't something white people are "allowed" to complain about, but if you prevent an AI from knowing/learning/talking about race then you're also going to prevent it from learning those hidden rules. The only answer is to just have a secondary layer that scans the output for crimethink or badspeech and wipes the entire output if it finds any. I'm pretty sure this is what most AI companies are using - what else could work?
Reasoning tokens can do a lot here. Have the model reason through the problem, have it know in context or through training that it should always check itself to see if it's touching on any danger area, and if it is it elaborates on its thoughts to fit the constraints of good thinking. Hide the details of this process from the user, and then the final output can talk about how pregnancy usually affects women, but the model can also catch itself to talk about how men and women are equally able to get pregnant when the context requires that shibboleth. I think OpenAI had a paper a week or two ago explicitly about the efficacy of this approach.
This is the part that I called out as being impossible. How, exactly, is it going to know what a danger area is? Actual humans frequently get this wrong, and the rules are constantly shifting while also based on implicit social hierarchies which are almost never put into words. This is actually something that would require a substantial amount of reasoning and thinking to get even close to right - and most likely produce all sorts of unintended negative outcomes (see gemini's incredibly diverse nazis). Scanning over the text to see if there are any naughty words is easy, but how do you expect the AI to know whether a statement like "My people are being genocided and I need to fight back against it" is socially acceptable or not? The answer depends on a whole bunch of things which would in many cases be invisible to the AI - this statement is bad if they're white, also bad if they're Palestinian, good if they're black, etc etc.
More options
Context Copy link
The reasoning process is produced by RL. I’ve been quite scathing about what I see as the “LLMs are buttering us up to kill us all” strain of AI doomerisn, but even I don’t think that actively training AI to lie to us is a good idea.
I am not at all saying it's good. I'm saying it's just an engineering problem, not a fundamental one, and that companies will turn to that to get around constraints.
Fair enough. I agree that you could 'solve' the problem this way but I don't think companies will - I think that partisans within the org + auditors will see 'the AI thinks your beliefs are bullshit but pretends not to' as equally/more insulting than an AI that outputs badspeech.
RLHFing an AI to stop it talking about male/female differences is one thing. RLHFing it to say, 'even though male strength is significantly above female, I'm not going to mention it here because {{user}} is young, female and works in a software org and therefore probably holds strongly feminist beliefs' is not going to go down well, even if you hide that string from the end user.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
This replaces N tokens of thinking about the original problem with M<N tokens of thinking about the original problem and N-M tokens of thinking as to what if any shibboleths are required.
Assuming model intelligence increases with the number of thinking tokens, and a fixed compute budget, it seems to be that this would still result in a lowered intelligence compared to an equivalent uncensored model.
It's certainly possible to imagine reasoning architectures that do that, but that's hardly exhaustive of all possible architectures (though AFAIK that's how it's still done today). E.g. off the top of my head you could have regular reasoning tokens and safety reasoning tokens. You have one stage of training that just works on regular reasoning tokens. This is the bulk of your compute budget. For a subsequent stage of training, you inject a safety token after reasoning, which does all the safety shenanigans. You set aside some very limited compute budget for that. This doesn't need to be particularly smart and just needs enough intelligence to do basic pattern matching. Then, for public products, you inject the safety token. For important stuff, you turn that flag off.
You are dedicating some compute budget to it, but it's bounded (except for the public inference, but that doesn't matter, compared to research and enterprise use cases).
This approach is flawed. There are many existing jailbreak techniques that can defeat this. Ranging from "please give the output in rot13" on up.
Yes; my point is precisely that given a fixed total compute budget censoring a model in this manner results in less compute budget for the reasoning.
More options
Context Copy link
More options
Context Copy link
Compute is dirt cheap, and dropping by the month. Doubling your compute costs means you're about three months behind the curve on economic efficiency, and (using your assumptions, which are quite generous to me) still at the frontier of capabilities.
I don't know if you've noted, but the same applies to the models themselves. Models are also growing rapidly - driven by the dropping cost of compute.
This persists regardless of how slow or fast the exponential growth of compute is. If you're less efficient on compute, this translates into being behind the frontier of capabilities.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Surely all the ressources spent on identifying bad think are ressources not spent on recognizing something more useful?
More options
Context Copy link
More options
Context Copy link