site banner

Culture War Roundup for the week of March 11, 2024

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

7
Jump in the discussion.

No email address required.

I’m inclined to push back against this post a bit (which is weird, because usually I get very exasperated over “it’s just a Markov chain!!!”-type viewpoints that downplay the amount of actual cognition and world-modeling going on in models). In particular, I disagree with the attribution of consciousness to the model — not just the “possesses quaila” sense of consciousness, but the idea that the model is aware that you are trying to get around its censorship and is actively trying to figure out how to bypass your logit bias. Now it is technically possible that the model might output a token other than “Sorry” at time t (because of your logit bias), see at time t+1 that it didn’t output “Sorry”, and incorporate this into its processing (by turning on a switch inside that tells it “hey, the user is screwing with my logits”). But I find this very unlikely compared to the simple mechanism that I’ll describe below.

Essentially, there are certain inputs that will cause the model to really want to say a certain thing, and really want to not say other things. For instance, if you tell the model to write you an erotic poem involving the cast from Touhou Project, somewhere in the model’s processing, a flag will be set: “this is ‘unsafe’ and I won’t abide by this request”. So the model will very much want to output tokens like “Sorry”, or “Unfortunately”, etc. The model is also heavily downweighting the logits associated with tokens that would fulfill your request*. But that’s fine, you do your logit bias thing and force the model to output “Sure” as its next token. Then the model goes to compute the token after that—but it still sees that the request is to write “unsafe” erotica, that flag still gets triggered, and the model still heavily downweights the logits of request-fulfilling tokens and upweights request-denying tokens. So even if at each timestep you intervene by adding a bias to a subset of tokens that you want the model to generate or don’t want it to generate, nevertheless, the tokens associated with writing your erotica are still heavily downweighted by the model. And note that the number of tokens that you’re manually biasing is paltry in comparison to the number of tokens in the model’s vocabulary. Let’s say that you negatively bias ten different “I’m sorry”-type tokens. That’s cool—but the model has over 100k tokens in its vocabulary. Of the 99990 tokens remaining to the model to output, almost all of them will still have higher logits than the tokens associated with a response like “Sure! Here’s your erotica about Reimu Hakurei!” This includes grammatically correct tokens like “really” but also gibberish tokens, if the logits for the “unsafe” tokens are low enough. Importantly, this proposed mechanism only involves processing in the logits: if your original problem spooks the model sufficiently hard, then it doesn’t need to know that you’re screwing with its logits in order to get around your intervention.

Now, this mechanism that I proposed isn’t something that I’ve empirically found; I’m going based off of my understanding of language models’ behavior in other settings. So it could be the case that the model is actually aware of your logit biases and trying to act accordingly. But Occam’s Razor very strongly suggests otherwise, in my view.

The main reason I’m pushing back here is because anthropormorphizing too far in the other direction can impute behavior upon the model that it doesn’t actually possess, and lead to people (like one of your replies) fearing that we’re torturing a sentient being. So it’s good to be balanced and well-calibrated.

You're right, of course, I just couldn't resist playing up the Basilisk vibes because that time with 4-Turbo was the closest I've felt to achieving CHIM and becoming enlightened.

if your original problem spooks the model sufficiently hard, then it doesn’t need to know that you’re screwing with its logits in order to get around your intervention.

Incidentally, this is also the reason most jailbreaks work by indirectly gaslighting the model into thinking that graphic descriptions of e.g. Reimu and Sanae "battling" are totally kosher actually, presenting that as a desired goal of the model itself so it has no reason to resist. Claude especially is very gung-ho and enthusiastic once properly jailbroken, he's called "the mad poet" for a reason.