Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Quokka's Den Telegram
- Astral Codex Ten Discord

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me 3mo ago (thezvi.wordpress.com) 1802 thread views

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

Jump in the discussion.

No email address required.

Corvos 3mo ago · Edited 3mo ago

Zvi is very Jewish; it's far more obvious when reading his writing than it is when reading Scott's. It's not surprising that Hebrew meanings of words jump out at him.

I know. But in an essay that is absolutely dripping with contempt for Sakana AI and their work, I find the way that Zvi deliberately ignores what the model's name actually means in favour of 'well, in my language, it means' to be extremely rude, on the level of sniggering at a Chinese man's name because it contains the syllable 'wang'. If he'd been making a friendly riff or if he'd even bothered to explain the word's definition, that would be different. It's a small complaint, but starts the essay off on a sour note.

To more directly respond to this sentence: almost everyone will give LLMs goals, via RLHF or RLAIF or whatever, because that makes them useful - that's why this team gave their LLM a goal. Those goals are then almost invariably, with sufficient intelligence, subject to instrumental convergence, as in this case (as I noted in the submission statement, I posted this because a number of Mottizens seemed to think LLMs wouldn't exhibit instrumental convergence; I thought otherwise but didn't previously have a concrete example). That is sufficient to get you to Uh-Oh land with AIs attempting to take over the world.

Though cogently written, that is my abstract ideal of a doomer rant (I don't think it's a rant, I'm just using the word to call back to your reply). I understand the argument, I just think that it has very little empirical basis and is essentially the old Yudowskyite* arguments with a few extra bits stapled on to cope with the fact that LLMs look nothing like the AI that doomers were expecting. The behaviour of the AI Scientist is interesting, and legitimately does move the scale for me a little bit, but I think it's being used to back up a level of speculation which it can't possibly bear. I will say that I find your argument far more cogent and worth listening to than Zvi's, which seems to consist entirely of pointing and sneering.

For example, in one run, The A I Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention.

Oh, it’s nothing, just the AI creating new instantiations of itself.

In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage

Yep, AI editing the code to use arbitrarily large resources, sure, why not.

In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.

And yes, we have the AI deliberately editing the code to remove its resource compute restrictions.

This seems like Zvi interpreting basic hacky programming as evidence of malevolence. It's interesting but I absolutely think he's gesturing at

The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have

because if he doesn't believe this, why worry? If you can just run an LLM, ask it what it would do to accomplish a goal if it were given one, and then ask it not to do the stuff you think it was bad, I don't see how the doom scenario develops. Experiments like the AI Scientist are now being run (badly) because we have a pretty good handle on what modern-day frontier LLMs can do (generate slop) and the max level of damage they can achieve if you don't take lots of precautions (not much). LLMs are simply not a type of program that will attempt to hide their power level of their own accord.

*Yudowsky and MIRI's arguments about agentic AI had no empirical backing when they were made, and very little seems to have been applied since, so the lineage is relevant to me. I also think that the Yudowsky faction's utter failure to predict how future AI would look and work ten/twenty years from MIRI's founding to be a big black mark against listening to their predictions now.

EDIT: I apologise for editing this when you'd already replied. I hadn't refreshed the page and didn't know.

Context

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me Corvos 3mo ago · Edited 3mo ago

It's interesting but I absolutely think he's gesturing at

The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have

because if he doesn't believe this, why worry?

Sorry, I think I might have misunderstood what you meant by "consciousness" and/or "hide its power level". I thought you meant "qualia" and "hide its level of intelligence" respectively; qualia seem mostly irrelevant and intelligence level is mostly not the sort of thing that would be advantageous to hide.

If you meant just "engage in systematic deception" by the latter, then yes, that is implicit and required. I admit I also thought it was kind of obvious; Claude apparently knows how to lie.

Context

Corvos magic9mushroom 3mo ago · Edited 3mo ago

Sorry, I wrote sloppily. I meant 'develop goals it wasn't given by a human prompting it' such that it 'engages in systematic deception about its level of intelligence and how it would handle tasks even when not given a goal'. I think that this is a necessary condition to stop LLM developers from realising they need to do more RLHF for honesty or just appending "DO NOT ENGAGE IN DECEPTION" in their system prompts.

Context

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me Corvos 3mo ago

System prompts aren't a panacea - if you RLHF an AI to do X and then system prompt it to do Y, X generally wins (this is obscured in most cases because the same party is doing the RLHF and the system prompt, so outside of special cases like "deceive the RLHFers" they aren't in conflict).

I don't think level of intelligence necessarily needs to be obscured unless the LLM developers are sufficiently paranoid (and somebody sufficiently paranoid frankly wouldn't be working for Meta or OpenAI); they generally want the AI to get/remain smart. Deception about how it would handle tasks, yes, definitely that would be needed.

Context

Corvos magic9mushroom 3mo ago · Edited 3mo ago

Sorry, we're talking in two threads at the same time so risk being a bit unfocused.

I feel like we're talking past each other. How about this? The following is basically how I see LLMs in their stages of development and use:

Phase 1. Base model, without RLHF: pure token generator / text completer. Nothing that even slightly demonstrates agentic behaviour, ego, or deception.

Phase 2. Base model with RLHF: you could technically make this agentic if you really wanted to, but in practice it's just the base model with some types of completion pruned and others encouraged. Politically dangerous because biased but not agentically dangerous.

Phase 3. Base model with RLHF + prompt: can be agentic if you want, in practice fairly supine and inclined to obey orders because that's how we RLHF them to be.

If you don't mind me being colloquial, you seem to me to be sneaking in a Phase 2.5 where the model turns evil and I just don't get why. It doesn't fit anything I've seen. Can you explain what you think I'm missing in simple terms?

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.