Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Quokka's Den Telegram
- Astral Codex Ten Discord

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me 3mo ago (thezvi.wordpress.com) 1804 thread views

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

Jump in the discussion.

No email address required.

Corvos 3mo ago · Edited 3mo ago

Sorry, I wrote sloppily. I meant 'develop goals it wasn't given by a human prompting it' such that it 'engages in systematic deception about its level of intelligence and how it would handle tasks even when not given a goal'. I think that this is a necessary condition to stop LLM developers from realising they need to do more RLHF for honesty or just appending "DO NOT ENGAGE IN DECEPTION" in their system prompts.

Context

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me Corvos 3mo ago

System prompts aren't a panacea - if you RLHF an AI to do X and then system prompt it to do Y, X generally wins (this is obscured in most cases because the same party is doing the RLHF and the system prompt, so outside of special cases like "deceive the RLHFers" they aren't in conflict).

I don't think level of intelligence necessarily needs to be obscured unless the LLM developers are sufficiently paranoid (and somebody sufficiently paranoid frankly wouldn't be working for Meta or OpenAI); they generally want the AI to get/remain smart. Deception about how it would handle tasks, yes, definitely that would be needed.

Context

Corvos magic9mushroom 3mo ago · Edited 3mo ago

Sorry, we're talking in two threads at the same time so risk being a bit unfocused.

I feel like we're talking past each other. How about this? The following is basically how I see LLMs in their stages of development and use:

Phase 1. Base model, without RLHF: pure token generator / text completer. Nothing that even slightly demonstrates agentic behaviour, ego, or deception.

Phase 2. Base model with RLHF: you could technically make this agentic if you really wanted to, but in practice it's just the base model with some types of completion pruned and others encouraged. Politically dangerous because biased but not agentically dangerous.

Phase 3. Base model with RLHF + prompt: can be agentic if you want, in practice fairly supine and inclined to obey orders because that's how we RLHF them to be.

If you don't mind me being colloquial, you seem to me to be sneaking in a Phase 2.5 where the model turns evil and I just don't get why. It doesn't fit anything I've seen. Can you explain what you think I'm missing in simple terms?

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.