Contact Us
Sign In
Sign Up
Rules Admins Moderation Log Random Post Random User
What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules
Recommended Posts And Communities
Recommended Realtime Chats
- Astral Codex Ten Discord
- Quokka's Den Telegram

magic9mushroom If you're going to downvote me, and nobody's already voiced your objection, please reply and tell me 1yr ago (thezvi.wordpress.com) 2684 thread views

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

Jump in the discussion.

No email address required.

faul_sname Fuck around once, find out once. Do it again, now it's science. 1yr ago

I posit that the optimal solution to RLHF, posed as a problem to NN-space and given sufficient raw "brain"power, is "an AI that can and will deliberately psychologically manipulate the HFer". Ergo, I expect this solution to be found given an extensive-enough search, and then selected by a powerful-enough RLHF optimisation. This is the idea of mesa-optimisers.

I posit that ML models will be trained using a finite amount of hardware for a finite amount of time. As such, I expect that the "given sufficient power" and "given an extensive-enough search" and "selected by a powerful-enough RLHF optimization" givens will not, in fact, be given.

There's a thought process that the Yudkowsky / Zvi / MIRI / agent foundations cluster tends to gesture at, which goes something like this

Assume have some ML system, with some loss function
Find the highest lower-bound on loss you can mathematically prove
Assume that your ML system will achieve that
Figure out what the world looks like when it achieves that level of loss

(Also 2.5: use the phrase "utility function" to refer both to the loss function used to train your ML system and also to the expressed behaviors of that system, and 2.25: assume that anything you can't easily prove is impossible is possible).

I... don't really buy it anymore. One way of viewing Sutton's Bitter Lesson is "the approach of using computationally expensive general methods to fit large amounts of data outperforms the approach of trying to encode expert knowledge", but another way is "high volume low quality reward signals are better than low volume high quality reward signals". As long as trends continue in that direction, the threat model of "an AI which monomaniacally pursues the maximal possible value of a single reward signal far in the future" is just not a super compelling threat model to me.

I'm mostly thinking about the AI proper going rogue rather than the character it's playing

What "AI proper" are you talking about here? A base model LLM is more like a physics engine than it is like a game world implemented in that physics engine. If you're a player in a video game, you don't worry about the physics engine killing you, not because you've proven the physics engine safe, but because that's just a type error.

If you want to play around with base models to get a better intuition of what they're like and why I say "physics engine" is the appropriate analogy, hyperbolic has llama 405b base for really quite cheap.

Context

What is this place?

This website is a place for people who want to move past shady thinking and test their ideas in a court of people who don't all share the same biases. Our goal is to optimize for light, not heat; this is a group effort, and all commentators are asked to do their part.

The weekly Culture War threads host the most controversial topics and are the most visible aspect of The Motte. However, many other topics are appropriate here. We encourage people to post anything related to science, politics, or philosophy; if in doubt, post!

Check out The Vault for an archive of old quality posts. You are encouraged to crosspost these elsewhere.

Why are you called The Motte?

A motte is a stone keep on a raised earthwork common in early medieval fortifications. More pertinently, it's an element in a rhetorical move called a "Motte-and-Bailey", originally identified by philosopher Nicholas Shackel. It describes the tendency in discourse for people to move from a controversial but high value claim to a defensible but less exciting one upon any resistance to the former. He likens this to the medieval fortification, where a desirable land (the bailey) is abandoned when in danger for the more easily defended motte. In Shackel's words, "The Motte represents the defensible but undesired propositions to which one retreats when hard pressed."

On The Motte, always attempt to remain inside your defensible territory, even if you are not being pressed.

New post guidelines

If you're posting something that isn't related to the culture war, we encourage you to post a thread for it. A submission statement is highly appreciated, but isn't necessary for text posts or links to largely-text posts such as blogs or news articles; if we're unsure of the value of your post, we might remove it until you add a submission statement. A submission statement is required for non-text sources (videos, podcasts, images).

Culture war posts go in the culture war thread; all links must either include a submission statement or significant commentary. Bare links without those will be removed.

If in doubt, please post it!

Rules

Recommended Realtime Chats

Link copied to clipboard

Action successful!

Error, please try again later.