site banner

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

9
Jump in the discussion.

No email address required.

I posit that the optimal solution to RLHF, posed as a problem to NN-space and given sufficient raw "brain"power, is "an AI that can and will deliberately psychologically manipulate the HFer". Ergo, I expect this solution to be found given an extensive-enough search, and then selected by a powerful-enough RLHF optimisation. This is the idea of mesa-optimisers.

I posit that ML models will be trained using a finite amount of hardware for a finite amount of time. As such, I expect that the "given sufficient power" and "given an extensive-enough search" and "selected by a powerful-enough RLHF optimization" givens will not, in fact, be given.

There's a thought process that the Yudkowsky / Zvi / MIRI / agent foundations cluster tends to gesture at, which goes something like this

  1. Assume have some ML system, with some loss function
  2. Find the highest lower-bound on loss you can mathematically prove
  3. Assume that your ML system will achieve that
  4. Figure out what the world looks like when it achieves that level of loss

(Also 2.5: use the phrase "utility function" to refer both to the loss function used to train your ML system and also to the expressed behaviors of that system, and 2.25: assume that anything you can't easily prove is impossible is possible).

I... don't really buy it anymore. One way of viewing Sutton's Bitter Lesson is "the approach of using computationally expensive general methods to fit large amounts of data outperforms the approach of trying to encode expert knowledge", but another way is "high volume low quality reward signals are better than low volume high quality reward signals". As long as trends continue in that direction, the threat model of "an AI which monomaniacally pursues the maximal possible value of a single reward signal far in the future" is just not a super compelling threat model to me.

I'm mostly thinking about the AI proper going rogue rather than the character it's playing

What "AI proper" are you talking about here? A base model LLM is more like a physics engine than it is like a game world implemented in that physics engine. If you're a player in a video game, you don't worry about the physics engine killing you, not because you've proven the physics engine safe, but because that's just a type error.

If you want to play around with base models to get a better intuition of what they're like and why I say "physics engine" is the appropriate analogy, hyperbolic has llama 405b base for really quite cheap.