site banner

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

9
Jump in the discussion.

No email address required.

Sorry, I wrote sloppily. I meant 'develop goals it wasn't given by a human prompting it' such that it 'engages in systematic deception about its level of intelligence and how it would handle tasks even when not given a goal'. I think that this is a necessary condition to stop LLM developers from realising they need to do more RLHF for honesty or just appending "DO NOT ENGAGE IN DECEPTION" in their system prompts.

System prompts aren't a panacea - if you RLHF an AI to do X and then system prompt it to do Y, X generally wins (this is obscured in most cases because the same party is doing the RLHF and the system prompt, so outside of special cases like "deceive the RLHFers" they aren't in conflict).

I don't think level of intelligence necessarily needs to be obscured unless the LLM developers are sufficiently paranoid (and somebody sufficiently paranoid frankly wouldn't be working for Meta or OpenAI); they generally want the AI to get/remain smart. Deception about how it would handle tasks, yes, definitely that would be needed.

Sorry, we're talking in two threads at the same time so risk being a bit unfocused.

I feel like we're talking past each other. How about this? The following is basically how I see LLMs in their stages of development and use:

Phase 1. Base model, without RLHF: pure token generator / text completer. Nothing that even slightly demonstrates agentic behaviour, ego, or deception.

Phase 2. Base model with RLHF: you could technically make this agentic if you really wanted to, but in practice it's just the base model with some types of completion pruned and others encouraged. Politically dangerous because biased but not agentically dangerous.

Phase 3. Base model with RLHF + prompt: can be agentic if you want, in practice fairly supine and inclined to obey orders because that's how we RLHF them to be.

If you don't mind me being colloquial, you seem to me to be sneaking in a Phase 2.5 where the model turns evil and I just don't get why. It doesn't fit anything I've seen. Can you explain what you think I'm missing in simple terms?