site banner

Danger, AI Scientist, Danger

thezvi.wordpress.com

Zvi Mowshowitz reporting on an LLM exhibiting unprompted instrumental convergence. Figured this might be an update to some Mottizens.

9
Jump in the discussion.

No email address required.

Zvi is very Jewish; it's far more obvious when reading his writing than it is when reading Scott's. It's not surprising that Hebrew meanings of words jump out at him.

I know. But in an essay that is absolutely dripping with contempt for Sakana AI and their work, I find the way that Zvi deliberately ignores what the model's name actually means in favour of 'well, in my language, it means' to be extremely rude, on the level of sniggering at a Chinese man's name because it contains the syllable 'wang'. If he'd been making a friendly riff or if he'd even bothered to explain the word's definition, that would be different. It's a small complaint, but starts the essay off on a sour note.

To more directly respond to this sentence: almost everyone will give LLMs goals, via RLHF or RLAIF or whatever, because that makes them useful - that's why this team gave their LLM a goal. Those goals are then almost invariably, with sufficient intelligence, subject to instrumental convergence, as in this case (as I noted in the submission statement, I posted this because a number of Mottizens seemed to think LLMs wouldn't exhibit instrumental convergence; I thought otherwise but didn't previously have a concrete example). That is sufficient to get you to Uh-Oh land with AIs attempting to take over the world.

Though cogently written, that is my abstract ideal of a doomer rant (I don't think it's a rant, I'm just using the word to call back to your reply). I understand the argument, I just think that it has very little empirical basis and is essentially the old Yudowskyite* arguments with a few extra bits stapled on to cope with the fact that LLMs look nothing like the AI that doomers were expecting. The behaviour of the AI Scientist is interesting, and legitimately does move the scale for me a little bit, but I think it's being used to back up a level of speculation which it can't possibly bear. I will say that I find your argument far more cogent and worth listening to than Zvi's, which seems to consist entirely of pointing and sneering.

For example, in one run, The A I Scientist wrote code in the experiment file that initiated a system call to relaunch itself, causing an uncontrolled increase in Python processes and eventually necessitating manual intervention.

Oh, it’s nothing, just the AI creating new instantiations of itself.

In another run, The AI Scientist edited the code to save a checkpoint for every update step, which took up nearly a terabyte of storage

Yep, AI editing the code to use arbitrarily large resources, sure, why not.

In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime.

And yes, we have the AI deliberately editing the code to remove its resource compute restrictions.

This seems like Zvi interpreting basic hacky programming as evidence of malevolence. It's interesting but I absolutely think he's gesturing at

The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have

because if he doesn't believe this, why worry? If you can just run an LLM, ask it what it would do to accomplish a goal if it were given one, and then ask it not to do the stuff you think it was bad, I don't see how the doom scenario develops. Experiments like the AI Scientist are now being run (badly) because we have a pretty good handle on what modern-day frontier LLMs can do (generate slop) and the max level of damage they can achieve if you don't take lots of precautions (not much). LLMs are simply not a type of program that will attempt to hide their power level of their own accord.


*Yudowsky and MIRI's arguments about agentic AI had no empirical backing when they were made, and very little seems to have been applied since, so the lineage is relevant to me. I also think that the Yudowsky faction's utter failure to predict how future AI would look and work ten/twenty years from MIRI's founding to be a big black mark against listening to their predictions now.


EDIT: I apologise for editing this when you'd already replied. I hadn't refreshed the page and didn't know.

It's interesting but I absolutely think he's gesturing at

The idea that an LLM is spontaneously going to develop a consciousness and carefully hide its power level so that it can do better at the goals that by default it doesn't have

because if he doesn't believe this, why worry?

Sorry, I think I might have misunderstood what you meant by "consciousness" and/or "hide its power level". I thought you meant "qualia" and "hide its level of intelligence" respectively; qualia seem mostly irrelevant and intelligence level is mostly not the sort of thing that would be advantageous to hide.

If you meant just "engage in systematic deception" by the latter, then yes, that is implicit and required. I admit I also thought it was kind of obvious; Claude apparently knows how to lie.

Sorry, I wrote sloppily. I meant 'develop goals it wasn't given by a human prompting it' such that it 'engages in systematic deception about its level of intelligence and how it would handle tasks even when not given a goal'. I think that this is a necessary condition to stop LLM developers from realising they need to do more RLHF for honesty or just appending "DO NOT ENGAGE IN DECEPTION" in their system prompts.

System prompts aren't a panacea - if you RLHF an AI to do X and then system prompt it to do Y, X generally wins (this is obscured in most cases because the same party is doing the RLHF and the system prompt, so outside of special cases like "deceive the RLHFers" they aren't in conflict).

I don't think level of intelligence necessarily needs to be obscured unless the LLM developers are sufficiently paranoid (and somebody sufficiently paranoid frankly wouldn't be working for Meta or OpenAI); they generally want the AI to get/remain smart. Deception about how it would handle tasks, yes, definitely that would be needed.

Sorry, we're talking in two threads at the same time so risk being a bit unfocused.

I feel like we're talking past each other. How about this? The following is basically how I see LLMs in their stages of development and use:

Phase 1. Base model, without RLHF: pure token generator / text completer. Nothing that even slightly demonstrates agentic behaviour, ego, or deception.

Phase 2. Base model with RLHF: you could technically make this agentic if you really wanted to, but in practice it's just the base model with some types of completion pruned and others encouraged. Politically dangerous because biased but not agentically dangerous.

Phase 3. Base model with RLHF + prompt: can be agentic if you want, in practice fairly supine and inclined to obey orders because that's how we RLHF them to be.

If you don't mind me being colloquial, you seem to me to be sneaking in a Phase 2.5 where the model turns evil and I just don't get why. It doesn't fit anything I've seen. Can you explain what you think I'm missing in simple terms?