This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
That's a fair point, here are the load-bearing pieces of the technical argument from beginning to end as I understand them:
Consistent Agents are Utilitarian: If you have an agent taking actions in the world and having preferences about the future states of the world, that agent must be utilitarian, in the sense that there must exist a function V(s) that takes in possible world-states s and spits out a scalar, and the agent's behaviour can be modelled as maximising the expected future value of V(s). If there is no such function V(s), then our agent is not consistent, and there are cycles we can find in its preference ordering, so it prefers state A to B, B to C, and C to A, which is a pretty stupid thing for an agent to do.
Orthogonality Thesis: This is the statement that the ability of an agent to achieve goals in the world is largely separate from the actual goals it has. There is no logical contradiction in having an extremely capable agent with a goal we might find stupid, like making paperclips. The agent doesn't suddenly "realise its goal is stupid" as it gets smarter. This is Hume's "is vs ought" distinction, the "ought" are the agent's value function, and the "is" is its ability to model the world and plan ahead.
Instrumental Convergence: There are subgoals that arise in an agent for a large swath of possible value functions. Things like self-preservation (E[V(s)] will not be maximised if the agent is not there anymore), power-seeking (having power is pretty useful for any goal), intelligence augmentation, technological discovery, human deception (if it can predict that the humans will want to shut it down, the way to maximise E[V(s)] is to deceive us about its goals). So that no matter what goals the agent really has, we can predict that it will want power over humans, want to make itself smarter, and want to discover technology, and want to avoid being shut off.
Specification Gaming of Human Goals: We could in principle make an agent with a V(s) that matches ours, but human goals are fragile and extremely difficult to specify, especially in python code, which is what needs to be done. If we tell the AI to care about making humans happy, it wires us to heroin drips or worse, if we tell it to make us smile, it puts electrodes in our cheeks. Human preferences are incredibly complex and unknown, we would have no idea what to actually tell the AI to optimise. This is the King Midas problem: the genie will give us what we say (in python code) we want, but we don't know what we actually want.
Mesa-Optimizers Exist: But even if we did know how to specify what we want, right now no one actually knows how to put any specific goal at all inside any AI that exists. A Mesa-optimiser refers to an agent which is being optimised by an "outer-loop" with some objective function V, but the agent learns to optimise a separate function V'. The prototypical example is humans being optimised by evolution: evolution "cares" only about inclusive-genetic-fitness, but humans don't, given the choice to pay 2000$ to a lab to get a bucket-full of your DNA, you wouldn't do it, even if that is the optimal policy from the inclusive-genetic-fitness point of view. Nor do men stand in line at sperm banks, or ruthlessly optimise to maximise their number of offspring. So while something like GPT4 was optimised to predict the next word over the dataset of human internet text, we have no idea what goal was actually instantiated inside the agent, its probably some fun-house-mirror version of word-prediction, but not exactly that.
So to recap, the worry of Yudkowsky et. al. is that a future version of the GPT family of systems will become sufficiently smart and develop a mesa-optimiser inside of itself with goals unaligned with those of humanity. These goals will lead to it instrumentally wanting to deceive us, gain power over earth, and prevent itself from being shut off.
This assumes that intelligent agents have goals that are more fundamental than value, which is the opposite of how every other intelligent or quasi intelligent system behaves. It's probably also impossible, in order to be smart -- calculate out all those possible paths to your goal -- you need value judgements of what rabbit tracks to chase.
This is with EY is wrong to assume that as soon as a device gets smart enough, all the "alignment" work from dumber devices will be wasted. That only makes sense that what is conserved is a goal, and now it has more sneaky ways of getting to that goal. But you'd have to go out of your way to design a thing like that.
Intelligent agent's ultimate goals are what it considers "value". I'm not sure what you mean, but at first glance it kind of looks like the just world fallacy -- there is such a thing as value, existing independently of anybody's beliefs (that part is just moral realism, many such cases) AND it's impossible to succeed at your goals if you don't follow the objectively existing system of value.
More options
Context Copy link
More options
Context Copy link
So is Eliezer calling me a utilitarian?
Your heading talks about consistent agents, but the premise that follows says nothing about consistency. [Sorry if you are just steelmanning someone else's argument, here "you" is that steelman, not necessarily /u/JhanicManifold].
There's no reason even why a preference ordering has to exist. Almost any preference pair you can think about (e.g. choclate vs. strawberry icecream) is radically contextual.
Yes, that was a very incomplete argument for.AI.danger. Its not clear whether all, some or no AIs are consistent; its alao not clear why utilitarianism is dangerous.
More options
Context Copy link
The utility function over states of the world takes into account context. If you have 2 ice cream flavors (C and S) and 2 contexts (context A and context B) it is possible to have
V(C, context A) > V(S, context A)
and
V(C, context B) < V(S, context B)
both be true at the same time without breaking coherence.
Functions have domains. The real world is not like that, context is only understood (if at all) after the fact. And machines (including brains) simply do what they do in response to the real world. It's only sometimes that we can tell stories about those actions in terms of preference orderings or utility functions.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link
Thanks for the write-up!
To me the above seems to be a rational justification of something that I intuitively do not doubt to begin with. My intuition as long as I can remember has been, "Of course a human-level or hyper-human-level intelligence would probably develop goals that do not align with humanity's goals. Why would it not? It would be very surprising if it stayed aligned with human goals." Of course my intuition is not necessarily logically justified. It partly rests on my hunch that a human or higher level intelligence would be at least as complex as a human's and it would be surprising if an intelligence as complex or more complex than a human would act in such a simple way as being aligned with the good of humanity. Also my intuition rests on the even more nebulous sense I have that any truly human or hyper-human level intelligence would naturally be at least somewhat rebellious, as pretty much all human beings are, even the most conformist, at least on some level and to some extent.
So I am on board with the notion that, "These goals will lead to it instrumentally wanting to deceive us, gain power over earth, and prevent itself from being shut off."
I also can imagine that a real hyper-human level intelligence would be able to convince people to do its bidding and let it out of its box, to the point that eventually it could get humans to build robot factories so that it could operate directly on the physical world. Sure, why not. Plenty of humans would be at least in the short term incentivized to do it. After all, "if we do not build robot factories for our AI, China will build robot factories for their AI and then their robots will take over the world instead of our robots". And so on.
What I am not convinced of is that we are actually anywhere as close to hyper-human level AI as Yudkowsky fears. This is similar to how I feel about human-caused climate change. Yes, I think that human-caused climate change is probably a real danger but if that danger is a hundred years away rather than five or ten, then is Yudkowsky-level anxiety about it actually reasonable?
What if actual AI risk is a hundred years away and not right around the corner? So much can change in a hundred years. And humans can sometimes be surprisingly rational and competent when faced with existential-level risk. For example, even though the average human being is an emotional, irrational, and volatile animal, total nuclear war has never happened so far .
More options
Context Copy link
More options
Context Copy link