site banner

Small-Scale Question Sunday for July 16, 2023

Do you have a dumb question that you're kind of embarrassed to ask in the main thread? Is there something you're just not sure about?

This is your opportunity to ask questions. No question too simple or too silly.

Culture war topics are accepted, and proposals for a better intro post are appreciated.

3
Jump in the discussion.

No email address required.

I don't know a lot about this topic, so I want to see if it makes sense: instrumental convergence is often posed in AI alignment as an existential risk, but could it not simply lead to a hedonistic machine? There is already precedent in the form of humans. As I understand it, many machine learning techniques operate on the idea of fitness, with a part that does something, and another part that rate its fitness. Already, it's common for AI to find loopholes in given tasks and designed aims. Is it a possibility that it would be much easier for the AI to, rather than destroying the world and such, simply find a loophole that gives it an "infinite" fitness/reward score? It seems logical to me that any sufficiently intelligent entity, with such simple coded motivations, would have almost a divergence, precisely because of self-modification. I suppose that the same logic applies to a system that is not originally like this, but turns into an agent.

Essentially: given the possibility of reward hacking, why would an advanced AI blow up the Earth?

Choose Life. Choose a job. Choose a career. Choose a family. Choose a fucking big television, choose washing machines, cars, compact disc players and electrical tin openers. Choose good health, low cholesterol, and dental insurance. Choose fixed interest mortgage repayments. Choose a starter home. Choose your friends. Choose leisurewear and matching luggage. Choose a three-piece suit on hire purchase in a range of fucking fabrics. Choose DIY and wondering who the fuck you are on Sunday morning. Choose sitting on that couch watching mind-numbing, spirit-crushing game shows, stuffing fucking junk food into your mouth. Choose rotting away at the end of it all, pissing your last in a miserable home, nothing more than an embarrassment to the selfish, fucked up brats you spawned to replace yourselves. Choose your future. Choose life... But why would I want to do a thing like that? I chose not to choose life. I chose somethin' else. And the reasons? There are no reasons. Who needs reasons when you've got heroin?

Trainspotting would have been a much happier movie if Renton and friends were able to do their reward hacking without fucking over everyone around them.

I do admit that I'm assuming that computers will not be similarly stupid lol but yes, I definitely thought a little about a comparison with humans.

Essentially: given the possibility of reward hacking, why would an advanced AI blow up the Earth?

If you consider that it might want to disassemble the planet to produce computational megastructures that make reward value go brrr, then from the perspective of a humble human who needs the biosphere, the difference is rather moot. You can always use more storage to hold larger values.

I'm not sure if that's the case. Acquiring more storage for that end means that you're, in the short-term, decreasing the reward value. It's functionally no different (eg. 100/110 and 90/100 have the same arithmetical difference). What's the incentive to go beyond a maximum? That would be like "over-completing" a goal, or, rather, setting a new goal- why would it expand its own laundry list? For example, an AI which has the goal to solve chess, has no incentive to go beyond that, if its reward value is maximum when it does solve chess. The machine is only incentivised to please this, it doesn't have any other prime motivation like long-term thinking. As a simplistic comparison, it's kind of like why very few projects aim to take control of the world.

You never specified that the AI in question had a "maximum" reward value beyond which it is indifferent. If it simply seeks to maximize a reward function, then more resources and more compute will obviously allow it to store bigger values of reward. If it hits a predetermined max beyond which it doesn't care, further behavior depends entirely on the specific architecture of the AI. It might plausibly seek more resources to help it minimize the probability of the existing reward being destroyed, be it by Nature, or other agents, or it might just shut itself off or go insane since it becomes indifferent to all further actions.

For example, an AI which has the goal to solve chess, has no incentive to go beyond that, if its reward value is maximum when it does solve chess. The machine is only incentivised to please this, it doesn't have any other prime motivation like long-term thinking. As a simplistic comparison, it's kind of like why very few projects aim to take control of the world.

You ought to pick an easier goal than solving chess. To dig down the entire decision tree would take colossal amount of resources, maybe even more than exists in the observable universe. Consider what that might imply for other goals that seem closed-ended.

You never specified that the AI in question had a "maximum" reward value beyond which it is indifferent.

Isn't that kind of implied if it can't store beyond a certain number? Like I said, acquiring more compute to store bigger values of reward is functionally the same as decreasing its value of reward.

If it hits a predetermined max beyond which it doesn't care, further behavior depends entirely on the specific architecture of the AI. It might plausibly seek more resources to help it minimize the probability of the existing reward being destroyed, be it by Nature, or other agents, or it might just shut itself off or go insane since it becomes indifferent to all further actions.

Yes, that's my central question. My argument is that it need not do anything close to apocalyptic for preservation. I am interested in the other possibilities, like "going insane", since I'm not sure what would happen in that case.

You ought to pick an easier goal than solving chess.

Ah, it's just a cliche example. However, I think that you can realistically weakly solve it, nonetheless. You're right that it would take an enormous amount of resources. My point is that it was a close-ended goal- but if you can't even measure the fitness properly for solving chess due to the complexity, and it would potentially ealise the futility, I'm not sure how ultimately relevant it is?

Isn't that kind of implied if it can't store beyond a certain number? Like I said, acquiring more compute to store bigger values of reward is functionally the same as decreasing its value of reward.

I struggle to think of any AI architecture that works the way you envision, using fractional ratios of reward to available room for reward instead of plain absolute magnitude of reward. I could be wrong, but I still doubt that's ever done.

Yes, that's my central question. My argument is that it need not do anything close to apocalyptic for preservation. I am interested in the other possibilities, like "going insane", since I'm not sure what would happen in that case.

It's impossible to answer that without digging into the exact specifications of the AI in question, and what tie-breaker mechanism it has to adjudicate between options when all of them have the same (zero) reward. Maybe it picks the first option, maybe it chooses randomly.

However, I am under the impression that in the majority of cases, a reward maximizing agent will simply try to minimize the risk of losing its accrued reward if it's maxed out, which will likely result in large scale behavior indistinguishable from attempting to increase the reward itself (turning the universe into computronium).

My point is that it was a close-ended goal- but if you can't even measure the fitness properly for solving chess due to the complexity, and it would potentially ealise the futility, I'm not sure how ultimately relevant it is?

Why could you not measure the fitness? Even if we can't evaluate each decision chain in chess, we know how many there are, so a reward that increases linearly for each tree solved should work.

using fractional ratios of reward to available room for reward instead of plain absolute magnitude of reward.

How does it follow that it's a fractional ratio? The only relevant fact is whether the maximum value has been reached. How could it even compare the absolute magnitude, if it can't store a larger number?

However, I am under the impression that in the majority of cases, a reward maximizing agent will simply try to minimize the risk of losing its accrued reward if it's maxed out,

I agree with this, but based on my knowledge of speculative ways to survive until the end of the Universe, few involve turning it into computronium. Presumably, AI would still factor in risk.

Why could you not measure the fitness?

I mean that, in practice, it could never be realised, for the reasons you mentioned- as in, achievement beyond a certain value would be impossible, since you can't strongly solve chess within current physical limits.