A Nonconstructive Existence Proof of Aligned…

Sep 12

A Final End to Theories Which Claim AI Alignment Might Be Impossible

13 Comments

Sep 12Liked by Roko

If I understand your proposal correctly, LT:BGROW isn't an intelligence at all, super- or not. It's a plan of action that would require a superintelligence to develop. A plan of action can indeed be fully "aligned", "friendly", or whatever, if you look at outcomes, but that doesn't imply that it's possible to construct an unfettered intelligence that always does what a given person thinks is best. That's even the case for the *same* person, upgraded to have more intelligence or information, as we arguably see all the time in humans.

Expand full comment

Reply (1)

Roko

Sep 13Author

An intelligence is merely a planner.

LT:BGROW is something like a specialized superintelligence which is specialized to do exactly the thing you want, and no more.

Is that not good enough for you?

Expand full comment

Roko

Sep 17Author

I think these types of methods can be extended significantly to show the following:

- existence proof of aligned superintelligence that is not just logically but also physically realizable

- proof that ML interpretability/mech interp cannot possibly be logically necessary for aligned superintelligence

- proof that ML interpretability/mech interp cannot possibly be logically sufficient for aligned superintelligence

- proof that given certain minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes

Expand full comment

Reply (1)

Roko

Sep 17Author

- proof that given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents

- small or even large alignment deviations make no fundamental difference to outcomes - the boundary between good/bad is determined by game theory and initial conditions, not by alignment fidelity.

Expand full comment

Reply (1)

Roko

Sep 17Author

- it is possible to define what states or goals AI should achieve that mostly doesn't end up in a contradiction between all of humanity and achieves maximal satisfaction in the sense of being Pareto Optimal (ELYSIUM)

Expand full comment

Izak Tait

The Guidebook to Making AI Frie…

Sep 12

I personally haven't seen much serious discussion about whether aligning an AI to "something" is impossible. The general argument against alignment is that it is impossible to classify what AI should align to that doesn't end up in a contradiction between all of humanity.

Sidestepping this issue in your essay means that the "hard problem" of alignment which is "aligned to what?" and that cannot be answered without the AI unaligning to other significant groups of people.

Expand full comment

Reply (2)

Roko

Sep 12Author

> to classify what AI should align to that doesn't end up in a contradiction between all of humanity.

There are proposals to solve this. But it would take us away from the topic of AI Alignment and into moral philosophy. I will discuss in a forthcoming post.

Expand full comment

Reply (1)

Izak Tait

The Guidebook to Making AI Frie…

Sep 12

I disagree. That is the core problem of alignment that is impossible to solve. Aligning AI behaviour towards a goal is the easy problem of AI alignment (as you showed above); aligning AI behaviour towards humanity is the nigh-impossible hard problem of alignment. You need to solve both before you can say you solved alignment.

Expand full comment

Reply (1)