13 Comments

If I understand your proposal correctly, LT:BGROW isn't an intelligence at all, super- or not. It's a plan of action that would require a superintelligence to develop. A plan of action can indeed be fully "aligned", "friendly", or whatever, if you look at outcomes, but that doesn't imply that it's possible to construct an unfettered intelligence that always does what a given person thinks is best. That's even the case for the *same* person, upgraded to have more intelligence or information, as we arguably see all the time in humans.

Expand full comment
author

An intelligence is merely a planner.

LT:BGROW is something like a specialized superintelligence which is specialized to do exactly the thing you want, and no more.

Is that not good enough for you?

Expand full comment
author

I think these types of methods can be extended significantly to show the following:

- existence proof of aligned superintelligence that is not just logically but also physically realizable

- proof that ML interpretability/mech interp cannot possibly be logically necessary for aligned superintelligence

- proof that ML interpretability/mech interp cannot possibly be logically sufficient for aligned superintelligence

- proof that given certain minimal emulation ability of humans by AI (e.g. understands common-sense morality and cause and effect) and of AI by humans (humans can do multiplications etc) the internal details of AI models cannot possibly make a difference to the set of realizable good outcomes

Expand full comment
author

- proof that given near-perfect or perfect technical alignment (=AI will do what the creators ask of it with correct intent) awful outcomes are Nash Equilibrium for rational agents

- small or even large alignment deviations make no fundamental difference to outcomes - the boundary between good/bad is determined by game theory and initial conditions, not by alignment fidelity.

Expand full comment
author

- it is possible to define what states or goals AI should achieve that mostly doesn't end up in a contradiction between all of humanity and achieves maximal satisfaction in the sense of being Pareto Optimal (ELYSIUM)

Expand full comment

I personally haven't seen much serious discussion about whether aligning an AI to "something" is impossible. The general argument against alignment is that it is impossible to classify what AI should align to that doesn't end up in a contradiction between all of humanity.

Sidestepping this issue in your essay means that the "hard problem" of alignment which is "aligned to what?" and that cannot be answered without the AI unaligning to other significant groups of people.

Expand full comment
author

> to classify what AI should align to that doesn't end up in a contradiction between all of humanity.

There are proposals to solve this. But it would take us away from the topic of AI Alignment and into moral philosophy. I will discuss in a forthcoming post.

Expand full comment

I disagree. That is the core problem of alignment that is impossible to solve. Aligning AI behaviour towards a goal is the easy problem of AI alignment (as you showed above); aligning AI behaviour towards humanity is the nigh-impossible hard problem of alignment. You need to solve both before you can say you solved alignment.

Expand full comment
author

> That is the core problem of alignment that is impossible to solve.

It is not impossible, perhaps I need to dedicate a whole post to that.

Expand full comment

I'd be keen to read it.

Expand full comment
author

Coming soon!

Expand full comment

Fortunately, control is an alternative to alignmemt.

Expand full comment

Absolutely... until AI becomes more potent than whomever is controlling it.

Expand full comment