A principle of AI alignment that does not seem reducible to other principles is “The AGI design should be widely separated in the design space from any design that would constitute a hyperexistential risk”. A hyperexistential risk is a “fate worse than death”, that is, any AGI whose outcome is worse than quickly killing everyone and filling the universe with paperclips.
I agree that this is a desirable quality for any design or approach to creating a design to have. However, I think it’s impossible to do so while creating the possibility for an ‘existential win’, i.e. a good event roughly as good as a hyperexistential risk is bad. In order to create the possibility of a Very Good Outcome, your AGI must understand what humans value in some detail. The author of this page* provides specifics, which they think will move us further away from Very Bad Outcomes, but I don’t agree.
This consideration weighing against general value learning of true human values might not apply to e.g. a Task AGI that was learning inductively from human-labeled examples, if the labeling humans were not trying to identify or distinguish within “dead or worse” and just assigned all such cases the same “bad” label. There are still subtleties to worry about in a case like that[…] But even on the first step of “use the same label for death and worse-than-death as events to be avoided, likewise all varieties of bad fates better than death as a type of consequence to notice and describe to human operators”, it seems like we would have moved substantially further away in the design space from hyperexistential catastrophe.
I find it hard to picture a method of learning what humans value that does not produce information about what they disvalue in equal supply, and this is no exception. Value is for the most part a relative measure rather than an absolute; to determine whether I value eating a cheeseburger it is necessary to compare the state of eating-a-cheeseburger to the state of not-eating-a-cheeseburger, to assess whether I value not-being-in-pain you must compare it to being-in-pain, to determine whether I value existence you must compare it to nonexistence. To the extent we are not labeling the distinction between fates worse than death and death, the learner is failing to understand what we value. And an intelligent sign-flipped learner, if we gave it many fine-grained labels for “things we prefer to death by X much”, would at minimum have the data needed to cause a (weakly-hyper)-existential catastrophe; a world in which we did not die but did not ever have any of the things we rated as better than death. Unless we have some means of preventing the learner from making such inferences or storing the information (so, call the SCP Foundation Antimemetics Division?), this suggestion would not help except against a very stupid agent.
Of course, maybe that’s the point. It seems obvious to me that a very stupid agent does not pose a hyperexistential risk because it can’t build up a model detailed enough to do more than existential harm, but “obvious” is a word to mistrust. Could I make the leap and infer the reversal property? I believe I could. Could one of the senders of That Alien Message, who are unusually stupid for humans but have all the knowledge of their ancestors from birth? I’m fairly confident they could, but not certain. Could one of them cause us hyperexistential harm? Yes, on that I am certain. That adds up to a fairly small, but nonempty, segment of probability space where this would be useful.
But does that add up to the approach being worthwhile?
* Presumably this is Eliezer Yudkowsky , since I don’t believe anyone else wrote anything on Arbital after its “official shutdown”, which was well before this page was created. But I’m not certain.