Benignness is Bottomless

If you are not interested in AI Safety, this may bore you. If you consider your sense of mental self fragile, this may damage it. This is basically a callout post of Paul Christiano for being ‘not paranoid enough’. Warnings end.

I find ALBA and Benign Model-Free AI hopelessly optimistic. My objection has several parts, but the crux starts very early in the description:

Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution).

Specifically, I claim that no agent H yet exists, and furthermore that if you had an agent H you would already have solved most of value alignment. This is fairly bold, but at least the first clause I am quite confident in.

Obviously the H is intended to stand for Human, and smuggles in the assumption that an (educated, intelligent, careful) human is benign. I can demonstrate this to be false via thought experiment.

Experiment 1: Take a human (Sam). Make a perfect uploaded copy (Sim). Run Sim very fast for a very long time in isolation, working on some problem.

Sim will undergo value drift. Some kinds of value drift are self-reinforcing, so Sim could drift arbitrarily far within the bounds of what a human mind could in theory value. Given that Sim is run long enough, pseudorandom value drift will eventually hit one of these patches and drift to an arbitrary direction an arbitrarily large distance.
It seems obvious from this example that Sim is eventually malign.

Experiment 2: Make another perfect copy of Sam (Som), and hold it “asleep”, unchanging and ready to be copied further without changes. Then repeat this process indefinitely: Make a copy of Som (Sem) and give him short written instructions, written by Sam or anyone else, and run Sem for one hour. By the end of the hour, have some set of instructions and state written in the same format. Shut off Sem at the end of the hour and take the written instructions to pass to the next instance, which will be copied off the original Som. (If there is a problem and a Sem does not create an instruction set, start from the beginning with the original instructions; deterministic loops are a potential problem but unimportant for purposes of this argument.)

Again, this can result in significant drift. Assume for a moment that this process could produce arbitrary plain text input to be read by a new Sem. Among the space of plain text inputs could exist a tailored, utterly convincing argument why the one true good in the universe is the construction of paperclips; one which exploits human fallibility, the fallibilities of Sam in particular, biases likely to be present in Som because he is a stored copy, and biases likely to be peculiar to a short-lived Sem that knows it will be shut down within one hour subjective. This could cause significant value drift even in short timeboxes, and once it began could be self-reinforcing just as easily as the problems with Sim.
Getting to the “golden master key” argument for any position, starting from a sane and normal starting point, is obviously quite hard. Not impossible, though, and while the difficulty of hitting any one master key argument is high, there is a very large set of potential “locks”, any of which has the same problem. If we ran Sem loops for an arbitrary amount of time, Sem will eventually fall into a lock and become malign.

Experiment 3: Instead of just Sam, use a number of people, put in groups and recombining regularly from different parts of a massively parallel system of simulations. Like Sem, it is using entirely plain-text I/O and is timeboxed to one hour per session. Call the Som-instance in one of these groups Sum, who works with Diffy, Prada, Facton, and so on.

Now rather than drifting to a lock which is a value-distorting plain text input for a Sem, we need one for the entire group, which must be able to propagate to one via reading and enough of the rest via persuasion. This is clearly a harder problem, but there is also more attack surface; only one of the participants in the group, perhaps the most charismatic, needs to propagate the self-reinforcing state. It can also drift faster, once motivated, with more brainpower that can be directed toward it. On balance, it seems likely to be safer for much longer, but how much? Exponentially? Quadratically?

What I am conveying here is that we are patching holes in the basic framework, and the downside risks are playing the game of Nearest Unblocked Strategy. Relying on a human is not benign; humans seem to be benign only because they are, in the environment we intuitively evaluate them in, confined to a very normal set of possible input states and stimuli. An agent which is benign only as long as it is never exposed to an edge case is malign, and examples like these convince me thoroughly that a human subjected to extreme circumstances is malign in the same sense that the universal prior is malign.

This, then, is my point: we have no examples of benign agents, we do not have enough diversity of environments to observe agents in to realistically conclude that an agent is benign, and so we have nowhere a hierarchy of benign-ness can bottom out. The first benign agent will be a Friendly AI – not necessarily particularly capable – and any approach predicated on enhancing a benign agent to higher capability to generate an FAI is in some sense affirming the consequent.