Benignness is Bottomless

If you are not interested in AI Safety, this may bore you. If you consider your sense of mental self fragile, this may damage it. This is basically a callout post of Paul Christiano for being ‘not paranoid enough’. Warnings end.

I find ALBA and Benign Model-Free AI hopelessly optimistic. My objection has several parts, but the crux starts very early in the description:

Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution).

Specifically, I claim that no agent H yet exists, and furthermore that if you had an agent H you would already have solved most of value alignment. This is fairly bold, but at least the first clause I am quite confident in.

Obviously the H is intended to stand for Human, and smuggles in the assumption that an (educated, intelligent, careful) human is benign. I can demonstrate this to be false via thought experiment.

Experiment 1: Take a human (Sam). Make a perfect uploaded copy (Sim). Run Sim very fast for a very long time in isolation, working on some problem.

Sim will undergo value drift. Some kinds of value drift are self-reinforcing, so Sim could drift arbitrarily far within the bounds of what a human mind could in theory value. Given that Sim is run long enough, pseudorandom value drift will eventually hit one of these patches and drift to an arbitrary direction an arbitrarily large distance.
It seems obvious from this example that Sim is eventually malign.

Experiment 2: Make another perfect copy of Sam (Som), and hold it “asleep”, unchanging and ready to be copied further without changes. Then repeat this process indefinitely: Make a copy of Som (Sem) and give him short written instructions, written by Sam or anyone else, and run Sem for one hour. By the end of the hour, have some set of instructions and state written in the same format. Shut off Sem at the end of the hour and take the written instructions to pass to the next instance, which will be copied off the original Som. (If there is a problem and a Sem does not create an instruction set, start from the beginning with the original instructions; deterministic loops are a potential problem but unimportant for purposes of this argument.)

Again, this can result in significant drift. Assume for a moment that this process could produce arbitrary plain text input to be read by a new Sem. Among the space of plain text inputs could exist a tailored, utterly convincing argument why the one true good in the universe is the construction of paperclips; one which exploits human fallibility, the fallibilities of Sam in particular, biases likely to be present in Som because he is a stored copy, and biases likely to be peculiar to a short-lived Sem that knows it will be shut down within one hour subjective. This could cause significant value drift even in short timeboxes, and once it began could be self-reinforcing just as easily as the problems with Sim.
Getting to the “golden master key” argument for any position, starting from a sane and normal starting point, is obviously quite hard. Not impossible, though, and while the difficulty of hitting any one master key argument is high, there is a very large set of potential “locks”, any of which has the same problem. If we ran Sem loops for an arbitrary amount of time, Sem will eventually fall into a lock and become malign.

Experiment 3: Instead of just Sam, use a number of people, put in groups and recombining regularly from different parts of a massively parallel system of simulations. Like Sem, it is using entirely plain-text I/O and is timeboxed to one hour per session. Call the Som-instance in one of these groups Sum, who works with Diffy, Prada, Facton, and so on.

Now rather than drifting to a lock which is a value-distorting plain text input for a Sem, we need one for the entire group, which must be able to propagate to one via reading and enough of the rest via persuasion. This is clearly a harder problem, but there is also more attack surface; only one of the participants in the group, perhaps the most charismatic, needs to propagate the self-reinforcing state. It can also drift faster, once motivated, with more brainpower that can be directed toward it. On balance, it seems likely to be safer for much longer, but how much? Exponentially? Quadratically?

What I am conveying here is that we are patching holes in the basic framework, and the downside risks are playing the game of Nearest Unblocked Strategy. Relying on a human is not benign; humans seem to be benign only because they are, in the environment we intuitively evaluate them in, confined to a very normal set of possible input states and stimuli. An agent which is benign only as long as it is never exposed to an edge case is malign, and examples like these convince me thoroughly that a human subjected to extreme circumstances is malign in the same sense that the universal prior is malign.

This, then, is my point: we have no examples of benign agents, we do not have enough diversity of environments to observe agents in to realistically conclude that an agent is benign, and so we have nowhere a hierarchy of benign-ness can bottom out. The first benign agent will be a Friendly AI – not necessarily particularly capable – and any approach predicated on enhancing a benign agent to higher capability to generate an FAI is in some sense affirming the consequent.

Daemon Speedup

A short thought about the applicability of Jessica Taylor’s reasoning in Are daemons a problem for ideal agents?, peering at the differences between the realistic reasoning for why it seems intuitive that this should be a problem, and the formalization where it isn’t.

Consider the following hypothetical:

Agent A wants to design a rocket to go to a Neptune. can either think about rockets at the object level, or simulate some alien civilization (which may be treated as an agent B) and then ask B how to design a rocket. Under some circumstances (e.g. designing a successful rocket is convergent instrumental goal for someone in A’s position), B will be incentivized to give A the design of a rocket that actually goes to Neptune. Of course, the rocket design might be a “treacherous” one that subtly pursues B’s values more than A’s original values (e.g. because the design of the rocket includes robots in the rocket programmed to pursue B’s values).

It’s plausible that A could think that B is better at designing rockets than A is, such that asking B to design a rocket is more likely to yield a successful rocket than A just thinking about rockets at the object level. (Something similar to this seems to be going with humans in the place of A: if humans had access to sufficiently powerful computers, then the easiest way to pursue convergent instrumental goals such as solving math problems might be through unaligned AGI). But there’s something weird about the epistemic state A is in: why is A able to design more-successful rockets by thinking about B thinking about rockets, rather than just by thinking about rockets at the object level?

In the realistic situation where we might put ourselves in A‘s position, we expect that B has access to much more computing resources per unit time than we do. For an ideal but bounded agent who has access to an ideal-reasoning but unaligned subagent oracle, this could still be a plausible situation, unless the questions being asked have a solution-verification algorithm that’s much faster than the solution-finding algorithm.

For the simulation idea, though, it’s harder to see how this might work. An ideal reasoner has self-trust, and would trust themselves to come back from a short-term 50x speedup with better knowledge and the same utility function. But say we know of an algorithm which uses the same resources as our own mind and operates significantly faster, but in an opaque way. (This is a problem we currently have with machine learning, and is on CHCHAI’s priorities list.) This does not appear to have an explicit term for treachery in it’s utility function, but because it is opaque we cannot rule out it placing an implicit or emergent value on treachery. Thus, a conservative agent A may model B as explicitly hostile.

I’d like to put that in terms of the formalism introduced in the linked post, which I’ll reproduce here:

If r is a rocket design (as a string), let s(r)[0,1]s(r)∈[0,1] indicate how successful the rocket r is at getting to Neptune. Additionally, let t(r)[0,1]t(r)∈[0,1] indicate how treacherous the rocket r is. A’s utility will be s(r)0.1t(r)s(r)−0.1⋅t(r) (i.e. A cares more about getting a successful rocket at all then getting a non-treacherous rocket). B‘s utility will be s(r)+0.1t(r)s(r)+0.1⋅t(r).

Let Y and Z be different strings specified as logical expressions. A will observe Y, while B will observe Z.

B will compute some string M intended as a message to A. M is a logical fact that A observes by simulating B. After receiving this message, A will choose some rocket design r.

In my elaboration, A knows of two algorithms; a(n), which finds an optimal solution for s(r)-0.1*t(r) in O(2^n) time (using a subalgorithm y(n) to access logical facts Y=y(n)), and  b(n), which is known to approximate an optimal solution for s(r) in O(n^2) time. The approximation is not well-understood and its internals are opaque, so its behavior for other factors such as t(r) is unknown except insofar as they would limit its good performance on s(r).

The algorithm b(n) can be treated as having two parts; z(n) accesses relevant logical facts from its inputs and surroundings, and m(z) approximates a solution for s(r) in O(n^2) time. In this case, A has access to M=m(z) and Z=z(n), but cannot use Z to solve for its true utility s(r)-0.1*t(r) while retaining the speedup of b(n). This would be a case where a conservative agent would want to model the process b(n) as the agent B, in a way that did not violate the orthogonality thesis.

This is largely a restatement of the prior points which Jessica referenced in her post (from Paul Christiano), but it does suggest that this may be an inevitable problem for any agent with any significant slowdown in its reasoning to ensure value alignment. In the case of Garrabrant logical inductors, the inductor is much slower than any of the individual agents B, and so this extension does not provide any additional reason to think that it should be impossible to create a variant inductor that accounts for the internals, though doing so might increase the difficulty of creating an efficient inductor with the same principles.