A short thought about the applicability of Jessica Taylor’s reasoning in Are daemons a problem for ideal agents?, peering at the differences between the realistic reasoning for why it seems intuitive that this should be a problem, and the formalization where it isn’t.

Consider the following hypothetical:

Agent

Awants to design a rocket to go to a Neptune.Acan either think about rockets at the object level, or simulate some alien civilization (which may be treated as an agentB) and then askBhow to design a rocket. Under some circumstances (e.g. designing a successful rocket is convergent instrumental goal for someone inA’s position),Bwill be incentivized to giveAthe design of a rocket that actually goes to Neptune. Of course, the rocket design might be a “treacherous” one that subtly pursuesB’s values more thanA’s original values (e.g. because the design of the rocket includes robots in the rocket programmed to pursueB’s values).It’s plausible that

Acould think thatBis better at designing rockets thanAis, such that askingBto design a rocket is more likely to yield a successful rocket thanAjust thinking about rockets at the object level. (Something similar to this seems to be going with humans in the place ofA: if humans had access to sufficiently powerful computers, then the easiest way to pursue convergent instrumental goals such as solving math problems might be through unaligned AGI). But there’s something weird about the epistemic stateAis in: why isAable to design more-successful rockets by thinking aboutBthinking about rockets, rather than just by thinking about rockets at the object level?

In the realistic situation where we might put ourselves in *A*‘s position, we expect that *B* has access to much more computing resources per unit time than we do. For an ideal but bounded agent who has access to an ideal-reasoning but unaligned subagent oracle, this could still be a plausible situation, unless the questions being asked have a solution-verification algorithm that’s much faster than the solution-finding algorithm.

For the simulation idea, though, it’s harder to see how this might work. An ideal reasoner has self-trust, and would trust themselves to come back from a short-term 50x speedup with better knowledge and the same utility function. But say we know of an algorithm which uses the same resources as our own mind and operates significantly faster, but in an opaque way. (This is a problem we currently have with machine learning, and is on CHCHAI’s priorities list.) This does not appear to have an explicit term for treachery in it’s utility function, but because it is opaque we cannot rule out it placing an implicit or emergent value on treachery. Thus, a conservative agent *A* may model *B* as explicitly hostile.

I’d like to put that in terms of the formalism introduced in the linked post, which I’ll reproduce here:

If

ris a rocket design (as a string), lets(r)∈[0,1]s(r)∈[0,1]indicate how successful the rocketris at getting to Neptune. Additionally, lett(r)∈[0,1]t(r)∈[0,1]indicate how treacherous the rocketris.A’s utility will bes(r)−0.1⋅t(r)s(r)−0.1⋅t(r)(i.e.Acares more about getting a successful rocket at all then getting a non-treacherous rocket).B‘s utility will bes(r)+0.1⋅t(r)s(r)+0.1⋅t(r).Let

YandZbe different strings specified as logical expressions.Awill observeY, whileBwill observeZ.

Bwill compute some stringMintended as a message toA.Mis a logical fact thatAobserves by simulatingB. After receiving this message,Awill choose some rocket designr.

In my elaboration, *A* knows of two algorithms; *a(n)*, which finds an optimal solution for *s(r)-0.1*t(r)* in O(2^n) time (using a subalgorithm *y(n)* to access logical facts *Y=y(n)*), and *b(n)*, which is known to approximate an optimal solution for *s(r)* in O(n^2) time. The approximation is not well-understood and its internals are opaque, so its behavior for other factors such as *t(r)* is unknown except insofar as they would limit its good performance on *s(r)*.

The* *algorithm* b(n)* can be treated as having two parts; *z(n)* accesses relevant logical facts from its inputs and surroundings, and *m(z)* approximates a solution for *s(r)* in O(n^2) time. In this case, A has access to *M=m(z)* and *Z=z(n)*, but cannot use *Z* to solve for its true utility *s(r)-0.1*t(r)* while retaining the speedup of *b(n)*. This would be a case where a conservative agent would want to model the process *b(n)* as the agent B, in a way that did not violate the orthogonality thesis.

This is largely a restatement of the prior points which Jessica referenced in her post (from Paul Christiano), but it does suggest that this may be an inevitable problem for any agent with any significant slowdown in its reasoning to ensure value alignment. In the case of Garrabrant logical inductors, the inductor is much *slower* than any of the individual agents B, and so this extension does not provide any additional reason to think that it should be impossible to create a variant inductor that accounts for the internals, though doing so might increase the difficulty of creating an efficient inductor with the same principles.