Durable execution of protocols

This literally came to me in a dream, so wanted to write down here before it escaped my brain (yes I sometimes dream in code)

We have our classic plr demo (adapted so it is using an API, because this is what enables this kind of thing):

from pylabrobot.liquid_handling import LiquidHandler
from pylabrobot.liquid_handling.backends import STARBackendAPI
from pylabrobot.resources import Deck

auth_key = "AUTH_KEY"

deck = Deck.from_api("machine1", base_url = "pylabrobot.org/api/v1/state", auth = auth_key)
backend = STARBackendAPI("machine1", base_url = "pylabrobot.org/api/v1/star", auth = auth_key)
lh = LiquidHandler(backend=STARBackendAPI(), deck=deck)
await lh.setup()

await lh.pick_up_tips(lh.deck.get_resource("tip_rack")["A1"])
await lh.aspirate(lh.deck.get_resource("plate")["A1"], vols=100)
await lh.dispense(lh.deck.get_resource("plate")["A2"], vols=100)
await lh.return_tips()

The problem is that if the program fails at any of the awaits, or if there is a disconnection in network, or if there is a power outage, everything goes down. You can’t just rerun the protocol, you have to figure out what went wrong and modify the code accordingly. This is obviously a bad thing, especially with very expensive protocols.

The idea of durable execution is that the protocol can be re-run however many times, and stopped at any point, and the software will continue execution as expected. While other systems need a lot of infrastructure in order to do this, it can be done quite simply with the networked backends. Here is how it works:

from pylabrobot.liquid_handling import LiquidHandler
from pylabrobot.liquid_handling.backends import STARBackend
from pylabrobot.resources import Deck
from pylabrobot.durable import DurableExecution

auth_key = "AUTH_KEY"
durability_key = DurableExecution.keygen("exec.key") # file that gets written to filesystem

deck = Deck.from_api("machine1", base_url = "pylabrobot.org/api/v1/state", auth = auth_key, durability_key = durability_key)
backend = STARBackendAPI("machine1", base_url = "pylabrobot.org/api/v1/star", auth = auth_key, durability_key = durability_key)
lh = LiquidHandler(backend=STARBackendAPI(), deck=deck)
await lh.setup()

await lh.pick_up_tips(lh.deck.get_resource("tip_rack")["A1"])
await lh.aspirate(lh.deck.get_resource("plate")["A1"], vols=100)
await lh.dispense(lh.deck.get_resource("plate")["A2"], vols=100)
await lh.return_tips()

What happens is that the program generates a durable key, which is just a random seed that it then writes to the filesystem (or wherever). Whenever it sends either a resource or liquid handling request to the API, it generates the next RNG number, and sends it alongside. The server checks if it has responded to that RNG number. If it has, it simply passes the exact data that it has already sent back to the robot.

POST pylabrobot.org/api/v1/star {"id": "2ec1a198-9300-458f-8616-d442ce95d27f", "cmd": "aspirate", "vol": 10}
# server checks if it has already generated 2ec1a198-9300-458f-8616-d442ce95d27f
# if it has, return that JSON. If it has not, go actually run it on the robot.
RETURN {"status": "complete"}

This is essentially just a key-value cache (id to JSON string), so is fairly easy to implement, but extremely difficult to implement if you don’t own the backend (temporal is an example of someone trying to solve this in a general way).

Coincidentally, this also creates a traceable log of everything that has happened on the robot.

It also depends on you NOT having any commands on the basis of a random number generator. Any decisions made from randomness fuck up the system. Which for robot protocols shouldn’t be much of a problem.

My original implementation of this was in lua because you could actually make execution guarantees, embed the scripting ability into a larger system), make pausing a first-class thing that always happens (lua is just a better language than python for this kind of thing), but eh, if it implemented at the API level it doesn’t matter if it is lua or python.

function main(ctx)
    result, cont = do_something(ctx)
    if cont then return cont # this handles errors and continuations

    return result
end

In this, you have explicit flow control, and protocols immediately halt and return execution every time (unlike the python, which kind of just waits at each async). But I don’t think you can really get that with python because of how long spawning the python process takes.

Could also use starlark, which would make it completely hermetic by design, and retain most of pythonic-ness

:thinking:

this can just be a simple counter?

randomness meaning non-determinism. Some subspace of protocols (the majority I would say) do incorporate some kind of feedback. Which, while not random, is still external information.

the more generalized form is: ‘durability’ only works for protocols without external information

You must at least have a generated identifier, so that the system can associate the generated identifier with the protocol. Otherwise any time you run a different protocol they’re gonna overlap. Hence the durable key.

You can replace the RNG with a counter though. Would be better. Counter + key

Depends. Is the external feedback from pylabrobot devices? If so, the external information can be durable in the same way. If you are using non-pylabrobot devices, you can always just cache the info you get back.

Take this snippet for example

function main(ctx)
    result, cont = do_something(ctx)
    if cont then return cont

    return result
end

do_something(ctx) would just throw whatever it returns into a cache (which is in ctx). Then it just checks next time if something is already there. So:

Is not true: durability CAN still work with external information. However, if the execution is non-deterministic (some kind of rng is deciding something - the code doesn’t work in the exact same way each time in deciding paths), it CANNOT work.