Durable execution of protocols

This literally came to me in a dream, so wanted to write down here before it escaped my brain (yes I sometimes dream in code)

We have our classic plr demo (adapted so it is using an API, because this is what enables this kind of thing):

from pylabrobot.liquid_handling import LiquidHandler
from pylabrobot.liquid_handling.backends import STARBackendAPI
from pylabrobot.resources import Deck

auth_key = "AUTH_KEY"

deck = Deck.from_api("machine1", base_url = "pylabrobot.org/api/v1/state", auth = auth_key)
backend = STARBackendAPI("machine1", base_url = "pylabrobot.org/api/v1/star", auth = auth_key)
lh = LiquidHandler(backend=STARBackendAPI(), deck=deck)
await lh.setup()

await lh.pick_up_tips(lh.deck.get_resource("tip_rack")["A1"])
await lh.aspirate(lh.deck.get_resource("plate")["A1"], vols=100)
await lh.dispense(lh.deck.get_resource("plate")["A2"], vols=100)
await lh.return_tips()

The problem is that if the program fails at any of the awaits, or if there is a disconnection in network, or if there is a power outage, everything goes down. You can’t just rerun the protocol, you have to figure out what went wrong and modify the code accordingly. This is obviously a bad thing, especially with very expensive protocols.

The idea of durable execution is that the protocol can be re-run however many times, and stopped at any point, and the software will continue execution as expected. While other systems need a lot of infrastructure in order to do this, it can be done quite simply with the networked backends. Here is how it works:

from pylabrobot.liquid_handling import LiquidHandler
from pylabrobot.liquid_handling.backends import STARBackend
from pylabrobot.resources import Deck
from pylabrobot.durable import DurableExecution

auth_key = "AUTH_KEY"
durability_key = DurableExecution.keygen("exec.key") # file that gets written to filesystem

deck = Deck.from_api("machine1", base_url = "pylabrobot.org/api/v1/state", auth = auth_key, durability_key = durability_key)
backend = STARBackendAPI("machine1", base_url = "pylabrobot.org/api/v1/star", auth = auth_key, durability_key = durability_key)
lh = LiquidHandler(backend=STARBackendAPI(), deck=deck)
await lh.setup()

await lh.pick_up_tips(lh.deck.get_resource("tip_rack")["A1"])
await lh.aspirate(lh.deck.get_resource("plate")["A1"], vols=100)
await lh.dispense(lh.deck.get_resource("plate")["A2"], vols=100)
await lh.return_tips()

What happens is that the program generates a durable key, which is just a random seed that it then writes to the filesystem (or wherever). Whenever it sends either a resource or liquid handling request to the API, it generates the next RNG number, and sends it alongside. The server checks if it has responded to that RNG number. If it has, it simply passes the exact data that it has already sent back to the robot.

POST pylabrobot.org/api/v1/star {"id": "2ec1a198-9300-458f-8616-d442ce95d27f", "cmd": "aspirate", "vol": 10}
# server checks if it has already generated 2ec1a198-9300-458f-8616-d442ce95d27f
# if it has, return that JSON. If it has not, go actually run it on the robot.
RETURN {"status": "complete"}

This is essentially just a key-value cache (id to JSON string), so is fairly easy to implement, but extremely difficult to implement if you don’t own the backend (temporal is an example of someone trying to solve this in a general way).

Coincidentally, this also creates a traceable log of everything that has happened on the robot.

It also depends on you NOT having any commands on the basis of a random number generator. Any decisions made from randomness fuck up the system. Which for robot protocols shouldn’t be much of a problem.

My original implementation of this was in lua because you could actually make execution guarantees, embed the scripting ability into a larger system), make pausing a first-class thing that always happens (lua is just a better language than python for this kind of thing), but eh, if it implemented at the API level it doesn’t matter if it is lua or python.

function main(ctx)
    result, cont = do_something(ctx)
    if cont then return cont # this handles errors and continuations

    return result
end

In this, you have explicit flow control, and protocols immediately halt and return execution every time (unlike the python, which kind of just waits at each async). But I don’t think you can really get that with python because of how long spawning the python process takes.

Could also use starlark, which would make it completely hermetic by design, and retain most of pythonic-ness

:thinking:

this can just be a simple counter?

randomness meaning non-determinism. Some subspace of protocols (the majority I would say) do incorporate some kind of feedback. Which, while not random, is still external information.

the more generalized form is: ‘durability’ only works for protocols without external information

You must at least have a generated identifier, so that the system can associate the generated identifier with the protocol. Otherwise any time you run a different protocol they’re gonna overlap. Hence the durable key.

You can replace the RNG with a counter though. Would be better. Counter + key

Depends. Is the external feedback from pylabrobot devices? If so, the external information can be durable in the same way. If you are using non-pylabrobot devices, you can always just cache the info you get back.

Take this snippet for example

function main(ctx)
    result, cont = do_something(ctx)
    if cont then return cont

    return result
end

do_something(ctx) would just throw whatever it returns into a cache (which is in ctx). Then it just checks next time if something is already there. So:

Is not true: durability CAN still work with external information. However, if the execution is non-deterministic (some kind of rng is deciding something - the code doesn’t work in the exact same way each time in deciding paths), it CANNOT work.

Let’s classify the problems we need to address to achieve durable execution in PLR:

  1. Type I: Unreliability at the USB/wire level (noise, connection drops). This is already solved at firmware for most devices. On Hamilton STAR, commands use a counter/ID and PLR resource model updates only on matching response with no error.

  2. Type II: Unreliability at the network level (disconnections in client-server). These are introduced by networked backends, enabling machine sharing but creating desync risks. Use deterministic ID generation (RNG seed from a durability key) for reproducible sequences on rerun, the server caches responses for idempotency without upfront step knowledge.

  3. Type III: Failures inducing unknown/changed physical state (power outages, firmware crashes, mechanical issues). These are the most common failures of integrated workcells. Recovery means initializing equipment to known state (discarding partial beads in a mixing step). For auto-resume without errors, adapt dynamically like pulling fresh foil-sealed reagents from storage (overprovisioning).

It seems the best way we can resolve Type II durability concerns while leaving space for Type III branching recovery logic is with granular step idempotency caching the robot response to exactly one command:

command_id = sha256(
    durability_key +                 #
    run_id +                         #
    str(current_counter).encode() +  # client-side, starts on 0, iterates on response
    canonical_json(action_params)    # data payload server sends over USB
)

The client send cmd logic:

def send(action_params):
    state = load_client_state() 
    current_counter = state['current_counter']
    durability_key = state['durability_key']
    run_id = state['run_id']
    
    step_id = str(current_counter)
    action_str = canonical_json(action_params)
    command_id = sha256(durability_key + run_id + step_id.encode() + action_str)
    
    cmd = {
        "command_id": command_id.hex(),
        "action": action_params
    }
    
    try:
        response = server.send(cmd)
        if response.error:
            raise ProtocolError("Break on error")  # No increment
        
        # Success:
        state['current_counter'] += 1
        persist_client_state(state)
        
        return response
    
    except NetworkError:
        reconnect()
        return send(action_params)  # Recurse: reloads state, same counter

The server logic:

def handle_command(cmd):
    key = (cmd['run_id'], cmd['command_id']) 
    persisted_response = load_from_db(key) #quick lookups
    if persisted_response:
        return persisted_response
    
    response = robot.execute(cmd['action'])
    save_to_db(key, response)
    return response

So if the client disconnects or kernel crashes, but the same run_id and durability_key is used, the server will play back cached responses until it finds a new key to add to the cache.

Most importantly, this only resolves software errors on the client side, and cannot resolve physical robot errors on the server side, which have to clear cache for recovery because reagent overprovisioning drives the branching logic required for a proper physical recovery, which invalidates the run_id and restarts the protocol from the beginning with new reagents.