Error handling api design

@CamilloMoschner and i have been talking a lot about a good api design for error handling. Ideally we make

  • a pattern that can be shared across all front ends to handle all errors.
  • a standard library of error handlers, while also allowing the user to easily write their own.
  • stackable error handlers: “retry pickup in the same place, if that doesn’t work try the next 5 tipspots, if that doesn’t work send a slack message.”

Below are three main ideas. I will use lh.pick_up_tips to demonstrate because it’s an easy example, but keep in mind we should generalize to all methods.

1. Context manger

PoC PR here: error handling: option 1 (PoC) by rickwierenga · Pull Request #507 · PyLabRobot/pylabrobot · GitHub

with lh.on_fail(ChannelizedError, try_next_tip_spot(iter(tip_rack["B1:H1"]))):
  await lh.pick_up_tips(tip_rack["A1"])
  • You can stack context managers.
  • It’s nice that it’s possible to use he same error handler on multiple calls / across functions.
  • However it’s also a downside. Camillo raised a good point that this is too dangerous: it’s not clear which error handler is active where. You can have things like:
with lh.on_fail(ChannelizedError, try_next_tip_spot(iter(tip_rack["B1:H1"]))):
  deep_function()

The other two options are specific to each lh/frontend call. “It does mean more code for every command but it means it is very clear what error handling is applied to what command and how”

2. Separate functions that do error handling

Roughly the following:

await lhtip_pickup_handler(
    tip_iterator=iter(tip_rack["B1:H1"]))
     ...
)

essentially new functions that incorporate error handling.

It’s nice that it’s very explicit what’s going on. The main downsides are adding error handling would completely change your code, and it’s not clear how you would stack error handlers. It can also cloud/make ambiguous the api of the front end.

3. A new parameter to existing functions

Similar to #2, but using the original method as opposed to having a new method wrapping the original, we pass the handler as an argument to the existing method

await lh.pick_up_tips(
  tip_rack["A1"],
  error_handler=try_next_tip_spot(iter(tip_rack["B1:H1"]))
)

what do you think?

1 Like

I kind of like #1 and #3.

As a user, my main need is wanting error handling for long running steps so they do not error out and stop. The most common ones I can think of tip issues; either miscalculating and running out of tips or random tip pickup issues. In both cases, I would want error handling to:

  1. Try to pick up the next tip
  2. optionally prompt the user if it is ok to continue from the current step (i.e. option to quit and start over or continue from where the error occurred). In the case of tips running out, could tell PLR that you reset the tips

In practice, I would want some global error handling for a long running scripts, like the “context manager” described here (#1). For each potential error, could give an error handling thing to try (ChannelizedError, try_next_tip; TooLittleLiquidError, wait for operator reply).

But a new parameter for existing lh functions (#3) would be easy enough to get started with, especially for tip issues. For more complex error catching, this may be insufficient depending on the error.

The other benefit of error handling is having the script stop itself if there is an issue and remember where it left off so you don’t need to rerun the entire script. I described a little bit of this in this thread about “pausing a PLR script”.

2 Likes

Thank you @rickwierenga for the summary of our discussions :slight_smile:

My views are still evolving, and I am not sure there is an objectively best solution here, but currently I believe that…

  1. a context handler implementation with with is dangerous and ambiguous:
    (a) depending on Python preferences, only 2-4 spaces (or whatever you set your tab to :roll_eyes: ) separate the non-context handler code from the context handler code - standard checks cannot detect whether one line of code has been accidentally added into the with statement → that can lead to some disastrous physical results, which would only be recognized during machine run-time.
    => Opinion: PLR’s error handling system should be close to the handled code by design and explicit - there should not be any ambiguity regarding what commands are managed by an error handler.

  2. separate functions or wrappers require more code… I think that is the only con about it?
    But they are explicit and very easy to read because they are always directly associated with the command they handle (as opposed to with statement maybe dozens or even hundreds of lines above).

  3. I really like the idea but wouldn’t this intrinsically link the error handling system with the creation of every command in PLR?
    => Opinion: PLR’s error handling system should be modular.
    While I think we can have a general handler the specific handling functions have to be adapted to the commands they are handling, e.g.: tip pickup requires different error handling than aspirate/dispense which requires different error handling to “measure OD450 in every plate”… I don’t think it is reasonable to add the handling straight into every command in PLR?
    (Note:some error handling functions can be universally used, e.g. stop_script_send_operator_warning_email, and stop_script_ask_operator_for_error_cause_resolution; but I believe besides a few hyper-useful functions like this, most error handling functions will be very specific)

As a result of this I think PLR’s error handling system should probably be based on a wrapper implementation.
I don’t like writing more code either but I like dangerous code or code I don’t understand in 2 months while re-reading it even less, and think maintaining non-modular error handling functions for every PLR command is not feasible :sweat_smile:

1 Like

Here is an old example of a tip_pickup error handler I’ve made a little while ago:

from typing import List, Callable, Any, Iterator
from pylabrobot.resources import Resource
from pylabrobot.liquid_handling.errors import ChannelizedError
from pylabrobot.resources.errors import HasTipError


async def tip_pickup_handler(
    pick_up_fn: Callable[..., Any],
    tip_iterator: Iterator[Resource],
    use_channels: List[int] = list(range(8)),
    num_channels: int = 8,
    **kwargs
) -> None:
    """
    Pick up tips on specified channels. If any channels fail, retry each failed channel
    individually using new tips from the iterator until all have succeeded or input is exhausted.

    Stops gracefully if the tip source runs out, without raising an error.
    """
    channel_to_tip = {}

    # Step 1: Assign initial tips
    for ch in use_channels:
        try:
            channel_to_tip[ch] = next(tip_iterator)
        except StopIteration:
            print("tip_pickup_handler: Not enough tips to start. Stopping.")
            return

    # Step 2: Try batch pickup
    try:
        tips = [channel_to_tip[ch] for ch in use_channels]
        await pick_up_fn(tips, use_channels=use_channels, **kwargs)
        print("tip_pickup_handler: Pickup succeeded.")
        return
    except ChannelizedError as e:
        failed_channels = list(e.errors.keys())
        print(f"tip_pickup_handler: Initial batch failed. Retrying individually. Failed: {failed_channels}")
    except HasTipError:
        print("tip_pickup_handler: HasTipError — some channels already have tips.")
        return

    # Step 3: Retry one-by-one
    for ch in failed_channels:
        print(f"tip_pickup_handler: Retrying channel {ch} individually...")
        while True:
            try:
                channel_to_tip[ch] = next(tip_iterator)
            except StopIteration:
                print(f"tip_pickup_handler: Tip source exhausted while retrying channel {ch}. Stopping.")
                return

            try:
                await pick_up_fn([channel_to_tip[ch]], use_channels=[ch], **kwargs)
                print(f"tip_pickup_handler: Channel {ch} succeeded.")
                break
            except ChannelizedError:
                print(f"tip_pickup_handler: Channel {ch} failed. Trying next tip...")
                continue
            except HasTipError:
                print(f"tip_pickup_handler: Channel {ch} already has a tip. Skipping.")
                break

    print("tip_pickup_handler: All channels now have tips.")


for x in range(12):
    await tip_pickup_handler(
        pick_up_fn=lh.pick_up_tips,
        tip_iterator=tip_1000ul_input_iterator,
        # num_channels=5,
        # use_channels=list(range(8)),
        retries=5,
        )

But this is just a bare bone - but functional - implementation of one error handling function, we want of course, multiple modular handling_functions and a main SerialErrorHandler that manages them.
@rickwierenga had some great ideas regarding this!

i agree with this

i also agree with these:

  1. PLR’s error handling system should be close to the handled code by design and explicit
  2. PLR’s error handling system should be modular .

i see multiple cons:

  • (requires mode code in PLR, more maintenance and room for bugs)
  • when adding/switching error handlers, users have huge diffs in their protocol
  • it is not modular at all
    • adding user-defined error handlers is less easy since they have to write entirely new functions
      • i.m.o.: the ideal scenario for user-defined & modular error handlers is each handler is a function
    • it makes it difficult to combine error handlers
      • how would you specify “retry with the same spot, then the next spot, then send a slack message”? Each combination seems to require a new function
    • it is impossible to cleanly define universal error handlers like “send email” without writing a new wrapper function. we have tens and soon hundreds of front-end methods in PLR. combine that with error handlers and it’s a combinational explosion.
  • worse api: assuming these would be front-end-level methods, it makes it ambiguous which function of the front end to use.

not to say we shouldn’t use this method, but it definitely comes at a high cost

imagine wanting to add a slack message to the tip_pickup_handler (should be named pick_up_tip_using_iterator or sth), you would have to copy the function, edit it, update your function call. it’s a lot of work.

could you explain what you mean with “link the error handling system with the creation of every command in PLR”?

yes, being able to pass functions freely (modular) comes at this cost, specifically in a dynamically typed language. but at the same time this is a power, like for the general examples you already named.

i think this can be mitigated by specifying the types of error handlers carefully, e.g.

def pick_up_tips(
  ...,
  error_handler: Callable[ # type of pick_up_tips
    [
      List[TipSpot],
      Optional[List[int]],
      Optional[List[Coordinate]]
    ],
    Coroutine[Any, Any, None]
]

why?

after speaking with @CamilloMoschner, here’s option 4 he suggested:

await lh.on_fail(lh.some_method("wrong input"), handler)

kind of like a context manager, but specific to one function.

it could work like this:

the pros are

  • type checking still works
  • minimal changes to PLR
  • clear which error handler is used for what
  • relatively clean diff for protocols
1 Like

I was actually thinking more of something like this:

await error_handler(
    fn=some_machine_method,
    fn_arguments=**kwargs,
    retry_strategies: Optional[List[Tuples[Error, Callable]] = None,
    fallback_strategy: Optional[Callable] = None
)

The architecture:

This is a general purpose error handler, consisting of

  1. the original class method (some_machine_method)
  2. its keyword-arguments (I don’t believe any PLR method has pure positional arguments?)
  3. a list of what I would call retry_function, given to the error handler function as a list of tuples; each tuple contains the specific error we want to handle, and the retry_function which modifies the class method’s arguments in a pre-specified manner,
  4. a fallback function, which could be anything from another method argument modification-based retry_function to a simple input() request

Here is a demo:


Why?

We are step-by-step generating a design requirement list for PLR’s general purpose error handling system here:

We already established:

  1. PLR’s error handling system should be close to the handled code by design and explicit
  2. PLR’s error handling system should be modular .

I would like to add:

  1. The top-level / general error handler error handler should not be dependent on the machine class method it handles (i.e. there must be no need to modify any machine class method code defintion).
  2. Serial error handling should be possible by default. (e.g. in the tip pickup example: try next tip_spot in the iterator)
  3. A “fallback” action should be possible by default in case serial error handling fails (e.g. in the tip pickup example: ask user to reset tipracks and confirm reset has occurred)

Please add more if you can think of more.

This demo from above:

await error_handler(
    fn=unstable_operation,
    fn_kwargs={'x': -1},
    retry_strategies=[
        (ValueError, handle_value_error, {'increment_dist': 2}),
        (RuntimeError, handle_runtime_error)
    ],
    fallback_strategy=fallback_handler
)

…showcases what I belief we actually want to achieve: execute retries of the method that failed with different keyword-arguments (kwargs).
Then the error handler tries whether the new kwargs achieve method execution without errors

(Note: though this is focusing on retry logic of the original method, we have the power to choose any functionalised action: if we wanted to write a function that sends us an email with the error report and the log up to the error encounter, that is still possible!)


Yes, this is much longer than a simple method call… but the more I am testing this, the more I realise that this is the bare minimum information that is required to achieve proper error handling. Error handling is just intrinsically a complex responsive action.
This is especially important when retry_functions require constraint-combinatorial kwargs modifications:
e.g. if a LLD aspiration fails and one wants to aspirate from the bottom, they have to not only modify the LLD_mode but also have to ensure that there is no Liquid Level Following distance specified when making the switch → otherwise it will crash.

As a result the retry_functions must take their own kwarg arguments to specify the change in the method kwargs we want to see.


The only thing: it isn’t pretty code

But it is explicit, highly adaptive, very readable and incredibly powerful - without modifying the method definition itself or being dependent on any specific machine (at this top / general purpose error handler level).

Applied to our specific (and comparatively simple) tip pickup error scenario:

without error handling:

await lh.pick_up_tips(tip_spots=tip_rack_1000ul_00["A1:H1"])
# imagine worst case scenario: 
# 2 channels already have old/used tips on them -> HasTipError 
# (then) 2 tip_spots don't have tips -> NoTipError
# (then) both of these tip pickup actions take a long time -> RuntimeError

with error handling:

await error_handler(
    fn=lh.pick_up_tips,
    fn_kwargs={'tip_spots': tip_rack_1000ul_00["A1:H1"]},
    retry_strategies=[
        (HasTipError, discard_preexisting_tip_and_retry_pickup, {
            'tip_iterator': tip_1000ul_input_iterator
        }),
        (NoTipError, retry_tip_pickup_on_next_tipspots, {
            'tip_iterator': tip_1000ul_input_iterator,
        }),
        (RuntimeError, retry_tip_pickup_with_extended_timeout), {
            'timeout': 240,
        }),

    ],
    fallback_strategy=reset_all_tipracks_and_wait_for_user_confirmation
)

All of the behaviour mentioned above must be declared, and the behaviour must coordinate the actions of each retry_function across (in this case) 8 different channel - no matter the implementation we choose… this is what I mean with "error handling is inherently a complex task … exactly the right type of problem for PyLabRobot :smirking_face:

here’s another question to ponder: if the error handler raises an error, should that be raised to the user making the original call? E.g. “max retry attempts” reached? Or should it be the last error that was raised by the function when attempting execution?

@CamilloMoschner: your description of the wrapper is starting to convince me. The generalizable error_handler looks right and does look easy/approachable to work with.

I’m still struggling to picture a full implementation and I think my problem is trying to guess at which level the error may occur. Should each lh (or machine) type have a list of possible errors? Having a context manager to be aware of a full script is nice so you can front load all of the error handling after developing the script. For example, I may start a new script by playing around in a notebook where I quickly catch/fix errors (ex. I noticed a longer step had a timeout issue so I just hardcoded a longer timeout). When I lock it in and want to make it into a succinct method and/or .py script, I would be able to add all the error handling upfront for the whole code and then let it run using the existing code. Using the wrapper, I would have to rewrap each step/function in an error_handler.

I wonder how the OEM softwares handle the error/retries?

@harley, I’m glad; it is quite difficult to get this right because error handling is just very complex.
But when things are complex, they must be extremely explicit - no hiding away of functionality; otherwise nobody knows what is happening (when commands are working, and worse, when commands are failing).

I will continue on a full implementation but I don’t have access to a STAR for the next 4-5 weeks.
(I’m moving jobs - I’ll finally take on an automation role :sweat_smile: )
Please feel free to work on an implementation in the meantime if you want to!

You can easily check this if you still have a Windows PC and the free VENUS 4.5: open VENUS, generate a basic deck layout and an aspirate command, the error handling options for 1 single command are buried inside the “Error Settings” tab:

→ yes, every command has to be told the specific error and the specific action of how to respond to it… and of course, it is up to you, the automator to keep all the information of all your “tab-buried” error handlers for every command in the entire automated protocol in your mind :smiling_face_with_tear:
(and don’t forget: there is no “Undo button”/Contr+Z, nor is git-based version control possible due to the binary nature of the files)

…I think it is fair to say that any PLR error handling implementation we’ve been discussing in this thread will massively accelerate automated protocol development time, and be more explicit.

1 Like

this is the fundamental tradeoff between readability/having the error handler and function close, or having it apply broadly. i think maybe we should support both and leave it up to the user what they want to do. (as camillo points out, the context manager can be dangerous)

2 Likes

Yes, and more:
PLR already has some nice error abstraction: e.g. ChannelizedError is a pretty broad PLR error that captures various firmware errors (e.g. Z-drive error [i.e. crash], NotEnoughLiquid, …).
Since different errors are raised by the firmware of different machines we most likely require - as you point out - a list of errors 1) per machine, and 2) per command.
The good news: we already have 1) the list of errors per machine … they are inside the machine backend :slight_smile:
(and can usually be found very easily in the firmware specification sheets that OEMs provide [if they do provide them, which many do: e.g. Inheco, Mettler Toledo, Brooks Automation, Azenta, … :heart: i.e. the companies that want to sell you their machines :joy:)

Yes, I completely agree: that would be wonderful, and I believe this was @rickwierenga’s initial idea when proposing the context handler/option 1 approach.
The problem is that this would required an additional step:
additionally to error_type specification + error_handling_response specification, which are specified by the programmer and are required for all implementations, the context handler would require autonomous error identification!

Imagine specifying a context handler, then writing a tip pickup and an aspirate function: how does the context handler know whether a z-Drive Error was raised by tip pickup or aspirate, and then correctly map a predefined error_handling_response to the command that caused the error.

This is incredibly complicated, and if one gets it wrong it is very dangerous.

So I would love for this feature to exist, but based on my understanding of context handlers (which is limited!) I believe it is not a time/maintainability/benefit-efficient approach.

see also this post: Clarifying / revising error handling

2 Likes

i am actually more and more leaning towards this implementation: error handling: option 3 (PoC) by rickwierenga · Pull Request #542 · PyLabRobot/pylabrobot · GitHub as a way to accomplish all of the above

my reasoning:
a handler for every PLR possible current and future method is equivalent to a handler for every method in every python program. this universality (e.g. passing the method and kwargs separately) comes at a cost: the type checker does not work anymore. you really have to change your code a lot to use it. it’s very non-pythonic.

the implementation in PR 542 is roughly the same, except the error handlers are passed directly to the method (meaning you still call the method as usual). it has the same power as camillo’s wrapper (the decorator here plays the same role as the wrapper, easy to see), while being more user friendly and pythonic. it’s also very easy to add to methods, every front end would just have to have @with_error_handler. granted, it’s a little more work than a plain universal wrapper but imo it’s worth it.

1 Like