Structured Concurrency

I want to make an argument for moving the PLR internals toward Structured Concurrency. In short, that is a relatively new programming paradigm (8 years old, so not that new anymore), which makes reasoning about concurrent tasks (such as waiting for instrument command responses) much easier, thereby greatly reducing bugs. The linked document by Nathaniel Smith is probably the origin of that term, and he is the author of the python trio library (a mostly incompatible alternative to asyncio). He claims that any other concurrency should really be banned (like goto is banned nowadays), so that is why trio intentionally doesn’t support them. However, for some projects, abandoning asyncio is not an option for various reasons. Luckily, the anyio library (inspired by, and modelled after trio) re–implements structured concurrency on top of either event loop (asncyio and trio), so that would be the path to go.

What problem does it solve concretely? Timeouts, cancellation, OS resource management and error handling. Currently, PLR tends to hang in many cases when an unexpected response comes from a machine. A failed setup call sometimes cannot be recovered by a stop call and may require a re-start of the process (i.e. interpreter or kernel). On linux, with the STAR backend, I quickly run into ā€œOSError: Too many open filesā€ because of the many asyncio event loops that are being started and stopped. Sure, all of these issues can be solved one by one, but I argue that most of these would have been avoided in the first place, if structured concurrency had been used.

So what will it entail? Mostly adaption would have to happen within the hardware communication loops. However, structured concurrency intimately means that each such loop shall live within some async context manager (structuring the lifetime of the concurrent tasks). Thus, the preferred API for enabling communication with hardware would be a move away from setup/stop to context managers. I anticipate push-back on this primarily because of interactive (notebook) use. There is a solution: setup/stop availability could be restored by having a (hidden) global object lifecycle context manager. That sounds scary? (it should - global state is a scary thing). It is required because setup/stop implicitly relies on global state in all of those detached futures which become managed by structured concurrency. By making this explicit in that global object lifecycle manager, at least we gain control over it.

What are the feelings from the community? Should I attempt a PoC?

3 Likes

Good writeup.

One thing I found helpful was looking at where PLR actually uses asyncio.create_task() and ensure_future(), it’s a handful of specific places (telemetry, pump array, BioTek shaking, visualizer, Inheco SiLA). The vast majority of PLR’s async code is sequential await. So the structured concurrency question might be more focused than it first appears, scoping those specific background tasks rather than restructuring the whole async architecture.

Have you looked at asyncio.TaskGroup (Python 3.11+)? It brings structured concurrency to asyncio natively, no need for anyio. Could be a lighter path for scoping those specific background tasks to the device lifetime.

On the notebook/setup-stop question, I looked at how other lab automation frameworks handle this. Bluesky/ophyd, QCoDeS, and QMI all use setup/connect + stop/close for device connections, not context managers. The context manager pattern lives at the experiment level (Bluesky’s RunEngine, QCoDeS’s Measurement context manager). That might be worth considering for where the scoping boundary should sit.

Thanks for the proposal and post!

setup failing because of a failed setup call before that is a problem that should be solved by backends. The idea is you should be able to call setup again if the previous called failed and it should recover. But not all backends are be written/tested to support that behavior.

One concrete problem I am thinking about is people often have notebook cells like this:

star = STARBackend()
await star.setup()

This might fail for some reason, and then they will run the cell again. Setup then fails because the previous instance of STARBackend, which is now lost, is still holding the connection. The SC alternative is

with STARBackend() as star:
  ...

The goal here with SC would be that a failed setup triggers the shutdown automatically (through the context manager), leaving the device in a clean state, if I understand it correctly? The problem SC addresses is it makes it impossible to NOT close the connection/state after it fails.

First of all, I think context managers are a nice thing to support for this purpose.

yes, this has an idea that has been discussed before (maybe not on a public page…) that I am in favor of. I think the io layer should keep a reference of all io objects so it can close them if necessary. Right now it is possible to create an io object, open a connection, and then lose the reference to it while the connection is still open…

This sounds like a terrible idea, one could imagine the io layer ā€œauto-recoveringā€ in the problematic notebook example above. The io connections are all uniquely identified, so it could be possible to imagine the io object looking in the cache map if there already is an open connection in this process and rather than trying to create a new one, it would just grab the existing one.

The OS resource management question is probably just one of multiple issues that exist of this type, so storing the connections globally is at best a partial fix.

going into technical stuff, the STAR should spawn a thread when it needs to read (and stop it when it’s done reading)

:grimacing:

before we get too deep into this, i should point out a really annoying thing which is that the cytation camera driver (spinnaker) requires python 3.10 …

yes I remember that. I was actually thinking about how to go around that. also don’t like downloading the sdk. I found this: harvesters Ā· PyPI but don’t know if that would work

if we could fine a way to get rid of it I’m all for it! it’s a really nasty thing. but we are getting distracted, topic for another post :slight_smile:

1 Like

Have you looked at asyncio.TaskGroup (Python 3.11+)? It brings structured concurrency to asyncio natively, no need for anyio. Could be a lighter path for scoping those specific background tasks to the device lifetime.

Yeah, people start to realise that structured concurrency is a good thing. However asyncio , while moving towards structured concurrency, is still a far cry from that. One thing that it is missing is ā€œLevel cancellationā€ and instead uses ā€œEdge cancellationā€. This means that you cannot reliably cancel something ā€œfrom the outsideā€; which is essential for good error behaviour of asyncio.TaskGroup. If I remember correctly anyio.CancelScope works around this by repeatedly cancelling as needed, but it can only do that when it gets a chance to. As Nathaniel writes, structured concurrency only gives you its guarantees if everything you call into follows the rules. That’s why he advocates banning ā€œnonconformingā€ primitives from the language the same as goto is banned. With asyncio, banned primitives remain available, but if everything you call into uses anyio only, your code is safe. The added benefit of anyio is that people can run it on trio too, where they do in fact get that guarantee - provided nothing in their application relies on asnycio.

I looked at how other lab automation frameworks handle this. Bluesky/ophyd, QCoDeS, and QMI all use setup/connect + stop/close for device connections, not context managers.

People stick to their habits, that doesn’t mean they are necessarily good. It took ~20 years after Dijkstra’s letter until widespread acceptance that goto is bad.

The context manager pattern lives at the experiment level (Bluesky’s RunEngine, QCoDeS’s Measurement context manager). That might be worth considering for where the scoping boundary should sit.

Even without structured concurrency, it’s good practice to explicitly scope resource acquisitions. If you forget to call stop, you leak your libusb resource, and have to clear them by stopping the process or removing the USB device physically. Scoping guarantees that there is no exit path which misses the exit action. With structured concurrency, task-group like scopes are the only way how you can run a background task. So if any of your setup/connect needs to start a background task that needs to outlive setup/connect, then it needs to happen in a scope that outlives setup/connect.

setup failing because of a failed setup call before that is a problem that should be solved by backends. The idea is you should be able to call setup again if the previous called failed and it should recover. But not all backends are be written/tested to support that behavior.

Sure. But writing such code is quite nontrivial, e.g. considering exceptions along the way. With structured concurrency, it becomes much easier.

going into technical stuff, the STAR should spawn a thread when it needs to read (and stop it when it’s done reading)

That is what it is currently doing, and creating a new asyncio event loop with each thread. I believe that is the reason for the leaked/hogged file handles on linux. But with structured concurrency, you should not have to spawn a thread in the first place. Multithreading/processing is for compute-bound work, not for waiting for I/O. If you only have blocking I/O libraries at hand to do communicate, then, yes, you need a thread. But then, you would use await anyio.to_thread.run_sync(read_something) just for that I/O call, and anyio.to_thread manages a threadpool for you so you don’t end up spraying threads all over.

I found this: harvesters Ā· PyPI but don’t know if that would work

I have worked with several different cameras through that API before. Can’t say I’m super happy with it, but it worked ok for our cameras. It relies on a closed source binary backend too though (genicam) which, at that time, could only be compiled and released after the genicam committee had decided to do so (each time), which also meant that it often was lagging one to two python minor versions. Haven’t worked with it in the last two years though, so perhaps, things improved.

That is what it is currently doing, and creating a new asyncio event loop with each thread.

bandage to the larger problem here, but fixed that particular problem:

(might move to asyncio.task later, similar to some new backends)

1 Like