Thoughts on having a compact "save format" for a deck/a resource that isn't just the built-in .serialize?

aferrari-unt · May 28, 2026, 9:42pm

The current .serialize() / .save() on Resource does a full recursive dump of the entire resource tree (every child, every coordinate, every well). For a HamiltonSTARDeck with a single 96-well plate on a carrier this comes out to ~3,600 lines / 124KB of JSON.

A fully loaded STAR deck (~40 plates) is roughly 5.5MB.

This is fine for archiving, but it’s too much if you want to e.g. sync deck state over a network, store a layout in a database, or version-control your deck setup.

It also looks like most of the lines are redundant

the well coordinates
sizes
geometry

are already encoded in the PLR resource class definitions. If the deserializer has access to the same PLR version, all it really needs is:

{
  "type": "pylabrobot.resources.hamilton.STARDeck",
  "children": [{
    "type": "pylabrobot.resources.hamilton.plate_carriers.PLT_CAR_L5AC_A00",
    "name": "first_carrier",
    "rails": 7,
    "children": [{
      "slot": 0,
      "type": "pylabrobot.resources.Eppendorf_96_wellplate_250ul_Vb_semiskirted",
      "name": "cell_plate_1"
    }]
  }]
}

That’s ~200 bytes for the same deck and it round-trips cleanly as long as both sides are on the same PLR version.

Would it make sense to add an optional compact=True mode to .serialize() that serializes by class name + position rather than full geometry? And a matching load_compact() that reconstructs by instantiating the classes?

Is there demand for this? Or is there some way that this goes around plr’s philosophy? rails is hamilton specific, so this feature would need to be somewhat backend-specific (but serialize is as well with with_teaching_rack and core_grippers)

burnpanck · May 29, 2026, 9:38am

I also noticed that. From my point of view, the issue is in fact more fundamental. In the current resource model, the “model” of a resource is mostly an afterthought, tucked away in a single string. It makes it very difficult to reason about types of resources - every instance of a resource appears like a unique identity. So for example, it is very difficult today to build a system that tracks inventory of items by type, or to catalog the set of “available”, “compatible” or “preferrable” labware. Sure, you can refer to the PLR method or the type string (which is usually the name of that method), but this isn’t easy introspectable. You can create a “test instance” to inspect the geometry of a particular plate, but from the data model perspective, there is nothing that guarantees you that the next instance you create will have the same geometry.

I would probably have preferred a data-model, where instances of labware (as a python object instance) only carry the truly instance-specific data, like parent-child association, including identity and geometry of these relations. The “model geometry” would be stored in a separate object, which could be the python type object of that instance (i.e. class GrainerDeepWellPlatePN12345(Plate):), or it could be an instance of a separate PlateGeometry class.

With that change, one still would have to adjust the serialization format, e.g. to make use of JSON $ref references to deduplicate the shared geometry data, but at least that more compact serialisation format would then be well-defined. Today, implementing compact=True would either risk discarding geometry information (because the data model doesn’t constrain any resource from deviating from its model), or would require expensive deep equality checks for every resource (and there is still no well defined ground truth to compare against!).

aferrari-unt · May 30, 2026, 2:22am

because the data model doesn’t constrain any resource from deviating from its model

I’m not entirely sure what you mean by this. Can you point to another system where this is the case? Isn’t a solution to just add a hash to the example serialization I have above which just checks if the full resource hashes the same way?

Regarding json refs, are you thinking of adding the full serialization of every individual resource at the bottom of the serialization let’s say, and then have instantiations of that resource reference the full serialization? This way we get both a check on “is this the same resource as the original serialization” (also provided by the hash) and “what is the original resource like”?

burnpanck · May 30, 2026, 8:36pm

In the current data model, the type attribute of a resource is a fully independent attribute from e.g. it’s size. You can have two resources with the same value at type, but a different size. The obvious resolution makes type an object containing the size information. Then, if you find two instances whose type is identical, then you are guaranteed that they will have the same size - it is the same information.

This design separates the per-instance information from the per-model information into two separate python objects. One design difficulty is that the current resource model has a nice geometric simplicity as that the location of each entity is relative to it’s parent: But e.g. for wells within plates, that relative location is part of the per-model information of the parent, whereas for plates within a deck, the relative location is part of the per-instance information of the child. You do still want to make the relative information accessible on the instance uniformly. There are a number of clean ways to solve this though.

burnpanck · May 30, 2026, 8:46pm

Isn’t a solution to just add a hash to the example serialization

Not easily. The individual wells within a plate do not hash equally: They are at a different location. They have a different name. Because the resource model doesn’t distinguish what should be constant across different instances of the same model from what is allowed to vary, you don’t know generically what to hash and what to retain.
You can do of course add that information in - but then you still don’t solve the other use-cases for having per-model information. You force every use-case who intends to work at the model level to deep-compare/hash every single instance every time; with a high risk of introducing skew of what is being hashed and what is not.

Regarding json refs, are you thinking of adding the full serialization of every individual resource at the bottom of the serialization?

That depends on if you want a geometric archival format (then it would be important to retain old model definitions) or a semantic description format. I just wanted to say that if you define a type object that is shared by all instances, blind JSON serialisation will still give you a copy of that type object at every instance. You recall, the model guarantee relies on the identity of the type, but bare JSON only encodes values, not identities. JSON ref is one way to encode an identity. There are many other ways that one can think of.

vcjdeboer · May 31, 2026, 4:35pm

Really like where this is going. The per-model vs per-instance split would clean up a lot in practice. On the variety side: the “same kind of thing” in the lab varies along several axes (geometry, material, surface treatment, sometimes even lot for binding-sensitive assays), and every consumer of the data currently has to reinvent what counts as “the same model” themselves. Having that decision live in one place would be a real win.

Could you sketch a small code example of what the type object and an instance pointing at it might look like?

burnpanck · June 5, 2026, 9:25pm

Could you sketch a small code example of what the type object and an instance pointing at it might look like?

Here you go: Sketch of a data model where resources are split into instance-specific and model-specific classes · GitHub

vcjdeboer · June 6, 2026, 5:56am

This is great.

Pulling geometry into the model object means the type name finally means something, instead of being a label nobody checks. Only open question is how global_id gets assigned. Content-derived so identical models collide on the same id, right?

burnpanck · June 6, 2026, 7:16am

Only open question is how global_id gets assigned. Content-derived so identical models collide on the same id, right?

I wouldn’t do that. Conceptually, it should uniquely identify exactly one orderable (i.e equivalent to the combination manufacturer, mpn, and perhaps a production version/date), independent of how you represent that thing in PLR. It is ok if two things that look the same in PLR have different ID’s. Furthermore, it enables reason about updates to the PLR representation that refer to the same thing. Say, we encode geometry of wells in a more detailed fashion. The representation changes, but it is still the same thing. If you receive a serialisation of some dataset/protocol/etc where the model representation looks different to what you have locally, you have a conflict that carries semantic value: Both intend to describe the same reality, so one of them is probably more correct, at least for a specific aspect. You may want to replace the model representation with the more accurate one. Or, you may want to re-instantiate with the exact same model representation to ensure bitwise reproducibility (which depends on having the exact same PLR version though).

After this discusion, I realise that we actually want a global_model_id as part of the “Model” data model with that semantic meaning described above. In the serialisation however, we don’t refer to the global_model_id, but a model_id which carries no semantic value other than to refer to a specific PLR representation of a model (including it’s semantic identity encoded in global_model_id). Outside of serialisation, in a running PLR program, that identity is instead encoded simply through python object identity.

robertbrown · June 15, 2026, 11:23am

I’d keep global_model_id for semantic identity and model_id strictly for serialization. that makes the save format compact while still allowing model updates, conflict detection, and reproducibility without coupling runtime object identity to the serialized data.

burnpanck · June 17, 2026, 9:09am

I think that is what I am saying: global_model_id for semantic identity beyond PLR object lifetime (i.e. it acts like a semantic UUID), and model_id only as a transient thing within a serialised dataset, derived from object identity.

However, if you are suggesting to also store a model_id in the PLR object at runtime, then I would argue otherwise: Like any API, the essential part of a data model is the semantic specification: What does it actually mean to have a specific value there. So what would model_id encode then, if we allow this to be an independent parameter? How do we handle a situation if we find two instances in a PLR program at runtime, where both of them share the same model_id, but otherwise in fact differ in some other aspect? If that is forbidden, then the only way to end up with the same model ID legally is to make an exact copy. There are zero benefits from making a copy in the first place, rather than sharing the instance. Specifically, none of the mentioned features rely on model_id

model updates: Keep global_model_id, create a new model object instance with new data (model objects are immutable, so you are forced to create a new model instance even if you do model updates in-process).
reproducibility: Requires the system to be capable of handling model instances that can be identified as referring to the same semantic entity (via global_model_id), while still being not equal. At runtime - easy. In serialisation, there needs to be a way for two serialisations of a single global_model_id so, we need to use a different key to deduplicate equal models. The key has no other value though.
no coupling of object identity to serialised data: There is none. Object identity is irrelevant. deep equality is. Object identity is a quick way to be able to determine deep equality fast.
conflict detection: model_id does not help with that, nor is there a need for it. A conflict is when two semantically equivalent model objects (instances) differ in some of their values. If you find two separate runtime instances with matching global_model_id, you need to deep compare the object to determine if the two objects are in conflict or not. There is no definition of model_id that can reliably guarantee object equality (other than a full serialisation of the model). Adding a redundant model_id is the classical SQL anti-pattern of data denormalisation: little value, but at risk of catastrophically destroying data model consistency.

Note that conflict detection is rarely necessary during runtime, because we tend to tear down the PLR process when we significantly change the model. So, the only time we may be interested in conflict detection in the first place is during deserialisation. With a compact serialisation format that doesn’t contain deep copies of model objects in the first place, that makes for very few deep comparisons.

IMHO there is semantic value in global_model_id and deep object equality, but nothing else. Python object identity is just a simple way to guarantee deep equality without expensive deep comparisons. In serialisation, we do want to use deep equality to reduce serialisation sizes and for that we need some way to refer to the “prototype” - an ad-hoc model_id is one way to do that, JSON $ref another. With that semantic model, there is no harm in having copies of model instances, other than increasing total data size, possible both at runtime and in serialisation. But as long as we are not creating copies easily, this is a minor threat, and we can freely decide if and when to do object consolidation via deep object comparisons to reduce the storage footprint. I would argue it is easy not to create copies to the point that we never need to do object consolidation at all.