The current .serialize() / .save() on Resource does a full recursive dump of the entire resource tree (every child, every coordinate, every well). For a HamiltonSTARDeck with a single 96-well plate on a carrier this comes out to ~3,600 lines / 124KB of JSON.
A fully loaded STAR deck (~40 plates) is roughly 5.5MB.
This is fine for archiving, but it’s too much if you want to e.g. sync deck state over a network, store a layout in a database, or version-control your deck setup.
It also looks like most of the lines are redundant
the well coordinates
sizes
geometry
are already encoded in the PLR resource class definitions. If the deserializer has access to the same PLR version, all it really needs is:
That’s ~200 bytes for the same deck and it round-trips cleanly as long as both sides are on the same PLR version.
Would it make sense to add an optional compact=True mode to .serialize() that serializes by class name + position rather than full geometry? And a matching load_compact() that reconstructs by instantiating the classes?
Is there demand for this? Or is there some way that this goes around plr’s philosophy? rails is hamilton specific, so this feature would need to be somewhat backend-specific (but serialize is as well with with_teaching_rack and core_grippers)
I also noticed that. From my point of view, the issue is in fact more fundamental. In the current resource model, the “model” of a resource is mostly an afterthought, tucked away in a single string. It makes it very difficult to reason about types of resources - every instance of a resource appears like a unique identity. So for example, it is very difficult today to build a system that tracks inventory of items by type, or to catalog the set of “available”, “compatible” or “preferrable” labware. Sure, you can refer to the PLR method or the type string (which is usually the name of that method), but this isn’t easy introspectable. You can create a “test instance” to inspect the geometry of a particular plate, but from the data model perspective, there is nothing that guarantees you that the next instance you create will have the same geometry.
I would probably have preferred a data-model, where instances of labware (as a python object instance) only carry the truly instance-specific data, like parent-child association, including identity and geometry of these relations. The “model geometry” would be stored in a separate object, which could be the python type object of that instance (i.e. class GrainerDeepWellPlatePN12345(Plate):), or it could be an instance of a separate PlateGeometry class.
With that change, one still would have to adjust the serialization format, e.g. to make use of JSON $ref references to deduplicate the shared geometry data, but at least that more compact serialisation format would then be well-defined. Today, implementing compact=True would either risk discarding geometry information (because the data model doesn’t constrain any resource from deviating from its model), or would require expensive deep equality checks for every resource (and there is still no well defined ground truth to compare against!).
because the data model doesn’t constrain any resource from deviating from its model
I’m not entirely sure what you mean by this. Can you point to another system where this is the case? Isn’t a solution to just add a hash to the example serialization I have above which just checks if the full resource hashes the same way?
Regarding json refs, are you thinking of adding the full serialization of every individual resource at the bottom of the serialization let’s say, and then have instantiations of that resource reference the full serialization? This way we get both a check on “is this the same resource as the original serialization” (also provided by the hash) and “what is the original resource like”?
In the current data model, the type attribute of a resource is a fully independent attribute from e.g. it’s size. You can have two resources with the same value at type, but a different size. The obvious resolution makes type an object containing the size information. Then, if you find two instances whose type is identical, then you are guaranteed that they will have the same size - it is the same information.
This design separates the per-instance information from the per-model information into two separate python objects. One design difficulty is that the current resource model has a nice geometric simplicity as that the location of each entity is relative to it’s parent: But e.g. for wells within plates, that relative location is part of the per-model information of the parent, whereas for plates within a deck, the relative location is part of the per-instance information of the child. You do still want to make the relative information accessible on the instance uniformly. There are a number of clean ways to solve this though.
Isn’t a solution to just add a hash to the example serialization
Not easily. The individual wells within a plate do not hash equally: They are at a different location. They have a different name. Because the resource model doesn’t distinguish what should be constant across different instances of the same model from what is allowed to vary, you don’t know generically what to hash and what to retain.
You can do of course add that information in - but then you still don’t solve the other use-cases for having per-model information. You force every use-case who intends to work at the model level to deep-compare/hash every single instance every time; with a high risk of introducing skew of what is being hashed and what is not.
Regarding json refs, are you thinking of adding the full serialization of every individual resource at the bottom of the serialization?
That depends on if you want a geometric archival format (then it would be important to retain old model definitions) or a semantic description format. I just wanted to say that if you define a type object that is shared by all instances, blind JSON serialisation will still give you a copy of that type object at every instance. You recall, the model guarantee relies on the identity of the type, but bare JSON only encodes values, not identities. JSON ref is one way to encode an identity. There are many other ways that one can think of.
Really like where this is going. The per-model vs per-instance split would clean up a lot in practice. On the variety side: the “same kind of thing” in the lab varies along several axes (geometry, material, surface treatment, sometimes even lot for binding-sensitive assays), and every consumer of the data currently has to reinvent what counts as “the same model” themselves. Having that decision live in one place would be a real win.
Could you sketch a small code example of what the type object and an instance pointing at it might look like?
Pulling geometry into the model object means the type name finally means something, instead of being a label nobody checks. Only open question is how global_id gets assigned. Content-derived so identical models collide on the same id, right?
Only open question is how global_id gets assigned. Content-derived so identical models collide on the same id, right?
I wouldn’t do that. Conceptually, it should uniquely identify exactly one orderable (i.e equivalent to the combination manufacturer, mpn, and perhaps a production version/date), independent of how you represent that thing in PLR. It is ok if two things that look the same in PLR have different ID’s. Furthermore, it enables reason about updates to the PLR representation that refer to the same thing. Say, we encode geometry of wells in a more detailed fashion. The representation changes, but it is still the same thing. If you receive a serialisation of some dataset/protocol/etc where the model representation looks different to what you have locally, you have a conflict that carries semantic value: Both intend to describe the same reality, so one of them is probably more correct, at least for a specific aspect. You may want to replace the model representation with the more accurate one. Or, you may want to re-instantiate with the exact same model representation to ensure bitwise reproducibility (which depends on having the exact same PLR version though).
After this discusion, I realise that we actually want a global_model_id as part of the “Model” data model with that semantic meaning described above. In the serialisation however, we don’t refer to the global_model_id, but a model_id which carries no semantic value other than to refer to a specific PLR representation of a model (including it’s semantic identity encoded in global_model_id). Outside of serialisation, in a running PLR program, that identity is instead encoded simply through python object identity.
I’d keep global_model_id for semantic identity and model_id strictly for serialization. that makes the save format compact while still allowing model updates, conflict detection, and reproducibility without coupling runtime object identity to the serialized data.