Skip FOLIO Project Navigation

On Primary Identifiers in FOLIO


#1

There has been a lot of concern about UUIDs as the primary identifier in certain circumstances. In the context of ongoing operations, there are a number of areas where people transact in identifiers in analog ways. The purpose of this thought piece is to call out a few types of identifiers that are in use in library systems, note some of their properties and what circumstances they are useful in, and to call for deliberate choices when choosing an identifier for any particular type of record based on the context in which it is used.

By way of context for this discussion, the ways in which libraries and their patrons and partners transact in identifiers is not to be minimized. For example, when people need to unambiguously reference a specific bibliographic record they will jot down that bib id or read it over the phone. This happens all the time, when troubleshooting, helping a patron, and similar circumstances. In the context of migration, there is a concern that we already have identifiers for many records and the relationships (in the RDBMS sense) between records are already encoded with these, translating to UUIDs is a potential source of error. Libraries also often use identifiers to in constructing persistent identifiers for bib records or to coordinate with external systems. For example, bib ids are used in persistent identifiers in the public catalog for patrons to have persistent links, for professors to make reading lists, etc. Changing these, even their format, has repercussions outside of the library.

There are three basic types of identifiers that are of interest:

Human-defined codes
These are alphanumeric codes, typically short in length, which are defined by hand. There may be some system defaults for a specific code and the implementing site may define more.
Useful cases are when there are a relatively small number of such things. Item status code, borrower types, location codes, donor codes often fit this description. The codes often serve as a valuable local shorthand, whether for operational reference or making SQL queries of the data. The display labels corresponding to the codes may change over time as the context of their display changes, but operational references to these codes are stable and staff learn them quickly.

Sequential identifiers
Identifiers which start with a number and automatically increment as assigned. These are used almost universally for bibliographic records, for example. (Variations would include accession numbers that might begin with a year.) These allow unambiguous reference over time by humans and between systems.

[Addition:
URIs
See post 7 below by @peter ]

UUIDs
128-bit integers with a particular pattern of alphanumeric display. These are convenient from the software perspective as they can be generated without reference to existing storage, but are miserable for humans to transact in. From that perspective, they are most suitable for cases where humans do not use the identifier directly. For example, transactional records like loans. A UUID makes good sense here: the humans looking for the loan will typically be looking for the patron or for the item in the loan, often by barcode in either case, and I believe rarely if ever need to make direct analog references to the loan record itself.

So far UUIDs have been used as the primary identifiers. One appealing feature of the UUID is that it is Universally Unique. However, in order to resolve any UUID in FOLIO, one would need to know what type of object it refers to and which tenant it belongs to in order to query the correct storage module. (Or to query each by brute force.) Any message pointing to an object would need to carry enough contextual information to identify the address of the storage module where that object can be found. So the universally unique attribute would seem to be less important, and the main concern would be uniqueness within that storage module.

There has been talk about accommodating human-readable IDs as a secondary identifier, tracked in parallel with the UUIDs. It’s not clear that two parallel identifiers for record types that need something human-readable is a desirable. It’s a form of double-entry of data. And for those operations where the human-readable identifier is useful, APIs would either need to support operations based on both identifiers, or there would always be some lookup involved. In short, we would be talking about adding human-readable identifiers as an afterthought for those records where it is needed, and that would seem to introduce some unnecessary complexity, assuming that the previous conclusion about uniqueness within the storage module holds.

Where does this line of thinking lead?

My current view is that any three of these types of identifiers could be suitable for a given record type, given the manner in which they are used. A desirable approach might be, for any given record type, to determine the type of primary identifier based on the context in which that identifier will be used. This could come out of discussion between the developers and the relevant SIG. RMB might be extended to allow for the type of identifier to be specified by the developer, and that would be reflected in the relevant JSON schema.

I am happy to discuss and to have holes poked in the above.


#2

Hi Tod,

I’ll just add from an acquisitions point of view, human-readable identifiers are essential shorthands for sharing information between vendors and libraries. We’ve been assuming that we’ll need these regardless of what is happening with UUIDs. There’s no way that vendors are going to transact in long, complex UUIDs. Even if it can be done by the larger, more automated vendors (and I’m not sure it can), they are completely unworkable for smaller vendors or situations where a reference number needs to be printed in an e-mail or on PDF/paper invoice.

Some of the key ones that come to mind quickly:

  • Purchase order numbers and purchase order lines: for quick reference and for invoice line matching. Probably fall into the “sequential” category you reference above.

  • Vendor codes: alphanumeric shorthand that identifies a particular vendor; code may be the same or different from the university ERP code. These get used in transactions all the time, especially when pushing EDI, order, invoice data in and out of the system. They are also handy when the vendor name changes, but it’s still the same vendor. You can update the vendor details, but retain the existing code. For example, how many libraries still use YANKEE as their vendor code for Yankee Book Peddler, then YBP Library Services, then GOBI Library Solutions from EBSCO?

  • Fund codes: alphanumeric shorthand for a longer, formal fund name. Again, often exchanged between vendor and library with regards to orders

  • Location codes: alphanumeric shorthand for library locations, often exchanged between vendor and library, critical for much shelfready and custom cataloging work

In short, I totally agree. I don’t really care much about UUIDs except that they seem to clutter the displays. The eye-readable, shorter numbers/codes seem much more important to me. The developers are welcome to use UUIDs and do whatever they want with them behind the scenes for storage/identification purposes, so long as we also can transact/deduplicate/identify using human-readable codes as well.

My 2 cents.


#3

There are some endpoints that do not use UUIDs at all, for example the check-out endpoint that takes itemBarcode and userBarcode (and the optional proxyUserBarcode and loanDate): https://s3.amazonaws.com/foliodocs/api/mod-circulation/circulation.html#circulation_check_out_by_barcode_post

Having two unique identifiers in the same record is very easy to implement and is not unnecessary complexity. There is no double-entry of data because a new UUID gets assigned automatically. Looking up an UUID using the other unique identifier is super fast, one cannot notice any time difference.

I disagree with the proposal to have a discussion between developers and SIGs whether to remove some UUID fields. The SIGs request the fields and API endpoint fields they need, the developers decide how they implement them and may add UUIDs.


#4

An analogy - I was just typing up some notes, and for pretty much every Google drive URL, I tend to shorten to something less ugly for hotlink purposes. Which is easier on the eyes?

https://drive.google.com/drive/folders/0B7G8S7WF6N20VlVENkE4LTZqd1k

or

Click Here [which is what I usually change the hotlink text into]

Same argument for the bit.ly shortened urls.


#5

This is the RAML JSON schema of the check-out-by-barcode endpoint mentioned above:


#6

“Universal” is the key word here. The reason we want UUIDs is in order to ensure that no two records of any type in any FOLIO instance anywhere in the universe have the same identifier.

This provides the FOLIO persistence layer with a very nice property, which is certainty that we can merge all record sets from any two FOLIO instances without any key collisions. (This is not to say that merging two FOLIO instances would be trivial or easy, by no means, just that one obvious complication can be ruled out.)

The problem with alternative unique identifiers are, that they are unique until suddenly they are not. They can for example easily be locally unique but not universally unique. But even locally, the business logic can change in such ways that a intended unique property no longer is unique. We’ve had examples of that in FOLIO already. This presents a real challenge in a micro services environment. With a RDBMS you would at least be able to inspect the schema and automatically find all foreign key references to a given primary key, in case there was a scenario where multiple records with the same primary key were candidates for insertion; you could then find ways for handling that situation. Not quite so with micro services where you by design cannot know for sure what clients might have child records referencing a primary key in your module. In practice you could find out, you could say, but it’s dicey. At any rate, fixing such collisions is not pleasant.

That’s the rationale anyways.

But that definitely doesn’t mean UUIDs have to be exposed to users. @julianladisch already pointed to the checkout app, see his comments above. Wherever it makes sense, alternative unique identifiers can be supplied, for example to help users that need to pass identifiers around. We already have examples of that in FOLIO.

I agree with Julian that maintaining the UUID is not difficult. You obviously need to add the alternate key property to the schema and devise a pattern for creating the keys. This could be the admin user coming up with unique keys for example.

We have that in the users table, where both username and barcode are alternate keys (sort of, they are not mandatory for the time being). The barcode might be a serial number actually?

This places no extra burden on the client for creating the UUID; if the general convention is followed, the UUID is generated by the server if the client doesn’t supply it.

It will probably place extra requirements on the server - beyond providing the alternate key field - since users would likely want to be able to search and find records by the alternate key(s) - as is the case in the Users UI where you can search users by both username and bar code. You can search a user by UUID as well I just found. But that’s not an architectural requirement; you would want to make sure that all users have a username or a barcode if you intended one of those fields to be a real alternate key.


#7

There is another kind of identifier not listed in the three from Tod’s original message: URIs. In fact, the components of a URI with a UUID takes into account some of the issues that Tod mentioned:

The full URI brings with it the tenant and the object type. For instance:

http://folio-snapshot-stable.aws.indexdata.com/users/view/2a70833a-ebe5-4a18-9ec4-19317eb17de5

What might be interesting is to have a system service that provides a shortened URI, similar to what Confluence does for wiki pages:

Example of a Confluence "Tiny Link"


#8

#9

I think it’s fine to have UUIDs in addition to other identifiers that are more human-friendly. It needs to be possible to have the human-friendly identifier used as a record’s identifier in integrations and export files, though. The example that comes to my mind is VuFind – we use that as the patron-facing Library catalog search tool. It shows the bib ID in the URL right now, e. g. “https://asa.lib.lehigh.edu/Record/1”. When Library staff want to know unambiguously what record a patron is looking at in the catalog, they can ask them to read them that ID from the URL. VuFind will be indexing an export of our catalog from FOLIO. If FOLIO puts the UUID in the ID field indexed by VuFind, then that’s what will show up as the bib ID in that URL, and this very common task will become impossible.

This is not explicitly an argument against having UUIDs, though if we don’t actually use them for anything in practice then one could reasonably question their utility, but my intention here is just to point out that the human-friendly identifiers need to be an option to use as the primary ID FOLIO provides in exports and integrations with external software.


#10

I’m glad you thought of URIs. URIs are interesting for a couple reasons. As you say, the full URI of an object in FOLIO brings with it the tenant and object type. I’m wondering what contexts it be used in. In the example it seems to be tied to a particular host, so would not persist if the system (or just the service) were to migrate to a different host. But that may just be inference on my part.

The other interesting thing is as we think about a future with linked data. And I’m not even talking BIBFRAME here. There are other cases like authority control where it would be useful to reference and use objects that are maintained by some other agency, such as ISNI names or FAST subject headings, even if we have to cache some of that data locally. We’re not there yet, but it won’t be long.


#11

Thank you for the discussion, everyone. I think the summary is:

  1. Human-readable IDs are essential in a number of cases, especially when transacting with other systems or when humans need to communicate in analog ways.
  2. UUIDs are desirable from a development point of view and for global uniqueness.
  3. Having UUIDs by default and adding human-readable where needed involves little overhead and meets both needs.

Whether UUIDs have a role in the UI (e.g. display by default, omit entirely, or make them available on inquiry) is a question, but a side issue.