Discuss.FOLIO.org is no longer used. This is a static snapshot of the website as of February 14, 2023.

Reference Data and Upgrades

wayne

15 May '20

Background

A number of modules in FOLIO support bootstrapping a tenant with parameters that are sent to the module when it is initialized for a tenant (see the Okapi Guide for more technical details). Two tenant parameters that are supported by many modules are loadSample and loadReference. These parameters tell the module to bootstrap the module storage for the tenant with module-provided “sample” or “reference” data.

The tenant interface for a module can be called whenever a module is enabled or disabled for a tenant, including the case of a module “upgrade” (which in this context might be better described as a module “replacement,” in that one version of the module is disabled for the tenant and another is enabled). This means that tenant parameters like loadSample and loadReference are honored both for the initial enabling of a module, and for upgrades.

The Problem

A FOLIO upgrade, as it is currently implemented, involves replacing the set of modules enabled for a tenant with a new set of modules. Each module is responsible for upgrading its storage for the tenant in place – for example, updating existing records with new required fields, or adding a database index to improve performance.

A new version of a module (or a new module that was not previously included in the tenant’s module set) might contain new or updated reference data. It seems reasonable that an operator might choose to specify loadReference=true in the call to the tenant install API to load the new reference data.

As currently implemented in most modules, this will cause the module to attempt to load all reference data (not just new data). New records will be created if needed, and existing records (matched by UUID) will be overlaid.

Due to this, issues arise if the tenant has altered or deleted any of the reference data loaded by the module when it was first enabled. Any changes will of course be overwritten with the system default, and deleted records will be re-created.

More subtle problems arise if the record type in question has data constraints (for example, the requirement that a particular property be unique), and the tenant has created a new record of that type which causes a conflict with incoming reference data. As currently implemented, this kind of conflict causes the module upgrade to fail, potentially leaving the tenant data in an inconsistent state.

These kinds of issues would very likely also arise if an operator specified loadSample=true in an upgrade, but that is currently untested, and seems like an unlikely use case, at least for production.

Desired Behavior

In the SysOps SIG meeting of 15 May 2020, we discussed what the desired behavior for reference data upgrades might be. A range of possible behaviors was discussed, including:

Do not honor the loadReference tenant parameter at all on upgrade, only on first-time module tenant initialization.
Keep the existing behavior to create or overlay all reference data.
Only create new records. If a reference record already exists, do not overlay it.
Introduce a new tenant parameter to allow for customizing reference data overlay behavior (not discussed in the meeting, proposed out-of-band).
Create new records and merge existing records, to allow for local updates to be preserved while adding new properties if the schema for the record type has changed.
Treat the base reference records as immutable, and introduce local updates as overlays on top of the base (like a customized view of the record for the tenant). This might be seen as a more sophisticated implementation of the merging strategy.

While no consensus was reached regarding the recommended behavior, it was generally agreed that a failure to load reference data should not be treated as a fatal error that prevents a successful upgrade. This may require more sophisticated reporting from the tenant install API to report non-fatal errors.

What are Reference Data?

To this point, reference data in FOLIO have been loosely defined as data that are referred to by other records, without which it is impossible to create those records – or more loosely still as data that are “required for the system to operate.” There is definitely a grey area between records that are part of external controlled vocabularies, such as RDA-defined content types, and records that are referred to by other records, but will almost certainly have local data, such as user groups or locations. It would be beneficial to the project to document a more precise definition of reference data, and each module that provides them probably needs to examine its reference data and determine if it meets the updated definition, or if some data should be moved to sample data.

Next Steps

The SysOps SIG needs to provide guidance to the core platform and functional development teams to help resolve these issues for system upgrades. In particular:

Which behavior would we like to see for reference data on upgrade?
Is there the possibly of a phased implementation for upgrade behavior?
Do we want to work with the core teams to develop a more precise definition of reference data? What is the best venue for that work?

Other ideas and comments welcome! Please comment below.

brandon-tharp

20 May '20

I like these two ideas.

Create new records and merge existing records, to allow for local updates to be preserved while adding new properties if the schema for the record type has changed.
Treat the base reference records as immutable, and introduce local updates as overlays on top of the base (like a customized view of the record for the tenant). This might be seen as a more sophisticated implementation of the merging strategy.

Reference data should be considered data “that is required to operate the system”. The goal of updating reference data should be to ensure that the data required to run the system exists once the upgrade is complete. This should cover both changed reference data (new properties) and any new reference data that needs to be added. It would be tragic if an upgrade competed successfully (without error), but the system did not function correctly because of some missing reference data that is required. It should be very clear that there is a need to change reference data as part of an upgrade. I imaging institutions would want to review those changes to ensure that the changes make sense in their environments.

I would like to see the project do a few things in regards to this:

make a clear distinction between sample and reference data
make it clear either before or after the upgrade process that reference data was changed and needs to be reviewed
in general, the upgrade process needs to do a better job handling errors and indicating what exactly was changed/upgraded during upgrades.

zenotajoli

20 May '20

This this the behavior that I like to see:
1)That it is possible to upgrade WITHOUT loadReference and WITHOUT loadSample
2)A good CHANGELOG of reference data beetween release (Is there in any place ?)
3)Reference data set less small as possible.

My idea is to fix changes on reference data by hand after upgrade.
My enviroment (Italy, Unimarc data) has many references data very different.
For example i don’t use RDA codes or Contributor types.

I’m ok on working with core teams to develop a more precise definition of reference data

enettifee

20 May '20

From an end-user perspective, thinking about librarians who might interact with settings but not be technical or have knowledge about the upgrade process, I’m not sure I would expect an upgrade to add new settings values. I don’t think I would want to suddenly have new values appearing in things like drop-downs if I hadn’t already vetted it or thought about how it might be used.

An exception might be if there’s a new parameter perhaps for a particular setting value - like something before was just a list of names, and now each name has an associated code. But if I’ve added values or changed the reference data, I did that for a reason, and I don’t want an upgrade to change the work that I did.

Ingolf_Kuss

21 May '20

This is what I would like to see:

Existing records will not be overlaid
New records will be created, if needed
Deleted records will not be created anew

We might try to merge existing records, but be careful with that. If that has happended, there should be a note that reference data was changed and needs to be reviewed.

A failure to load reference data should not be considered a fatal error. The install-API should differentiate between errors, warnings and info messages.

Default reference data sets should be as small as possible. For example, there should be only one user group and only one location in the default reference data that is loaded by the system upon a new installation.

Yes, we want to work with the core team to define what is reference data (for each module) and what is sample data. I would call it a Working Group of SysOps SIG. I will be willing to work in that group. I will ask if we can set up a new Zoom slot for this group (once a week, until the work is done) and do a Doodle Poll for those who would like to participate.

I don’t understand what is meant by a “phased implementation for upgrade behavior”.

drexljo

25 May '20

Set up the wiki page here:
https://wiki.folio.org/display/SYSOPS/Upgrades+with+Reference+data