Discuss.FOLIO.org is no longer used. This is a static snapshot of the website as of February 14, 2023.

On distributed updates and eventual consistency

tod

18 Jul '18

Problem Statement

One concern I keep hearing about the microservices-inspired architecture is a concern about how distributed updates are coordinated across data storage modules and data consistency maintained. Libraries have grown accustomed to the ACID guarantees of relational database management systems. The move to an distributed storage architecture which lacks those guarantees creates discomfort.

Microservices have developed programming strategies to address those data consistency issues. Perhaps now is a good time to discuss recommendations and best practices for FOLIO.

Towards eventual consistency

Points:

As microservices are decoupled, we see data references between modules and perhaps redundant storage, the denormalizing of data.
There are cases where data will be replicated or kept in sync across data stores, presumably based upon an agreed-upon source of truth.
It is legitimate to think that not all data will not kept in sync. For example, if two different KBs are integrated into the system with overlapping and conflicting data. It is unlikely that they will be corrected to line up with each other. What matters is how the system decides to choose between one or the other: i.e. the source-of truth. Keeping in mind that the source-of-truth may not be global but contextual.
This is a common problem in microservices architectures, and the industry solution seems to be “eventual consistency,” where temporary lags are acceptable, at least in some cases, but the system settles to a consistent state.
It might be useful to distinguish between issues of “eventual consistency” at the microservice level and at the system level. A microservice might allow temporary inconsistencies, for scalability reasons, during something like a bulk upload. But then “reconcile” the data in a later pass. Note that “reconcile” is distinct from “synchronize”.
While synchronous RESTful HTTP calls are easy to consume and make for easy-to-use external APIs, in the internal microservices architecture they can at times be problematic from a coordination/orchestration point of view. It is common for microservices architectures to solve some such problems through message-based RPC-like communication over HTTP(S).
There are patterns of messages. Messages could be synchronous or asynchronous, and could be consumed by a single listener or overheard by multiple listeners. There are patterns for different kinds of needs for messages.
The microservice that is the source of truth for some data element issues messages on update, and subscribers act accordingly.
There seems to be an architectural role, to look at the interactions between modules and to make intentional decisions about which patterns are appropriate for the different interactions.
This all seems to fit well with the needs for FOLIO in several ways: we’ve heard of the desire for a message-queuing system in FOLIO; we need ways to coordinate data updates, like data updates in storage modules being reflected in inventory; coordinating tasks through a workflow engine; and even flowing data out to the Data Lake.

There are well described general patterns applicable in the microservices setting: Event Sourcing[1] and Saga[2]. While it’s entirely possible for the module authors to implement such pattern across selected modules there’s no direct support for said patterns in the platform (Okapi, RMB, etc). An RFC could be written to propose such support on the platform level. Also note that these solve multiple issues, but not necessarily all of them at the same time. For example, solving transactional integrity between multiple µservices doesn’t automatically solve data consistency.

That said, there do not seem to be guidelines or best practices documented in dev.folio.org, and it seems like this is an area where the project would benefit from having a discussion about how best to deal with distributed updates.

[1] http://microservices.io/patterns/data/event-sourcing.html
[2] http://microservices.io/patterns/data/saga.html

Thanks to Jakub and Vince for their comments and contributions.

mreno-EBSCO

1 Aug '18

Data consistency is critical for production environments. The ostrich approach, which I feel we are currently taking, is very risky. We definitely need guidelines to address this problem. There are already modules, with more to come, that update/create data in multiple tables/databases/microservices. If any of these fail, previously updated data will remain committed. There is no rollback or replay functionality.

We all know that no service is available 100% of the time and we also do not want to manually fix problems due to data inconsistency. The patterns @tod describes above make sense for much of what we are doing. I don’t think we should leave this to be implemented in individual modules. A unified library should be available, whether it uses event sourcing or some other mechanism.

Then there are the situations that don’t seem to be covered yet, for example, what should happen if you delete an item from inventory? What happens to the associated loans, requests, fees, etc.? Right now, we simply delete the record from the DB. These other modules will need to know something happened to the item.

Another thing, somewhat related, I believe is critical for production environments is having a monitoring and alerting mechanism built into/integrated with FOLIO so that failed operations are reported to the proper resources. Of course, this could be added external to FOLIO, for example via log monitoring, but requiring each FOLIO deployer to design a mechanism is not ideal. There are many monitoring systems that FOLIO could integrate with, perhaps this: https://issues.folio.org/browse/FOLIO-887.

In the case where data is inconsistent and failed operations will never complete, for example, due to invalid data in the payload that somehow passed API level validation but fails some DB validation, something monitoring for this type of situation could detect it and alert someone to manually fix the data.

Sorry for the rambling post.