Discuss.FOLIO.org is no longer used. This is a static snapshot of the website as of February 14, 2023.

Planning for scalability and testing of scalability

tod
11 Nov '16

What are the plans for scalabilty and testing of scalability in FOLIO? I’m thinking both for the plantform itself, and any scaffolding to help module/app builders to test their work.

Scalability can, of course, include magnitude of data, complexity of data, and frequency of (potentially competing) CRUD operations. And there will be considerable variation according to the particulars. I’m asking in the context of a technical review, and thinking about how we ensure or measure scalability. How are you thinking about and planning for this?

peter
17 Nov '16

Hi, @tod – I spoke with @jakub this morning. He is circulating the questions around the developers to formulate a reply.

Sebastian_Hammer
17 Nov '16

This isn’t to pre-empt comments from the developers, but I’ll share some observations reflecting my own point of view as well as our early conversations around this topic. As with all other things FOLIO, it’s an evolving topic, and our experiences moving forward AS WELL as the insights and support of the community really matters here.

I’m basing my own perspective on experience from building complex distributed systems in the past, and battling with and working through performance issues in various ways. You always come into a new, greenfield project wanting to learn from and to avoid repeating those mistakes. I am thinking here primarily of frequency of operations and bottlenecks that might arise that might impede our performance. Volume and complexity of data we can perhaps address separately. I think I tend to think of the request frequency as a major concern because the microservices approach, and especially the mediation of requests through OKAPI (which doubles the number of hops) stands out and naturally gives people pause. Network hops are orders of magnitude more expensive than local context switches, and we must be mindful about exploding numbers of requests, or moving large amounts of data around because of ill-considered architectural or implementation choices.

Note, this discussion is focused mostly on interactive activity, although most of these issues also relate to batch-processing (whether interactive or deferred), the latter use case presents, I think, some separate issues.

I see us pursuing a three-prong strategy here:

  1. Good interface design. The textbook example of a distributed system design gone wrong surely must be a scenario where a single user request explodes into hundreds or thousands of expensive network hops, or where enormous amounts of data are moved across long-haul networks because of poor interfaces or module boundaries. A basic premise, to my mind, is that most APIs (but especially those frequently used to meet basic or common user functions) should be designed so that they match the granularity of the user request. In other words, the ideal we strive for is to have no more than one request to a given service for any one user request as you travel down the stack. I said strive because life has a way of upsetting the best laid plans, but I find that much pain can be avoided by keeping this in mind. By the same token, I believe that the amount of data moved about should be no more than necessary to meet the user’s request, plus perhaps a lookahead buffer if that is needed to provide a better experience. There are exceptions to this, and as in all things, compromises rule, but in general, a good design will avoid moving arbitrary-sized buffers around.

  2. Instrumentation. The second strategy is to build in instrumentation for performance monitoring even at an early stage. In the old days, when entire, complex systems might be running on a single server or on a small number of carefully curated servers, this might have been less essential; simply system monitoring tools or lightweight log analysis might have provided all the information we needed. But when you imagine a complex system of networked modules dynamically assigned to virtual servers by your cloud infrastructure, monitoring tools and dashboards are no longer a gimmick or a luxury: They are essential. In the FOLIO architecture, we aim at first to focus instrumentation in OKAPI and likely in the storage layer modules provided by the platform, because this reduces the need for casual app developers to think about this. Since all traffic passes through OKAPI, it provides a nice central location at which to measure. OKAPI today does have some initial instrumentation in place, but I expect that this will be extended and refined as more practical experience is gathered during the next year.

  3. Continuous testing. It is my sense that while some performance issues show themselves immediately as a result of poor service design or shoddy implementation choices, in some cases, problems are more insidious. An architecture which tests out great in early performance experiments (the FOLIO architecture has been ‘lab tested’ by throwing large numbers of requests at fake modules through OKAPI), real-world problems have a way of creeping up on you as developers work over time to develop and refine actual functionality. An addition here and a tweak there, and suddenly you have a sluggish dog of a system on your hands. It is my view that to prevent this, as we collaborate among ever more and more distributed teams, we should treat performance and stress-testing as constant concerns, just as we do unit testing and continuous integration workflows. Discover problems sooner rather than later. We do not have such a framework in place today because to date, we have not had enough pieces there to meaningfully test, but I hope that we will see this coming together during the beginning of 2017.

I hope this answers some of your questions @tod. What else should we be thinking about?

jakub
17 Nov '16

A couple of high level notes to get the conversation started.

In terms of scalability we have tried to tackle some of the issues early in the design process by choosing an architecture that is inherently vertically scalable. That’s the primary reason for splitting the Platform into three distinct tiers: system/data modules, stateless business logic modules and in-browser presentation/UI modules (the so called SPA).

Scalability on the data layer is probably the hardest to get right, but we hope we can exploit some elements specific to the library domain: e.g the fact the write-heavy pieces (e.g circulation, cataloging to some extent) can in many circumstances be partitioned by tenant (thus limiting the size of any particular DB instance), while the read-heavy pieces (e.g search) can be optimized by creating a separate index that’s updated much less frequently or where updates are planned during idle time. It’s worth noting that we would like to avoid forcing any single partitioning approach: we know that tenants will vary in size. In some cases it will be cost effective to group smaller tenants within a single DB instance, while large tenants will be partitioned separately from others and, in the extreme cases, their DB instances could be sharded. It’s of course crucial that the DB engine we use allows for those flexible modes, the ones we have been investigating (MongoDB and PostgreSQL) do.

Scaling the business logic modules should be much simpler, especially in the elastic cloud environment, assuming we can enforce statelessness. This requires rigour in terms of how developers need to structure their apps and we will provide examples and guidelines for how one should approach it. With stateless modules, CPU is usually the bottleneck, but it’s also one aspect that we have standard methods to deal with (e.g load-balancing multiple processes).

Finally, the UI is implemented as a Single Page App (SPA) that is pre-compiled, based on tenant selection of apps and re-generated after every change of that selection, into a static bundle of assets (JavaScript, HTML, CSS, images, etc). Such bundle can be served by a general purpose web server (or a CDN, generally extremely fast and scalable) and the dynamic elements are executed directly on the client (browser).

Stress/performance testing and instrumentation is, as you and Seb point out, a crucial piece to make sure this well-crafted, but also complex because of it’s distributed nature, architecture actually works. We will be performance testing continuously on our CI system, although that’s of course focused on the core modules. For debugging performance/scalability/latency issues, instrumentation (and visualisation of metrics) is key – Okapi will have first-class support for standard instrumentation collectors (like Graphite) and we will provide guides for how to report instrumentation data from within the modules.