A couple of high level notes to get the conversation started.
In terms of scalability we have tried to tackle some of the issues early in the design process by choosing an architecture that is inherently vertically scalable. That’s the primary reason for splitting the Platform into three distinct tiers: system/data modules, stateless business logic modules and in-browser presentation/UI modules (the so called SPA).
Scalability on the data layer is probably the hardest to get right, but we hope we can exploit some elements specific to the library domain: e.g the fact the write-heavy pieces (e.g circulation, cataloging to some extent) can in many circumstances be partitioned by tenant (thus limiting the size of any particular DB instance), while the read-heavy pieces (e.g search) can be optimized by creating a separate index that’s updated much less frequently or where updates are planned during idle time. It’s worth noting that we would like to avoid forcing any single partitioning approach: we know that tenants will vary in size. In some cases it will be cost effective to group smaller tenants within a single DB instance, while large tenants will be partitioned separately from others and, in the extreme cases, their DB instances could be sharded. It’s of course crucial that the DB engine we use allows for those flexible modes, the ones we have been investigating (MongoDB and PostgreSQL) do.
Scaling the business logic modules should be much simpler, especially in the elastic cloud environment, assuming we can enforce statelessness. This requires rigour in terms of how developers need to structure their apps and we will provide examples and guidelines for how one should approach it. With stateless modules, CPU is usually the bottleneck, but it’s also one aspect that we have standard methods to deal with (e.g load-balancing multiple processes).
Finally, the UI is implemented as a Single Page App (SPA) that is pre-compiled, based on tenant selection of apps and re-generated after every change of that selection, into a static bundle of assets (JavaScript, HTML, CSS, images, etc). Such bundle can be served by a general purpose web server (or a CDN, generally extremely fast and scalable) and the dynamic elements are executed directly on the client (browser).
Stress/performance testing and instrumentation is, as you and Seb point out, a crucial piece to make sure this well-crafted, but also complex because of it’s distributed nature, architecture actually works. We will be performance testing continuously on our CI system, although that’s of course focused on the core modules. For debugging performance/scalability/latency issues, instrumentation (and visualisation of metrics) is key – Okapi will have first-class support for standard instrumentation collectors (like Graphite) and we will provide guides for how to report instrumentation data from within the modules.