I was present at the Reporting SIG meeting on 12/4/2017 and have reviewed the meeting recording a second time to try to get a clear sense of Vince’s presentation. Based on that, I’d like to raise some questions:
Based on the Analytical Reports graphic and Vince’s explanation, I infer that the only data stored in what we’ve been calling the reporting system will be the transactional records in the data lake. Is that the plan at this point? There doesn’t appear to be any place to store MARC records, for example, either in whole or in part.
In order to get metadata for a report, such as title or author, the analytics tool will access the Codex from the operational system; there will be no Codex or equivalent in the reporting database. Correct?
If my first point is correct and the only data stored in the data lake is transactional data from FOLIO, what are the plans, if any, to convert retrospective transactional data? It looks to me as though the data lake is intended to record transactional data from FOLIO DAY 1 going forward only. Is that an accurate assessment? Or is there going to be some facility for incorporating historic transactional data into FOLIO, such as previous circulation history or the record of serials payments for the last several years.
Vince was cut off in the original meeting before he finished laying out the scenario for creating a report that combined circulation and acquisitions history. Vince, could you please complete that scenario here? How would I do a report, for example, that traces the cost per circulation of items that were bought on Engineering College funds and checked out by faculty or students of the Engineering College? I understand that I could get author/title metadata from the Codex, but where would I derive the data for the College affiliation of the person who checked out the item?
Qulto is currently reviewing analytics tools. Will they go on to develop the process of streaming the transactional data from the operational system to the data lake? Who, if anyone, has been tasked with developing what we’ve been calling the in-module reports?
“Transaction” is a word with many meanings depending on how deep in the software stack you are looking. Editing a MARC record is a transaction (the act transmitting a new or updated record from the user interface through the Okapi Gateway to the business logic and storage modules), so that information is available in the data lake/refinery.
This is an interesting problem of data modeling and migration. If the transactional data is available in the old system, then it becomes a question of how much effort one is going to put into creating the transformations to get that data into a format that the FOLIO apps recognize. I think this is something that is still to be worked out, but it will likely depend on a case-by-case review of the data that is being migrated.
Thanks for the clarification on the use of transaction in this context. But that answer feeds back to my question about what data is going to be available in the reporting system Day 1. How will retrospective records be handled, those which haven’t been updated in FOLIO or added as new to FOLIO? How do we get access to all of the MARC data that is considered reportable, for example, regardless of whether it’s gone through an update transaction? It still sounds as though the record won’t be in the data lake unless it has been updated in FOLIO, in which case we would have to access some records from one source and some from another. Or will certain types of records be streamed into the data lake as part of the initial conversion to FOLIO?
And this doesn’t apply just to MARC records, of course, but any type of record. If we migrate purchase order (PO) records, for example, and need to combine their data with circ data to determine if the College of Engineering is using the stuff we’re buying for them, will we have to access those POs from two different sources, i.e. operational FOLIO and the data lake?
These questions may sound unnecessarily detailed at this point, but I think we need to hear scenarios that address these kinds of issues, so that we can correctly visualize how reporting will work.
I don’t have a more exact mechanism other than “it’ll happen in migration”. I recognize that this is important, but until some other architectural points are addressed it isn’t easy to how it will happen.
I think some of the concerns around reporting are best illustrated with real examples. Here’s an example from just a month ago. The Law public services librarians needed a report to support a weeding project:
Material located in the Law Reference collection
data about those materials:
call number, copy number
author, title, imprint (publisher and publication date)
additional copies of these material and their locations
volume counts for all copies
in past projects (and anticipated) extent of ownership and active subscription status has been required
Our TS librarians have the skills to get this kind of report out of a relational database. In any environment, this is the sort of ad hoc report that will be build up iteratively, so each iteration should not be too time consuming. There are similar inquiries that we may make of millions of records, so the ability to scale is important.
What kind of skills will be needed to produce such ad hoc reports in FOLIO? How well does the reporting architecture scale? These questions require answers that indicate a practical reporting architecture if FOLIO is to be successful.
Recall that I am proposing drawing a line between what I’m calling Statistical reporting and Analytical reporting. The former would be within the scope of a Folio domain and would thus occur within Folio. The later would involve multiple domains, the Data Lake and sits outside of Folio. Of course nothing prevents using Analytical reporting for a single domain.
Legacy data may or may not be migrated into Folio
If legacy data are migrated, there will be a transaction trail for that migration which will make them available for reporting in the Data Lake
If legacy data are not migrate, they can still be added to the Data Lake independently of Folio. This is because the Analytics reporting system sits outside of Folio.
More generally, having an external Analytics reporting system is very powerful. It allows making data available without needing to first find a way to support them in Folio. Some of which may never be directly supported in Folio.
By FOLIO domain, I think @vbar is referring to the different record types defined in FOLIO. In the FOLIO design, there is generally a one-to-one correspondence between record types and apps: a vendors app for vendor records, a users app for user records, an orders app for order records, and so forth.