Listing MARC fields for reporting

I know we have made the point several times in our SIG meetings that when it comes to what data we need from FOLIO for reports, we say, “we need all of it.” Given that there has always been substantial agreement on that point in the SIG, I wonder why we are trying to put together a list of MARC fields that are necessary in the reporting system. I feel the correct answer is that they are all necessary. We can’t choose which MARC fields we will need because we can’t anticipate future needs; based on the examples in the master spreadsheet and our demos I believe we have documented that we currently use all fields in MARC bibliographic and holdings records and many in the authority records also.

I’m very much afraid that if we don’t have complete MARC records available in the FOLIO reporting system, that we’ll end up having to build that data someplace else and keep it updated there. Or even worse, that when we need a report where the data appears only in a 5xx field ( such as which items in Texas A&M’s Special Collections were acquired as “part of the personal Science Fiction/Fantasy library of Anne McCaffrey”) we’ll have to export the whole MARC bib record database and scan the records one-by-one to get the information. Let’s just say we need our MARC records complete in our reporting system, structured so that individual fields are easy to access. And be done with it.

1 Like

I think Anne has a valid point.

And seconded (thirded?). As sure as fields are left out, then some library will want or need that field. Can we start on the assumption that all fields should be reportable - all fields in all records - user records, item records, order records, invoice records, vendor records, holdings records, bib records (in all forms). And if so, then are there any good reasons to pull back from that, or just assume all fields should be available for reporting? The only ones that seem a little iffy to me would be some of the sensitive user data.

Yes, my feeling is that we should have all fields (applies to other record migrations, as well as MARC). If there are strong reasons why being selective is preferable, those reasons need to be fully explained. As Anne notes, it is extremely difficult to anticipate future needs for reporting purposes, and almost certain to be more difficut to cope with the absence of those fields after the fact.

just to defend my idea of creating a list of MARC fields (bib + holdings) for reporting: my idea was rather to have a listing of “what information should be taken from where” (for the developer teams, the database design) - and not being selective concerning the fields.

I have what may be a contrary view.

MARC is not the only descriptive format what will be in FOLIO. Data from the electronic holdings knowledge bases are not likely to be MARC, and we have made deliberate design decisions to enable descriptive formats other than MARC to be used in FOLIO. By concentrating on just MARC fields, we are going to miss out on a lot of other data formats. To set us up to include all data formats may make the reporting system overly huge.

I’d prefer a scenario where fields are specified for the different instance descriptive formats and have the reporting engine work with that. If additional fields are needed, then perhaps they could be added to the format-specific definition and a batch job launched to pull that information out of the source records into the reporting database.

Hello Peter, can I conclude from your lines that you do not want us (the Reporting SIG) to make any reference to MARC and want us to just describe the fields we need (in words or by their contents) (I am not quite sure what you mean by “different instance descriptive formats”) ? That would leave the job a mapping from MARC to some other SIG (probably Metadata Management).

I certainly understand the point that MARC will not be the only descriptive format in FOLIO. And I’m glad to see it, knowing that we will have Dublin Core and likely other flavors of xml to load if not other formats.

I don’t see, however, why the requirement that all MARC records be available intact in the reporting system should be a problem; it almost sounds, Peter, as though you’re saying that including complete MARC records would preclude having other record types and I don’t understand why that would be. In fact, I’d like to shift the discussion even farther and say, why do we have to specify a single list of data elements for reporting at all? Why couldn’t we, for instance, have a library-specific configuration file that specifies which data elements from which records are available in the reporting database, whether they come from MARC records or not.

An additional point in favor of requiring complete MARC records in the reporting system is this: for most institutions, FOLIO will be the only source where those records are available. We may load our repository, archival finding aid and digital asset records into FOLIO to permit wider discovery, but we’re not going to dump our repository or archival finding aid software or our digital asset management system and put those records solely into FOLIO. They will continue to exist in their “home” systems. MARC records won’t; FOLIO will be the only place they’re available.

I have discussed with my boss this issue of specifying MARC fields or any fields for the reporting database to get his opinion. And his off-the-cuff response was that if we have to specify a list of fields for any record type, not just MARC records, our first local modification to FOLIO would probably be to rewrite the reporting module so that we have everything available to us. Which, of course, we’d really rather not do.

As I read this thread, it strikes me that the discussion is focused on implementation details - ie. it is specifying parts of a solution design. What is less clear to me are the specific problems that such a solution is meant to address.

I would personally find it most beneficial to understand the use cases that the Reporting SIG has come up with. Are those available somewhere? Put another way, the use cases would form a strong basis for defining what data need to be extracted for reporting purposes.

1 Like

In part, I want the reporting SIG to be prepared to consider record formats other than MARC. And in considering things other than MARC, think that the use cases we have are not limited to just MARC constructs.

I think what I’m saying is that one won’t be including complete MARC records in the reporting database as MARC records blobs. There will be some transformation (for instance, breaking up the leader into individual components) and denormalization (for instance, replace codes with label values to reduce the number of joins that need to be done to create a report) on MARC records – and other record formats – as they are streamed to the reporting database. And so there is already work in specifying these transformation and normalization steps for every record type.

The use cases are defined in the Reporting SIG Master Spreadsheet

The use cases are defined in the Reporting SIG Master Spreadsheet

I suspected as much, but ran into a permission issue trying to access the Google Drive folder. I’ve requested permission, but is there any particular reason that the folder is locked-down?

Vince, I’m sending you a separate email to address the issue of access to Google Drive for Reporting.

Meanwhile, I didn’t mean to start a discussion that would result in our having a SIG meeting online, but it looks like that’s where we’ve ended up. Should we just table this until the SIG meets, at which time Sharon and Ingolf will also be able to report on their meeting with Katalin, our Product Owner, which might also shed some light on this?

Picking this up a bit late, and because I cannot tell where this went from the meeting notes:

It looks like the reporting module is the primay way we have to do research on our data, whether for canned reports for administrative use or data consistency and other operational functions. We cannot say which fields need to be indexed and which do not because we do not know in advance which we eill need to search on. Canned reports are one thing, but operationally dealing with data maintenance is another.

Let us just take the example of data maintenance. There a many ways in which erroneous data can get into our systems: a vendor supplies some bad data or an operator misapplies policy, the upload and/or overlay process goes wrong, some business logic has errors, or we discover errors in data migration from a previous system. Some of these problems can lie fallow for a long time before being discovered. In a similar vein, there may be changes in policy or practice that we need to retro-actively apply to our data.

Whatever the case, we do not know in advance what fields will be affected, or which need to be cross-referenced while determining scope of a problem or identifying specific data to be fixed. In practice in our current and previous systems, we may do complicated joins across different tables. These queries may cross different modules, for example bringing together purchase order with bib, holding, and item information. As mentioned above, we cannot predict what will be needed, and our experience indicates anything may be needed, there is no way to pre-define the scope of our data mining needs.

It is probably also true that new fields will be added to the various metadata schemas. Is it okay if we start with a subset of all and provide a way for new fields to be added to the index retrospectively? That way the same mechanism could be used whether it is a new field or an existing field that now has a need for reporting.

I’m not sure if Peter’s question was addressed generally or specifically to Tod, but I’d like to add my two cents in any case.

If the list Peter suggests exists in addition to the reporting system having all MARC content complete with content designation, that would be very useful. But in my opinion that doesn’t take the place of having all MARC content & content designation in the reporting database.

That leads me to a response I’d to make to a point that was made in the meeting on Monday, 11/13/2017. (Always have to reflect on these issues, particularly when new info is presented.) On the question of whether we need all the data in the reporting database, I contend that we do need all of the data in the same format for reporting in the same place. I think we need access to all of the “raw” data as well as reports via a data warehousing concept. I want all the raw data in the same place because, as Tod makes the point above, “… we cannot predict what will be needed, and our experience indicates anything may be needed, there is no way to pre-define the scope of our data mining needs.” That implies to me that regardless of whatever data warehouse-style reports we develop, programmers at individual institutions will have to write custom reports. They should be able to do that and do it from a single data source without having to constantly futz (highly technical word) with a config file or go back and forth between operational and reporting systems.

I’m feeling the need to push back against this because there are issues with storing it all: impacts performance, cost money to store, backup, and recover in the case of disaster. A data warehouse should be structured in such a way that adding new elements can either be accomplished by adding a column or adding a supporting table (sometimes referred to as snowflake schemas).

If we do this right (and since we are talking about it early enough I would expect it to be done ‘right’), modifying an data integration process between the transation system and the data warehouse is a quick task to complete in a fairly agile environment.

I’ve been talking with a college friend who has built a career in business intelligence systems. He said that in his experience we should design a system that answers the most critical questions – that usually amounts to 70-80% of the analytically valuable data. He also introduced a new term: data lake. When there is a need to do reporting on raw data, the ‘data lake’ concept can be used to store and retrieve the raw MARC records. It can be thought of as a “staging area” for quickly bringing needed data points into the data warehouse.

This is what I had in mind.

I think it is crucial that the new system will offer a comfortable way to add new fields to the central reporting data warehouse. Yes, one should not have to deal with a config file and restart the system. I agree that having all valuable data in the reporting data warehouse out of the box probably comes with too much cost (of performance, money etc.). Considering the locally defined fields it is not even possible to have all fields without customizing. I like the idea of a staging area between the raw data and the data warehouse.

I have to admit that I am pretty ignorant about much of what is being proposed about the data warehouse/data lake concept. Here are some questions that I have:

  1. How is the ‘data lake’ different from the ‘data warehouse’?
  2. If all the data in all the records are being stored in a data lake, how is that a cheaper solution than storing it in a data warehouse?
  3. The bottom line for me is how this plan would affect creating reports (time, difficulty, etc.). Suppose we went with the data lake proposal. What would happen if I wanted to do a query that used 80% of data from the data warehouse, and 20% of data from the data lake. How does this impact the actual time it takes to construct and run the query and get results? And, if I’ve used data from the data lake, does that data get put into the data warehouse (or does it remain in the lake)? Should I care?
  4. What are the potential (or known) downsides to using the ‘data lake’ system?

Thanks all! - Joanne