Discuss.FOLIO.org is no longer used. This is a static snapshot of the website as of February 14, 2023.

Mapping MARC fields to Codex

16 Jul '17

I had some very specific thoughts about the “Instance Metadata Elements” slides from @Kathryn’s powerpoint. I’m curious what others think.

I’m wondering why 130 is mapped to creator? I think it would be appropriate to map to title. Often, a 245 is not enough to distinguish a resource. When this is the case, there should be a 130 that would help.

As was already brought up in chat, publisher should come from both 260 and 264 $b

I’m concerned about taking the publication date(s) from 26x. For serial publications at least, often the 26x$c is not populated, but the fixed field date elements should be.

As a former map cataloger, I’m also thinking about some difficulties particular to accurate identification of maps, and I’m sure other non-book formats (any music catalogers here?) have additional challenges. Would it be possible to have a field somewhat akin to the 368 “other attributes” (http://www.loc.gov/marc/authority/ad368.html) currently used in authority records? This could be mapped, for instance, to scale and coordinates fields for maps.

I’m also wondering if the “creator” field could be repeatable. Currently, due to the structure of MARC, a book with two authors will have one traced in a 100 and the other traced in a 700. But both have creator roles. (This might present a challenge with pre-RDA records?)

Finally, in some cases I think the 7xx fields could be very helpful for identification/disambiguation. For example, many serial publications have an issuing corporate body in a 710, while the 130 might have a place or date that is much less helpful.

17 Jul '17

Hi, Laura…

A couple of response to your questions and comments…

  1. Re: the 130, that’s likely my error. The uniform title should have been mapped to the Resource Name element, along with the 245 and its subfields.

  2. Mapping publisher data from both fields is an easy enough change to make.

  3. To confirm, the publication date mappings would be more accurate/complete if they came from the 008/07–14…correct?

  4. I do think that we need to have a conversation about non-book resources and some of the challenges they may present to this model – or other types of data that we may need to include to support access to that content through the Codex. This is an area in which I will need to defer to the group’s expertise…and probably one in which a workshop-type call might be most beneficial to fully exploring the issues.

  5. I see no reason why Creator can’t be repeatable…it’s a carryover from MARC, but probably an anachronistic one given where we’re headed. The challenges of mapping probably warrant a discussion, either here or in a future call.

  6. The 7xx fields have been in and out of the work I’ve done…and was a topic that I wanted to broach with the group. For the purposes of disambiguation, do you expect this information to appear on a search results display?

I’m going to update the spreadsheet I referenced in my slide deck with some of your comments about proper mappings so that this stuff stays in sync.

Thanks for the feedback and questions!

17 Jul '17

Hi, Kathryn,
As a serials cataloger I think that mapping the data from the 246 and 247 fields is important as well as mapping the series statements from the 490 and 830 fields. Having all of this info would help with searching and easily identifying/disambiguating records. And I agree with Laura–we should definitely consider including info from some 7xx fields and that mapping dates from the 008/07-14 would be best.

18 Jul '17

HI Kathryn,
I agree that the 7xx fields (particularly the 700 added entry personal name should be traced in the Codex. The difference between the 100 and the 700 is a legacy from the days of the card catalog I believe. At times the difference between an author being placed in the 100 and the 700 is as simple as the order they are listed on the title page.
I think reaching out to experts in other formats is a good idea. In addition to maps, music, and a/v, I would like to reach out to some of the DCRM community (the DCRM stands for Descriptive Cataloging of Rare Materials).

19 Jul '17

Thanks for this input, Natascha…

I’m going to add these fields to the spreadsheet I’m maintaining (https://docs.google.com/spreadsheets/d/1yVc-BYtM5eQF1zIooeP3b7JP2wr5_ztYsPIHwV3WBfg/edit?usp=sharing) so that all of the fields that we need to consider are in one spot, and I’ll discuss with the architecture team in our call later today so that we’ll all prepared for tomorrow’s MM SIG conversations!

19 Jul '17

Likewise, thanks to you, Sarah, for this input…I’ll add these comments to the spreadsheet (https://docs.google.com/spreadsheets/d/1yVc-BYtM5eQF1zIooeP3b7JP2wr5_ztYsPIHwV3WBfg/edit?usp=sharing), as well!

I’ll also see if we can reserve a few minutes with the group tomorrow to talk about identifying and involving some experts in other formats. That’s real need and will require some special focus.

19 Jul '17

Hi again, @LauraW

As I was updating the spreadsheet, I realized I needed some additional input from you (and others who have indicated that the 7XX fields should be included in the Codex).

Currently, the 700, 710, 711, and 720 are included in the Codex, but mapped to Contributor – is that correct? (Actually, as I re-read @sarahlschmidt’s comments, I believe she’s indicating that they would most likely be mapped to Creator, since the 700 field is often populated with second+ authors, etc.).


19 Jul '17

I would like to know what others think (@natascha ? @sarahlschmidt ?). I think that since 700, 710, and 711 (and I assume 720 also, though my institution doesn’t use it) may contain either Creator or Contributor names, it would make more sense to have a single (repeatable) Codex category for Creator/Contributor.

Can anyone think of an example where this would cause problems/confusion?

19 Jul '17

Laura, I can’t think of any reason why Creator/Contributor shouldn’t be combined into one category for the purposes of the Codex.

Kathryn, I was just taking a peek at the spreadsheet and I thought it might be worth clarifying the various date info that is found in serial MARC records. The 008/07-14 dates are actually drawn from the 362 field(s) whenever possible. Generally, the dates in the 362 field apply to the chronological designation of publication and not the actual publication date. For example, an annual report might have a chronological designation of 2014 with a publication date of 2015. However, if there is no chronological info then publication date is used. So, sometimes the dates in the 008-07/17 correspond to the dates in the 260/264 fields (if there are any) but sometimes they don’t. Anyway, this is just reiterating that we should map “publication” dates from the 008/07-14.

19 Jul '17

A combined Creator/COntributor codex field seems logical, but results in other questions/considerations.

How do we map/display the fields in a way that ‘makes sense’ - do we map the 1XX field (when it exists) to the first Creator/Contributor field? Does that get a label other than “Creator/Contributor”? Or is it displayed differently than the other Creator/Contributors?
If we don’t distinguish 1xx mapped Cr/Co field from 7XX Cr/Co fields, we could create an interpretation/parsing problem - I’m thinking here of records for video specifically, where the list of contributors/actors/directors/etc. etc. can get quite long.
Relator terms (when they exist) might help with parsing, but there are lots of legacy records that don’t have that data.

my 2cents!

20 Jul '17

Hey All - apologies if I’m way out of place here, I’d like to contribute, but I’ve not found a gap in the conversation that felt right - but at some point my technical discomfort has outweighed my social anxiety and this is the result :wink:

When we did the work on the eBooks pilot in GOKb (Some of that thinking made it into the cabalog thinking that was at least a tributary precursor to some of the codex ideas) we had a pretty strong requirement for creator data (And subject data, and a few other reference properties).

I’m reading this thread alongside the linked data thread also, and the thing that I’d like to throw out there is that as a storage (Codex?) model Person (Beer, Stafford) -> Role (Author) <- Resource (“Platform for Change”) — In effect formally separating out Work, Person and the Work-Person relation/role — might make much more sense than work.creator -> Author. This is especially true where we might want creators to play other roles, but be able to walk semantic relationships. I’m thinking particularly here with cataloguing and managing datasets for researchers so they can cite those datasets in published works. Similar problems appear in article processing charge handling systems for OA publishing, and in Reserves/other course materials handling.

I guess my worry is that as soon as you start to see the related items (Authors, Subjects, Editors, Publishers, etc) in the context of satellite “apps” (FOLIO speak) like APC Handling, Dataset cataloging, Course Reserves, Current Awareness Services, etc you start to need to treat people/subjects as first class citizens that can have their own relationships outside the catalog of works/instances/items.

My worry about this thread is that it seems very heavy (And to accept as a given) a traditional bib perspective. I spoke with seb (H) the other day and I felt there was a real miscommunication about how problems like this are perceived in bibframe too - and the root of that miscommunication was the conflation of a specific serialisation of the model with the model itself. I’m feeling like those assumptions might be baked in here also.

It strikes me that discussing the marc mapping in this way, we might be skipping a whole load of detailed domain mapping that might bite us later on. Apols if I’m speaking out of turn here - it’s not an easy thread to break into, and I’m substantially intimidated by the breadth of experience here - but I’m a bit wary of how this might play out in terms of the final built system, which is why I’m sticking my head up now.

Apols if this is all taken care of in the thinking already, really I just wanted to find an in to the conversation.


25 Jul '17

Hi Ian-
I appreciate you speaking up! Speaking for myself only :wink: I do have a bit (a lot) of legacy, cataloger, tunnel-vision, so it is definitely helpful to have folks outside the catalog-centric world give us a different perspective.

I’m in agreement that breaking out the Work / Person / Work-person relationship data points is going to serve us best in the long term.

I think what we’re wrestling with is the iceberg of legacy bibliographic data where the relationship/role is not explicitly stated, but inferred by assigning a name entry the 1xx MARC tag or 7xx tag - which is, I think, what leads us to the “Creator” / Contributor data bucket(s).

@vbar - would it make more sense for the Codex to have a “Name” data bucket - with role attributes that can be very granular if they are specified in the source record (think MARC 1xx/7xx subfield e or subfield 4) or very general (creator, contributor, publisher) if they are not specified in the source record?

This means more work on the mapping side of things - and would the work be worth it, if the Codex is really for internal FOLIO functions, and not exposed outside of FOLIO?

26 Jul '17

Ian, definitely a tangent to this discussion, so apologies. I wanted to pick up on an implication in what you posted about using semantic tagging to help with work/person/work-person relationships. In your thoughts, you talk about the advantages of separating out work/person/work-person relationships, and use datasets as an example that would benefit from this approach.

Here’s the tangent…it would be helpful that for objects such as datasets, perhaps other types too, to have a ID minting facility in FOLIO that could be marshaled from relevant workflows such as discussed here in mapping incoming source records to the Codex. The ID Mint would allow an operator to generate and register, with appropriate management infrastructure, a unique identifier to associate with an object. This would allow a service from the library to the campus for minting DOIs or other domain identifiers and managing these for researchers and other users. The ID Mint would be a separate module (perhaps several with some coordinating module that manages UX) that generates the appropriate object identifier and assigns it to the object. Useful for datasets, local repo objects, or other sorts of unique local content.


26 Jul '17

Interesting – this tangent is related to the Sharable Local Name Authorities project that is coming to a close with the drafting of its report this summer. I wrote about this project and how it might intersect with FOLIO last year. This is a more general case for the specific local name authorities issue (minting identifiers for all sorts of things), with the added twist that they somehow be published and discoverable by others that might want to link to them as part of their own linked data stores.

This probably isn’t a version-1 deliverable, though, so I’m tempted to tag this as a ‘#folio-future’ idea.

26 Jul '17

Peter…tag away!

Good point you make, and one i meant to add is that FOLIO could have a module that provides a public interface for resolving locally minted identifiers. External interfaces may require a specific implementation to respond to resolution requests from the network, but could be built on a general capability for lightweight discovery support.

26 Jul '17

Yep, to tie this thread back to others, I also think that this comes down to doing a little more rich domain modelling within FOLIO itself -

  1. Because I’d like to see if it’s feasible to come up with any standards that let us use hashing as a way to create identifiers rather than just minting another opaque ID - It seems really attractive to me to be able to say something like check index for HASH(PERSON+SURNAME+INITIAL+DISCRIMINATOR). I can imagine that we might need several variant “Standard” hashes (for different combinations).

  2. Related - I’ve been pondering the idea of using some kind of deferred name resolution for the reference data - Essentially the ability to go RESOURCE -> REFDATA_OCCURRENCE <- CONTROLLED REFDATA. Although I know this looks just like a standard relational M:N or a semantic blank node, the approach taken in Jisc MONITOR was to use the relationship node to carry the refdata as it appeared in the source record. So if the author name “Beer, s” appears in a record, that is held in REFDATA_OCCURRENCE. We then later on do the more complex job of resolving that into our controlled vocab. This allows us to both store exactly what what was in the source record, but also connect it to authority data. The down side is that you then have 2 places to choose from, although one would hope that the authority data wins. The other good thing about this is that it gives us a pivot point to correct errors when our name resolution jobs get the result wrong.

Certainly seems that shared name authorities are a core part of this tho - along with mechanisms for propagating and sharing the contents of those authority files (Which turns out to be more tricky than the source process of minting new identifiers)

Be good to get some concrete ideas on how we explore some of these issues in detail.

26 Jul '17

I completely understand the issue that Ian is raising about whether to “treat People as first class citizens” within the Folio model and furthermore how to represent and manage the semantic relationships. I agree that providing support for an abstract model that represents complex relationships between various entities is not in the current Codex scope. The focus for Codex has been around supporting Resources (instances and items). The concept of a Work is on the periphery of the discussion, but out of scope for v1. That Work object is meant to introduce the notion of relationships as they relate to resources.

The Codex, in contrast to the previously discussed Kabalog, is a higher level normalization layer that seeks to simplify and flatten data structures, including any complex semantic relationships. So I would imagine that there remains the opportunity to create a separate domain to define and manage those semantic relationships. That domain could then inform other domains such as Codex or Inventory.

26 Jul '17

@ianibbo & @Lynn_W
You’re right, many of us here have a very MARC-centric perspective (well, at least I do) and while it’s important that FOLIO be able to work with MARC we definitely need to think beyond that.

I’m trying to make my way through the new IFLA LRM (Library Reference Modek–i.e., the successor to FRBR) and I really like the idea of having an even broader entity than either you or I have suggested, something like “Agent” or “Name” (i.e., some term that encompasses not just individual persons but corporate bodies, or really any entity capable of creating or contributing to a resource) and a “Role” that could be as generic as “contributor” or as specific as “director” or “actor” or “editor” or “illustrator.” So, if there are granular relationships coded in the source record, they have a place to be represented in the Codex, and if not, there is still a place for the data to live.

26 Jul '17

Hey Vince!

Ah you’re absolutely right to raise this - and it came up when Marc J and I discussed this. I think where Marc and I left this we managed to confuse ourselves a little about the exact boundary between the codex and more expressive domain model. It would be incredibly helpful for me to try and understand the service boundary. I know that might mean having a conversation at a more concrete level, but I think it would help.

We’ve got some code that takes marc records (A big batch) which MW mediated for us from Chicago. It would be really useful to understand how we see the flow of onboarding a recordset through some command interface and the FOLIO pathways.

I suddenly realise I’m exposing the fact I perhaps really don’t understand how people see these things fitting together - if anyone can suggest ideas I’d really appreciate it.

27 Jul '17

Hi @ianibbo,
@Kathryn has started a new topic re. collection of use cases to be discussed at today’s MM SIG meeting, e.g. see @fhemme’s post about Union Catalog records

27 Jul '17

I agree that it seems like we should just have the most generic label for people or organizations that contribute to an instance - basically any of the MARC personal name/corporate 1xx/7xx fields, instead of trying to categorize them. That doesn’t mean we’re wiping out the relator terms in the source data or trying to figure out a way to impose categories on names that don’t have relator terms in the source data. And we’re not trying to assign primary or secondary contributions, which is often just a factor of who is listed first on a resource and a legacy from card catalog days. Instead, in the Codex, we’re just saying these people or entities have a creator/contributor type relationship to this instance (as opposed to a subject relationship). A book about Shakespeare by Bill and Ann Smith would have name entries in the Codex record for Bill Smith and Ann Smith, but not for William Shakespeare.

27 Jul '17

I’d like to suggest that most Codex fields be repeatable, e.g., title, publisher, and format.

Rationale is primarily based on serials, which often have multiple titles (abbreviated forms or minor changes) and change publishers.

28 Jul '17

I suddenly realise I’m exposing the fact I perhaps really don’t understand how people see these things fitting together - if anyone can suggest ideas I’d really appreciate it.

Ian I think that’s because nobody exactly knows, we’re figuring it out as we go along. It feels to me that we’re about ready to get into looking at how the codex instance and item/holding unfolds from a UX perspective for print management, and I suspect that we will learn things and gain clarity as we do.

Ian, I’m not sure that I follow all of your concerns about flattening these structures, but that’s possibly because I don’t view the Codex as a finished product, but as a first step in what will inevitably be a series of iterations as we challenge it through UX and code. I don’t think so far that links between entities in the “Codex domain” have been explored that much, but they will be as we load bibliographic data. E.g. right now we tend to represent creators/contributors as strings IIRC, but the right way to model authorities in the system IMO is probably as links between separate entities. The same thing will hold if we use the codex internally to allow apps to view/work with data that lives natively as BIBFRAME entities out on the open web.

I tend to view our progression here as gradually claiming territory by building forward bases and deploying our apps and microservices as we go along. The space is simply too big for us to attack every single thing at once… we’d end up designing something terribly complex or overly general or just bad.

So how do we evaluate the Codex, or any model in FOLIO really, at this stage? I would suggest that the baseline probably has to draw on our limited scope for v1 at this stage… if there’s a workflow or feature in v1 that we can’t represent, that’s a hard error and should be fixed. If there’s a feature that sits right over the horizon, and which might come into sight more quickly if someone comes along and starts working on it, then it would be best if we already had SOME idea how it could be supported in the Codex (for example), even if it can’t today.

We’ve got some code that takes marc records (A big batch) which MW mediated for us from Chicago. It would be really useful to understand how we see the flow of onboarding a recordset through some command interface and the FOLIO pathways.

I’m not sure that flow is completely described at this point, so step #1 is to invent it. :slight_smile:

28 Jul '17

Pulling together all the recent theads here and elsewhere, the intuition I’m developing is that I should stop worrying about the codex quite so much. It sounds to me like a more natural home for my real concerns is the Inventory system. I don’t know that we really bottomed out some those concerns (Essentially authority data and how it works) before we seemed to switch to codex being the hot topic.

I think I’ve probably not been totally transparent on the record onboarding question - Marc records seemed to be the logical starting point, but for me the real value of the “Dataload” procedure is that it gives us an endpoint where I can say “I’ve got a resource description for a research dataset created by the SDL research data microservice, where can I load it?”. Having a record onboarding service endpoint gives us a structured place to do experiements which might allow us to validate if our internal datamodel is capable of storing the very diverse range of item types that we will encounter in 2017 and onwards (And answering the questions that modern apps will need answering). Since we want to load marc, it’s an ideal candidate, but I’d only see that as the first in a range of per-type schemas.

To put it another way - record onboarding seems to be a vital tool to validate approaches - it’s not that I see it as end-user functionality per-se, but it seems to be the most critical path item to allowing us to test our assumptions on the data modelling front.

Does that make any more sense than my previous ramblings? It feels like it’s a more clear statement of the concern :slight_smile:

28 Jul '17

Quick clarification, are you looking to ‘load’ a resource description of the dataset, or the dataset itself?

If you are looking to load a research dataset, then my view is that while that is a totally valid FOLIO pursuit, it is out of scope of the Codex. I hope that someone will build that infrastructure on the FOLIO platform, and a suite of apps for it. @nassar has been pursuing that idea as a side project and has explored at least one possible direction one might take, but not the only way. FOLIO is a platform, so you’re never limited to just one way of doing things.

If you’re looking to load bibliographic descriptions of data set(s) into the Codex you have options.

The Codex grew out of a desire to have a common data model and interface that would span a lot of popular and emergent data models that already exists in the wild for the purpose of bibliographic description and electronic and print holdings management. Nothing more. It’s scope and capabilities are to my mind influenced by a few important factors

  1. The functional needs driving the initial set of apps, which focus largely on core library administration functions. Specifically resource/inventory management.

  2. The desire to try to harmonize or bring together print and electronic resource management into a single data model to facilitate a more integrated solution across the two main content delivery mechanisms

  3. A desire to support multiple, detailed schema for description (including BIBFRAME) while allowing a set of relatively naive apps to ignore all but the data elements they need to support common workflows

  4. A desire to support multiple storage mechanisms, including a conventional, local-to-the-system data well or multiple remote knowledge bases, union catalogs, authorities, or LOD data sources. These storage mechanisms can be facilitated simply by writing new implementations of the Codex interface.

The Codex leans on BIBFRAME in the sense that it recognizes a similar/identical set of entities. The “Instance” and “Item” are the one we focus on initially in this stage because you need them to drive present day functions. The core Codex Instance record is our most basic descriptive metadata set that the v1 apps can rely on (when we discussed this in Ireland, I suggested we call it the “Real Dublin Core”). If richer metadata is associated with a given Codex record, e.g. as a linked MARC or BIBFRAME or MODS record, then v1 apps can be extended to take advantage of that richer metadata if desired, or someone might write a new app specifically to take advantage of some specific format.

Back to your question

Assuming that you want to ‘load’ descriptive metadata, not actual research datasets: First, I’d say, do you really want to load it? Or, do you want to write an implementation of the Codex Instance interface that will expose the metadata to FOLIO apps while leaving it in place in SDL (or whatever). If you really want to load it, you have options…

Assume we have a “local-to-FOLIO” storage module for metadata which allows you to load up data in any schema you like provided there exists a crosswalk to the Codex (so that v1 apps can see the data without having to know your schema).

Is your data set ‘supported’? I.e. does a crosswalk already exist? If not, you either create the crosswalk, then load the data, or, you map your data into one of the already existing schemas like MARC, BIBFRAME, whatever. Then you load it. As a special case, obviously we can support simply loading the data in the Codex schema, which would imply a no-op crosswalk.

What if your data set can’t be captured as a set of ‘records’ because it’s a free-floating set of triples or a RDBMS. Then, you will have to create a ‘view’ of your data set that represent entities or units of descriptive metadata conceptually similar to records, and load them. If you can’t do THAT, then your data set probably won’t make sense to any of the apps written to deal with a Codex view of the world.

Again, the Codex is a ‘view’ of the universe of bibliographic and holding metadata designed to let us do a certain set of things… by all means we should push the boundaries of what that is and what we can do with it, that’s the whole point… I keep expecting someone to make a pitch to use JSON-LD and shared schemas for the Codex and I’d be really intrigued by that conversation. But if you want to build apps in FOLIO that lean on all the expressive power of the semantic web and the breakdown of normal notions of entities/records, etc., then I don’t think the Codex is or should be the tool for that… trying to make it that in the short term would only trip us up.

At the same time, I really do hope that someone will integrate a triplestore into FOLIO and start to build apps that leverage it to do library and non-library things. It’s just that for the part of FOLIO which is some people racing to build a suite of apps you can use to run a library, that’s not the approach that has been chosen.

Please note, I’m not trying to shut down a conversation about Linked Data and where FOLIO should be moving. We need that conversation. But that’s a really broad conversation, and, there are actually some parts of that conversation that probably SHOULD inform even the present roadmap, while others can help us keep an eye on the long game even while we focus on something less ambitious right this moment.