Skip FOLIO Project Navigation

Bulk user import feature


#1

FOLIO will have a bulk user import feature to support the migration from existing ILS systems, as well as to provide some mechanism to automatically import data from external systems like a Student Information System.

We have created a draft document to describe the workflow of an import and the contents of the user import file format. This document can be found here: https://docs.google.com/document/d/14S4YfB7g6L5lrdnebFJU1ioNqaRdcSTZZLUMrJlbpQY/edit?usp=sharing


On Slack I’ve recently asked some question regarding the automated import. I’m moving these questions here for discussion:

  • I’ve had some thinking about the configuration file we’ve discussed, which will configure how FOLIO behaves with the automated import. I don’t know what will be the distribution and installation method of FOLIO when actually released, but with the current Vagrant VM images it seems difficult to me to edit configuration files. I think it would be still a better way to create a configuration page for the user import in the Settings app, and you can setup the import there. What do you think?

  • I was also thinking about importing through an API call you’ve mentioned in the end. My proposed method works by polling the file system, so it’s basically a pull method. But actually an API based push import is not a bad idea either. From my past experience push can be more real-time and can avoid the creation of temporary files. Also with a good API design the caller can be notified instantly about issues. It’s also easier to support some kind of client application controlled transaction. However the SIS conversion script/tool have to be a little bit more complex and handle some of the additional tasks of the API calling. Which of the two import methods would support your needs better?


#2

I think it would be much more in alignment with the overall FOLIO approach to start with a well-defined API and add bulk processing externally. The institutions that are likely to be early adopters, I suspect, are also likely to be the sort of place that’s looking to move away from traditional batch processing towards something more like dynamic updates from an identity management system. Starting with a good API would support both in the end without making one more favored than the other.


#3

@cam2 i agree with your suggestion here. it is significant work for libraries to be in the middle between ILS needs for bulk import of students (and don’t forget faculty and staff are not in the SIS) and crazy SIS systems, data warehouses, HR systems, and the like. an api approach puts no additional burdens on a library that has no institutional support for coordinated IdM, yet makes it possible for a library that has that IdM support to use a better approach.


#4

Bulk loading in a time-efficient fashion will be necessary for transitioning into FOLIO, and not just for users. There are different ways this could be done, but whatever method is settled on will need to be reasonably efficient.

On the users themselves, while we would like libraries to not have to resolve differences between SIS, HR, and other data feeds, it seems that many do still have to resolve identities between disparate systems. At UChicago we’re lucky. We used to have to do that resolution ourselves, but there is now a central identity management infrastructure on campus which takes care of this. But in this I believe we are in the minority.


#5

Do you get updates/deltas from identity management as data changes, or is it a full feed of all users each time?


#6

How about drag and drop the entire table from MS-Excel format to FOLIO, FOLIO will create the table format and finally FOLIO will ask confirmation for each column / table as per the FOLIO database architecture…


#7

The problem with spreadsheet like import formats (Excel, CSV) is that some patron data structures can be repeated arbitrarily. An example is the address block. We can force them into columns, but it makes the format unfriendly. In my example I was using these column names for the fields of the first address:

addresses[1].addressType,addresses[1].primaryAddress,addresses[1].country,addresses[1].addressLine1,addresses[1].city,addresses[1].region,addresses[1].postalCode

It’s error prone and looks weird. JSON is a more natural format to describe these.

Of course there should be a way to import column based formats, but this can be solved by building a tool which converts the spreadsheet format to JSON. The same goes to other formats, like an LDAP file. JSON will be the primary format we are supporting.


#8

Good question. typically it is a full load. Ideally we could do deltas. The issue is if the upstream system, at least practically, can do deltas that represent comprehensive notification. I don’t question that it can be done, the question is institutional resources such as IdM systems are setup for this. What are the cases at Cornell & Chicago?


#9

We use full feed. I don’t know whether there is an option for deltas.

What we have found in the past with deltas for various data is that there is eventual drift, sometimes more quickly than expected, so we should anticipate a need for robust bulk loading in any case.


#10

@tod fair enough. we will certainly need robust bulk loads - even in an update situation, it is wise to do clean loads on some periodic basis, like weekly, perhaps.


#11

I see mention of a “configuration file” above but I haven’t found any other info here or on slack yet about it. Is there a vision for what might be contained in a config file? Just wondering if it will contain anything that somehow needs to take into account loads coming from multiple sources E.g. here at FLO we have 10+ sources we might receive a patron load file from.


#12

We don’t have configuration files anymore, because it’s only necessary if we go with the pull approach, but the SIG voted for a push method through the API.

In your case, I can imagine either one or several scripts (possibly one by each source system) which are calling the FOLIO user import API in order to update the patron database.


#13

I may have missed the vote, but i know there has been considerable discussion of doing pull registrations from campus IdM systems. Has FOLIO abandoned this thinking? Bulk loading is probably necessary to accommodate implementations where deep integration is not possible. But tighter integration into the campus infrastructure seems to best position FOLIO for use as a services platform.

Did we really decide not to do this?


#14

We haven’t abandoned these requirements, just the approach changed.

  • Instead of trying to support all possible pull mechanism and source system formats, we leave the conversion to an externalized tool or script. Based on the reactions, institutions already have these; updating them to support FOLIO instead of the existing ILS is not an issue. This way the tool can do any kind of data transformation (like applying overlays) and FOLIO don’t have to deal with these (which is a good thing, to stay as institution independent as possible).

  • Tools can be run periodically, so integration is there. The tool/script works as the glue layer. Schedule can be done as fixed as registering a cron trigger in the OS, or if the source IdM supports event triggers/hooks, it can be more real time, reacting to changes almost instantly.

  • Bulk loading is possible. Currently the script have to call the API independently for each user. This enables fine detailed error handling. If the separate insert becomes a performance bottleneck, there is a very high chance that an “insert multiple users in one call” API endpoint will be implemented as an alternative.


#15

So, a fairly standard implementation here like what we already do for our current generation tools.


#16

Tania: Out of curiosity, are you thinking of institutional or campus specific configuration files like the LDAP/Active directory files In Ex Libris’ authentication ecosystem to map attributes correctly (per that institution’s user administration rules)? Or the PLIF (Patron Info Loader Format)?

Istvan_Nagy: Thank for the confirmation about the push method through the API. It still sounds like there will be some level of manual/explicit pull (as Mike Winkler noted, much like in current generation tools). I would like to see/know more about the status of any developments regarding this: “Currently the script have to call the API independently for each user… If the separate insert becomes a performance bottleneck, there is a very high chance that an “insert multiple users in one call” API endpoint will be implemented as an alternative.”

Thanks, Marc


#17

@MarcK - I wasn’t sure what might be contained in a config file, so I was trying to discover if it would be attempting to apply any default data or default processes that might need to differ from source file to source file.

In my current and past lives we’ve had loaders that don’t impose much of anything other than a data format, so we’ve done tons of pre-processing on files of patrons to give them appropriate groups, roles, IDs, match points, etc. We’d certainly be willing to continue doing that work as long as any loader that FOLIO has won’t impose rules or restrictions on us that would make loading patrons exactly the way we want them difficult or impossible. E.g. if we need a match point to be different based on the source of the load file but FOLIO dictates a single match point. Or if we need to stack match points.

Possibly a moot question since it looks like config file is not the direction we’re headed, but if config files were to exist and were to be affecting the data/roles/groups that resulted from a load, we might need to account for more than one set of rules to be followed and define when to follow which rule set. If that make any sense at all!


#18

We have assembled an example script in NodeJS for importing users into FOLIO. Basically it is calling the FOLIO API. It has one entry point and does the following things:

  • Login to FOLIO with a username/password set in a configuration file
  • Use the token from the login result for further actions (some actions may require special permissions)
  • List the currently available address types in the system (it will be mapped in the user object)
  • List the currently available patron groups in the system (it will also be mapped in the user object)
  • Reads user data from a JSON file (file path is configurable)
  • Queries users for existence in batches of 10 (the batch size is configurable)
  • Decide about each user if it should be an update for an existing user or the creation of a new user
  • Update user data with id and address type, patron group, preferred contact type references to ids from FOLIO if present
  • Send update request for existing users
  • Create new users, and create credential information for these users

Permission insert/update is not yet implemented in the script but it is also possible. Please contact me if you have any questions about this script.

The repository with the source code can be found here: https://github.com/qultoltd/folio-user-import


#19

We have updated the user import script with some of the requests discussed on last week’s SIG meeting.

  • The externalSystemId is used as a matching point for user creation/update.
  • An empty permission list is now assigned to the imported users. This is necessary because otherwise new permissions can not be added to a user from the FOLIO UI.
  • The code was updated to manage multiple requests. It was successfully tested with the import of 1000 users. This has to be tested with larger amount of data.
  • Error management was also updated. When a user’s import fails because of error in the data it will not cause the whole batch to fail. But if a user can not be created/updated because of a server error then the whole batch will fail and the script won’t import more users.

Currently all user data in FOLIO (except for permissions) is overwritten with the data coming from the import script.