Discuss.FOLIO.org is no longer used. This is a static snapshot of the website as of February 14, 2023.

Version of Unicode supported by FOLIO?

Charles_Riley
4 Sep '20

Up to what version of the Unicode Standard does FOLIO currently support? Is it supporting scripts such as Adlam, Yezidi, and Hanifi Rohingya?

Best,
Charles Riley

massoud
5 Sep '20

Hell Charles,

I am not familiar with such variation of the UNICODE. With Arabic Unicode, the standard Arabic code page is so far sufficient enough.

Thanks,
Massoud.

Charles_Riley
8 Sep '20

Thanks Massoud,

Unicode is a versioned standard, and updates are made about every 1 to 2 years toward more complete representation of the world’s scripts and languages. It is up to Unicode 13.0 now. Newer operating systems are handling the more recently added scripts, including Adlam, but individual software applications sometimes lag behind in their implementation.

Best,
Charles

peter
9 Sep '20

This is an interesting question with a couple of nuances, I expect. On the one hand, the user interface is all in a web browser and the interactions with the server are JSON files passed in HTTP. I would expect all of that to be up to the latest version of Unicode. On the back end, almost all of the modules are using Java (Java 8 primarily but moving to Java 11, I’m told). The data is stored in PostgreSQL version 8 in the hosted reference implementation, although there are some sites that are using version 11 and version 12. To the best of my knowledge, all of those components treat handle whatever the latest version of Unicode is.

Where I wonder if there might be differences is in things like sorting order. For that we would probably have to look at the various routines that are performing the sorting—most likely using a common library in RAML Module Builder. As far as display goes, though, I would expect any of the scripts to display fine. Feel free to try out one of the hosted reference environments and post back what you find.

Charles_Riley
9 Sep '20

Thanks! I think even on Java 8, the ICU4J software library can be helpful. I’d like to check out the reference environments, but am not sure how to obtain a login.

Charles

peter
10 Sep '20

Hey Charles. Take a look at the links under the “Demo Sites” heading on wiki.folio.org. I would recommend the “Current Release” one to ensure you have a stable environment.

Charles_Riley
24 Sep '20

Hi Peter. I was able to test a record containing valid Adlam characters from Unicode 9.0. The data included “𞤫𞤬𞤼𞤫𞤪𞤫 𞤨𞤢𞤴𞤳𞤮𞤴”, romanized as “Deftere paykoy”. It was rendered as “�������������� ������������” on import into the FOLIO Goldenrod release. I would have liked to see at least empty boxes, indicating lossless conversion, but what was output was lossy.

Best,
Charles

peter
24 Sep '20

It would be good to have an issue in the project tracker (issues.folio.org) for this, Charles. I’m willing to add it on your behalf if you want me to. This web forum tool is somewhat limiting in the file types it accepts, so you can send the MARC file to me at peter@indexdata.com. Also, describe the sequence of events you used to import the record so we can reproduce it exactly.

Thanks for testing this out.

Charles_Riley
24 Sep '20

A sample title from the Yezidi script, enabled in Unicode 13.0, is “𐺋𐺣𐺗𐺀𐺩𐺋 𐺀𐺩𐺏𐺀𐺨𐺀𐺢” to test for support of. Hanifi Rohingya was included in the 12.0 release, and a test string for support of its characters is “𐴌𐴟𐴗𐴝𐴙𐴣𐴒 𐴧𐴙𐴝”.

Best,
Charles

peter
24 Sep '20

Thanks for the details, Charles. I’ve created UXPROD-2685 with the sample file to track this need.

Charles_Riley
24 Sep '20

Thanks a lot!

Charles

peter
2 Nov '20

Hi Charles! Could you look at the comments on https://issues.folio.org/browse/MODDATAIMP-332 and see if the issue is addressed?

Charles_Riley
31 Oct '22

Hi Peter, I believe it has been addressed. Apologies for the long delay in getting back to you here; I had left a note to the developers under the issue indicating that I thought it had been resolved. Thanks for your time!

Charles

peter
31 Oct '22

Great! Thanks for the update!

Charles_Riley
31 Oct '22

At least, resolved as far as ADLaM in UXPROD-2685. I would still like to test with Yezidi and Hanifi Rohingya.

Charles_Riley
2 Nov '22

In the meantime, here is a prototype Mongolian script record from LC:
https://catalog.loc.gov/vwebv/staffView?searchId=21764&recPointer=2&recCount=25&bibId=21042143

I would be interested to know the roadmap for supporting scripts beyond MARC-8 in authority records.

Charles_Riley
3 Nov '22

For the sorting question, the Unicode Collation Algorithm should be helpful: UTS #10: Unicode Collation Algorithm

Charles

marcjohnson
4 Nov '22

FOLIO does not necessarily use unicode for sorting.

This will likely vary (as Peter suggested above) by the tooling used by individual modules.

Some parts of the system use PostgreSQL to perform sorting, the behaviour of which is dependent upon the collation configuration of the database. This is defined by the system operator and varies by implementation.

Other parts use a search engine e.g. ElasticSearch or OpenSearch, and these may use other sorting mechanisms.

And it’s quite likely that some sorting is done using code libraries / language features that will vary too.