title: ai4lam Metadata/Discovery WG May Monthly Meeting
May 7, 2024#
8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
Attending
Name, institution
Jeremy Nelson, Stanford
Abigail Porter, Library of Congress
Caroline Saccucci, Library of Congress
Julia Kim, Library of Congress
Erik Radio, Colorado
Sara Amato, Eastern Academic Scholars’ Trust
Ian Bogus, ReCAP
Helpful Links#
Project Documents and Data#
Agenda#
Announcements
Presentation “Exploring Computational Description: An update on Library of Congress experiments to automatically create MARC data metadata from ebooks” by Abigail Porter and Caroline Saccucci
Update on FF 2023 presentation
Cataloging and OC Labs collaborating on this experiment
LC Labs AI Planning Framework
Understand, Experiment, Implement - Governance and Policy
Come with quality baseline and then implement, have a robust auditing and shared quality standards
Tools
Articulating Principles
Use case Risks & Benefits
Domain Profiles
Data Assessment
Acquisitions
Goal of Exploring Machine Learning
First task order in the Digital Innovation in IDIA which is scoped for experiments inAI and ML
What are examples, benefits, risks, costs, and quality benchmarks
What technologies and workflow models are most promising to support metadata creation and assist with cataloging wordings
Similar activities being employed by other organizations
Is experiment, not building towards production
E-Books with Cataloging Records Used to Train Models
Ground Truth fo testing
CIP 13802 items
Open Access 5825 items
E Deposit ebooks 403 items
Legal reports 3750
Test data
Existing catalog records for the ebooks
Test models against test data
Generate performance reports
Target data
Uncataloged ebooks
Run most models
What are testing?
Token classification - extracting specific bibliography metadata from the text such as Title or Author name
Text classification - charactering the whole of the text for example into subject heading or genres
Models: Bert, Spay GPTs with variations (NLP, NER, LLM, transformer and non-transformer models) Vendor picked models, 2022 was before ChatGPT
Results: Token Classification for All Fields
F1 score for each of the models, ranked from highest (best) to lowest (worst)
80% for some fields
Results: Token Classification for One Field
Fields
700
655
264
245
Expectation ~80% Quality standard ~95%
Exceeded ~80% for some fields (SBN, Author, Title, did pretty well in identifying these field)
Date - not easy for it identified, Vendors setting for date parameter, not sure if is the Model.
LCCN 100%, list of records almost always had the LCCN. Copyright page in e-book would LCCN.
Early Metrics Analysis: Matches and Non-Matches
Results after applying Annif
Green shaded areas are exact match to MARC XML
1-1 match
White rows ML model that wasn’t
Didn’t put in fiction, nothing in text indicated that is fiction
Ability to get an URL with a freeform sub divisions.
The ones that have fiction, established with URL, maybe, no URL
Combination of Positive and Negative hits questions about relevancy and accuracy, how well does the model do
Assisted Cataloging HITL Prototype
Vendor met with a group of catalogers workshop conversation, benefitting catalogers for this work.
What is the name of the author but the name of the author in the authority field
Authorized form of name or concept.
Two low-fidelity prototypes
Number of tabs for the cataloger to go through
Abstracts and summary,
Extracted summary directly from the text of the book, Machine picks main sentences
Abstract of Summary scanning through the text
Cataloger selects model suggestions, opportunity for cataloger to select to what the machine is suggesting. Opportunity how well does ML workflow suggest Subjects and Names in the authority records.
Both cases narrow, broader terms from LCSH, wikidata, and other linked-data resource
How well did the cataloger appreciated the suggestions, what is and is not beneficial, provide feedback to the model
Assessing Quality
How to access quality for these tools in different ways
F1 score, highest performing scores by field
Humans in the loop prototypes to increase quality for the records.
Contractor qualitative scoring of the models
Reliability
Compute cost
Training data
Activity
Documentation
Developer
Compliance to security and privacy considerations
Overall quality of the service, program evaluation factors
Likelihood of maintaining quality over time
Reasonable cost
Benefits to staff/catalogers
Benefits to users
Benefits to organization
Fair and equitable risks
Risks to users
Risks to organization
Security risks
Privacy risks
Authenticity risks
Reputation risks
Compliance risks
Want to collaborate with other or
Challenges
Unbalanced data- long tail of subject terms
Create well balanced training data to train and test models
Exploring correcting over representation of English language * other bias
Short-term timeline - Several NLP tools have ben in development for over 20 years, can’t reach state of the art
Need to develop quality standards and policies for these approaches
Stability of tools, unknown costs, tooling lock-in
What did I learn?
There two type so ML classification, text and token
One of the models, Annif, had some success with text classification predicting subjects
Some models very successful at token classification, predicating authors, titles, and identifiers, such as ISBN and LCCN
ML requires lots of training data to improve results
½ of the training data contained similar patterns of LCSH
½ of the training data contained unique LCSH
Catalogers reacted more positively to the results than expected
Cataloger-assisted prototypes were really cool and have potential
Catalogers interested in ML and seem less afraid of it than expected
LLM (ChatGPT) shows promise but need more experimentation
Room for both HILN and LLM
What do I still want to learn?
Faceted subject headings (post-coordinated) be more success than subject strings a la LCSH (pre-coordinated) in ML processes?
Subject categories are more successfully cataloged using ML?
Could a model be trained to accurately predict LC Classification and/or Dewey Decimal Classification?
What other metadata elements can be extracted from ebooks?
Can LLMs like ChatGP be trained to predicate accurate bibliography descriptions?
What are ML policies/decision that the Library need ot make, e.g.g:
Copyright concerns
Accuracy vs. Relevance
Training data biases
Next steps
ECD2: Toward Piloting Computational Description
Where are the most effective combinations of automation and human intervention in generating high-quality catalog records tht will be usable at the LOC
What are the benefits, risks, and requirement for building a pilot application for ML-assisted cataloging workflows
ECD3: Extending Experiments to Explore Computational Description
How can ML methods support the CIP cataloging workflow
How can CIP metadata generated through ML be ingested and used in BFDB
How can additional elements added to BF descriptions improve quality and usefulness of the metadata compared to ECD1 and ECD2?
Experiment with 3 ML models
Use prepublication galleys, all in PDF, some with minimal text provided
Created BIBFRAME descriptions that can be loaded to test BFDB
Require more metadata beyond the 6 fields required in task order 1
Allows for cataloger review in the BIBFRAME Editor
Extension of cataloger assistant prototypes
Questions:
Other LLMs? Current Using ChatGPT maybe Claude, tested Llama almost as good as ChatGPT 3.5, more permissive with later GPT 4.0
Timeline for next steps and BF Work? Current task order, working on prototypes, task order ends in August, BF begins in August 2024.
Can you expand on faceted subjects as potentially more successful? Extreme text classification breaks strings down, going to use linked-data, URLs necessary for BF. There is controversy about pre-coordinate vs post-coordinated strings useful for users. If only one or two patterns of data across the training data, interesting question for policy
AI metadata clean-up? National Library of Medicine from Alvin Stockdale using ChatGPT to run TOC format properly for a MARC record. Give prompts to format TOC. Running a script to fix things is more automation than ML. In an automated find/replace, ML making choices of what it thinks it should be. Difference in application, ML method.
Prototype integrated w/FOLIO? Ebook cataloging in a number of ways. Will the same method in ML produce the same quality. Getting results, very open to plans for integrating with FOLIO. Not serialized in MARC format, in weird space next string. BF description is going into BF test instance, then migrate to FOLIO instance. Not near pilot phase, but experiments. Learning about requirements and what is possible, creating future systems or creating requirements for future systems. Not dealing with production systems, experiments with Prototypes. Fortunately to work with LC labs to work with cataloging to experiments with ML methods.