title: ai4lam Metadata/Discovery WG May Monthly Meeting

May 7, 2024#

8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris

Attending

  • Name, institution

  • Jeremy Nelson, Stanford

  • Abigail Porter, Library of Congress

  • Caroline Saccucci, Library of Congress

  • Julia Kim, Library of Congress

  • Erik Radio, Colorado

  • Sara Amato, Eastern Academic Scholars’ Trust

  • Ian Bogus, ReCAP

Project Documents and Data#

Agenda#

  • Announcements

  • Presentation “Exploring Computational Description: An update on Library of Congress experiments to automatically create MARC data metadata from ebooks” by Abigail Porter and Caroline Saccucci

    • Update on FF 2023 presentation

    • Cataloging and OC Labs collaborating on this experiment

    • LC Labs AI Planning Framework

      • Understand, Experiment, Implement - Governance and Policy

      • Come with quality baseline and then implement, have a robust auditing and shared quality standards

    • Tools

      • Articulating Principles

      • Use case Risks & Benefits

      • Domain Profiles

      • Data Assessment

      • Acquisitions

    • Goal of Exploring Machine Learning

      • First task order in the Digital Innovation in IDIA which is scoped for experiments inAI and ML

      • What are examples, benefits, risks, costs, and quality benchmarks

      • What technologies and workflow models are most promising to support metadata creation and assist with cataloging wordings

      • Similar activities being employed by other organizations

      • Is experiment, not building towards production

    • E-Books with Cataloging Records Used to Train Models

      • Ground Truth fo testing

        • CIP 13802 items

        • Open Access 5825 items

        • E Deposit ebooks 403 items

        • Legal reports 3750

      • Test data

        • Existing catalog records for the ebooks

          • Test models against test data

          • Generate performance reports

      • Target data

        • Uncataloged ebooks

          • Run most models

    • What are testing?

      • Token classification - extracting specific bibliography metadata from the text such as Title or Author name

      • Text classification - charactering the whole of the text for example into subject heading or genres

      • Models: Bert, Spay GPTs with variations (NLP, NER, LLM, transformer and non-transformer models) Vendor picked models, 2022 was before ChatGPT

    • Results: Token Classification for All Fields

      • F1 score for each of the models, ranked from highest (best) to lowest (worst)

      • 80% for some fields

    • Results: Token Classification for One Field

      • Fields

        • 700

        • 655

        • 264

        • 245

      • Expectation ~80% Quality standard ~95%

      • Exceeded ~80% for some fields (SBN, Author, Title, did pretty well in identifying these field)

      • Date - not easy for it identified, Vendors setting for date parameter, not sure if is the Model.

      • LCCN 100%, list of records almost always had the LCCN. Copyright page in e-book would LCCN.

    • Early Metrics Analysis: Matches and Non-Matches

      • Results after applying Annif

        • Green shaded areas are exact match to MARC XML

        • 1-1 match

        • White rows ML model that wasn’t

        • Didn’t put in fiction, nothing in text indicated that is fiction

        • Ability to get an URL with a freeform sub divisions.

        • The ones that have fiction, established with URL, maybe, no URL

        • Combination of Positive and Negative hits questions about relevancy and accuracy, how well does the model do

    • Assisted Cataloging HITL Prototype

      • Vendor met with a group of catalogers workshop conversation, benefitting catalogers for this work.

        • What is the name of the author but the name of the author in the authority field

        • Authorized form of name or concept.

      • Two low-fidelity prototypes

      • Number of tabs for the cataloger to go through

        • Abstracts and summary,

        • Extracted summary directly from the text of the book, Machine picks main sentences

        • Abstract of Summary scanning through the text

      • Cataloger selects model suggestions, opportunity for cataloger to select to what the machine is suggesting. Opportunity how well does ML workflow suggest Subjects and Names in the authority records.

      • Both cases narrow, broader terms from LCSH, wikidata, and other linked-data resource

      • How well did the cataloger appreciated the suggestions, what is and is not beneficial, provide feedback to the model

    • Assessing Quality

      • How to access quality for these tools in different ways

      • F1 score, highest performing scores by field

      • Humans in the loop prototypes to increase quality for the records.

      • Contractor qualitative scoring of the models

        • Reliability

        • Compute cost

        • Training data

        • Activity

        • Documentation

        • Developer

        • Compliance to security and privacy considerations

      • Overall quality of the service, program evaluation factors

        • Likelihood of maintaining quality over time

        • Reasonable cost

        • Benefits to staff/catalogers

        • Benefits to users

        • Benefits to organization

        • Fair and equitable risks

        • Risks to users

        • Risks to organization

        • Security risks

        • Privacy risks

        • Authenticity risks

        • Reputation risks

        • Compliance risks

      • Want to collaborate with other or

    • Challenges

      • Unbalanced data- long tail of subject terms

        • Create well balanced training data to train and test models

        • Exploring correcting over representation of English language * other bias

      • Short-term timeline - Several NLP tools have ben in development for over 20 years, can’t reach state of the art

      • Need to develop quality standards and policies for these approaches

      • Stability of tools, unknown costs, tooling lock-in

    • What did I learn?

      • There two type so ML classification, text and token

      • One of the models, Annif, had some success with text classification predicting subjects

      • Some models very successful at token classification, predicating authors, titles, and identifiers, such as ISBN and LCCN

      • ML requires lots of training data to improve results

        • ½ of the training data contained similar patterns of LCSH

        • ½ of the training data contained unique LCSH

      • Catalogers reacted more positively to the results than expected

      • Cataloger-assisted prototypes were really cool and have potential

      • Catalogers interested in ML and seem less afraid of it than expected

      • LLM (ChatGPT) shows promise but need more experimentation

      • Room for both HILN and LLM

    • What do I still want to learn?

      • Faceted subject headings (post-coordinated) be more success than subject strings a la LCSH (pre-coordinated) in ML processes?

      • Subject categories are more successfully cataloged using ML?

      • Could a model be trained to accurately predict LC Classification and/or Dewey Decimal Classification?

      • What other metadata elements can be extracted from ebooks?

      • Can LLMs like ChatGP be trained to predicate accurate bibliography descriptions?

      • What are ML policies/decision that the Library need ot make, e.g.g:

        • Copyright concerns

        • Accuracy vs. Relevance

        • Training data biases

    • Next steps

      • ECD2: Toward Piloting Computational Description

        • Where are the most effective combinations of automation and human intervention in generating high-quality catalog records tht will be usable at the LOC

        • What are the benefits, risks, and requirement for building a pilot application for ML-assisted cataloging workflows

      • ECD3: Extending Experiments to Explore Computational Description

        • How can ML methods support the CIP cataloging workflow

        • How can CIP metadata generated through ML be ingested and used in BFDB

        • How can additional elements added to BF descriptions improve quality and usefulness of the metadata compared to ECD1 and ECD2?

        • Experiment with 3 ML models

        • Use prepublication galleys, all in PDF, some with minimal text provided

        • Created BIBFRAME descriptions that can be loaded to test BFDB

        • Require more metadata beyond the 6 fields required in task order 1

        • Allows for cataloger review in the BIBFRAME Editor

        • Extension of cataloger assistant prototypes

    • Questions:

      • Other LLMs? Current Using ChatGPT maybe Claude, tested Llama almost as good as ChatGPT 3.5, more permissive with later GPT 4.0

      • Timeline for next steps and BF Work? Current task order, working on prototypes, task order ends in August, BF begins in August 2024.

      • Can you expand on faceted subjects as potentially more successful? Extreme text classification breaks strings down, going to use linked-data, URLs necessary for BF. There is controversy about pre-coordinate vs post-coordinated strings useful for users. If only one or two patterns of data across the training data, interesting question for policy

      • AI metadata clean-up? National Library of Medicine from Alvin Stockdale using ChatGPT to run TOC format properly for a MARC record. Give prompts to format TOC. Running a script to fix things is more automation than ML. In an automated find/replace, ML making choices of what it thinks it should be. Difference in application, ML method.

      • Prototype integrated w/FOLIO? Ebook cataloging in a number of ways. Will the same method in ML produce the same quality. Getting results, very open to plans for integrating with FOLIO. Not serialized in MARC format, in weird space next string. BF description is going into BF test instance, then migrate to FOLIO instance. Not near pilot phase, but experiments. Learning about requirements and what is possible, creating future systems or creating requirements for future systems. Not dealing with production systems, experiments with Prototypes. Fortunately to work with LC labs to work with cataloging to experiments with ML methods.