title: ai4lam Metadata/Discovery WG Monthly Meeting

March 12, 2024#

8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris

Attending

  • Jeremy Nelson, Stanford

  • Aaron Krebeck, Washington Research Library Consortium

  • Erik Radio, Colorado

  • Tim Thompson, Yale

  • Sara Amato

Project Documents and Data#

Agenda#

  • Announcements

  • Aaron Krebeck Presentation

    • https://docs.google.com/presentation/d/1e4jemOguxXVYLvPESyWnIw8M7BzgPv2SlqFcXDgtSd8/edit?usp=sharing

    • Practical application using ChatGPT

    • Asked ChatGPT to write a script for Bob Burgers, impressed with output and formatted. What other applications to give ChatGPT a voice? Had a few projects in the library that might benefit.

    • Comes from shared print and shared print serials? Serials holding statements have large variability, parts and dates not consistent. Have a Metadata committee came up with some metadata standards for writing serials statements. Guidelines how serial statements for new and ongoing serials statements in the consortium.

    • Not a retroactive project, 900k serials, not enough staff time to go back manually.

    • Metadata committee came up with good documentation for serials statement

    • This documentation good starting point for AI

    • Proof of Concept

      • Slowly refine a query with ChatGPT

      • Item description templates very helpful

      • Great opportunity to learn to use AI

      • ChatGPT free tier allows for 100 lines of input is plenty for testing query development

      • Enough to test 25 gnarly descriptions

    • Prompt had guides for volumes, numbers,

      • Got to the point of parsing correctly 19/20, 20/20

    • How to use API calls and application

    • OpenAI released custom GPT like data analytics,

      • Upload a CSV, tell it what to do, and download results

      • Only need to buy a pro license

      • Didn’t need to use APIs or develop custom app

    • Results with csv that had barcode, description, results included new column “New Description”

    • Colleague says that training GPT is like teaching

    • Bookmarked with iterative training, in 900k rows CSV, when 100k, 200k create bookmark

    • 9 different universities, 50 different tech services managers over the years. Each one was wrong in different ways

    • Q: Talk more about the process?

      • Started with Alma analytics, first pulled 900k, ChatGPT doesn’t struggle with Pro ChatGPT 4.0 engine, have include 1.5 millions. Keep it pretty simple, identifier like barcode and field to enforce standardized.

      • Once the CSV file, upload through ChatGPT web interface

      • Prompts started off doing all 900k, better success by breaking down by University 200-300k at any one time.

      • Allows for taylor prompt for “House” style of serials statements

      • Use the English word for volume, for example “jahr” in German, always translated into English. Need a special rule to process.

    • Q: Last time ran this process?

      • Already did 900k volumes in shared storage facility

      • Partners asked to do volumes on Campus, will be over a 1 million

      • Ran into a problem, network configuration in Alma, material in shared storage facility is also in the home zone, duplicated in consortium and institutions. Item description must be exact. Need to do it twice for consortium and individual

    • Isn’t perfect, out of the specification and the metadata format. Definitely gets confused when you have 4 digit numbers that aren’t dates. In the rare instance with like page number, ChatGPT treats like dates

    • Also not great, (garbage in, garbage out) data with cataloging errors doesn’t fix, special characters. Generally speaking, average is above 80%.

    • What to do with the 15% worse or doesn’t fix?

      • Ask ChatGPT to ask to check over it’s work

      • Looks at the original description and all of the number elements, ensures that the all of number elements appear in the new description

      • If missing information don’t use from the checking process and flag things.

    • Next steps?

      • From normalized item descriptions, to 866s to coded fields

      • National/international policies and templates for standardized 866s

      • Or not!

      • What other periodical problems can this solve.

    • East Consortium, list of publishers to exclude, not uniform standard for names, could ChatGPT be used in similar

    • Use other models Gemini or Claude?

      • Gemini good for small datasets in Google Sheets. Haven’t done any large data-sets, no upload CSV option.

    • Once we had a CSV with revised CSV, have a Python script to upload to Alma,

    • Q: Publicly shared? Link to public thread or documentation.

    • Q: Curious about parsing into structured data instead of formatted strings?

      • Some success, taking these item descriptions, break apart, and add as enumerate and chronology fields in MARC, done in Alma environment

      • On a task force, 583 field subfield a committee, push on Ex Libris if you have retention information, put into the 583 a.

      • Standards themselves do not provide much information about holdings information

    • Q: What is the ultimate purpose? Common Dataset?

    • With success, how can these tools to be applied to other problems, what are the things that are safe for LLM to do or human resources to ourselves.

    • Q: How long does it take to process 900k?

      • A minute or two

    • Tim: Working on places in subject headings in hierarchy, using API to do one-off requests is “place a part of place b”, will go back and try again with CSV a familiar juggling process.

    • Asking ChatGPT in shared print, keep two copies of every title. Ideally, both copies are in a shared facility. Didn’t want any partners have unequal representation, prioritize shared storage and equal representation. 1.8 million monographs, how many are in shared storage but not “official” retained copy

      • Rows find but extra cells are problematic.

      • Only give it what needs to make a decision

    • Q: Publish work in Code4Lib?

      • Even a Github repo with the prompts would be very helpful

    • Q: Similar problem parsing semi-structured data, is there a way to turn off code interpreters and just pure statistical?

      • Haven’t done it on this project

    • Q: Use of these models applied to other ares in the library

      • Harvard high density storage model, storage by size, accessions is challenging, remove a small paperback, replace with large volume. End up with “swiss cheese”, some policies.

      • More efficient rules using ChatGPT for heat maps for high density storage. Row with most duplication, could be targeted.

      • Storage Optimization.