title: ai4lam Metadata/Discovery WG Monthly Meeting
March 12, 2024#
8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
Attending
Jeremy Nelson, Stanford
Aaron Krebeck, Washington Research Library Consortium
Erik Radio, Colorado
Tim Thompson, Yale
Sara Amato
Helpful Links#
Project Documents and Data#
Agenda#
Announcements
Aaron Krebeck Presentation
https://docs.google.com/presentation/d/1e4jemOguxXVYLvPESyWnIw8M7BzgPv2SlqFcXDgtSd8/edit?usp=sharing
Practical application using ChatGPT
Asked ChatGPT to write a script for Bob Burgers, impressed with output and formatted. What other applications to give ChatGPT a voice? Had a few projects in the library that might benefit.
Comes from shared print and shared print serials? Serials holding statements have large variability, parts and dates not consistent. Have a Metadata committee came up with some metadata standards for writing serials statements. Guidelines how serial statements for new and ongoing serials statements in the consortium.
Not a retroactive project, 900k serials, not enough staff time to go back manually.
Metadata committee came up with good documentation for serials statement
This documentation good starting point for AI
Proof of Concept
Slowly refine a query with ChatGPT
Item description templates very helpful
Great opportunity to learn to use AI
ChatGPT free tier allows for 100 lines of input is plenty for testing query development
Enough to test 25 gnarly descriptions
Prompt had guides for volumes, numbers,
Got to the point of parsing correctly 19/20, 20/20
How to use API calls and application
OpenAI released custom GPT like data analytics,
Upload a CSV, tell it what to do, and download results
Only need to buy a pro license
Didn’t need to use APIs or develop custom app
Results with csv that had barcode, description, results included new column “New Description”
Colleague says that training GPT is like teaching
Bookmarked with iterative training, in 900k rows CSV, when 100k, 200k create bookmark
9 different universities, 50 different tech services managers over the years. Each one was wrong in different ways
Q: Talk more about the process?
Started with Alma analytics, first pulled 900k, ChatGPT doesn’t struggle with Pro ChatGPT 4.0 engine, have include 1.5 millions. Keep it pretty simple, identifier like barcode and field to enforce standardized.
Once the CSV file, upload through ChatGPT web interface
Prompts started off doing all 900k, better success by breaking down by University 200-300k at any one time.
Allows for taylor prompt for “House” style of serials statements
Use the English word for volume, for example “jahr” in German, always translated into English. Need a special rule to process.
Q: Last time ran this process?
Already did 900k volumes in shared storage facility
Partners asked to do volumes on Campus, will be over a 1 million
Ran into a problem, network configuration in Alma, material in shared storage facility is also in the home zone, duplicated in consortium and institutions. Item description must be exact. Need to do it twice for consortium and individual
Isn’t perfect, out of the specification and the metadata format. Definitely gets confused when you have 4 digit numbers that aren’t dates. In the rare instance with like page number, ChatGPT treats like dates
Also not great, (garbage in, garbage out) data with cataloging errors doesn’t fix, special characters. Generally speaking, average is above 80%.
What to do with the 15% worse or doesn’t fix?
Ask ChatGPT to ask to check over it’s work
Looks at the original description and all of the number elements, ensures that the all of number elements appear in the new description
If missing information don’t use from the checking process and flag things.
Next steps?
From normalized item descriptions, to 866s to coded fields
National/international policies and templates for standardized 866s
Or not!
What other periodical problems can this solve.
East Consortium, list of publishers to exclude, not uniform standard for names, could ChatGPT be used in similar
Use other models Gemini or Claude?
Gemini good for small datasets in Google Sheets. Haven’t done any large data-sets, no upload CSV option.
Once we had a CSV with revised CSV, have a Python script to upload to Alma,
Q: Publicly shared? Link to public thread or documentation.
If migrating to Alma WRLC/alma_notes_import Github repository, Alma doesn’t have more batch
Q: Curious about parsing into structured data instead of formatted strings?
Some success, taking these item descriptions, break apart, and add as enumerate and chronology fields in MARC, done in Alma environment
On a task force, 583 field subfield a committee, push on Ex Libris if you have retention information, put into the 583 a.
Standards themselves do not provide much information about holdings information
Q: What is the ultimate purpose? Common Dataset?
With success, how can these tools to be applied to other problems, what are the things that are safe for LLM to do or human resources to ourselves.
Q: How long does it take to process 900k?
A minute or two
Tim: Working on places in subject headings in hierarchy, using API to do one-off requests is “place a part of place b”, will go back and try again with CSV a familiar juggling process.
Asking ChatGPT in shared print, keep two copies of every title. Ideally, both copies are in a shared facility. Didn’t want any partners have unequal representation, prioritize shared storage and equal representation. 1.8 million monographs, how many are in shared storage but not “official” retained copy
Rows find but extra cells are problematic.
Only give it what needs to make a decision
Q: Publish work in Code4Lib?
Even a Github repo with the prompts would be very helpful
Q: Similar problem parsing semi-structured data, is there a way to turn off code interpreters and just pure statistical?
Haven’t done it on this project
Q: Use of these models applied to other ares in the library
Harvard high density storage model, storage by size, accessions is challenging, remove a small paperback, replace with large volume. End up with “swiss cheese”, some policies.
More efficient rules using ChatGPT for heat maps for high density storage. Row with most duplication, could be targeted.
Storage Optimization.