title: ai4lam Metadata/Discovery WG Monthly Meeting

January 12, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris

Connection Information

Topic: AI-LAM Metadata Working Group


Time: This is a recurring meeting. Meet anytime


Join from PC, Mac, Linux, iOS or Android: [https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09](https://stanford.zoom.us/j/91421044393?pwd=L0VLbnQ0WlE4SDV0MDY5SUhTQnVydz09) 


	Password: 306295


Or iPhone one-tap (US Toll): +18333021536,,91421044393# or +16507249799,,91421044393#


Or Telephone:


	Dial: +1 650 724 9799 (US, Canada, Caribbean Toll) or +1 833 302 1536 (US, Canada, Caribbean Toll Free)


     	 


	Meeting ID: 914 2104 4393


	Password: 306295


	International numbers available: https://stanford.zoom.us/u/aeoeCDrpd


	Meeting ID: 914 2104 4393


	Password: 306295


	SIP: 91421044393@zoomcrc.com


	Password: 306295

Attending

  • _Name (institution) _

  • Tim Thompson (Yale University)

  • Jeremy Nelson (Stanford University)

  • Joshua Gomez (UCLA)

  • Mike Jones (USC Libraries) mikejone@usc.edu

  • Amy Rudersdorf (AVP)

  • Corey Harper (Elsevier Labs / UvA)

  • David Lowe (Texas A&M University)

Regrets

  • Jodi Schneider

  • Melissa Gill (J. Paul Getty Trust)

**Notetaker: **

#

Helpful Links

Project Documents and Data#

Agenda Topics#

  1. Updates and announcements (5-10 mins.)

  2. Report on recent Stanford University Libraries project on using machine learning to classify electronic theses and dissertations (15-20 mins.)

  3. Preliminary discussion of metadata requirements for ML models and other AI resources (15 mins.)

  • Corey Harper’s comment on draft WG charter:

      _To address the roles of metadata in helping curate, collect, describe, and make discoverable the resources needed for AI, ML, and Data Science more broadly._
    
    
      _A few examples of this include:_
    
    
      _* Model cards to provide information about how ML models are trained: [https://arxiv.org/pdf/1810.03993.pdf](https://arxiv.org/pdf/1810.03993.pdf)_
    
    
      _* Documentation of datasets used in training and testing SOA [state of art] on specific tasks, such as [http://nlpprogress.com/](http://nlpprogress.com/)_
    
    
      _* And perhaps most importantly, documenting bias in both models and datasets that could have undue effects on already marginalized groups._
    
  • Approaches:

  • “Model Cards” proposal spells out sections and requirements for ML model documentation. But the data recorded is not necessarily machine actionable. Example:

              Intended Use > Out-of-Scope Uses:
    
    
              _Examples include “not for use on text examples shorter than 100 tokens” or “for use on black-and-white images only; please consider our research group’s full-color-image classifier for color images.” (p. 4)_
    
  • Potential guide: Emulation as a Service Infrastructure (Software Preservation Network). Using Wikidata (WikiProject Informatics) to model structured data about software, file formats, and computing environments. Example: Atari TT030

  • Wikidata entry for ML model: fastText (for text classification)

  1. Next steps (10-20 mins.)

    1. Proposed standing meeting time for Metadata WG be the second Tuesday of every month at 9AM California / 12PM Washington DC / 5PM UK / 6PM Oslo & Paris.

    2. Discuss communication preferences and platforms.

    3. Forecast for group’s plans for 2021.

Meeting Notes#

  1. Updates and announcements

  2. Report on recent Stanford University Libraries project on using machine learning to classify electronic theses and dissertations. Jupyter notebook for this presentation is at jermnelson/ai4lam-jan-2021.

    Scoping project: \

  • Catalogers realized there were no subjects in these materials \



  • 2.5 week project to automate assignment of FAST headings

    Workflow: \

  • Loading raw text for 7k+ documents (ocr’d from pdf) \

  • Did some segmentation previously, TOCs, etc, but didnt’ do much with it. \

  • Initial approach, using BERT via HuggingFace Transformers (https://huggingface.co/transformers/pretrained_models.html) \

    • Also tried FastAI’s BERT \

  • Needed to overcome lack of labeled training/eval data – wanted to avoid assignment by catalogers, but ended up needing to go back to that manual work. \

  • First attempt was using Google books data that Stanford has access to. \

    • Mapped google books LC headings to corresponding FAST headings

      Challenges: \

  • Couldn’t get appropriate GPU environment in place \



  • 2nd idea was to build a tool for catalogers to make training data: \

  • Streamlit app and a custom spaCy workflow \

  • Running these on abstract only, spaCy statistical NER model with dictionary based FAST-BIO additions, then human in the loop correction / quality-control routine in Streamlit.


    Secondary project around clustering: \

  • K-means based on BERT embeddings \

  • Setting up clusters, then looking at distribution of departments in each cluster \

  • Streamlit app to tune number of clusters and view results

    Challenges: \

  • Huge number of FAST headings; relatively small corpus

    Questions about genesis of project: \

  • Phil S is a proponent of providing ML to cataloging, and he championed the experiment

    Discussion a bit about how this POC really showed how metadata folks can be involved in this process. Metadata librarians are now proponents.

    Amy making connectiosn to AVP’s AMP work and their MGMs, including a H-MGM (Human Metadata Generation Mechanisim), and they have found that the algorithmic approaches aren’t super useful without the human components…

  1. Preliminary discussion of metadata requirements for ML models and other AI resources

    See above for context. Tim’s done a lot of work on identifying resources around state of the art in this space. What kind of things folks are thinking about.

    Tim’s set up a zotero library on this topic and what’s going on in this space. Suggesting that we collaborate on providing notes, discussion, etc.

    Various sub areas: \

  • Evaluation of ML models, performance on diff datasets \

  • Ethical considerations and fairness in ML models \

    • Intersections of attributes, etc

      Another possible role for this group could be formalizing some of this work in an ontology or other actionable schema… \

  • Yales’ Emulation as a service infrax has been using wikidata to model similar kinds of documentation \

  • Wikidata actually has some metadata on some popular models… \

- Importance of documenting individual projects and workflows--so much is project specific. 
  1. Next steps

  2. Proposed standing meeting time for Metadata WG be the second Tuesday of every month at 9AM California / 12PM Washington DC / 5PM UK / 6PM Oslo & Paris.

  3. Discuss communication preferences and platforms. \

  4. Forecast for group’s plans for 2021.

Recording available at

https://stanford.zoom.us/rec/share/W8vzeFV2BDrxlwPXNRucJOZ3faU-ivXYrvH_onASb5-RmDNi3IPpg7Z48BntYRrF.nevqxwLcw102GgKR

Action Items#