title: ai4lam Metadata/Discovery WG Monthly Meeting

June 08, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris


  • Name (institution)

  • Tim Thompson (Yale)

  • Jeremy Nelson (Stanford)

  • Nicole Coleman (Stanford)

  • Amanda Whitmire (Standord)

  • Erik Radio (Colorado)

  • Massimo Petrozzi (Computer History Museum)


  • Name

**Notetaker (alpha by first name): **Tim

[]{#anchor}Helpful Links

[]{#anchor-1}Project Documents and Data

[]{#anchor-2}Agenda Topics

  1. Updates, announcements, intros

  2. Stanford Taxa Project - Nicole Coleman, Amanda Whitmire, Jeremy Nelson

  3. Sinopia Knowledge Graph using kglab (time permitting) Jeremy Nelson

Stanford Taxa Project

  • Marine station library. 100+ year collection: theses, dissertations, research artifacts, including student work (1946-present)

    • Using measurements from the past in context of climate change

      • But data is hidden in texts

      • Digitizing student papers, etc., to extract species observations–i.e., species occurrence

        • Going into GBIF database (6 billion entries)

      • OCR on images, parsed TEI XML through GROBID → Segmentation of text into divs

      • Using NER to identify species

      • Having human review tagged text to confirm species occurrence (can’t be 100% unsupervised) → Making confirmation as fast as possible

    • Collaborating with colleagues in British Columbia, etc. (also

      with historical papers)

    • Broad range of potential stakeholders for tool

  • Oversight: Stanford AI Steering Group

    • Process and project design template

      • Lit review

      • Reflections on other projects

        • E.g., COPIOUS project

          • Using Biodiversity Heritage Library

          • Focused on Philippines

  • Text/data processing → Segmentation can influence results → species references not necessarily sequential

    • Expression of observations can be complex → narrative, not

      consistently structured (context varies)

    • First step: ingest paper into GNRD tool → frequencies provide

      clues as to subject of paper

    • Catalog of Life as knowledge source → link to WoRMS → which

      links to Wikipedia/Wikidata → Description of habitat

    • GBIF also a data source

  • Q: At what point does human intervention occur in the process?

    • Created Taxa app to allow librarian to verify in a UI

    • Rethinking pipeline to try to make prefiltering smarter in order

      to limit what’s included for verification via app

    • Interested in process more generally to make it applicable as a

      model → More emphasis on data sources, preparing the groundwork

    • What’s the implication of using contemporary ML models on

      historical sources?

  • Q: How was model trained? Was expert knowledge leveraged?

    • Custom spaCy lookup defined using values from WoRMS

    • Also custom geographic data from West Coast + list of habitats

    • Body of TEI organized into divs → each div run through NER


      • Verifier represents divs

      • Pattern-based initial approach had some limitations

      • Looking into entity rules, other built-in parts of spaCy to improve the model

      • Currently focusing on locations entities for improvement → minimizing false positives

    • All initial setup was done in one week!

    • Data problems could be spotted because subject expert + metadata

      expert were part of the team. → Building more complete knowledge graph.

      • More coordination costs, but serves the goals of the project

      • Building a tool for experts to be able to verify

  • Example of what the library can provide to address a research problem using AI

  • Also unique doing this in the context of science, biodiversity.

  • Created “methods planner” to communicate about steps in pipeline. Processes can be treated as routine, but involve many decision points along the way.

    • Metadata about ML pipelines and processes → How can we

      standardize it, document decision points in a structured way.

    • Looked at data nutrition labels, datasheets for datasets →

      robust readme file

    • But definitions of “data” can be ambiguous → at what point do

      we call it “data” → How much processing has to be done before it’s “data”

    • Line between what’s data & what’s metadata → e.g. are

      identifiers data or metadata → Yes, but the border is contextual

    • But to the extent that outputs are hosted in library, we have to

      put them somewhere, relate them to other objects → Risk of not fully appreciating the transformation of data along the way.

    • Treating text/corpora into primary texts/objects of concern

    • Question of goals: why are we working with data? For access? For

      research? How to preserve project for future in order to understand how “data” was defined. With AI, we can mobilize both data + metadata → Yet risk of data deluge

      • Still important to keep the distinction for future understanding, reuse/reproducibility of work

    • Possibilities for how we can rethink relation between data &

      metadata → Theses/dissertations FAST project example → Better approach is unsupervised topic modeling (Andromeda Yelton’s work), rather than supervised assignment of subject terms → But means we need different kinds of discovery tools.

Recording of today’s call is available at https://stanford.zoom.us/rec/share/woYmC8o5Rqa63TycKWzUuKEN9_s95WMfiD7y0rnDRIINJ_BuFMYEab8kQZILyhza.LwrpUEBKa5Jg7T6O