June 08, 2021

title: ai4lam Metadata/Discovery WG Monthly Meeting

June 08, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris

Attending

Name (institution)
Tim Thompson (Yale)
Jeremy Nelson (Stanford)
Nicole Coleman (Stanford)
Amanda Whitmire (Standord)
Erik Radio (Colorado)
Massimo Petrozzi (Computer History Museum)

Regrets

Name

**Notetaker (alpha by first name): **Tim

[]{#anchor}Helpful Links

Metadata WG Zotero Group Library

[]{#anchor-1}Project Documents and Data

WG charter

WG Google Drive folder

[]{#anchor-2}Agenda Topics

Updates, announcements, intros
Stanford Taxa Project - Nicole Coleman, Amanda Whitmire, Jeremy Nelson
Sinopia Knowledge Graph using kglab (time permitting) Jeremy Nelson

Stanford Taxa Project

Marine station library. 100+ year collection: theses, dissertations, research artifacts, including student work (1946-present)
- Using measurements from the past in context of climate change
  - But data is hidden in texts
  - Digitizing student papers, etc., to extract species observations–i.e., species occurrence
    - Going into GBIF database (6 billion entries)
  - OCR on images, parsed TEI XML through GROBID → Segmentation of text into divs
  - Using NER to identify species
  - Having human review tagged text to confirm species occurrence (can’t be 100% unsupervised) → Making confirmation as fast as possible
- Collaborating with colleagues in British Columbia, etc. (also
  
  with historical papers)
- Broad range of potential stakeholders for tool
Oversight: Stanford AI Steering Group
- Process and project design template
  - Lit review
  - Reflections on other projects
    - E.g., COPIOUS project
      - Using Biodiversity Heritage Library
      - Focused on Philippines
Text/data processing → Segmentation can influence results → species references not necessarily sequential
- Expression of observations can be complex → narrative, not
  
  consistently structured (context varies)
- First step: ingest paper into GNRD tool → frequencies provide
  
  clues as to subject of paper
- Catalog of Life as knowledge source → link to WoRMS → which
  
  links to Wikipedia/Wikidata → Description of habitat
- GBIF also a data source
Q: At what point does human intervention occur in the process?
- Created Taxa app to allow librarian to verify in a UI
- Rethinking pipeline to try to make prefiltering smarter in order
  
  to limit what’s included for verification via app
- Interested in process more generally to make it applicable as a
  
  model → More emphasis on data sources, preparing the groundwork
- What’s the implication of using contemporary ML models on
  
  historical sources?
Q: How was model trained? Was expert knowledge leveraged?
- Custom spaCy lookup defined using values from WoRMS
- Also custom geographic data from West Coast + list of habitats
- Body of TEI organized into divs → each div run through NER
  
  pipeline
  - Verifier represents divs
  - Pattern-based initial approach had some limitations
  - Looking into entity rules, other built-in parts of spaCy to improve the model
  - Currently focusing on locations entities for improvement → minimizing false positives
- All initial setup was done in one week!
- Data problems could be spotted because subject expert + metadata
  
  expert were part of the team. → Building more complete knowledge graph.
  - More coordination costs, but serves the goals of the project
  - Building a tool for experts to be able to verify
Example of what the library can provide to address a research problem using AI
Also unique doing this in the context of science, biodiversity.
Created “methods planner” to communicate about steps in pipeline. Processes can be treated as routine, but involve many decision points along the way.
- Metadata about ML pipelines and processes → How can we
  
  standardize it, document decision points in a structured way.
- Looked at data nutrition labels, datasheets for datasets →
  
  robust readme file
- But definitions of “data” can be ambiguous → at what point do
  
  we call it “data” → How much processing has to be done before it’s “data”
- Line between what’s data & what’s metadata → e.g. are
  
  identifiers data or metadata → Yes, but the border is contextual
- But to the extent that outputs are hosted in library, we have to
  
  put them somewhere, relate them to other objects → Risk of not fully appreciating the transformation of data along the way.
- Treating text/corpora into primary texts/objects of concern
- Question of goals: why are we working with data? For access? For
  
  research? How to preserve project for future in order to understand how “data” was defined. With AI, we can mobilize both data + metadata → Yet risk of data deluge
  - Still important to keep the distinction for future understanding, reuse/reproducibility of work
- Possibilities for how we can rethink relation between data &
  
  metadata → Theses/dissertations FAST project example → Better approach is unsupervised topic modeling (Andromeda Yelton’s work), rather than supervised assignment of subject terms → But means we need different kinds of discovery tools.

Recording of today’s call is available at https://stanford.zoom.us/rec/share/woYmC8o5Rqa63TycKWzUuKEN9_s95WMfiD7y0rnDRIINJ_BuFMYEab8kQZILyhza.LwrpUEBKa5Jg7T6O