title: ai4lam Metadata/Discovery WG Monthly Meeting
June 08, 2021#
9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris
Attending
Name (institution)
Tim Thompson (Yale)
Jeremy Nelson (Stanford)
Nicole Coleman (Stanford)
Amanda Whitmire (Standord)
Erik Radio (Colorado)
Massimo Petrozzi (Computer History Museum)
Regrets
Name
**Notetaker (alpha by first name): **Tim
[]{#anchor}Helpful Links
[]{#anchor-1}Project Documents and Data
[]{#anchor-2}Agenda Topics
Updates, announcements, intros
Stanford Taxa Project - Nicole Coleman, Amanda Whitmire, Jeremy Nelson
Sinopia Knowledge Graph using kglab (time permitting) Jeremy Nelson
Stanford Taxa Project
Marine station library. 100+ year collection: theses, dissertations, research artifacts, including student work (1946-present)
Using measurements from the past in context of climate change
But data is hidden in texts
Digitizing student papers, etc., to extract species observations–i.e., species occurrence
Going into GBIF database (6 billion entries)
OCR on images, parsed TEI XML through GROBID → Segmentation of text into divs
Using NER to identify species
Having human review tagged text to confirm species occurrence (can’t be 100% unsupervised) → Making confirmation as fast as possible
Collaborating with colleagues in British Columbia, etc. (also
with historical papers)
Broad range of potential stakeholders for tool
Oversight: Stanford AI Steering Group
Process and project design template
Lit review
Reflections on other projects
E.g., COPIOUS project
Using Biodiversity Heritage Library
Focused on Philippines
Text/data processing → Segmentation can influence results → species references not necessarily sequential
Expression of observations can be complex → narrative, not
consistently structured (context varies)
First step: ingest paper into GNRD tool → frequencies provide
clues as to subject of paper
Catalog of Life as knowledge source → link to WoRMS → which
links to Wikipedia/Wikidata → Description of habitat
GBIF also a data source
Q: At what point does human intervention occur in the process?
Created Taxa app to allow librarian to verify in a UI
Rethinking pipeline to try to make prefiltering smarter in order
to limit what’s included for verification via app
Interested in process more generally to make it applicable as a
model → More emphasis on data sources, preparing the groundwork
What’s the implication of using contemporary ML models on
historical sources?
Q: How was model trained? Was expert knowledge leveraged?
Custom spaCy lookup defined using values from WoRMS
Also custom geographic data from West Coast + list of habitats
Body of TEI organized into divs → each div run through NER
pipeline
Verifier represents divs
Pattern-based initial approach had some limitations
Looking into entity rules, other built-in parts of spaCy to improve the model
Currently focusing on locations entities for improvement → minimizing false positives
All initial setup was done in one week!
Data problems could be spotted because subject expert + metadata
expert were part of the team. → Building more complete knowledge graph.
More coordination costs, but serves the goals of the project
Building a tool for experts to be able to verify
Example of what the library can provide to address a research problem using AI
Also unique doing this in the context of science, biodiversity.
Created “methods planner” to communicate about steps in pipeline. Processes can be treated as routine, but involve many decision points along the way.
Metadata about ML pipelines and processes → How can we
standardize it, document decision points in a structured way.
Looked at data nutrition labels, datasheets for datasets →
robust readme file
But definitions of “data” can be ambiguous → at what point do
we call it “data” → How much processing has to be done before it’s “data”
Line between what’s data & what’s metadata → e.g. are
identifiers data or metadata → Yes, but the border is contextual
But to the extent that outputs are hosted in library, we have to
put them somewhere, relate them to other objects → Risk of not fully appreciating the transformation of data along the way.
Treating text/corpora into primary texts/objects of concern
Question of goals: why are we working with data? For access? For
research? How to preserve project for future in order to understand how “data” was defined. With AI, we can mobilize both data + metadata → Yet risk of data deluge
Still important to keep the distinction for future understanding, reuse/reproducibility of work
Possibilities for how we can rethink relation between data &
metadata → Theses/dissertations FAST project example → Better approach is unsupervised topic modeling (Andromeda Yelton’s work), rather than supervised assignment of subject terms → But means we need different kinds of discovery tools.
Recording of today’s call is available at https://stanford.zoom.us/rec/share/woYmC8o5Rqa63TycKWzUuKEN9_s95WMfiD7y0rnDRIINJ_BuFMYEab8kQZILyhza.LwrpUEBKa5Jg7T6O