title: ai4lam Metadata/Discovery WG Monthly Meeting
June 08, 2021#
9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris
Attending
- Name (institution) 
- Tim Thompson (Yale) 
- Jeremy Nelson (Stanford) 
- Nicole Coleman (Stanford) 
- Amanda Whitmire (Standord) 
- Erik Radio (Colorado) 
- Massimo Petrozzi (Computer History Museum) 
Regrets
- Name 
**Notetaker (alpha by first name): **Tim
[]{#anchor}Helpful Links
[]{#anchor-1}Project Documents and Data
[]{#anchor-2}Agenda Topics
- Updates, announcements, intros 
- Stanford Taxa Project - Nicole Coleman, Amanda Whitmire, Jeremy Nelson 
- Sinopia Knowledge Graph using kglab (time permitting) Jeremy Nelson 
Stanford Taxa Project
- Marine station library. 100+ year collection: theses, dissertations, research artifacts, including student work (1946-present) - Using measurements from the past in context of climate change - But data is hidden in texts 
- Digitizing student papers, etc., to extract species observations–i.e., species occurrence - Going into GBIF database (6 billion entries) 
 
- OCR on images, parsed TEI XML through GROBID → Segmentation of text into divs 
- Using NER to identify species 
- Having human review tagged text to confirm species occurrence (can’t be 100% unsupervised) → Making confirmation as fast as possible 
 
- Collaborating with colleagues in British Columbia, etc. (also - with historical papers) 
- Broad range of potential stakeholders for tool 
 
- Oversight: Stanford AI Steering Group - Process and project design template - Lit review 
- Reflections on other projects - E.g., COPIOUS project - Using Biodiversity Heritage Library 
- Focused on Philippines 
 
 
 
 
- Text/data processing → Segmentation can influence results → species references not necessarily sequential - Expression of observations can be complex → narrative, not - consistently structured (context varies) 
- First step: ingest paper into GNRD tool → frequencies provide - clues as to subject of paper 
- Catalog of Life as knowledge source → link to WoRMS → which - links to Wikipedia/Wikidata → Description of habitat 
- GBIF also a data source 
 
- Q: At what point does human intervention occur in the process? - Created Taxa app to allow librarian to verify in a UI 
- Rethinking pipeline to try to make prefiltering smarter in order - to limit what’s included for verification via app 
- Interested in process more generally to make it applicable as a - model → More emphasis on data sources, preparing the groundwork 
- What’s the implication of using contemporary ML models on - historical sources? 
 
- Q: How was model trained? Was expert knowledge leveraged? - Custom spaCy lookup defined using values from WoRMS 
- Also custom geographic data from West Coast + list of habitats 
- Body of TEI organized into divs → each div run through NER - pipeline - Verifier represents divs 
- Pattern-based initial approach had some limitations 
- Looking into entity rules, other built-in parts of spaCy to improve the model 
- Currently focusing on locations entities for improvement → minimizing false positives 
 
- All initial setup was done in one week! 
- Data problems could be spotted because subject expert + metadata - expert were part of the team. → Building more complete knowledge graph. - More coordination costs, but serves the goals of the project 
- Building a tool for experts to be able to verify 
 
 
- Example of what the library can provide to address a research problem using AI 
- Also unique doing this in the context of science, biodiversity. 
- Created “methods planner” to communicate about steps in pipeline. Processes can be treated as routine, but involve many decision points along the way. - Metadata about ML pipelines and processes → How can we - standardize it, document decision points in a structured way. 
- Looked at data nutrition labels, datasheets for datasets → - robust readme file 
- But definitions of “data” can be ambiguous → at what point do - we call it “data” → How much processing has to be done before it’s “data” 
- Line between what’s data & what’s metadata → e.g. are - identifiers data or metadata → Yes, but the border is contextual 
- But to the extent that outputs are hosted in library, we have to - put them somewhere, relate them to other objects → Risk of not fully appreciating the transformation of data along the way. 
- Treating text/corpora into primary texts/objects of concern 
- Question of goals: why are we working with data? For access? For - research? How to preserve project for future in order to understand how “data” was defined. With AI, we can mobilize both data + metadata → Yet risk of data deluge - Still important to keep the distinction for future understanding, reuse/reproducibility of work 
 
- Possibilities for how we can rethink relation between data & - metadata → Theses/dissertations FAST project example → Better approach is unsupervised topic modeling (Andromeda Yelton’s work), rather than supervised assignment of subject terms → But means we need different kinds of discovery tools. 
 
Recording of today’s call is available at https://stanford.zoom.us/rec/share/woYmC8o5Rqa63TycKWzUuKEN9_s95WMfiD7y0rnDRIINJ_BuFMYEab8kQZILyhza.LwrpUEBKa5Jg7T6O
