title: ai4lam Metadata/Discovery WG Monthly Meeting

# June 08, 2021

9 AM California \| 12 PM Washington DC \| 5 PM UK \| 6 PM Oslo & Paris


**Attending**

-   *Name (institution)*
-   Tim Thompson (Yale)
-   Jeremy Nelson (Stanford)
-   Nicole Coleman (Stanford)
-   Amanda Whitmire (Standord)
-   Erik Radio (Colorado)
-   Massimo Petrozzi (Computer History Museum)
-   

**Regrets**

-   *Name*

**Notetaker (alpha by first name): **Tim

[]{#anchor}Helpful Links

-   [*Metadata WG Zotero Group
    Library*](https://www.zotero.org/groups/2709151/ai4lam_metadata_wg/library)

[]{#anchor-1}Project Documents and Data

-   [*WG
    charter*](https://drive.google.com/file/d/1ypcx2F30siqr-KYOKFZtVv8h9PIS9a77/view?usp=sharing)

```{=html}
<!-- -->
```
-   [*WG Google Drive
    folder*](https://drive.google.com/drive/folders/1cpZtbjKadgD30794fD97XY-EChUSy2r9?usp=sharing)

[]{#anchor-2}Agenda Topics

1.  Updates, announcements, intros
2.  Stanford Taxa Project - Nicole Coleman, Amanda Whitmire, Jeremy
    Nelson
3.  Sinopia Knowledge Graph using kglab (time permitting) Jeremy Nelson

Stanford Taxa Project

-   Marine station library. 100+ year collection: theses, dissertations,
    research artifacts, including student work (1946-present)

    -   Using measurements from the past in context of climate change

        -   But data is hidden in texts

        -   Digitizing student papers, etc., to extract species
            observations\--i.e., species occurrence

            -   Going into GBIF database (6 billion entries)

        -   OCR on images, parsed TEI XML through GROBID → Segmentation
            of text into divs

        -   Using NER to identify species

        -   Having human review tagged text to confirm species
            occurrence (can\'t be 100% unsupervised) → Making
            confirmation as fast as possible

    -   Collaborating with colleagues in British Columbia, etc. (also
        > with historical papers)

    -   Broad range of potential stakeholders for tool

-   Oversight: Stanford AI Steering Group

    -   Process and project design template

        -   Lit review

        -   Reflections on other projects

            -   E.g., COPIOUS project

                -   Using Biodiversity Heritage Library
                -   Focused on Philippines

-   Text/data processing → Segmentation can influence results → species
    references not necessarily sequential

    -   Expression of observations can be complex → narrative, not
        > consistently structured (context varies)

    -   First step: ingest paper into GNRD tool → frequencies provide
        > clues as to subject of paper

    -   Catalog of Life as knowledge source → link to WoRMS → which
        > links to Wikipedia/Wikidata → Description of habitat

    -   GBIF also a data source

-   Q: At what point does human intervention occur in the process?

    -   Created Taxa app to allow librarian to verify in a UI

    -   Rethinking pipeline to try to make prefiltering smarter in order
        > to limit what\'s included for verification via app

    -   Interested in process more generally to make it applicable as a
        > model → More emphasis on data sources, preparing the
        > groundwork

    -   What\'s the implication of using contemporary ML models on
        > historical sources?

-   Q: How was model trained? Was expert knowledge leveraged?

    -   Custom spaCy lookup defined using values from WoRMS

    -   Also custom geographic data from West Coast + list of habitats

    -   Body of TEI organized into divs → each div run through NER
        > pipeline

        -   Verifier represents divs
        -   Pattern-based initial approach had some limitations
        -   Looking into entity rules, other built-in parts of spaCy to
            improve the model
        -   Currently focusing on locations entities for improvement →
            minimizing false positives

    -   All initial setup was done in one week!

    -   Data problems could be spotted because subject expert + metadata
        > expert were part of the team. → Building more complete
        > knowledge graph.

        -   More coordination costs, but serves the goals of the project
        -   Building a tool for experts to be able to verify

-   Example of what the library can provide to address a research
    problem using AI

-   Also unique doing this in the context of science, biodiversity.

-   Created \"methods planner\" to communicate about steps in pipeline.
    Processes can be treated as routine, but involve many decision
    points along the way.

    -   Metadata about ML pipelines and processes → How can we
        > standardize it, document decision points in a structured way.

    -   Looked at data nutrition labels, datasheets for datasets →
        > robust readme file

    -   But definitions of \"data\" can be ambiguous → at what point do
        > we call it \"data\" → How much processing has to be done
        > before it\'s \"data\"

    -   Line between what\'s data & what\'s metadata → e.g. are
        > identifiers data or metadata → Yes, but the border is
        > contextual

    -   But to the extent that outputs are hosted in library, we have to
        > put them somewhere, relate them to other objects → Risk of not
        > fully appreciating the transformation of data along the way.

    -   Treating text/corpora into primary texts/objects of concern

    -   Question of goals: why are we working with data? For access? For
        > research? How to preserve project for future in order to
        > understand how \"data\" was defined. With AI, we can mobilize
        > both data + metadata → Yet risk of data deluge

        -   Still important to keep the distinction for future
            understanding, reuse/reproducibility of work

    -   Possibilities for how we can rethink relation between data &
        > metadata → Theses/dissertations FAST project example → Better
        > approach is unsupervised topic modeling (Andromeda Yelton\'s
        > work), rather than supervised assignment of subject terms →
        > But means we need different kinds of discovery tools.

Recording of today's call is available at
https://stanford.zoom.us/rec/share/woYmC8o5Rqa63TycKWzUuKEN9_s95WMfiD7y0rnDRIINJ_BuFMYEab8kQZILyhza.LwrpUEBKa5Jg7T6O