title: ai4lam Metadata/Discovery WG Monthly Meeting

April 13, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris


  • *Name (institution) *

  • Tim Thompson (Yale)

  • Jeremy Nelson (Stanford)

  • Eric Lease Morgan (University of Notre Dame)

  • David Lowe (Texas A&M)

  • Corey Harper (Elsevier Labs / Univ. of Amsterdam)

  • Olga Barysheva


**Notetaker (alpha by first name): **

[]{#anchor-1}Helpful Links

[]{#anchor-2}Project Documents and Data

<<<<<<< HEAD


9c0c03e (Adds remaining 2021 meetings)

[]{#anchor-3}Agenda Topics

  1. Updates, announcements, intros

  2. Presentation by Kalyan Dutia (Science Museum Group) on the Heritage Connector project

Background reading:

Kalyan Dutia, John Stack. Heritage Connector: A Machine Learning Framework for Building Linked Open Data from Museum Collections. Authorea. January 06, 2021. https://www.authorea.com/users/387788/articles/502720-heritage-connector-a-machine-learning-framework-for-building-linked-open-data-from-museum-collections

  1. Presentation by Eric Lease Morgan (University of Notre Dame) on the Distant Reader application

  2. Discussion


  • Kalyan Dutia, NER, Entity Linking and Record Linkage in Heritage Connector. Part of five museums in the UK. One of five people on the project for a 18 month grant.

  • How can existing digital tools and methods be used to build relationships at scale between poorly and inconsistently catalogued digitised collection objects and other content source?

  • How can make an approach scaled

  • Some constraints, 59% of heritage organizsations don’t have the time or resources for mass data-=labelling

  • Heritage collections often have little metadata

    • Often inconsistently applied and does not follow a controlled


    • Collections will often mention people, orgs or objects whose

      wikidata record is also lacking in metadata

  • Therefore we aim to design/use methods which require little labelled data, finding other ways to inject expert knowledge

  • Tabular Collection Data -> Ingestion Process, converts table to RDF, map categorical field to controlled LOD vocabularies, pattern matching to find existing LOD URLS -> Heritage Connector knowledge graph -> Record Linkage to WIKIDATA, identify links from heritage records to WIKIData -> Information Retrieval.

  • spaCY pipeline - Tokeniser, POS Tagger, Parser, NER -> text to annotated text

  • Custom component - Dictionary matcher -warpper around spacCy PhraseMatcher, and Rule-based Matcher, Custom patterns (e.g. dates, collections) written for Heritage collections. Some success in overcoming limitations in spaCy

  • DictionaryMatcher, label, pattern, and id. DATE_PATTERNS - use regex. COLLECTION_NAME_PATTERNS. Applied across the collections without people needing to add labels.

  • A Framework for Low-resource Entity Linking and Record Lionkage

    • Search (Candidate Generation)

      • Source DB and Target DB - Elasticsearch indices

      • Get back a lot of candidate

    • Feature Creation

      • fi[sim1(a1, 1), sim2(a2,b2)…, simn(an, bn)

      • Where:

        • A is the value of a feature from the source record

        • Bi is the value of the corresponding feature from the

          target record

        • The classifier doesn’t need to have seen all the test

          samples (common requirement of EL methods)

        • Feature vectors are likely low-dimensional -> low data


        • New similarity functions can be added to accommodate

          domain knowledge

    • Classification (Candidate Ranking)

      • If multiple correct links per record -> binary classifier

      • If one correct link per record -> learning to rank model e.e RankSVM

      • In our tests so far on Science Museum data, 100-1000 labels is sufficient to train the classifier

  • Enitty Linking to the Heritage Connector

    • Search -

      • NER-Annotated Source Text

      • Candidates for Entity Link, label and description

    • Feature Creation

      • Mention exactly matches label

      • Label is in mention

      • Levenshtein distance between mention and label

      • One-hot-encoded mention type

      • One-hot-encoded candidate type

      • Fuzzwuzzy token-sort distance between mention and label

      • Fuzzwuzzy token-sort distance bgetween mention and label, with common organisation suffixes removed

    • Classification (Candidate Ranking)

      • Binary classifier as there is more than one possible link mention

      • 10-field

  • Entity Linking Wikidata,

  • Record Linkage to Wikidata, follows same pattern, values for the source and candidate are the properties that relate the source to value.

    • Binary classifier (decision tree)

    • Really difficult to collect enough data on the entity

  • What’s next

    • Iterating

  • Eric Morgan - Distant Reader - the Reader is tool for Reading

    • Descriptive bibliographics (authors, titles, publication dates,

      extents, readability scores)

    • Analytics bibliographics (keywords, summaries, and added entries

      to some extend)

    • Ngram features

    • Parts-of-speech

    • Named entities

    • Textual snippets based on grammars (noun-verb clauses,

      adjective-non-claused, questions, etc.)

    • The Reader uses machine learning spaCy

    • Study carrells collections of the papers and other material. Can

      ask Who-What-Where-Why-How questions to the study also more sophisticated

    • Reader uses machine learning - spaCy

      • Import spacy

      • FILE = ‘./walden.text’

      • MODEL = ‘en’

      • Text = open (FILE).read()

      • Nlp = spacy.laod(MODEL)

      • Doc = nlp ( text )

      • For ent in doc.ents: print(ent.text, ent.label_)

      • exit()

      • Also has scientific models for chemical and other labels.

    • Reader adds entities to database for users

    • Reader can be used - Topic Modeling and Classification

      • Study carrel content can be used as input for topic modeling, and the results can be “pivoted” to compare topics to metadata (authors, dates, places, keywords, etc.)

      • Study carrel content can be used to classify texts, and the results can be used to identify the salient characteristics, denote authorship, classify new/unclassified text

      • Output can be used in machine learning

    • Distant Reader and Machine Learning

      • Reader creates study carrels, thought to maybe convert to linked-data, discover other relationships.

      • Hard part if finding URIs for the entities.

    • Distantreader.org, collection of 287 study carrels

      • Browse the study carrel,

      • Raw input text or PDF

      • Create tab-delimited entities from a particular text

      • POS - word, lemma , parts of the speech

      • Really get a good idea of what is happening in a text collection

    • CLI rdr model-build homer, take the output of the distant reader

      and then do other types of machine-learning

  • Questions:

    • Kalyan: David posted a question, modeling mentally

      subject-classification, or call-number assignment. Wondering if that require a greater amount of data. What areas? Call-number assignment, subject, discipline (field) the document is in. Scaling linear with the number of classes. Very hard to do, find good sample data and model, and model isn’t class imbalanced, would need to work around that.

    • Kalyan: Appreciated that using spaCy to create models, curious

      linked-data source with ArchivesHub, very old information but doing something similar a few years ago. Not trying to build an aggregated, much more interested in how far you can get with just the right amount of human interventions.

    • Kalyan: How much augmentation is happening? Number of triples

      run about tripled in size when running

    • Kalyan: Tables to RDF - entity linking on columns in Pandas df,

      taking from textual information in tables in PDF? Everything is in relational database, hopefully a follow-on project, what they can do a historical material

    • Kaylan Noticed that “hell” is an organization. One of the

      differences when running NER or before, a load of false-positives.

    • Kaylan: Performance in spaCy pipeline, Kaylan, Pattern matching

      or Entity Ruler, pattern matcher much more computationally efficient

    • Kaylan: VIAF or other library sources using Geonames, LOD,

      haven’t found a control or FB, doesn’t have to worry much about controlled vocabularies

    • Cultural heritage institutions and machine learning, enhance

      bibliographic? Other uses? Living with machines at British Libraries https://livingwithmachines.ac.uk/ with other applications to machine learning. Read the map, better OCR?