April 13, 2021

title: ai4lam Metadata/Discovery WG Monthly Meeting

April 13, 2021#

9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris

Attending

*Name (institution) *
Tim Thompson (Yale)
Jeremy Nelson (Stanford)
Eric Lease Morgan (University of Notre Dame)
David Lowe (Texas A&M)
Corey Harper (Elsevier Labs / Univ. of Amsterdam)
Olga Barysheva

Regrets

**Notetaker (alpha by first name): **

[]{#anchor-1}Helpful Links

Metadata WG Zotero Group Library

[]{#anchor-2}Project Documents and Data

WG charter

<<<<<<< HEAD

=======

9c0c03e (Adds remaining 2021 meetings)

WG Google Drive folder

[]{#anchor-3}Agenda Topics

Updates, announcements, intros
Presentation by Kalyan Dutia (Science Museum Group) on the Heritage Connector project

Background reading:

Kalyan Dutia, John Stack. Heritage Connector: A Machine Learning Framework for Building Linked Open Data from Museum Collections. Authorea. January 06, 2021. https://www.authorea.com/users/387788/articles/502720-heritage-connector-a-machine-learning-framework-for-building-linked-open-data-from-museum-collections

Presentation by Eric Lease Morgan (University of Notre Dame) on the Distant Reader application
Discussion

[]{#anchor-4}Notes

Kalyan Dutia, NER, Entity Linking and Record Linkage in Heritage Connector. Part of five museums in the UK. One of five people on the project for a 18 month grant.
How can existing digital tools and methods be used to build relationships at scale between poorly and inconsistently catalogued digitised collection objects and other content source?
How can make an approach scaled
Some constraints, 59% of heritage organizsations don’t have the time or resources for mass data-=labelling
Heritage collections often have little metadata
- Often inconsistently applied and does not follow a controlled
  
  vocabulary
- Collections will often mention people, orgs or objects whose
  
  wikidata record is also lacking in metadata
Therefore we aim to design/use methods which require little labelled data, finding other ways to inject expert knowledge
Tabular Collection Data -> Ingestion Process, converts table to RDF, map categorical field to controlled LOD vocabularies, pattern matching to find existing LOD URLS -> Heritage Connector knowledge graph -> Record Linkage to WIKIDATA, identify links from heritage records to WIKIData -> Information Retrieval.
spaCY pipeline - Tokeniser, POS Tagger, Parser, NER -> text to annotated text
Custom component - Dictionary matcher -warpper around spacCy PhraseMatcher, and Rule-based Matcher, Custom patterns (e.g. dates, collections) written for Heritage collections. Some success in overcoming limitations in spaCy
DictionaryMatcher, label, pattern, and id. DATE_PATTERNS - use regex. COLLECTION_NAME_PATTERNS. Applied across the collections without people needing to add labels.
A Framework for Low-resource Entity Linking and Record Lionkage
- Search (Candidate Generation)
  - Source DB and Target DB - Elasticsearch indices
  - Get back a lot of candidate
- Feature Creation
  - fi[sim1(a1, 1), sim2(a2,b2)…, simn(an, bn)
  - Where:
    - A is the value of a feature from the source record
    - Bi is the value of the corresponding feature from the
      
      target record
    - The classifier doesn’t need to have seen all the test
      
      samples (common requirement of EL methods)
    - Feature vectors are likely low-dimensional -> low data
      
      requirements
    - New similarity functions can be added to accommodate
      
      domain knowledge
- Classification (Candidate Ranking)
  - If multiple correct links per record -> binary classifier
  - If one correct link per record -> learning to rank model e.e RankSVM
  - In our tests so far on Science Museum data, 100-1000 labels is sufficient to train the classifier
Enitty Linking to the Heritage Connector
- Search -
  - NER-Annotated Source Text
  - Candidates for Entity Link, label and description
- Feature Creation
  - Mention exactly matches label
  - Label is in mention
  - Levenshtein distance between mention and label
  - One-hot-encoded mention type
  - One-hot-encoded candidate type
  - Fuzzwuzzy token-sort distance between mention and label
  - Fuzzwuzzy token-sort distance bgetween mention and label, with common organisation suffixes removed
- Classification (Candidate Ranking)
  - Binary classifier as there is more than one possible link mention
  - 10-field
Entity Linking Wikidata,
Record Linkage to Wikidata, follows same pattern, values for the source and candidate are the properties that relate the source to value.
- Binary classifier (decision tree)
- Really difficult to collect enough data on the entity
What’s next
- Iterating
Eric Morgan - Distant Reader - the Reader is tool for Reading
- Descriptive bibliographics (authors, titles, publication dates,
  
  extents, readability scores)
- Analytics bibliographics (keywords, summaries, and added entries
  
  to some extend)
- Ngram features
- Parts-of-speech
- Named entities
- Textual snippets based on grammars (noun-verb clauses,
  
  adjective-non-claused, questions, etc.)
- The Reader uses machine learning spaCy
- Study carrells collections of the papers and other material. Can
  
  ask Who-What-Where-Why-How questions to the study also more sophisticated
- Reader uses machine learning - spaCy
  - Import spacy
  - FILE = ‘./walden.text’
  - MODEL = ‘en’
  - Text = open (FILE).read()
  - Nlp = spacy.laod(MODEL)
  - Doc = nlp ( text )
  - For ent in doc.ents: print(ent.text, ent.label_)
  - exit()
  - Also has scientific models for chemical and other labels.
- Reader adds entities to database for users
- Reader can be used - Topic Modeling and Classification
  - Study carrel content can be used as input for topic modeling, and the results can be “pivoted” to compare topics to metadata (authors, dates, places, keywords, etc.)
  - Study carrel content can be used to classify texts, and the results can be used to identify the salient characteristics, denote authorship, classify new/unclassified text
  - Output can be used in machine learning
- Distant Reader and Machine Learning
  - Reader creates study carrels, thought to maybe convert to linked-data, discover other relationships.
  - Hard part if finding URIs for the entities.
- Distantreader.org, collection of 287 study carrels
  - Browse the study carrel,
  - Raw input text or PDF
  - Create tab-delimited entities from a particular text
  - POS - word, lemma , parts of the speech
  - Really get a good idea of what is happening in a text collection
- CLI rdr model-build homer, take the output of the distant reader
  
  and then do other types of machine-learning
Questions:
- Kalyan: David posted a question, modeling mentally
  
  subject-classification, or call-number assignment. Wondering if that require a greater amount of data. What areas? Call-number assignment, subject, discipline (field) the document is in. Scaling linear with the number of classes. Very hard to do, find good sample data and model, and model isn’t class imbalanced, would need to work around that.
- Kalyan: Appreciated that using spaCy to create models, curious
  
  linked-data source with ArchivesHub, very old information but doing something similar a few years ago. Not trying to build an aggregated, much more interested in how far you can get with just the right amount of human interventions.
- Kalyan: How much augmentation is happening? Number of triples
  
  run about tripled in size when running
- Kalyan: Tables to RDF - entity linking on columns in Pandas df,
  
  taking from textual information in tables in PDF? Everything is in relational database, hopefully a follow-on project, what they can do a historical material
- Kaylan Noticed that “hell” is an organization. One of the
  
  differences when running NER or before, a load of false-positives.
- Kaylan: Performance in spaCy pipeline, Kaylan, Pattern matching
  
  or Entity Ruler, pattern matcher much more computationally efficient
- Kaylan: VIAF or other library sources using Geonames, LOD,
  
  haven’t found a control or FB, doesn’t have to worry much about controlled vocabularies
- Cultural heritage institutions and machine learning, enhance
  
  bibliographic? Other uses? Living with machines at British Libraries https://livingwithmachines.ac.uk/ with other applications to machine learning. Read the map, better OCR?
  - David Lowe - looking at Patents
  - Cory Harper - https://www.cultural-ai.nl/missionandvision, looking at topic drift