title: ai4lam Metadata/Discovery WG Monthly Meeting
April 13, 2021#
9 AM California | 12 PM Washington DC | 5 PM UK | 6 PM Oslo & Paris
Attending
*Name (institution) *
Tim Thompson (Yale)
Jeremy Nelson (Stanford)
Eric Lease Morgan (University of Notre Dame)
David Lowe (Texas A&M)
Corey Harper (Elsevier Labs / Univ. of Amsterdam)
Olga Barysheva
Regrets
**Notetaker (alpha by first name): **
[]{#anchor-1}Helpful Links
[]{#anchor-2}Project Documents and Data
<<<<<<< HEAD
=======
9c0c03e (Adds remaining 2021 meetings)
[]{#anchor-3}Agenda Topics
Updates, announcements, intros
Presentation by Kalyan Dutia (Science Museum Group) on the Heritage Connector project
Background reading:
Kalyan Dutia, John Stack. Heritage Connector: A Machine Learning Framework for Building Linked Open Data from Museum Collections. Authorea. January 06, 2021. https://www.authorea.com/users/387788/articles/502720-heritage-connector-a-machine-learning-framework-for-building-linked-open-data-from-museum-collections
Presentation by Eric Lease Morgan (University of Notre Dame) on the Distant Reader application
Discussion
[]{#anchor-4}Notes
Kalyan Dutia, NER, Entity Linking and Record Linkage in Heritage Connector. Part of five museums in the UK. One of five people on the project for a 18 month grant.
How can existing digital tools and methods be used to build relationships at scale between poorly and inconsistently catalogued digitised collection objects and other content source?
How can make an approach scaled
Some constraints, 59% of heritage organizsations don’t have the time or resources for mass data-=labelling
Heritage collections often have little metadata
Often inconsistently applied and does not follow a controlled
vocabulary
Collections will often mention people, orgs or objects whose
wikidata record is also lacking in metadata
Therefore we aim to design/use methods which require little labelled data, finding other ways to inject expert knowledge
Tabular Collection Data -> Ingestion Process, converts table to RDF, map categorical field to controlled LOD vocabularies, pattern matching to find existing LOD URLS -> Heritage Connector knowledge graph -> Record Linkage to WIKIDATA, identify links from heritage records to WIKIData -> Information Retrieval.
spaCY pipeline - Tokeniser, POS Tagger, Parser, NER -> text to annotated text
Custom component - Dictionary matcher -warpper around spacCy PhraseMatcher, and Rule-based Matcher, Custom patterns (e.g. dates, collections) written for Heritage collections. Some success in overcoming limitations in spaCy
DictionaryMatcher, label, pattern, and id. DATE_PATTERNS - use regex. COLLECTION_NAME_PATTERNS. Applied across the collections without people needing to add labels.
A Framework for Low-resource Entity Linking and Record Lionkage
Search (Candidate Generation)
Source DB and Target DB - Elasticsearch indices
Get back a lot of candidate
Feature Creation
fi[sim1(a1, 1), sim2(a2,b2)…, simn(an, bn)
Where:
A is the value of a feature from the source record
Bi is the value of the corresponding feature from the
target record
The classifier doesn’t need to have seen all the test
samples (common requirement of EL methods)
Feature vectors are likely low-dimensional -> low data
requirements
New similarity functions can be added to accommodate
domain knowledge
Classification (Candidate Ranking)
If multiple correct links per record -> binary classifier
If one correct link per record -> learning to rank model e.e RankSVM
In our tests so far on Science Museum data, 100-1000 labels is sufficient to train the classifier
Enitty Linking to the Heritage Connector
Search -
NER-Annotated Source Text
Candidates for Entity Link, label and description
Feature Creation
Mention exactly matches label
Label is in mention
Levenshtein distance between mention and label
One-hot-encoded mention type
One-hot-encoded candidate type
Fuzzwuzzy token-sort distance between mention and label
Fuzzwuzzy token-sort distance bgetween mention and label, with common organisation suffixes removed
Classification (Candidate Ranking)
Binary classifier as there is more than one possible link mention
10-field
Entity Linking Wikidata,
Record Linkage to Wikidata, follows same pattern, values for the source and candidate are the properties that relate the source to value.
Binary classifier (decision tree)
Really difficult to collect enough data on the entity
What’s next
Iterating
Eric Morgan - Distant Reader - the Reader is tool for Reading
Descriptive bibliographics (authors, titles, publication dates,
extents, readability scores)
Analytics bibliographics (keywords, summaries, and added entries
to some extend)
Ngram features
Parts-of-speech
Named entities
Textual snippets based on grammars (noun-verb clauses,
adjective-non-claused, questions, etc.)
The Reader uses machine learning spaCy
Study carrells collections of the papers and other material. Can
ask Who-What-Where-Why-How questions to the study also more sophisticated
Reader uses machine learning - spaCy
Import spacy
FILE = ‘./walden.text’
MODEL = ‘en’
Text = open (FILE).read()
Nlp = spacy.laod(MODEL)
Doc = nlp ( text )
For ent in doc.ents: print(ent.text, ent.label_)
exit()
Also has scientific models for chemical and other labels.
Reader adds entities to database for users
Reader can be used - Topic Modeling and Classification
Study carrel content can be used as input for topic modeling, and the results can be “pivoted” to compare topics to metadata (authors, dates, places, keywords, etc.)
Study carrel content can be used to classify texts, and the results can be used to identify the salient characteristics, denote authorship, classify new/unclassified text
Output can be used in machine learning
Distant Reader and Machine Learning
Reader creates study carrels, thought to maybe convert to linked-data, discover other relationships.
Hard part if finding URIs for the entities.
Distantreader.org, collection of 287 study carrels
Browse the study carrel,
Raw input text or PDF
Create tab-delimited entities from a particular text
POS - word, lemma , parts of the speech
Really get a good idea of what is happening in a text collection
CLI rdr model-build homer, take the output of the distant reader
and then do other types of machine-learning
Questions:
Kalyan: David posted a question, modeling mentally
subject-classification, or call-number assignment. Wondering if that require a greater amount of data. What areas? Call-number assignment, subject, discipline (field) the document is in. Scaling linear with the number of classes. Very hard to do, find good sample data and model, and model isn’t class imbalanced, would need to work around that.
Kalyan: Appreciated that using spaCy to create models, curious
linked-data source with ArchivesHub, very old information but doing something similar a few years ago. Not trying to build an aggregated, much more interested in how far you can get with just the right amount of human interventions.
Kalyan: How much augmentation is happening? Number of triples
run about tripled in size when running
Kalyan: Tables to RDF - entity linking on columns in Pandas df,
taking from textual information in tables in PDF? Everything is in relational database, hopefully a follow-on project, what they can do a historical material
Kaylan Noticed that “hell” is an organization. One of the
differences when running NER or before, a load of false-positives.
Kaylan: Performance in spaCy pipeline, Kaylan, Pattern matching
or Entity Ruler, pattern matcher much more computationally efficient
Kaylan: VIAF or other library sources using Geonames, LOD,
haven’t found a control or FB, doesn’t have to worry much about controlled vocabularies
Cultural heritage institutions and machine learning, enhance
bibliographic? Other uses? Living with machines at British Libraries https://livingwithmachines.ac.uk/ with other applications to machine learning. Read the map, better OCR?
David Lowe - looking at Patents
Cory Harper - https://www.cultural-ai.nl/missionandvision, looking at topic drift