title: ai4lam Metadata/Discovery WG June 2024 Monthly Meeting
Jun 11, 2024#
8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris
Attending
Name, institution
Tim Thompson, Yale
Gavin Mendel-Gleason, TerminusDB
Jeremy Nelson, Stanford
Stephen McConnachie, BFA
Erik Radio, Colorado
Sara Amato, Eastern Academic Scholars’ Trust
Victor Mireles, National University of Mexico
Kalli Mathios, Stanford
Joy Panigabutra-Roberts
Helpful Links#
Project Documents and Data#
Agenda#
Announcements
Presentation by Dr Gavin Mendel-Gleason, who will present on the following topic: What are text embeddings, and how can we use them for AI-assisted information retrieval and data quality?
AI For Document Retrieval and Data Quality presentation
Using text-embedding to enhance user-experience of data collectionsWhat is a text embedding?
LLMs offer a representation of text in a high dimensional vector space
These semantic spaces allow meaning to be represented with maths
King - man + woman = queen
Transformer models are the current state of the art for obtaining vector representations
What can you do with a vector representation?
Semantic Record retrieval
Improve data quality
Entity matching
Anomaly detection outliers in the vector space, sometimes represent problems in the data
RAG (Retrieval-Augmented Generation) promise to really help us find resource interactively
Semantic Record retrieval
The semantic distance between two text is the distance between vectors
d(“dogs are the best”, “canines are the greatest”) ~ 0Similar documents are near each-other in the space allowing clustering, matching etc. d(“QUERY Who was president in 1964”, “ANSWER Lyndon Johnson”) ~ 0
Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is the near the answer, rather than the question. d(“ANSWER
Data Quality
Matching with LLMs is flexible - text can work across languages and otheographies, with spelling mistakes and without normalisation. This is very helpful in entity recognition tasks:
d(“Jim”, “James”) ~ 0
d(“Khrushchev”, “Chruscthschow”) ~0
The strategy of “embed and cluster” can help to find duplicate records
It also provides a strategy for controlling record matching at scale (1 billion x 1 billion = 1 quintillion) - it’s much faster just to search the neighbors (low distance vectors) of each record
Anomaly detection
RAG Retrieval-Augmented Generation
Multi-stage process using chatbot to obtain answers
Vectorize documents
Ask a question
Get the “QUESTION” embedding, and search for neighbors, e.g. “QUESTION What are some novels about the Cold War”
Extract information about the neighbors from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link, etc)
Use this information to produce a prompt
“Answer the following question given the relevant documents and their hyperlinks:
Author: Tom Clancy, Title: The Hunt for Red October
Author: Tom Clancy, Title: Clear and Present Danger
Feed the prompt and question to a chat bot
Chat bot responses with the correct internal document record answer without hallucination (hopefully)
What do I need to make these things work
An LLM turned for embeddings (MxBai is good for small things, Ada is better for big things)
A traditional database (Graph or RDBMS)
A vector database (currently HNSW - higher navigable small graphs) variants are the best performing)
A way to create strings for the embeddings from records (JSON+handlebars templates?)
Good prompt engineering (Good luck!)
Some glue code (python?)
Question: Vector for Graph documents
Graph embeddings create Transformers embed information in graph
Easiest to create a query embedded into query
Get answer back from query, link back to the original elements in the graph
Disaggregate book into chapter, paragraphs, query then bubble up, structure after the fact after the query and vectors
Question: MxBai
Large data set with small documents, vector size relative small, vectorize faster
Question: Thoughts about RAG graph vs. relational database
Bias towards graph databases
Easier to model hierarchical and more complex
Artificial Intelligence - Deep Learning for Language Modelling presentation
A simple neural network
Input layer -> hidden layer -> output layerInputs, new vector with weights create new vector, “higher order features from previously recognized features”, Last vector is the what is provide as output
Generative Model output is feed back through the layers to approximate the input
How do we vectorise language?
No single answer, sill an open question -anything I say here is but one approach
BUT we have amazingly good language model now
Some questions:
What is the unit of vectorisation? Word, sentence, paragraph, document
How do we incorporate context?
Vectorising words
Input Vector - “1 hot vector” exactly 1 element in a 10k with remaining zeros
Hidden Layer - Linear Neurons
Output Layer - Softmax Classifer
Probability that the word at a randomly chosen nearby position is “abandon”
“Ability”
Semantic context
CBOW
Input -> Projection -> Output
Skip-gram
How do we get the right weights?
Use a loss/cost/object function for how good the answer is - manay possibilities
Use a search strategy to alter the weights (gradient descent for instance)
Rise of the Transformers
Sequence data with context was being addressed with recursive neural networks
These had problems with keeping good track of context
Largely superseded by an attention model
Attention tells us what part of a sequence we should be paying attention to in order to understand the next bit.
Attention
I want to got the store
\Ich mochte zum Geschaft gehen
Look forward and backward in time to construct a complex context
Transformers (training)
Word IDs “You are welcome”
Embeddings and Position Encoding
Encorder-1 -> Enc-1 Out -> Encoder -2 -> en
The Transformer (training) processes as follows
The input sequence is converted into Embeedings with Position encoding) and feed to the Encorder
The stack of Encoders process this and produces an ecoded representation of the input sequence.
The target sequence is prepended
Transformer (inference)
Built up on answer based on the probability of the next word, looks back at information, feeds back to the encoder, feeds it back into the loop back.
Train with enough data, very good to
AI is the future of content
These models allow sophisticated modelling of semantics
We need to be at the forefront of semantic modelling of content to win
Some ideas
A librarian which knows about your conet and con converse with you about it
Library who knows about connections (attention model embeddings?)
Content summarisation engine
Automatic schema generation from examples
Synthetic content generation
Auto-clusterings
Entity Resolution
Anomy detection
Question: Why use HNSW vector database?
Vector database, one of the requirements to be open source, TerminesDB open source graph database.
Not too many database that can go to billion vectors, smooth, no nice CLI
Main problem is the recall, most important when dealing with these things.
Product search for “fish and chips” need 100%, 80% not good high, high recall at scale 99.999% for 1 million documents
Building own vector database, proud of recall and scale
Vespa - billion scale,
Question: Vector postgres?
Storage great
Indexing is the issue
Worth looking at the recall numbers
Question: Different use cases you worked on?
Library vs other industry use cases
Entity recognition fraud detection, sales , bankruptcy, match records between records, similar to things what is being done.
Older traditional matching techniques mixes together with vectors has the best results.
Phone numbers and Dates don’t work very well
Reasoning by Transformers on time very bad
Semantic search
Experiments with RAG, no big project with RAG, refocus in the TerminsDB very low touch with easy to use RAG with a range of different options.
Question: Names and disambiguation
Name and titles anything that can be misspelled good match