Jun 11, 2024

Jun 11, 2024#

8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris

Attending

Name, institution
Tim Thompson, Yale
Gavin Mendel-Gleason, TerminusDB
Jeremy Nelson, Stanford
Stephen McConnachie, BFA
Erik Radio, Colorado
Sara Amato, Eastern Academic Scholars’ Trust
Victor Mireles, National University of Mexico
Kalli Mathios, Stanford
Joy Panigabutra-Roberts

Helpful Links#

Metadata WG Zotero Group Library

Project Documents and Data#

Agenda#

Announcements
Presentation by Dr Gavin Mendel-Gleason, who will present on the following topic: What are text embeddings, and how can we use them for AI-assisted information retrieval and data quality?
- AI For Document Retrieval and Data Quality presentation
  Using text-embedding to enhance user-experience of data collections
- What is a text embedding?
  - LLMs offer a representation of text in a high dimensional vector space
  - These semantic spaces allow meaning to be represented with maths
  - King - man + woman = queen
  - Transformer models are the current state of the art for obtaining vector representations
- What can you do with a vector representation?
  - Semantic Record retrieval
  - Improve data quality
    - Entity matching
    - Anomaly detection outliers in the vector space, sometimes represent problems in the data
  - RAG (Retrieval-Augmented Generation) promise to really help us find resource interactively
- Semantic Record retrieval
  - The semantic distance between two text is the distance between vectors
    d(“dogs are the best”, “canines are the greatest”) ~ 0
  - Similar documents are near each-other in the space allowing clustering, matching etc. d(“QUERY Who was president in 1964”, “ANSWER Lyndon Johnson”) ~ 0
  - Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is the near the answer, rather than the question. d(“ANSWER
- Data Quality
  - Matching with LLMs is flexible - text can work across languages and otheographies, with spelling mistakes and without normalisation. This is very helpful in entity recognition tasks:
    - d(“Jim”, “James”) ~ 0
    - d(“Khrushchev”, “Chruscthschow”) ~0
  - The strategy of “embed and cluster” can help to find duplicate records
  - It also provides a strategy for controlling record matching at scale (1 billion x 1 billion = 1 quintillion) - it’s much faster just to search the neighbors (low distance vectors) of each record
  - Anomaly detection
- RAG Retrieval-Augmented Generation
  - Multi-stage process using chatbot to obtain answers
    - Vectorize documents
    - Ask a question
    - Get the “QUESTION” embedding, and search for neighbors, e.g. “QUESTION What are some novels about the Cold War”
    - Extract information about the neighbors from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link, etc)
    - Use this information to produce a prompt
    - “Answer the following question given the relevant documents and their hyperlinks:
      - Author: Tom Clancy, Title: The Hunt for Red October
      - Author: Tom Clancy, Title: Clear and Present Danger
    - Feed the prompt and question to a chat bot
    - Chat bot responses with the correct internal document record answer without hallucination (hopefully)
- What do I need to make these things work
  - An LLM turned for embeddings (MxBai is good for small things, Ada is better for big things)
  - A traditional database (Graph or RDBMS)
  - A vector database (currently HNSW - higher navigable small graphs) variants are the best performing)
  - A way to create strings for the embeddings from records (JSON+handlebars templates?)
  - Good prompt engineering (Good luck!)
  - Some glue code (python?)
Question: Vector for Graph documents
- Graph embeddings create Transformers embed information in graph
- Easiest to create a query embedded into query
- Get answer back from query, link back to the original elements in the graph
- Disaggregate book into chapter, paragraphs, query then bubble up, structure after the fact after the query and vectors
Question: MxBai
- Large data set with small documents, vector size relative small, vectorize faster
Question: Thoughts about RAG graph vs. relational database
- Bias towards graph databases
- Easier to model hierarchical and more complex
Artificial Intelligence - Deep Learning for Language Modelling presentation
- A simple neural network
  Input layer -> hidden layer -> output layer
- Inputs, new vector with weights create new vector, “higher order features from previously recognized features”, Last vector is the what is provide as output
- Generative Model output is feed back through the layers to approximate the input
How do we vectorise language?
- No single answer, sill an open question -anything I say here is but one approach
- BUT we have amazingly good language model now
- Some questions:
  - What is the unit of vectorisation? Word, sentence, paragraph, document
  - How do we incorporate context?
Vectorising words
- Input Vector - “1 hot vector” exactly 1 element in a 10k with remaining zeros
- Hidden Layer - Linear Neurons
- Output Layer - Softmax Classifer
  - Probability that the word at a randomly chosen nearby position is “abandon”
  - “Ability”
Semantic context
- CBOW
  - Input -> Projection -> Output
- Skip-gram
How do we get the right weights?
- Use a loss/cost/object function for how good the answer is - manay possibilities
- Use a search strategy to alter the weights (gradient descent for instance)
Rise of the Transformers
- Sequence data with context was being addressed with recursive neural networks
- These had problems with keeping good track of context
- Largely superseded by an attention model
- Attention tells us what part of a sequence we should be paying attention to in order to understand the next bit.
Attention
- I want to got the store
  \
- Ich mochte zum Geschaft gehen
- Look forward and backward in time to construct a complex context
Transformers (training)
- Word IDs “You are welcome”
- Embeddings and Position Encoding
- Encorder-1 -> Enc-1 Out -> Encoder -2 -> en
The Transformer (training) processes as follows
- The input sequence is converted into Embeedings with Position encoding) and feed to the Encorder
- The stack of Encoders process this and produces an ecoded representation of the input sequence.
- The target sequence is prepended
Transformer (inference)
- Built up on answer based on the probability of the next word, looks back at information, feeds back to the encoder, feeds it back into the loop back.
- Train with enough data, very good to
AI is the future of content
- These models allow sophisticated modelling of semantics
- We need to be at the forefront of semantic modelling of content to win
- Some ideas
  - A librarian which knows about your conet and con converse with you about it
  - Library who knows about connections (attention model embeddings?)
  - Content summarisation engine
  - Automatic schema generation from examples
  - Synthetic content generation
  - Auto-clusterings
  - Entity Resolution
  - Anomy detection
Question: Why use HNSW vector database?
- Vector database, one of the requirements to be open source, TerminesDB open source graph database.
- Not too many database that can go to billion vectors, smooth, no nice CLI
- Main problem is the recall, most important when dealing with these things.
  - Product search for “fish and chips” need 100%, 80% not good high, high recall at scale 99.999% for 1 million documents
- Building own vector database, proud of recall and scale
- Vespa - billion scale,
Question: Vector postgres?
- Storage great
- Indexing is the issue
- Worth looking at the recall numbers
Question: Different use cases you worked on?
- Library vs other industry use cases
- Entity recognition fraud detection, sales , bankruptcy, match records between records, similar to things what is being done.
- Older traditional matching techniques mixes together with vectors has the best results.
- Phone numbers and Dates don’t work very well
  - Reasoning by Transformers on time very bad
- Semantic search
- Experiments with RAG, no big project with RAG, refocus in the TerminsDB very low touch with easy to use RAG with a range of different options.
Question: Names and disambiguation
- Name and titles anything that can be misspelled good match

Jun 11, 2024

Contents

Jun 11, 2024#

Helpful Links#

Project Documents and Data#

Agenda#