title: ai4lam Metadata/Discovery WG June 2024 Monthly Meeting

Jun 11, 2024#

8 AM California | 11 AM Washington DC | 4 PM UK | 5 PM Oslo & Paris

Attending

  • Name, institution

  • Tim Thompson, Yale

  • Gavin Mendel-Gleason, TerminusDB

  • Jeremy Nelson, Stanford

  • Stephen McConnachie, BFA

  • Erik Radio, Colorado

  • Sara Amato, Eastern Academic Scholars’ Trust

  • Victor Mireles, National University of Mexico

  • Kalli Mathios, Stanford

  • Joy Panigabutra-Roberts

Project Documents and Data#

Agenda#

  • Announcements

  • Presentation by Dr Gavin Mendel-Gleason, who will present on the following topic: What are text embeddings, and how can we use them for AI-assisted information retrieval and data quality?

    • AI For Document Retrieval and Data Quality presentation
      Using text-embedding to enhance user-experience of data collections

    • What is a text embedding?

      • LLMs offer a representation of text in a high dimensional vector space

      • These semantic spaces allow meaning to be represented with maths

      • King - man + woman = queen

      • Transformer models are the current state of the art for obtaining vector representations

    • What can you do with a vector representation?

      • Semantic Record retrieval

      • Improve data quality

        • Entity matching

        • Anomaly detection outliers in the vector space, sometimes represent problems in the data

      • RAG (Retrieval-Augmented Generation) promise to really help us find resource interactively

    • Semantic Record retrieval

      • The semantic distance between two text is the distance between vectors
        d(“dogs are the best”, “canines are the greatest”) ~ 0

      • Similar documents are near each-other in the space allowing clustering, matching etc. d(“QUERY Who was president in 1964”, “ANSWER Lyndon Johnson”) ~ 0

      • Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is the near the answer, rather than the question. d(“ANSWER

    • Data Quality

      • Matching with LLMs is flexible - text can work across languages and otheographies, with spelling mistakes and without normalisation. This is very helpful in entity recognition tasks:

        • d(“Jim”, “James”) ~ 0

        • d(“Khrushchev”, “Chruscthschow”) ~0

      • The strategy of “embed and cluster” can help to find duplicate records

      • It also provides a strategy for controlling record matching at scale (1 billion x 1 billion = 1 quintillion) - it’s much faster just to search the neighbors (low distance vectors) of each record

      • Anomaly detection

    • RAG Retrieval-Augmented Generation

      • Multi-stage process using chatbot to obtain answers

        • Vectorize documents

        • Ask a question

        • Get the “QUESTION” embedding, and search for neighbors, e.g. “QUESTION What are some novels about the Cold War”

        • Extract information about the neighbors from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link, etc)

        • Use this information to produce a prompt

        • “Answer the following question given the relevant documents and their hyperlinks:

          • Author: Tom Clancy, Title: The Hunt for Red October

          • Author: Tom Clancy, Title: Clear and Present Danger

        • Feed the prompt and question to a chat bot

        • Chat bot responses with the correct internal document record answer without hallucination (hopefully)

    • What do I need to make these things work

      • An LLM turned for embeddings (MxBai is good for small things, Ada is better for big things)

      • A traditional database (Graph or RDBMS)

      • A vector database (currently HNSW - higher navigable small graphs) variants are the best performing)

      • A way to create strings for the embeddings from records (JSON+handlebars templates?)

      • Good prompt engineering (Good luck!)

      • Some glue code (python?)

  • Question: Vector for Graph documents

    • Graph embeddings create Transformers embed information in graph

    • Easiest to create a query embedded into query

    • Get answer back from query, link back to the original elements in the graph

    • Disaggregate book into chapter, paragraphs, query then bubble up, structure after the fact after the query and vectors

  • Question: MxBai

    • Large data set with small documents, vector size relative small, vectorize faster

  • Question: Thoughts about RAG graph vs. relational database

    • Bias towards graph databases

    • Easier to model hierarchical and more complex

  • Artificial Intelligence - Deep Learning for Language Modelling presentation

    • A simple neural network
      Input layer -> hidden layer -> output layer

    • Inputs, new vector with weights create new vector, “higher order features from previously recognized features”, Last vector is the what is provide as output

    • Generative Model output is feed back through the layers to approximate the input

  • How do we vectorise language?

    • No single answer, sill an open question -anything I say here is but one approach

    • BUT we have amazingly good language model now

    • Some questions:

      • What is the unit of vectorisation? Word, sentence, paragraph, document

      • How do we incorporate context?

  • Vectorising words

    • Input Vector - “1 hot vector” exactly 1 element in a 10k with remaining zeros

    • Hidden Layer - Linear Neurons

    • Output Layer - Softmax Classifer

      • Probability that the word at a randomly chosen nearby position is “abandon”

      • “Ability”

  • Semantic context

    • CBOW

      • Input -> Projection -> Output

    • Skip-gram

  • How do we get the right weights?

    • Use a loss/cost/object function for how good the answer is - manay possibilities

    • Use a search strategy to alter the weights (gradient descent for instance)

  • Rise of the Transformers

    • Sequence data with context was being addressed with recursive neural networks

    • These had problems with keeping good track of context

    • Largely superseded by an attention model

    • Attention tells us what part of a sequence we should be paying attention to in order to understand the next bit.

  • Attention

    • I want to got the store
      \

    • Ich mochte zum Geschaft gehen

    • Look forward and backward in time to construct a complex context

  • Transformers (training)

    • Word IDs “You are welcome”

    • Embeddings and Position Encoding

    • Encorder-1 -> Enc-1 Out -> Encoder -2 -> en

  • The Transformer (training) processes as follows

    • The input sequence is converted into Embeedings with Position encoding) and feed to the Encorder

    • The stack of Encoders process this and produces an ecoded representation of the input sequence.

    • The target sequence is prepended

  • Transformer (inference)

    • Built up on answer based on the probability of the next word, looks back at information, feeds back to the encoder, feeds it back into the loop back.

    • Train with enough data, very good to

  • AI is the future of content

    • These models allow sophisticated modelling of semantics

    • We need to be at the forefront of semantic modelling of content to win

    • Some ideas

      • A librarian which knows about your conet and con converse with you about it

      • Library who knows about connections (attention model embeddings?)

      • Content summarisation engine

      • Automatic schema generation from examples

      • Synthetic content generation

      • Auto-clusterings

      • Entity Resolution

      • Anomy detection

  • Question: Why use HNSW vector database?

    • Vector database, one of the requirements to be open source, TerminesDB open source graph database.

    • Not too many database that can go to billion vectors, smooth, no nice CLI

    • Main problem is the recall, most important when dealing with these things.

      • Product search for “fish and chips” need 100%, 80% not good high, high recall at scale 99.999% for 1 million documents

    • Building own vector database, proud of recall and scale

    • Vespa - billion scale,

  • Question: Vector postgres?

    • Storage great

    • Indexing is the issue

    • Worth looking at the recall numbers

  • Question: Different use cases you worked on?

    • Library vs other industry use cases

    • Entity recognition fraud detection, sales , bankruptcy, match records between records, similar to things what is being done.

    • Older traditional matching techniques mixes together with vectors has the best results.

    • Phone numbers and Dates don’t work very well

      • Reasoning by Transformers on time very bad

    • Semantic search

    • Experiments with RAG, no big project with RAG, refocus in the TerminsDB very low touch with easy to use RAG with a range of different options.

  • Question: Names and disambiguation

    • Name and titles anything that can be misspelled good match