How to Select an Embedding Model model for RAG or Search using Vector DB?

Apr 25, 2024

How to Select an Embedding Model model for RAG or Search using Vector DB?

An embedding is an array of numbers (a vector) representing a piece of information, such as text, images, audio, video, etc.

Types of Embeddings

1. Dense Embeddings

Dense embeddings focus on capturing the overall semantic meaning of words or phrases, making them suitable for tasks like dense retrieval which involve mapping text into a single embedding. This helps effectively match and rank documents based on content similarity

2. Sparse Embeddings

Sparse embeddings, on the other hand, are representations where most values are zero, emphasizing only relevant information. Sparse vectors focus on relative word weights per document, resulting in a more efficient and interpretable system

3. Multi-Vector Embeddings

Multi-vector embedding models like ColBERT feature late interaction, where the interaction between query and document representations occurs late in the process, after both have been independently encoded. This approach contrasts with early interaction models, where query and document embeddings interact at earlier stages, potentially leading to increased computational complexity.

4. Long Context Embeddings

Long documents have always posed a particular challenge for embedding models. The limitation on maximum sequence lengths, often rooted in architectures like BERT, leads to practitioners segmenting documents into smaller chunks. Unfortunately, this segmentation can result in fragmented semantic meanings and misrepresentation of entire paragraphs. Additionally, it increases memory usage, computational demands during vector searches, and latencies.

5. Variable Dimension Embeddings

Variable dimension embeddings are a unique concept built on Matryoshka Representation Learning (MRL). MRL learns lower-dimensional embeddings that are nested into the original embedding, akin to a series of Matryoshka Dolls. Each representation sits inside a larger one, from the smallest to the largest "doll". This hierarchy of nested subspaces is learned by MRL, and it efficiently packs information at logarithmic granularities.

6. Code embeddings

Code embeddings are a recent development used to integrate AI-powered capabilities into Integrated Development Environments (IDEs), fundamentally transforming how developers interact with codebases. Unlike traditional text search, code embedding offers semantic understanding, allowing it to interpret the intent behind queries related to code snippets or functionalities. Code embedding models are built by training models on paired text data, treating the top-level docstring in a function along with its implementation as a (text, code) pair.

Different types of embeddings commonly used in various applications

Word Embeddings

Word2Vec: Developed by Google, Word2Vec represents words as dense vectors, capturing semantic relationships between words based on their co-occurrence patterns in large text corpora.

GloVe (Global Vectors for Word Representation): GloVe is another popular word embedding technique that combines global word co-occurrence statistics with local context information to generate word vectors.

2. Contextual Word Embeddings

ELMo (Embeddings from Language Models): ELMo uses bidirectional LSTMs to generate context-dependent word embeddings. It captures word meanings in different contexts.

BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model that generates deep contextualized word embeddings. It has been influential in various NLP tasks.

3. Transformer-Based Embeddings

Transformer Models: Transformers, like BERT, GPT, T5, etc., produce high-quality embeddings by considering the relationships between words in a sentence. These models are pre-trained on large text corpora and fine-tuned for specific tasks.

4. Document Embeddings

Doc2Vec: An extension of Word2Vec, Doc2Vec generates fixed-length vectors representing documents or paragraphs. It takes into account the semantic information of words within a document.

Paragraph Vectors (PV-DBOW and PV-DM): These models, similar to Doc2Vec, create embeddings for paragraphs or documents. PV-DBOW and PV-DM stand for Distributed Bag of Words and Distributed Memory versions, respectively.

5. Image Embeddings

Convolutional Neural Networks (CNN) Embeddings: CNNs can be used to generate embeddings for images by extracting features at different layers of the network. These embeddings capture visual characteristics of the image.

Pre-trained Models for Image Embeddings: Models like ResNet, VGG, and Inception are often pre-trained on large image datasets. The activations of these models’ layers can be used as image embeddings.

6. Graph Embeddings

Node Embeddings: In graph-based tasks, node embeddings represent nodes (entities) in a graph. Techniques like Node2Vec and GraphSAGE learn embeddings that capture the structural and semantic properties of nodes in a graph.

7. Audio Embeddings

Similar to image embeddings, audio embeddings translate sound into vectors, capturing features like pitch, tone, and rhythm. These are used in voice recognition, music analysis, and sound classification tasks.

8. Video Embeddings

Video embeddings capture both the visual and temporal dynamics of videos. They’re used for activities like video search, classification, and understanding scenes or activities within the footage.

9. Knowledge Graph Embeddings

TransE, TransR, DistMult: These models embed entities and relations in a knowledge graph into continuous vector spaces. They help capture semantic relationships between entities and can be used in link prediction and knowledge graph completion tasks.

These are just a few examples of embeddings in different domains. Embeddings play a crucial role in transferring real-world entities into a format that machine learning algorithms can understand and process effectively.

How to Create Vector Embeddings

Choose Your Vector Embedding Model

Decide on the type of model based on your needs. Word2Vec, GloVe, and FastText are popular for word embeddings, while BERT and GPT-4 are used for sentence and document embeddings, etc.

Prepare Your Data

Clean and preprocess your data. For text, this can include tokenization, removing “stopwords,” and possibly lemmatization (reducing words to their base form). For images, this might include resizing, normalizing pixel values, etc.

Train or Use Pre-trained Models

You can train your model on your dataset or use a pre-trained model. Training from scratch requires a significant amount of data, time, and computational resources. Pre-trained models are a quick way to get started and can be fine-tuned (or augmented) with your specific dataset.

Generate Embeddings

Once your model is ready, feed your data through it (via SDK, REST, etc.) to generate embeddings. Each item will be transformed into a vector that represents its semantic meaning. Typically, the embeddings are stored in a database, sometimes right alongside the original data.

How to Measure Embedding Performance?

Retrieval Average

Represents average Normalized Discounted Cumulative Gain(NDCG) @ 10 across several datasets. NDCG is a common metric to measure the performance of retrieval systems. A higher NDCG indicates a model that is better at ranking relevant items higher in the list of retrieved results.

Model Size

Size of the model (in GB). It gives an idea of the computational resources required to run the model. While retrieval performance scales with model size, it is important to note that model size also has a direct impact on latency. The latency-performance trade-off becomes especially important in a production setup.

Max Tokens

Number of tokens that can be compressed into a single embedding. You typically don’t want to put more than a single paragraph of text (~100 tokens) into a single embedding. So even models with max tokens of 512 should be more than enough.

Embedding Dimensions

Length of the embedding vector. Smaller embeddings offer faster inference and are more storage-efficient, while more dimensions can capture nuanced details and relationships in the data. Ultimately, we want a good trade-off between capturing the complexity of data and operational efficiency.

Evaluation metrics

Embedding latency: Time taken to create embeddings

Retrieval quality: Relevance of retrieved documents to the user query

Cost Considerations

Querying Cost

Ensure high availability of the embedding API service, considering factors like model size and latency needs. OpenAI and similar providers offer reliable APIs, while open-source models may require additional engineering efforts.

Indexing Cost

The cost of indexing documents is influenced by the chosen encoder service. Separate storage of embeddings is advisable for flexibility in service resets or reindexing.

Storage Cost

Storage cost scales linearly with dimension, and the choice of embeddings, such as OpenAI's in 1526 dimensions, impacts the overall cost. Calculate average units per document to estimate storage cost.

Applications of Vector Embeddings

Natural Language Processing (NLP)

Semantic Search: Improving search relevance and user experience by better utilizing the meaning behind search terms, above and beyond traditional text-based searching.

Sentiment Analysis: Analyzing customer feedback, social media posts, and reviews to gauge sentiment (positive, negative, or neutral).

Language Translation: Understanding the semantics of the source language and generating appropriate text in the target language.

Recommendation Systems

E-commerce: Personalizing product recommendations based on browsing and purchase history.

Content Platforms: Recommending content to users based on their interests and past interactions.

Computer Vision

Image Recognition and Classification: Identifying objects, people, or scenes in images for applications like surveillance, tagging photos, identifying parts, etc.

Visual Search: Enabling users to search with images instead of text queries.

Healthcare

Drug Discovery: Helping to identify interactions.

Medical Image Analysis: Diagnosing diseases by analyzing medical images such as X-rays, MRIs, and CT scans.

Finance

Fraud Detection: Analyzing transaction patterns to identify and prevent fraudulent activities.

Credit Scoring: Analyzing financial history and behavior.

Thiyagarajan’s Substack

Discussion about this post