How to Select an Embedding Model model for RAG or Search using Vector DB?
How to Select an Embedding Model model for RAG or Search using Vector DB?
An embedding is an array of numbers (a vector) representing a piece of information, such as text, images, audio, video, etc.
Types of Embeddings
1. Dense Embeddings
Dense embeddings focus on capturing the overall semantic meaning of words or phrases, making them suitable for tasks like dense retrieval which involve mapping text into a single embedding. This helps effectively match and rank documents based on content similarity
2. Sparse Embeddings
Sparse embeddings, on the other hand, are representations where most values are zero, emphasizing only relevant information. Sparse vectors focus on relative word weights per document, resulting in a more efficient and interpretable system
3. Multi-Vector Embeddings
Multi-vector embedding models like ColBERT feature late interaction, where the interaction between query and document representations occurs late in the process, after both have been independently encoded. This approach contrasts with early interaction models, where query and document embeddings interact at earlier stages, potentially leading to increased computational complexity.
4. Long Context Embeddings
Long documents have always posed a particular challenge for embedding models. The limitation on maximum sequence lengths, often rooted in architectures like BERT, leads to practitioners segmenting documents into smaller chunks. Unfortunately, this segmentation can result in fragmented semantic meanings and misrepresentation of entire paragraphs. Additionally, it increases memory usage, computational demands during vector searches, and latencies.
5. Variable Dimension Embeddings
Variable dimension embeddings are a unique concept built on Matryoshka Representation Learning (MRL). MRL learns lower-dimensional embeddings that are nested into the original embedding, akin to a series of Matryoshka Dolls. Each representation sits inside a larger one, from the smallest to the largest "doll". This hierarchy of nested subspaces is learned by MRL, and it efficiently packs information at logarithmic granularities.
6. Code embeddings
Code embeddings are a recent development used to integrate AI-powered capabilities into Integrated Development Environments (IDEs), fundamentally transforming how developers interact with codebases. Unlike traditional text search, code embedding offers semantic understanding, allowing it to interpret the intent behind queries related to code snippets or functionalities. Code embedding models are built by training models on paired text data, treating the top-level docstring in a function along with its implementation as a (text, code) pair.
Different types of embeddings commonly used in various applications
Word Embeddings
Word2Vec: Developed by Google, Word2Vec represents words as dense vectors, capturing semantic relationships between words based on their co-occurrence patterns in large text corpora.
GloVe (Global Vectors for Word Representation): GloVe is another popular word embedding technique that combines global word co-occurrence statistics with local context information to generate word vectors.
2. Contextual Word Embeddings
ELMo (Embeddings from Language Models): ELMo uses bidirectional LSTMs to generate context-dependent word embeddings. It captures word meanings in different contexts.
BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based model that generates deep contextualized word embeddings. It has been influential in various NLP tasks.
3. Transformer-Based Embeddings
Transformer Models: Transformers, like BERT, GPT, T5, etc., produce high-quality embeddings by considering the relationships between words in a sentence. These models are pre-trained on large text corpora and fine-tuned for specific tasks.
4. Document Embeddings
Doc2Vec: An extension of Word2Vec, Doc2Vec generates fixed-length vectors representing documents or paragraphs. It takes into account the semantic information of words within a document.
Paragraph Vectors (PV-DBOW and PV-DM): These models, similar to Doc2Vec, create embeddings for paragraphs or documents. PV-DBOW and PV-DM stand for Distributed Bag of Words and Distributed Memory versions, respectively.
5. Image Embeddings
Convolutional Neural Networks (CNN) Embeddings: CNNs can be used to generate embeddings for images by extracting features at different layers of the network. These embeddings capture visual characteristics of the image.
Pre-trained Models for Image Embeddings: Models like ResNet, VGG, and Inception are often pre-trained on large image datasets. The activations of these models’ layers can be used as image embeddings.
6. Graph Embeddings
Node Embeddings: In graph-based tasks, node embeddings represent nodes (entities) in a graph. Techniques like Node2Vec and GraphSAGE learn embeddings that capture the structural and semantic properties of nodes in a graph.
7. Audio Embeddings
Similar to image embeddings, audio embeddings translate sound into vectors, capturing features like pitch, tone, and rhythm. These are used in voice recognition, music analysis, and sound classification tasks.
8. Video Embeddings
Video embeddings capture both the visual and temporal dynamics of videos. They’re used for activities like video search, classification, and understanding scenes or activities within the footage.
9. Knowledge Graph Embeddings
TransE, TransR, DistMult: These models embed entities and relations in a knowledge graph into continuous vector spaces. They help capture semantic relationships between entities and can be used in link prediction and knowledge graph completion tasks.
These are just a few examples of embeddings in different domains. Embeddings play a crucial role in transferring real-world entities into a format that machine learning algorithms can understand and process effectively.
How to Create Vector Embeddings
Choose Your Vector Embedding Model
Decide on the type of model based on your needs. Word2Vec, GloVe, and FastText are popular for word embeddings, while BERT and GPT-4 are used for sentence and document embeddings, etc.
Prepare Your Data
Clean and preprocess your data. For text, this can include tokenization, removing “stopwords,” and possibly lemmatization (reducing words to their base form). For images, this might include resizing, normalizing pixel values, etc.
Train or Use Pre-trained Models
You can train your model on your dataset or use a pre-trained model. Training from scratch requires a significant amount of data, time, and computational resources. Pre-trained models are a quick way to get started and can be fine-tuned (or augmented) with your specific dataset.
Generate Embeddings
Once your model is ready, feed your data through it (via SDK, REST, etc.) to generate embeddings. Each item will be transformed into a vector that represents its semantic meaning. Typically, the embeddings are stored in a database, sometimes right alongside the original data.
How to Measure Embedding Performance?
Retrieval Average
Represents average Normalized Discounted Cumulative Gain(NDCG) @ 10 across several datasets. NDCG is a common metric to measure the performance of retrieval systems. A higher NDCG indicates a model that is better at ranking relevant items higher in the list of retrieved results.
Model Size
Size of the model (in GB). It gives an idea of the computational resources required to run the model. While retrieval performance scales with model size, it is important to note that model size also has a direct impact on latency. The latency-performance trade-off becomes especially important in a production setup.
Max Tokens
Number of tokens that can be compressed into a single embedding. You typically don’t want to put more than a single paragraph of text (~100 tokens) into a single embedding. So even models with max tokens of 512 should be more than enough.
Embedding Dimensions
Length of the embedding vector. Smaller embeddings offer faster inference and are more storage-efficient, while more dimensions can capture nuanced details and relationships in the data. Ultimately, we want a good trade-off between capturing the complexity of data and operational efficiency.
Evaluation metrics
Embedding latency: Time taken to create embeddings
Retrieval quality: Relevance of retrieved documents to the user query
Cost Considerations
Querying Cost
Ensure high availability of the embedding API service, considering factors like model size and latency needs. OpenAI and similar providers offer reliable APIs, while open-source models may require additional engineering efforts.
Indexing Cost
The cost of indexing documents is influenced by the chosen encoder service. Separate storage of embeddings is advisable for flexibility in service resets or reindexing.
Storage Cost
Storage cost scales linearly with dimension, and the choice of embeddings, such as OpenAI's in 1526 dimensions, impacts the overall cost. Calculate average units per document to estimate storage cost.
Applications of Vector Embeddings
Natural Language Processing (NLP)
Semantic Search: Improving search relevance and user experience by better utilizing the meaning behind search terms, above and beyond traditional text-based searching.
Sentiment Analysis: Analyzing customer feedback, social media posts, and reviews to gauge sentiment (positive, negative, or neutral).
Language Translation: Understanding the semantics of the source language and generating appropriate text in the target language.
Recommendation Systems
E-commerce: Personalizing product recommendations based on browsing and purchase history.
Content Platforms: Recommending content to users based on their interests and past interactions.
Computer Vision
Image Recognition and Classification: Identifying objects, people, or scenes in images for applications like surveillance, tagging photos, identifying parts, etc.
Visual Search: Enabling users to search with images instead of text queries.
Healthcare
Drug Discovery: Helping to identify interactions.
Medical Image Analysis: Diagnosing diseases by analyzing medical images such as X-rays, MRIs, and CT scans.
Finance
Fraud Detection: Analyzing transaction patterns to identify and prevent fraudulent activities.
Credit Scoring: Analyzing financial history and behavior.