Published on

Interacting with Documents in Langchain

Table of Contents

Chatting with Documents

In the world of Generative AI, the ability to effectively interact with and extract information from various data sources, including PDF documents, is crucial. Langchain, a powerful framework for building AI applications, provides a seamless way to integrate language models with PDF processing capabilities, enabling users to leverage the power of large language models (LLMs) to extract insights and generate relevant responses from PDF content.

RecursiveCharacterTextSplitter in Langchain 1

The RecursiveCharacterTextSplitter is a text splitter in Langchain that splits text into smaller chunks while preserving semantic meaning. It works by recursively splitting the text on newline characters, then on sentence characters, and finally on a specified character limit. This allows it to preserve the structure of the text better than a simple character-based splitter.

Why Split Documents into Chunks?

Documents are often split into smaller chunks before storing them in a vector database for a few reasons:

  1. Memory Efficiency: Large documents can consume a lot of memory when stored in their entirety. Splitting them into smaller chunks reduces the memory footprint.
  2. Granular Retrieval: Storing documents as chunks allows for more granular retrieval, where you can find the most relevant parts of a document instead of the entire document.
  3. Parallelization: Splitting documents into chunks enables parallel processing, which can speed up the ingestion and querying process.
  4. Semantic Relevance: Splitting documents at logical boundaries (e.g., paragraphs, sections) can help preserve the semantic meaning of the text, which is important for tasks like question answering.

The chunk overlap specifies how much of the previous chunk should be included in the next chunk. For example, if the chunk_overlap is set to 20 and the chunk_size is set to 100, the splitter will create chunks of 100 characters each, but the last 20 characters of each chunk will overlap with the first 20 characters of the next chunk.

Embeddings in Langchain

In the context of Langchain, embeddings refer to the high-dimensional vector representations of text, images, or other data that are used to enable similarity search, clustering, and other Generative AI tasks. Embeddings are typically generated by pre-trained language models or other specialized embedding models, such as BERT, GPT, or Sentence Transformers. These models take input data and produce a numerical vector representation that captures the semantic and contextual meaning of the input. Embeddings are a crucial component of Langchain, as they enable the use of vector databases and similarity search algorithms to efficiently retrieve and process relevant information for a wide range of Generative AI applications, such as question answering, document retrieval, and text generation. Based on the search results, here are some common evaluation metrics for model embeddings:

  1. Word Similarity:

    • Measures how well the embeddings capture semantic and syntactic relationships between words.
    • Metrics include cosine similarity, Euclidean distance, and correlation with human-annotated word similarity scores.
  2. Word Analogy:

    • Tests the embeddings' ability to capture linguistic relationships, such as "king is to queen as man is to woman".
    • Measures how well the model can complete analogies.
  3. Concept Categorization:

    • Evaluates how well the embeddings group words into semantic categories.
    • Measures the clustering of words into predefined categories.
  4. Outlier Detection:

    • Tests the embeddings' ability to identify words that are semantically unrelated to a group.
    • Measures how well the model can detect words that don't fit with a set of related words.
  5. Downstream Task Performance:

    • Measures how well the embeddings perform on specific NLP tasks like text classification, named entity recognition, etc.
    • Evaluates the embeddings' usefulness for real-world applications.
  6. Perplexity:

    • Measures how well the language model can predict the next word in a sequence.
    • Lower perplexity indicates better language modeling ability.
  7. Accuracy, F1-score, ROUGE, BLEU, METEOR:

    • Common metrics for evaluating the quality of text generation tasks like translation, summarization, and question answering.
  8. Embedding-based Metrics:

    • Metrics like BERTScore that leverage contextual embeddings to assess semantic similarity between generated and reference text.
  9. Ontology-specific Metrics:

    • Measures the quality of embeddings for ontological concepts, including categorization, hierarchy, and relationships.

The choice of evaluation metrics depends on the specific use case and the properties of the embeddings that need to be assessed. A combination of intrinsic and extrinsic evaluations is often recommended to get a comprehensive understanding of the embedding quality.

Similarity Search Methods

The most common similarity search methods used in Langchain and vector databases are:

  1. Cosine Similarity: Measures the cosine of the angle between two vectors. It is a popular choice for text-based similarity search, as it is effective at capturing semantic similarity.
  2. Dot Product: Calculates the dot product between two vectors. It is a simple and efficient similarity measure, but it can be sensitive to vector magnitude.
  3. Euclidean Distance: Measures the straight-line distance between two vectors. It is useful for finding the most similar items in terms of their numerical values, but it may not capture semantic similarity as well as cosine similarity.

The choice of similarity search method depends on the specific requirements of your application, the nature of your data, and the performance characteristics of the vector database you're using.

Vectors and Vector Databases 2 3

A vector is a numerical representation of an object, such as a piece of text, an image, or a sound. Vector databases are specialized databases that store and index these vectors, allowing for efficient similarity search and other vector-based operations.

The main advantages of vector databases over traditional databases are:

  1. Similarity Search: Vector databases excel at finding the most similar items to a given query, which is essential for tasks like recommendation systems, image search, and text retrieval.
  2. Scalability: Vector databases can handle large volumes of high-dimensional data, making them suitable for applications with growing data needs.
  3. Performance: Vector databases are optimized for fast similarity search, often outperforming traditional databases for these types of queries.
  4. Flexibility: Vector databases can handle a wide range of data types, from text to images to audio, making them versatile for various Generative AI applications.

Best DBs for Similarity Search and Generative AI

Some of the best vector databases for similarity search and other Generative AI tasks include:

  1. Meilisearch: A fast and relevant open-source search engine that supports vector search.
  2. Chroma: A vector database built specifically for Langchain that provides efficient similarity search.
  3. Pinecone: A managed vector database service that offers scalable and performant vector search.
  4. Faiss: An open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors.
  5. Elasticsearch: A popular search engine that can be used for vector search with the help of plugins like Elasticsearch Vector Plugin.

Apache Cassandra Database & DataStax Astra DB

A NoSQL Database. ⬆️ Availability and Scalability. Low Latency and High throughput

Apache Cassandra is a highly scalable, fault-tolerant, and distributed NoSQL database management system. It is designed to handle large amounts of data across many servers, providing high availability with no single point of failure. Cassandra uses a unique architecture with a distributed coordinator that manages data across multiple nodes in a cluster. It also supports GQL, a graph query language that enables efficient querying of complex, interconnected data. Cassandra is recommended for use cases that require horizontal scalability, such as real-time analytics, e-commerce data, and IoT data processing.

output

Figure 1 : Apache Cassandra Architecture. Image courtesy of Hostinger

✅ Checkout this article by Stackademic

Key differences between Cassandra and other NoSQL databases like MongoDB:

  1. Data Model: Cassandra uses a column-family data model, while MongoDB uses a document-oriented data model.
  2. Scalability: Cassandra is designed for linear scalability, allowing you to add more nodes to the cluster as your data grows. MongoDB has limitations in horizontal scaling.
  3. Consistency: Cassandra prioritizes availability and partition tolerance over strict consistency, following the principles of the CAP theorem. MongoDB offers stronger consistency guarantees.
  4. Query Language: Cassandra uses CQL (Cassandra Query Language), which is similar to SQL. MongoDB uses its own query language, which is different from SQL.

DataStax Astra DB is a fully managed, cloud-native Apache Cassandra database-as-a-service (DBaaS) offering. It provides the benefits of Cassandra with the ease of use and scalability of a cloud-hosted service, without the need to manage the underlying infrastructure.

What is Cassio and what are its uses when integrated with Langchain? 4 5

A python Library

Cassio is a python library that allows you to use Apache Cassandra as a vector store for your Langchain applications. Cassio is a Python library that "abstracts away the details of accessing the Cassandra database for the typical needs of generative artificial intelligence (AI) or other machine learning workloads." It provides "a low-boilerplate, ready-to-use set of tools for seamless integration of Cassandra in most AI-oriented applications." When integrated with Langchain, Cassio can be used for:

💻 Cassio - Documentation & References

What are Vector Stores and how do they use embeddings to convert chunks into vectors?

Vector stores are specialized databases that store and index high-dimensional vector representations of data, such as text, images, or audio. These vector representations, known as embeddings, capture the semantic meaning and relationships between the data points.

In the context of Langchain, vector stores are used to store and retrieve the embeddings of document chunks. The process typically involves:

  1. Embedding Generation: The text chunks are passed through a language model or other embedding model to generate numerical vector representations of the content.
  2. Vector Storage: The generated vectors are stored in the vector store, along with metadata about the original document and chunk.
  3. Similarity Search: When a user query is received, it is also converted into a vector representation. The vector store can then be queried to find the most similar document chunks based on the vector similarity (e.g., cosine similarity, dot product, Euclidean distance).

This approach allows for efficient and semantically-aware retrieval of relevant document chunks, which is crucial for tasks like question answering, document search, and content recommendation.

What are Vector Indices?

Vector indices are specialized data structures used in vector stores to efficiently store and query high-dimensional vector data. They enable fast similarity search and retrieval of the most relevant vectors based on a given query vector.

One common example of a vector index is the Hierarchical Navigable Small World (HNSW) index. HNSW is a graph-based index that organizes the vectors into a multi-level hierarchy, allowing for efficient nearest-neighbor search.

Here's an example of how HNSW works:

  1. Index Creation: When adding a new vector to the index, the HNSW algorithm determines where to place it in the hierarchical graph structure based on the vector's proximity to other vectors.
  2. Query Processing: To find the most similar vectors to a given query vector, the algorithm starts at the top level of the hierarchy and navigates down through the levels, following the links to the most promising regions of the graph.
  3. Optimization: The HNSW index is designed to minimize the number of distance computations required during the search process, making it highly efficient for large-scale vector databases.

Other examples of vector indices include Faiss (Facebook AI Similarity Search), Annoy (Approximate Nearest Neighbors Oh Yeah), and NMSLIB (Non-Metric Space Library), each with its own strengths and trade-offs in terms of performance, memory usage, and supported features.

The choice of vector index depends on the specific requirements of the application, such as the size of the vector database, the desired search accuracy, and the available computational resources.

Code 🚧

This codebase is a Work in Progress (WIP). 🚧

Please refer to my Github repository for Langchain for the latest updates.


Citations:

Footnotes

  1. https://blog.meilisearch.com/langchain-semantic-search-tutorial/

  2. https://python.langchain.com/docs/integrations/vectorstores/faiss

  3. https://python.langchain.com/docs/integrations/vectorstores/faiss_async

  4. https://github.com/langchain-ai/langchain/discussions/9984

  5. https://stackoverflow.com/questions/76678783/langchains-chroma-vectordb-similarity-search-with-score-and-vectordb-simil