Document Content Querying with LangChain
Insights / Document Content Querying with LangChain

Document Content Querying with LangChain

Artificial Intelligence

In a rapidly advancing world of AI, staying up-to-date with the latest information is essential. However, when it comes to extracting insights from lengthy documents such as research papers or multi-page documents, the limitations of language models like GPT-4 can present a real challenge. Imagine you have a crucial task at hand: extracting valuable insights from a lengthy PDF or a recently published research paper. But here’s the catch—it spans multiple pages, making the conventional method of copying the entire content and posing questions a herculean task, especially when dealing with the token limitations of LLM models. What if we told you there’s a game-changing solution on the horizon?

LangChain, a versatile open-source developer framework tailored for LLM applications, has been making waves in the world of natural language processing. While ChatGPT has its knowledge cutoff, Langchain empowers users to create and train their own custom LLM models with the latest data. In this in-depth exploration, our spotlight turns to a specific application of LangChain, namely, leveraging it to engage in conversations with our own data. Let’s delve into the mechanics behind Langchain and discover how it revolutionizes the way we interact with documents, making data extraction from lengthy, multi-page PDFs and research papers a breeze.

Document Processing Pipeline

As an initial step, install the following: langchain, openai, pypdf to load the contents of pdf, faiss-cpu (just like a vector database where embeddings are stored)  and tiktoken for generation of tokens.

pip install langchain pypdf openai  faiss-cpu tiktoken

 Import the required libraries as shown below:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

        Set the API key for openai api (we can use any llm model) in an environment variable.

import os
os.environ[“OPENAI_API_KEY”] = ” “
  • Document Loading

Document loading involves handling data from diverse sources and formats – be it structured data from websites, databases, or unstructured data like YouTube videos, Twitter feeds and much more. Notably, LangChain offers a wide range of document loaders, with over 80 different loaders available. These loaders can handle various data types, such as PDFs, HTML, JSON, Word, PowerPoint, or tabular formats and convert them into a standardized document format enriched with content and metadata. Some of the loaders include PyPDFLoader for processing PDFs, YouTubeAudioLoader designed for handling content from YouTube and WebBaseLoader, which is specialized in loading data from URLs, to name just a few.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader(“/content/autonomous_language_agents.pdf”)

#Load the document
documents = loader.load()

This returns a list of Document objects, with each Document representing a single page of the PDF. The index of the list corresponds to the page number within the document. For example, documents[0] represent the content of the first page, documents[1] represent the content of the second page, and so forth.

  • Document Splitting

Once the documents are loaded, the next step often involves dividing them into more manageable chunks for efficient processing.This is a tricky task, primarily because it necessitates preserving the meaningful connections between these segments.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(documents)

The input document is divided into segments determined by a specified chunk size, often measured in characters or tokens, and with a designated overlap between these chunks. The chunk overlap serves the purpose of ensuring a slight redundancy or overlap between two adjacent chunks, promoting a sense of continuity between them.

  • Vector Store and Embeddings

Embeddings provide a vector or coordinate-based representation to categorize a piece of text. An embedding transformer can be used to convert each text chunk into an embedding. When vectors or coordinates are proximate to one another, it signifies that the corresponding pieces of information share a similar meaning or context. These embedding vectors, in conjunction with the corresponding text chunks, are stored within a vector store.

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()
vectorstore_contents = FAISS.from_documents(documents, embeddings)

We use OpenAI to generate embeddings for our documents and store these embeddings within FAISS vectorstores.

  • Retrieval and Question Answering

Retrieval becomes pivotal during the querying process, enabling us to locate the most relevant splits when a query is initiated. We establish the RetrievalQA chain, incorporating the vector store as our primary source of information. Behind the scenes, this process efficiently retrieves only the relevant data from the vector store, driven by the semantic similarity between the prompt and the stored information. We invoke the qa_chain function with our question of interest to obtain an answer.

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore_contents.as_retriever()
)

question = ” “
result = qa_chain({“query”: question})
result[“result”]

There are other methods similar to RetrievalQA using which we can retrieve the QA from Documents. These include load_qa_chain and ConversationalRetrievalChain.

Conclusion

LangChain is paving the way for a transformative era in natural language processing, particularly in the realm of information extraction and question-answering. By seamlessly integrating external documents and enhancing language models with relevant information, it’s redefining the landscape of language understanding and generation. Retrieval augmented generation (RAG) is a shining example of how LangChain is at the forefront of innovation, pushing the boundaries of question-answering. 

While challenges remain, the possibilities it unlocks are boundless, offering a promising future for AI-driven insights and knowledge extraction from lengthy documents. Say goodbye to the days of laborious manual copying and pasting, LangChain promises a future where information extraction is effortless and efficient.


Solutions Tailored to Your Needs

Need a tailored solution? Let us build it for you.


Related Articles