How to insert a large amount of custom data into ChatGPT and perform a search on it?

Open AI's ChatGPT is a powerful general-purpose text generative AI that can provide personalized response to individuals. But we need to about any specific info it can't, So using Langchain we can insert custom data to ChatGPT in the form of Pdf, Text files or CSV file etc. and perform search on this custom dataset.

Introduction :

Since we are limited to 4096 characters for free users we can’t give custom data beyond this limit as a source to give ChatGPT a full understanding. There is some limitation we can’t directly give a text, ppt, or pdf document as input to chatGPT. To overcome this problem there is a tool called Langchain which can be used to provide additional functionality to LLMs. Using Langchain we can give a large amount of pdf, text, and ppt documents as inputs.

Langchain is a powerful Python library designed to simplify and streamline natural language processing tasks. It offers a wide range of functionalities for text analysis, language detection, sentiment analysis, named entity recognition, part-of-speech tagging, and more. Langchain leverages state-of-the-art machine learning models and algorithms, making it an efficient tool for processing and understanding textual data.

With Langchain, developers can easily integrate natural language processing capabilities into their Python projects, enabling them to extract valuable insights from text data. The library provides a user-friendly interface, allowing developers to perform various natural language processing tasks with just a few lines of code.

Flow Diagram :

Steps :

Step 1: Generate Open Ai API key


Step 2: Install Python packages

  • Install the following Python packages

  • Open AI

pip install openai
  • Langchain

pip install langchain
  • Also other optional Python packages

pip install python-magic pip install pypdf pip install faiss-cpu pip install pdf2image pip install unstructured pip install nltk pip install tiktoken pip install tabulate


Step 3: Import required Python packages.

from langchain.embeddings.openai import OpenAIEmbeddings from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, PyPDFDirectoryLoader from langchain.vectorstores import Pinecone from langchain.vectorstores import FAISS from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain import OpenAI from langchain.chains import RetrievalQA from langchain.document_loaders import DirectoryLoader import magic import os import nltk import pinecone from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.document_loaders import DirectoryLoader from langchain.vectorstores import FAISS from langchain.chains import RetrievalQA


Step 4: Python code to import data & perform the search operation.

  • To import large amounts of files in a directory use DirectoryLoader from Langchain.

# Get your loader ready loader = DirectoryLoader('../data/PaulGrahamEssaySmall/', glob='**/*.txt') # Load up your text into documents documents = loader.load()
  • Split your documents into individual pieces for efficient search using RecursiveCharacterTextSplitter from Langchain.

# Get your text splitter ready text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0) # Split your documents into texts texts = text_splitter.split_documents(documents)
  • converts the document pieces into embeddings

  • OpenAIEmbeddings are trained using large-scale unsupervised learning techniques, typically on massive amounts of text data from diverse sources such as books, articles, websites, and more.

  • The embeddings are numerical vectors that encode the semantic meaning of words or texts. Similar words or texts will have similar vector representations.

# Turn your texts into embeddings embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key) # Get your docsearch ready docsearch = FAISS.from_documents(texts, embeddings)
  • Load Open Ai

# Load up your LLM llm = OpenAI(openai_api_key=openai_api_key)
  • Use RetrievalQA from Langchain to perform a search.

  • RetrievalQA refers to a task that focuses on answering questions by retrieving relevant information from a given collection of documents. It involves using retrieval techniques to search for the most appropriate documents or passages that contain the answer to a given question

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True) query = "What is the download speed of AT&T?" result = qa({"query": query})
Result :

  • Enter your question on the query variable

query = "What is the download speed of AT&T?"
  • Execute the Python code to get the search results.
