ArXiv chatbot: building a RAG agent with LangChain and GPT-3.5
I spent some time last week building a chat bot that can answer questions based on the abstracts of more than 2.4 million articles on the ArXiv. This bot is built on top of GPT-3.5 (the LLM that powers (a previous version of) ChatGPT) and uses Retrieval Augmented Generation (RAG). The repository with the code is here and you can chat with the final product here.
What is Retrieval Augmented Generation?
Introduced in a 2020 paper, Retrieval Augmented Generation (RAG) is a technique for augmenting the knowledge of a general-purpose LLM (or more generally any kind of generative AI tool) through retrieval of information from external sources (often called documents). The idea is that having access to this kind of external specific knowledge can help the model be more accurate, more up-to-date, have fewer hallucinations, or all of these.
ChatGPT, while being an incredibly useful tool, can often reach the limits of its capabilities or generate hallucinations when asked specific questions in scientific/technical domains. It also does not reveal how it arrived at a certain response or what sources it used. The idea behind the ArXiv Chatbot is to use a RAG approach to help GPT-3.5 overcome these limitations.
Data Preparation
The foundation of ArXiv Chatbot is the extensive dataset from ArXiv, which I obtained from Kaggle. This dataset contains metadata for approximately 2.4 million papers. Here’s how I processed the data:
Downloading and loading the data
The dataset was downloaded from Kaggle. I narrowed down to the following five features: id
, title
, authors
, abstract
, and categories
.
from langchain_community.document_loaders import DataFrameLoader
import pandas as pd
# Load dataset
df = pd.read_csv('path_to_arxiv_data.csv')
# Select relevant features
df = df[['id', 'title', 'authors', 'abstract', 'categories']]
# Load into DataFrameLoader
loader = DataFrameLoader(df, page_content_column="abstract")
docs = loader.load()
Embedding and Upserting into Pinecone
I chose BAAI/bge-small-en-v1.5
for generating the embeddings. This can be downloaded for free from Hugging Face.
# Use a GPU if we have one
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load the embedding
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {"device": device}
encode_kwargs = {
"normalize_embeddings": True,
"show_progress_bar": True,
"batch_size": 128
}
hf = HuggingFaceBgeEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
Next I connected to the Pinecone index I had already set up using the Pinecone web interface:
from pinecone import Pinecone
pc = Pinecone(api_key=pinecone_key)
index = pc.Index("arxiv-abstracts")
which had been configured with dimensions
set to 384 and metric
as dotproduct
. Then I set up the PineconeVectorStore
object for implementing the upserts:
from langchain_pinecone import PineconeVectorStore
index_name = "arxiv-absracts"
# Connect to Pinecone index to insert the chunked docs as contents
docsearch = PineconeVectorStore(
index=index,
embedding=hf,
text_key="text",
distance_strategy='dotproduct'
)
and finally upserted the documents:
embeddings = docsearch.add_documents(docs)
Backend Implementation
The backend of the chatbot is implemented using FastAPI and deployed on Google Cloud Run using a Docker image. Here’s a breakdown of the implementation:
FastAPI Setup
FastAPI provides a fairly straightforward way to handle GET and POST requests. The following is an excerpt from the backend/main.py
file where this is implemented.
from fastapi import FastAPI
import query
app = FastAPI()
@app.post("/query")
async def chat(user_id: int, message: str):
# Additional logic for figuring out the agent
# based on the user id omitted.
response = query.connect(agent, message)
response['user_id'] = user_id
return response
where the agent and its logic are implemented in the query
module.
Creating the RAG Agent
This involved several steps. First I set up a global prompt,
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.chat import MessagesPlaceholder
def create_prompt(memory_key: str = "history"):
system_prompt = """
You are an intelligent agent equipped with a tool
named "ArXiv-search" that can answer
technical and scientific questions.
When asked to answer a technical or scientific
question, you should use "ArXiv-search" to
handle the input.
Instructions:
1. If the input asks you to answer a technical
or scientific question, use "ArXiv-search".
2. If the input is a general query, answer it
to the best of your knowledge.
Examples:
- Input: "Which is the largest ocean?"
Action: Answer "The Pacific Ocean".
- Input: "What is quantum gravity?"
Action: Use "ArXiv-search" to answer the question.
"""
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
MessagesPlaceholder(memory_key),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad")
])
return prompt
PROMPT = create_prompt()
LLM (note that for this to work the OpenAI API key needs to be loaded as an environment variable; alternatively you can pass it along as an argument),
from langchain_openai import ChatOpenAI
LLM = ChatOpenAI(
model="gpt-3.5-turbo",
temperature=0
)
and retriever:
RETRIEVER = docsearch.as_retriever()
I then made a wrapper around the base agent/executor which would contain all of the necessary objects for creating and implementing a RAG chain within the conversation agent that would also have memory (i.e. remember the last few messaes exchanged with the user for additional context). It also contains methods for invoking the individual elements when needed:
@dataclass
class ConversationAgentWrapper:
memory_key: str = "history"
input_key: str = "input"
memory: BaseMemory = field(init=False)
chain: Chain = field(init=False)
tools: List[BaseTool] = field(init=False)
agent: Runnable = field(init=False)
executor: AgentExecutor = field(init=False)
def __post_init__(self):
self.memory = self.create_memory()
self.chain = self.create_chain()
self.tools = self.create_tools()
self.agent = self.create_agent()
self.executor = self.create_executor()
def create_memory(self) -> BaseMemory:
return ConversationSummaryBufferMemory(
llm = LLM,
max_token_limit = 650,
memory_key=self.memory_key,
return_messages=True,
input_key=self.input_key
)
def create_chain(self) -> Chain:
conversation_chain = ConversationChain(
llm = LLM,
memory = self.memory
)
return create_retrieval_chain(
RETRIEVER,
conversation_chain
)
def create_tools(self) -> List[Tool]:
return [
Tool(
name="ArXiv-search",
func = self.invoke_chain,
description = ("""
use this tool when answering questions to get more
information about a scientific or technical topic
""")
)
]
def create_agent(self) -> Runnable:
return create_openai_functions_agent(
llm = LLM,
tools = self.tools,
prompt = PROMPT
)
def create_executor(self,
remember_intermediate_steps: bool = True,
verbose: bool = True,
**kwargs: Any,
) -> AgentExecutor:
return AgentExecutor(
agent = self.agent,
tools = self.tools,
memory = self.memory,
verbose = verbose,
return_intermediate_steps =remember_intermediate_steps,
**kwargs,
)
def invoke_chain(self, query: str) -> Any:
print("Chain invoked")
return self.chain.invoke({self.input_key: query})
def invoke_executor(self, input):
return self.executor.invoke({self.input_key: input})
The last step was the function that actually gets called when a POST request is made:
def connect(agent, query):
warnings.simplefilter('ignore')
response = agent.invoke_executor(query)
return response
All of this is in the backend/query.py
file.
Docker Deployment
Deploying to Google Cloud Run took a while to figure out. Ultimately I had to use this repo as a guide to write down my Dockerfile. The main issue was ASGI vs WSGI and having to specify the use of uvicorn
for running the FastAPI functionality.
# Dockerfile
FROM python:3.9-slim
ENV PYTHONUNBUFFERED True
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
ENV PORT 80
RUN pip install -r requirements.txt
CMD exec uvicorn main:app --host 0.0.0.0 --port ${PORT} --workers 1
Frontend Implementation
The frontend is a Streamlit app that communicates with the FastAPI backend. This setup allows for a simple and interactive user experience. The code for this is fairly self-explanatory and is in streamlit_app.py
.
Evaluation
I had no idea how RAG implementations are evaluated. So I looked around and found the Ragas framework. And that’s what I ended up using. I used the guide here and the code is available in backend/ragas_test.py
. Here are the results I got:
Metric | Score |
---|---|
answer_relevancy |
0.9494 |
answer_correctness |
0.6239 |
answer_similarity |
0.9308 |
faithfulness |
0.8448 |
context_recall |
0.8593 |
context_precision |
0.9784 |
There are a couple of drawbacks to the specific evaluation method I used:
- The test sample size of 16 is quite small. Unfortunately Ragas’s generation process is quite token-heavy and those tokens cost money.
- The same LLM (GPT-3.5-turbo) is used in all three steps of the process: for generating the test set, for making the predictions of the model (since the RAG agent uses it), and for calculating the evaluation metrics. So in some sense these results are just the LLM using itself to evaluate its own performance with itself as the benchmark. Future work should involve using different LLMs for all three steps.
Conclusion
Building the ArXiv Chatbot was a great learning process for me. The final product could still do with many tweaks and improvements, and I’ll be thinking about those in a bit.
Try it out yourself at arxiv-chatbot.onrender.com and feel free to get in touch if you’d like to integrate the backend API into your own projects.