I spent some time last week building a chat bot that can answer questions based on the abstracts of more than 2.4 million articles on the ArXiv. This bot is built on top of GPT-3.5 (the LLM that powers (a previous version of) ChatGPT) and uses Retrieval Augmented Generation (RAG). The repository with the code is here and you can chat with the final product here.

What is Retrieval Augmented Generation?

Introduced in a 2020 paper, Retrieval Augmented Generation (RAG) is a technique for augmenting the knowledge of a general-purpose LLM (or more generally any kind of generative AI tool) through retrieval of information from external sources (often called documents). The idea is that having access to this kind of external specific knowledge can help the model be more accurate, more up-to-date, have fewer hallucinations, or all of these.

ChatGPT, while being an incredibly useful tool, can often reach the limits of its capabilities or generate hallucinations when asked specific questions in scientific/technical domains. It also does not reveal how it arrived at a certain response or what sources it used. The idea behind the ArXiv Chatbot is to use a RAG approach to help GPT-3.5 overcome these limitations.

Data Preparation

The foundation of ArXiv Chatbot is the extensive dataset from ArXiv, which I obtained from Kaggle. This dataset contains metadata for approximately 2.4 million papers. Here’s how I processed the data:

Downloading and loading the data

The dataset was downloaded from Kaggle. I narrowed down to the following five features: id, title, authors, abstract, and categories.

from langchain_community.document_loaders import DataFrameLoader
import pandas as pd

# Load dataset
df = pd.read_csv('path_to_arxiv_data.csv')

# Select relevant features
df = df[['id', 'title', 'authors', 'abstract', 'categories']]

# Load into DataFrameLoader

loader = DataFrameLoader(df, page_content_column="abstract")

docs = loader.load()

Embedding and Upserting into Pinecone

I chose BAAI/bge-small-en-v1.5 for generating the embeddings. This can be downloaded for free from Hugging Face.

# Use a GPU if we have one
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load the embedding
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {"device": device}
encode_kwargs = {
   "normalize_embeddings": True,
   "show_progress_bar": True,
   "batch_size": 128
   }
hf = HuggingFaceBgeEmbeddings(
   model_name=model_name,
   model_kwargs=model_kwargs,
   encode_kwargs=encode_kwargs
)

Next I connected to the Pinecone index I had already set up using the Pinecone web interface:

from pinecone import Pinecone

pc = Pinecone(api_key=pinecone_key)
index = pc.Index("arxiv-abstracts")

which had been configured with dimensions set to 384 and metric as dotproduct. Then I set up the PineconeVectorStore object for implementing the upserts:

from langchain_pinecone import PineconeVectorStore

index_name = "arxiv-absracts"

# Connect to Pinecone index to insert the chunked docs as contents
docsearch = PineconeVectorStore(
   index=index,
   embedding=hf,
   text_key="text",
   distance_strategy='dotproduct'
)

and finally upserted the documents:

embeddings = docsearch.add_documents(docs)

Backend Implementation

The backend of the chatbot is implemented using FastAPI and deployed on Google Cloud Run using a Docker image. Here’s a breakdown of the implementation:

FastAPI Setup

FastAPI provides a fairly straightforward way to handle GET and POST requests. The following is an excerpt from the backend/main.py file where this is implemented.

from fastapi import FastAPI
import query

app = FastAPI()

@app.post("/query")
async def chat(user_id: int, message: str):
      # Additional logic for figuring out the agent
      # based on the user id omitted.
      
      response = query.connect(agent, message)
      response['user_id'] = user_id
      return response

where the agent and its logic are implemented in the query module.

Creating the RAG Agent

This involved several steps. First I set up a global prompt,

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.chat import MessagesPlaceholder

def create_prompt(memory_key: str = "history"):

    system_prompt = """
        You are an intelligent agent equipped with a tool
        named "ArXiv-search" that can answer
        technical and scientific questions.
        When asked to answer a technical or scientific
        question, you should use "ArXiv-search" to
        handle the input.

        Instructions:
        1. If the input asks you to answer a technical 
        or scientific question, use "ArXiv-search".
        2. If the input is a general query, answer it 
        to the best of your knowledge.

        Examples:
        - Input: "Which is the largest ocean?"
        Action: Answer "The Pacific Ocean".
        - Input: "What is quantum gravity?"
        Action: Use "ArXiv-search" to answer the question.
    """

    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        MessagesPlaceholder(memory_key),
        ("human", "{input}"),
        MessagesPlaceholder("agent_scratchpad")
    ])
    return prompt
   
PROMPT = create_prompt()

LLM (note that for this to work the OpenAI API key needs to be loaded as an environment variable; alternatively you can pass it along as an argument),

from langchain_openai import ChatOpenAI

LLM = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0
)

and retriever:

RETRIEVER = docsearch.as_retriever()

I then made a wrapper around the base agent/executor which would contain all of the necessary objects for creating and implementing a RAG chain within the conversation agent that would also have memory (i.e. remember the last few messaes exchanged with the user for additional context). It also contains methods for invoking the individual elements when needed:

@dataclass
class ConversationAgentWrapper:
    memory_key: str = "history"
    input_key: str = "input"
    memory: BaseMemory = field(init=False)
    chain: Chain = field(init=False)
    tools: List[BaseTool] = field(init=False)
    agent: Runnable = field(init=False)
    executor: AgentExecutor = field(init=False)

    def __post_init__(self):
        self.memory = self.create_memory()
        self.chain = self.create_chain()
        self.tools = self.create_tools()
        self.agent = self.create_agent()
        self.executor = self.create_executor()

    def create_memory(self) -> BaseMemory:
        return ConversationSummaryBufferMemory(
            llm = LLM,
            max_token_limit = 650,
            memory_key=self.memory_key,
            return_messages=True,
            input_key=self.input_key
        )
    
    def create_chain(self) -> Chain:
        conversation_chain = ConversationChain(
        llm = LLM,
        memory = self.memory
        )

        return create_retrieval_chain(
            RETRIEVER,
            conversation_chain
        )

    def create_tools(self) -> List[Tool]:
        return [
            Tool(
                name="ArXiv-search",
                func = self.invoke_chain,
                description = ("""
                    use this tool when answering questions to get more 
                    information about a scientific or technical topic
                """)
            )
        ]

    def create_agent(self) -> Runnable:
        return create_openai_functions_agent(
            llm = LLM,
            tools = self.tools,
            prompt = PROMPT
        )
    
    def create_executor(self,
        remember_intermediate_steps: bool = True,
        verbose: bool = True,
        **kwargs: Any,
    ) -> AgentExecutor:
        return AgentExecutor(
            agent = self.agent,
            tools = self.tools,
            memory = self.memory,
            verbose = verbose,
            return_intermediate_steps =remember_intermediate_steps,
            **kwargs,
        )
   
    def invoke_chain(self, query: str) -> Any:
        print("Chain invoked")
        return self.chain.invoke({self.input_key: query})

    def invoke_executor(self, input):
        return self.executor.invoke({self.input_key: input})

The last step was the function that actually gets called when a POST request is made:

def connect(agent, query):
    warnings.simplefilter('ignore')
    response = agent.invoke_executor(query)
    return response

All of this is in the backend/query.py file.

Docker Deployment

Deploying to Google Cloud Run took a while to figure out. Ultimately I had to use this repo as a guide to write down my Dockerfile. The main issue was ASGI vs WSGI and having to specify the use of uvicorn for running the FastAPI functionality.

# Dockerfile
FROM python:3.9-slim 

ENV PYTHONUNBUFFERED True

ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./

ENV PORT 80

RUN pip install -r requirements.txt 

CMD exec uvicorn main:app --host 0.0.0.0 --port ${PORT} --workers 1

Frontend Implementation

The frontend is a Streamlit app that communicates with the FastAPI backend. This setup allows for a simple and interactive user experience. The code for this is fairly self-explanatory and is in streamlit_app.py.

Evaluation

I had no idea how RAG implementations are evaluated. So I looked around and found the Ragas framework. And that’s what I ended up using. I used the guide here and the code is available in backend/ragas_test.py. Here are the results I got:

Metric Score
answer_relevancy 0.9494
answer_correctness 0.6239
answer_similarity 0.9308
faithfulness 0.8448
context_recall 0.8593
context_precision 0.9784

There are a couple of drawbacks to the specific evaluation method I used:

  • The test sample size of 16 is quite small. Unfortunately Ragas’s generation process is quite token-heavy and those tokens cost money.
  • The same LLM (GPT-3.5-turbo) is used in all three steps of the process: for generating the test set, for making the predictions of the model (since the RAG agent uses it), and for calculating the evaluation metrics. So in some sense these results are just the LLM using itself to evaluate its own performance with itself as the benchmark. Future work should involve using different LLMs for all three steps.

Conclusion

Building the ArXiv Chatbot was a great learning process for me. The final product could still do with many tweaks and improvements, and I’ll be thinking about those in a bit.

Try it out yourself at arxiv-chatbot.onrender.com and feel free to get in touch if you’d like to integrate the backend API into your own projects.