AI Engineering Course: Day 2

I had done some RAG-style prototypes in a Jupyter Notebook a few months back using LangChain and some hosted vector databases such as Pinecone. While these Notebooks helped me understand the features of what a RAG-style Chatbot might perform, it left a lot to be desired in terms of how it was working under the hood.

This Day 2 tutorial really helped me explore what exactly is happening when we used embeddings to create context for a LLM-powered chatbot. At first I was frustrated that the code provided in the tutorials was a bit out of date, but upon reflection, I think it was a good thing. This really forced me to dive in a bit deeper than normal so I could understand and fix what was missing and broken. I’m glad I did.

A great first step is to always clean up the data as much as possible, which will ease the embedding process. We’ll use a CSV instead of a vector database to store these embeddings and their associated data.

But let’s first understanding what an embedding actually is. An embedding is a vector list of floating point numbers. The correlation between two vectors measures how related each is, with small distances being more related and larger distances less related.

It was surprisingly fast to turn all of the scraped text into a simple and single CSV. We also took the filenames and converted them to URL which will allow us to provide links to the original sources when we answer questions.

We’ll now turn our raw text into tokens an then turn these tokens into embeddings that we can ask and receive answers from. A token represents ~4 characters, but it’s not a precise count. Most embedding services have a limit of the number of tokens allowed in given call, which means we’ll need to use LangChain’s chunking to process the entire documentation.

This tutorial is storing the embeddings in a CSV file vs. a vector database such as Pinecone. I built a few RAG-style chat applications before, but never as hands-on as this with storing the embeddings directly in a CSV.

This, coupled with the outdated code examples from the AI Engineering course, posed a few challenges to overcome to get something working.

The largest gaps are how we’re calculating the distance between the embeddings. There was previously a function baked into the OpenAI Python package that did this as follows:

df['distances'] = distances_from_embeddings(q_embeddings, df['embeddings'].values, distance_metric='cosine')

We’re able to replicate this with a cosine function that calculates the differences between the question embedding and the embeddings used in the data provided.

Since the tutorial had some outdated code using an older version of OpenAI’s Python package, I needed a way to test the updates to ensure that this all worked as expected and allowed me to iterate quickly.

A simple bot that can be ran in the command line solved this:

### Test_Bot.py

# Import necessary functions from questions.py
from questions import create_context, answer_question
import pandas as pd
import numpy as np

# Load your embeddings data
# Ensure this path is correct
df = pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

def test_bot():
    while True:
        # Input question
        question = input("Enter your question (or type 'quit' to exit): ")

        # Check for quit command
        if question.lower() == 'quit':
            break

        # Generate context
        context = create_context(question, df)
        print("\nContext Generated:\n")
        print(context)

        # Get answer
        answer, _ = answer_question(question, df)
        print("\nAnswer:\n")
        print(answer)

if __name__ == "__main__":
    test_bot()

I’m able to get the correct results from here with the expected answers. Good.

The final kink was trying to figure out why my Telegram bot was returning “I don’t know” more often than the test bot script above. It turns out the entire string of “/mozilla What is HTML?” was being passed to the LLM. Stripping out the “/mozilla” greatly improved the results of the Telegram bot so it now gives the same result as my test file.

I had originally thought this was going to be a simple tutorial that referenced a lot of what I had done with creating RAG-style prototypes in Jupyter Notebooks using LangChain and hosted vector databases. It turns out this tutorial got into the details much more than I had previously. This is a good thing! I learned a lot more about how these embeddings work under the hood by relying on a CSV and figuring out how to calculate the correct context given the embeddings for both the data and the question.

I had a lot of frustration working through this that told me I needed to go back to basics and understand what I was doing before trying to implement something from a tutorial. It was a great learning lesson and something I’ll remember going forward. For reference here’s the final updated code that works for my Telegram bot.

### Embed.py

import pandas as pd
from dotenv import load_dotenv
import os
import tiktoken
import openai
from langchain.text_splitter import RecursiveCharacterTextSplitter
load_dotenv()

openai.api_key = os.environ['OPENAI_API_KEY']

DOMAIN = "developer.mozilla.org"

def remove_newlines(series):
  series = series.str.replace('\n', ' ')
  series = series.str.replace('\\n', ' ')
  series = series.str.replace('  ', ' ')
  series = series.str.replace('  ', ' ')
  return series

# Create a list to store the text files
texts=[]

# Get all the text files in the text directory
for file in os.listdir("text/" + DOMAIN + "/"):

  # Open the file and read the text
  with open("text/" + DOMAIN + "/" + file, "r", encoding="UTF-8") as f:
    text = f.read()
    # we replace the last 4 characters to get rid of .txt, and replace _ with / to generate the URLs we scraped
    filename = file[:-4].replace('_', '/')
    """
    There are a lot of contributor.txt files that got included in the scrape, this weeds them out. There are also a lot of auth required urls that have been scraped to weed out as well
    """ 
    if filename.endswith(".txt") or 'users/fxa/login' in filename:
      continue

    # then we replace underscores with / to get the actual links so we can cite contributions
    texts.append(
      (filename, text))

# Create a dataframe from the list of texts
df = pd.DataFrame(texts, columns=['fname', 'text'])

# Set the text column to be the raw text with the newlines removed
df['text'] = df.fname + ". " + remove_newlines(df.text)
df.to_csv('processed/scraped.csv')

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

df = pd.read_csv('processed/scraped.csv', index_col=0)
df.columns = ['title', 'text']

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

chunk_size = 1000  # Max number of tokens

text_splitter = RecursiveCharacterTextSplitter(
    # This could be replaced with a token counting function if needed
    length_function = len,  
    chunk_size = chunk_size,
    chunk_overlap  = 0,  # No overlap between chunks
    add_start_index = False,  # We don't need start index in this case
)

shortened = []

for row in df.iterrows():

  # If the text is None, go to the next row
  if row[1]['text'] is None:
    continue

  # If the number of tokens is greater than the max number of tokens, split the text into chunks
  if row[1]['n_tokens'] > chunk_size:
    # Split the text using LangChain's text splitter
    chunks = text_splitter.create_documents([row[1]['text']])
    # Append the content of each chunk to the 'shortened' list
    for chunk in chunks:
      shortened.append(chunk.page_content)

  # Otherwise, add the text to the list of shortened texts
  else:
    shortened.append(row[1]['text'])

df = pd.DataFrame(shortened, columns=['text'])


df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

df['embeddings'] = df.text.apply(
    lambda x: openai.embeddings.create(input=x, model="text-embedding-ada-002").data[0].embedding
    if x is not None else None
)

df.to_csv('processed/embeddings.csv')

### Questions.py

# questions.py
from dotenv import load_dotenv
import numpy as np
import pandas as pd
import openai
import os
from scipy.spatial.distance import cosine

# Load environment variables
load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"]

# Load embeddings data
df = pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

def create_context(question, df, max_len=1800):
    # API call to get question embeddings
    q_response = openai.embeddings.create(input=question, model="text-embedding-ada-002")
    q_embeddings = np.array(q_response.data[0].embedding)

    # Calculate cosine distances using vectorization
    df['distances'] = df['embeddings'].apply(lambda x: 1 - cosine(q_embeddings, x))

    # Construct context from closest matches
    context = ''
    for _, row in df.sort_values('distances', ascending=False).iterrows():
        if len(context) + len(row['text']) > max_len:
            break
        context += row['text'] + "\n\n"
    return context

def answer_question(question, df, max_len=1800, max_tokens=150):
    print(question)
    # Create context
    context = create_context(question, df, max_len)
    if not context:
        return "Unable to generate context for the question."

    # API call for generating an answer
    try:
        response = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{
                "role": "user",
                "content": f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know.\" Try to cite sources to the links in the context when possible.\n\nContext: {context}\n\n---\n\nQuestion: {question}\nSource:\nAnswer:",
            }],
            temperature=0,
            max_tokens=max_tokens,
            top_p=1,
            frequency_penalty=0,
            presence_penalty=0,
        )
        return response.choices[0].message.content, context
    except Exception as e:
        return f"Error in generating answer: {e}"

### Main.py (Telegram bot)

# Import telegram api key from .env
from dotenv import load_dotenv
import logging
from telegram import Update
from telegram.ext import filters, MessageHandler, ApplicationBuilder, ContextTypes, CommandHandler
import os
import openai
import numpy as np
import pandas as pd
from questions import answer_question
load_dotenv()

openai.api_key = os.environ["OPENAI_API_KEY"]

tg_bot_token = os.getenv("TELEGRAM_API_KEY")

messages = [{
    "role" : "system",
    "content": "You are a helpful assistant that answers questions."
}]

df = pd.read_csv('processed/embeddings.csv', index_col=0)
df['embeddings'] = df['embeddings'].apply(eval).apply(np.array)

logging.basicConfig(format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
                        level=logging.INFO)

async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
    await context.bot.send_message(chat_id=update.effective_chat.id,
                                   text="I'm a bot, please talk to me!")

async def chat(update: Update, context: ContextTypes.DEFAULT_TYPE):
    messages.append({"role": "user", "content": update.message.text})
    completion = openai.chat.completions.create(model="gpt-3.5-turbo",
                                                messages=messages)
    completion_answer = completion.choices[0].message.content
    messages.append({"role": "assistant", "content": completion_answer})

    await context.bot.send_message(chat_id=update.effective_chat.id,
                                   text=completion_answer)
    
async def mozilla(update: Update, context: ContextTypes.DEFAULT_TYPE):
    # Extract the text from the update and remove the command prefix
    full_text = update.message.text
    question = full_text.replace('/mozilla', '').strip()

    print(question)  # For debugging, to see the processed question
    answer, _ = answer_question(question=question, df=df)
    await context.bot.send_message(chat_id=update.effective_chat.id, text=answer)

    
if __name__ == "__main__":
    application = ApplicationBuilder().token(tg_bot_token).build()
    
    start_handler = CommandHandler('start', start)
    chat_handler = CommandHandler('chat', chat)  # Change to CommandHandler
    mozilla_handler = CommandHandler('mozilla', mozilla)
    application.add_handler(mozilla_handler)
    application.add_handler(start_handler)
    application.add_handler(chat_handler)  # Add the chat_handler
    application.run_polling()

On to Day 3 :)