Embedding Wikipedia Articles for Search

In this tutorial, we will explore the process of preparing a dataset of Wikipedia articles for search, specifically focusing on the 2022 Winter Olympics. This involves several key steps: collecting articles, chunking them into manageable sections, embedding these sections, and storing the results for efficient retrieval.

Prerequisites

Before we begin, ensure you have the necessary libraries installed. You can install any missing libraries using pip:

pip install openai mwclient mwparserfromhell pandas tiktoken

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY="your-api-key"

Collecting Articles

We start by downloading Wikipedia articles related to the 2022 Winter Olympics. Using the mwclient library, we can access Wikipedia's API to fetch articles from a specific category.

import mwclient
 
CATEGORY_TITLE = "Category:2022 Winter Olympics"
WIKI_SITE = "en.wikipedia.org"
 
site = mwclient.Site(WIKI_SITE)
category_page = site.pages[CATEGORY_TITLE]
titles = titles_from_category(category_page, max_depth=1)
print(f"Found {len(titles)} article titles in {CATEGORY_TITLE}.")

Chunking Articles

Once we have the articles, we need to split them into smaller, semi-self-contained sections. This is crucial because GPT models can only process a limited amount of text at a time. We discard less relevant sections and clean up the text by removing reference tags and whitespace.

def clean_section(section):
    titles, text = section
    text = re.sub(r"<ref.*?</ref>", "", text)
    text = text.strip()
    return (titles, text)
 
wikipedia_sections = [clean_section(ws) for ws in wikipedia_sections]

Embedding Sections

With the sections prepared, we use the OpenAI API to generate embeddings for each section. This step transforms the text into a numerical format that can be easily searched and compared.

from openai import OpenAI
import pandas as pd
 
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
 
embeddings = []
for batch_start in range(0, len(wikipedia_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = wikipedia_strings[batch_start:batch_end]
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=batch)
    embeddings.extend([e.embedding for e in response.data])
 
df = pd.DataFrame({"text": wikipedia_strings, "embedding": embeddings})

Storing Embeddings

Finally, we store the embeddings in a CSV file for easy access and retrieval. For larger datasets, consider using a vector database for better performance.

SAVE_PATH = "data/winter_olympics_2022.csv"
df.to_csv(SAVE_PATH, index=False)

This process allows us to efficiently prepare and search through large collections of text data, making it a powerful tool for applications like question answering and information retrieval.

Reference: Embedding Wikipedia articles for search by Ted Sanders.

Prerequisites

Collecting Articles

Chunking Articles

Embedding Sections

Storing Embeddings

Discuss Your Project with Us