An Introduction to GPT-4o and GPT-4o mini

Welcome to a comprehensive tutorial aimed at introducing you to the pioneering models of OpenAI – GPT-4o and GPT-4o mini. As you embark on this journey to understand these groundbreaking technologies, our goal is to provide you with the foundational knowledge and skills to leverage them effectively in your own applications.

What Are GPT-4o and GPT-4o mini?

GPT-4o, standing for "omni," is the latest generational leap in OpenAI's arsenal of language models. Unlike its predecessors, which were restricted to text-only inputs and outputs, GPT-4o is a multimodal model, adept at understanding and generating information across text, audio, and video inputs.

GPT-4o mini is essentially its "little sibling," offering a smaller, more affordable variant that retains remarkable speed and accuracy, all while being capable of supporting multimodal interactions.

Getting Started with GPT-4o mini

Before diving into the practical aspects, it is crucial to understand that GPT-4o mini operates on a unified neural network, seamlessly processing text, visual, and auditory inputs. This means that whether you provide a text query, an image, or an audio clip, the model will return text-based outputs in a consistent, cohesive manner.

Installation

To begin, you'll need to install the OpenAI SDK for Python. This can be done using the package manager pip with the following command:

%pip install --upgrade openai

Configuration

Next, you'll have to configure the OpenAI client, for which an API key is essential. If you don't already have one, create a new project on the OpenAI platform and generate an API key. Once obtained, set this API key as an environment variable for easy access across projects.

Your First Request

Once your installation and configuration are in place, it's time to make your first request. Here's how you can initiate the conversation with GPT-4o mini:

from openai import OpenAI
import os
 
MODEL = "gpt-4o-mini"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))
 
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"},
        {"role": "user", "content": "Hello! Could you solve 2+2?"}
    ]
)
print("Assistant: " + response.choices[0].message.content)

The output will be the solution to the math problem provided as the user message.

Processing Images with GPT-4o mini

With its multimodal capabilities, GPT-4o mini can also interpret image-based queries. For instance, if you ask about the area of a triangle and provide an image of a triangle, GPT-4o mini can analyze it and respond accordingly.

Base64 Encoded Images

To process images, you can pass them as Base64 encoded strings or as direct URL links. Here's an example of encoding an image and sending a request:

import base64
 
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")
 
base64_image = encode_image("triangle.png") # Replace with your actual image path
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [{"type": "text", "text": "What's the area of the triangle?"}, {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{base64_image}"}}]}
    ],
    temperature=0.0,
)
print(response.choices[0].message.content)

Summarization and Q&A with Video Content

While direct video processing is not yet supported, GPT-4o's capability to understand videos through frame sampling opens the door to applications such as video summarization and question-answering.

Video Processing Setup

First, ensure you have the necessary dependencies installed:

%pip install opencv-python
%pip install moviepy

Next, process the video to extract frames and audio:

import cv2
from moviepy.editor import VideoFileClip
 
def process_video(video_path, seconds_per_frame=2):
    # ... code to process video
    # this will append frames to base64Frames and save the audio as an mp3 file
 
base64Frames, audio_path = process_video("keynote_recap.mp4") # Replace with your actual video path

Summarizing Video Content

After processing, send both frames and audio transcripts to the model for summarization:

# ... code to display frames and play audio for context
 
# Now generate a summary with visual and audio inputs
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"},
        {"role": "user", "content": [
            # ... messages containing image URLs of video frames and the text transcription
        ]}
    ],
    temperature=0,
)
print(response.choices[0].message.content)

Through this method, GPT-4o mini can give you a rich, comprehensive summary by leveraging both visual and spoken details in the video.

This tutorial has laid out the steps to get started with GTP-4o and GTP-4o mini, from installation to making sophisticated requests involving text and image input. With practice, you will be adept at leveraging these models for a wider array of tasks as OpenAI introduces additional modalities like audio.

Expand your understanding and keep exploring the capabilities of these powerful AI tools.

Source: OpenAI Cookbook: Introduction to GPT-4o by Juston Forte. Published on July 18, 2024.