Master Structured Data Output with LangChain: Strategies and Techniques for Chat Models

In the ever-evolving landscape of AI and machine learning, obtaining structured data from chat models is paramount for integrating these models smoothly into various applications. LangChain, a well-regarded library in AI, offers robust tools and methods to achieve structured data output seamlessly. Whether you're a seasoned developer or just beginning your journey, this tutorial will guide you through the process of transforming arbitrary text outputs into structured data using LangChain.

Let’s embark on this journey to master structured data output through LangChain, uncovering strategies and techniques designed to enhance your capabilities.

Prerequisites: This guide assumes a basic understanding of chat models and the .withStructuredOutput() method.

Why Structured Data Output?

Structured data allows you to map text outputs into predefined schemas, enabling them to fit seamlessly into databases and other downstream systems. This tutorial will demonstrate several strategies to achieve this with LangChain.

Initial Installation

First, we need to install the necessary dependencies. Depending on your preferred model provider, your installation commands will vary:

OpenAI

npm i @langchain/openai

import { ChatOpenAI } from "@langchain/openai";
 
const model = new ChatOpenAI({ model: "gpt-3.5-turbo", temperature: 0 });

Anthropic

npm i @langchain/anthropic

import { ChatAnthropic } from "@langchain/anthropic";
 
const model = new ChatAnthropic({ model: "claude-3-sonnet-20240229", temperature: 0 });

Initial Configuration

Remember to set your environmental variables with your API keys:

OPENAI_API_KEY=your-api-key
ANTHROPIC_API_KEY=your-api-key

Schema Definition with Zod

To facilitate structured data output, we can leverage Zod, a TypeScript-first schema declaration and validation library. Here’s a simple example of defining a schema for a joke:

import { z } from "zod";
 
const joke = z.object({
  setup: z.string().describe("The setup of the joke"),
  punchline: z.string().describe("The punchline to the joke"),
  rating: z.number().optional().describe("How funny the joke is, from 1 to 10"),
});

Integrating with LangChain

Using .withStructuredOutput(), let’s integrate this schema with our model:

const structuredLlm = model.withStructuredOutput(joke);
 
await structuredLlm.invoke("Tell me a joke about cats");
// Expected Output:
// { setup: "Why don't cats play poker in the wild?", punchline: "Too many cheetahs.", rating: 7 }

Schema Naming for Better Context

Adding a name to your schema can provide additional context to the model, improving its performance:

const structuredLlm = model.withStructuredOutput(joke, { name: "joke" });
 
await structuredLlm.invoke("Tell me a joke about cats");
// Expected Output:
// { setup: "Why don't cats play poker in the wild?", punchline: "Too many cheetahs!", rating: 7 }

JSON Schema Usage

For those who prefer not to use Zod, LangChain also supports OpenAI-style JSON schema:

const structuredLlm = model.withStructuredOutput({
  name: "joke",
  description: "Joke to tell user.",
  parameters: {
    title: "Joke",
    type: "object",
    properties: {
      setup: { type: "string", description: "The setup for the joke" },
      punchline: { type: "string", description: "The joke's punchline" },
    },
    required: ["setup", "punchline"],
  },
});
 
await structuredLlm.invoke("Tell me a joke about cats");
// Expected Output:
// { setup: "Why was the cat sitting on the computer?", punchline: "Because it wanted to keep an eye on the mouse!" }

Advanced Output Specification

For models supporting multiple output methodologies, specify your preference:

const structuredLlm = model.withStructuredOutput(joke, { method: "json_mode", name: "joke" });
 
await structuredLlm.invoke("Tell me a joke about cats, respond in JSON with `setup` and `punchline` keys");
// Expected Output:
// { setup: "Why don't cats play poker in the jungle?", punchline: "Too many cheetahs!" }

Prompting Techniques

For models that don’t support built-in structured output capabilities, well-crafted prompts can compel models to output data in structured formats. Leveraging the JsonOutputParser, let's demonstrate this:

import { JsonOutputParser } from "@langchain/core/output_parsers";
import { ChatPromptTemplate } from "@langchain/core/prompts";
 
type Person = { name: string; height_in_meters: number; };
type People = { people: Person[]; };
 
const formatInstructions = `
Respond only in valid JSON. The JSON object you return should match the following schema:
{ people: [{ name: "string", height_in_meters: "number" }] }
Where people is an array of objects, each with a name and height_in_meters field.
`;
 
// Set up a parser
const parser = new JsonOutputParser();
 
// Create the prompt template
const prompt = await ChatPromptTemplate.fromMessages([
  ["system", "Answer the user query. Wrap the output in `json` tags\n{format_instructions}"],
  ["human", "{query}"]
]).partial({ format_instructions: formatInstructions });

Custom Parsing with LangChain

If built-in solutions don’t fit your use case, custom parsing using LangChain Expression Language (LCEL) may be what you need:

import { AIMessage } from "@langchain/core/messages";
import { ChatPromptTemplate } from "@langchain/core/prompts";
 
type Person = { name: string; height_in_meters: number; };
type People = { people: Person[]; };
 
const schema = `
{ people: [{ name: "string", height_in_meters: "number" }] }
`;
 
// Creating the prompt template
const prompt = await ChatPromptTemplate.fromMessages([
  ["system", `Answer the user query. Output your answer as JSON that matches the given schema: \`\`\`json\n{schema}\n\`\`\`. Make sure to wrap the answer in \`\`\`json and \`\`\` tags`],
  ["human", "{query}"]
]).partial({ schema });
 
// Custom extractor to parse JSON from AI output
const extractJson = (output: AIMessage): Array => {
  const text = output.content as string;
  const pattern = /```json(.*?)```/gs;
  const matches = text.match(pattern);
 
  try {
    return (matches?.map(match => {
      const jsonStr = match.replace(/```json|```/g, "").trim();
      return JSON.parse(jsonStr);
    }) ?? []);
  } catch (error) {
    throw new Error(`Failed to parse: ${output}`);
  }
};
 
// Invoke the parsing chain
const query = "Anna is 23 years old and she is 6 feet tall";
const chain = prompt.pipe(model).pipe(new RunnableLambda({ func: extractJson }));
 
await chain.invoke({ query });
// Expected Output:
// [ { people: [ { name: "Anna", height_in_meters: 1.83 } ] } ]

Summary

In this tutorial, we've explored multiple methods to ensure your chat models output structured data using LangChain. Armed with this knowledge, you can now integrate these models more efficiently into your applications, delivering refined and structured outputs that fit your needs.

To read more detailed guides on structured output and other advanced techniques, please visit the official LangChain documentation.

Author: LangChain Documentation Team

Explore the power of structured data output with LangChain. Transform the way you handle AI outputs today! Learn more here.