使用 OpenAI 的 JSON Schema 進行結構化數據提取:升級版 AI 驅動解決方案
在前一篇文章中,我們介紹了如何利用 OpenAI API 從 PDF 文件中提取結構化的食譜數據。在本篇文章中,我們將進一步深入探討如何使用 OpenAI 的 JSON Schema 技術,以更精確和結構化的方式提取數據。這次的改進將結合 Python 的 pydantic
數據模型,並使用 OpenAI 的 response_format
功能來確保返回的數據符合我們預期的結構。
使用 JSON Schema 的優勢
透過 OpenAI 的 JSON Schema 功能,我們可以確保 AI 模型生成的輸出完全符合預期的數據格式,從而大幅降低數據清理和後期處理的工作量。在自然語言文本處理任務中,使用預定義的 Schema 可以避免格式錯誤,確保數據的一致性和準確性。
預期輸出格式
我們希望從 PDF 文件中提取的數據結構如下:
{
"id": 1,
"title": "Tomato Soup",
"page": 11,
"author": "Mrs. Wilhelmina Albrecht",
"ingredients": [
{ "item": "tomatoes", "quantity": 12, "unit": null },
{ "item": "soda", "quantity": 1, "unit": "teaspoon" },
{ "item": "butter", "quantity": 1, "unit": "tablespoon" },
{ "item": "flour", "quantity": 1, "unit": "teaspoon" }
],
"instructions": [
"Boil 12 tomatoes until they are soft.",
"Add a teaspoon of soda to a quart of pulp.",
"Put a tablespoon of butter in a sauce pan and add a teaspoon of flour.",
"Add hot milk, salt, cayenne pepper, and cracker crumbs.",
"Serve at once."
]
}
Python 代碼解析
1. 數據模型定義
首先,我們使用 pydantic
來定義我們的數據結構:
from typing import List, Optional, Union
from pydantic import BaseModel
class Ingredient(BaseModel):
item: Optional[str] = None
quantity: Optional[Union[float, str]] = None
unit: Optional[str] = None
class Recipe(BaseModel):
id: int
title: Optional[str] = None
page: Optional[int] = None
author: Optional[str] = None
ingredients: Optional[List[Ingredient]] = None
instructions: Optional[List[str]] = None
這些模型幫助我們確保所有提取到的數據符合定義的格式,避免因輸出不一致而導致的問題。
2. 使用 OpenAI API 進行數據提取
我們利用 OpenAI 的 response_format
功能來確保返回的數據符合我們的 JSON Schema:
def find_structured_info_with_ai(prompt, system_content, response_format):
completion = openai_client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": prompt}
],
response_format=response_format,
)
message = completion.choices[0].message
if message.parsed:
return message.parsed
else:
return message.refusal
3. 查詢提示生成
我們為每一頁生成提示,以幫助 AI 從文本中準確識別食譜內容:
def generate_prompt(page_number, text):
return (f"Extract all recipes from the following text starting from ---Page {page_number}--- "
f"Each recipe should include: - title - Ingredients (with item, quantity, and unit) - "
f"Instructions (use the original text as much as possible) - Author - Page number\n"
f"There could be multiple recipes in a page. "
f"---Page {page_number}---\n{text}")
4. 主程序運行邏輯
以下是主程序,從 PDF 文件中提取文本,並使用 AI 進行結構化數據提取:
def main():
content = read_pdf(FILE_PATH)
recipes = []
no_recipes_warnings = []
for page_number, text in content.items():
prompt = generate_prompt(page_number, text)
response = find_structured_info_with_ai(prompt, SYSTEM_CONTENT, CookBook)
if isinstance(response, CookBook) and response.recipes:
recipes.extend(response.recipes)
logger.info(f"Found {len(response.recipes)} recipes on page {page_number}")
else:
no_recipes_warnings.append({"page_number": page_number, "warning": response})
# 保存提取的食譜數據
recipes_dict = [recipe.model_dump() for recipe in recipes]
dump_json(recipes_dict, STRUCTURED_RECIPES_FILEPATH)
dump_json(no_recipes_warnings, STRUCTURED_NO_RECIPES_WARNINGS_FILEPATH)
if __name__ == "__main__":
main()
完整的 Python 代碼
以下是完整的代碼,您可以直接複製並運行:
import os
from dotenv import load_dotenv
from icecream import ic
from loguru import logger
from openai import OpenAI
from pdf_text import read_pdf
from settings import FILE_PATH, SYSTEM_CONTENT, STRUCTURED_RECIPES_FILEPATH, STRUCTURED_NO_RECIPES_WARNINGS_FILEPATH
from utils import dump_json
from typing import List, Optional, Union
from pydantic import BaseModel
class Ingredient(BaseModel):
item: Optional[str] = None
quantity: Optional[Union[float, str]] = None
unit: Optional[str] = None
class Recipe(BaseModel):
id: int
title: Optional[str] = None
page: Optional[int] = None
author: Optional[str] = None
ingredients: Optional[List[Ingredient]] = None
instructions: Optional[List[str]] = None
class CookBook(BaseModel):
recipes: List[Recipe]
load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def main():
content = read_pdf(FILE_PATH)
recipes = []
for page_number, text in content.items():
prompt = generate_prompt(page_number, text)
response = find_structured_info_with_ai(prompt, SYSTEM_CONTENT, CookBook)
if isinstance(response, CookBook) and response.recipes:
recipes.extend(response.recipes)
recipes_dict = [recipe.model_dump() for recipe in recipes]
dump_json(recipes_dict, STRUCTURED_RECIPES_FILEPATH)
if __name__ == "__main__":
main()
結論
透過本次升級,我們展示了如何結合 OpenAI 的 JSON Schema 和 Python 的 pydantic
庫,以更精確的方式從 PDF 中提取結構化數據。這不僅提高了數據提取的準確性,還減少了後期數據清理的工作量。
未來,我們可以將這種方法應用於其他類型的文檔數據提取,如合同、技術手冊和報告等,為企業和數據科學家提供更高效的數據處理工具。
希望本文能幫助您更好地理解 OpenAI 的 JSON Schema 功能及其應用場景。