Skip to main content
Open In ColabOpen on GitHub

如何向模型传递多模态数据

这里我们演示了如何将多模态(/docs/concepts/multimodality/) 输入直接传递给模型。

LangChain 支持将多模态数据作为聊天模型的输入:

  1. 遵循特定提供商的格式
  2. 遵守跨提供商的标准

下面,我们演示了跨提供商的标准。有关特定提供商的本地格式的详细信息,请参阅聊天模型集成

note

大多数支持多模态图像输入的聊天模型也接受 OpenAI 的聊天完成格式中的这些值:

{
"type": "image_url",
"image_url": {"url": image_url},
}

图片

许多提供者会接受内联的 base64 格式的图片数据。有些还会直接接受来自 URL 的图片。

来自 base64 数据的图片

要内联传递图片,请将它们格式化为以下形式的内容块:

{
"type": "image",
"source_type": "base64",
"mime_type": "image/jpeg", # 或 image/png 等
"data": "<base64 数据字符串>",
}

示例:

import base64

import httpx
from langchain.chat_models import init_chat_model

# Fetch image data
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")


# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "base64",
"data": image_data,
"mime_type": "image/jpeg",
},
],
}
response = llm.invoke([message])
print(response.text())
API Reference:init_chat_model
The image shows a beautiful clear day with bright blue skies and wispy cirrus clouds stretching across the horizon. The clouds are thin and streaky, creating elegant patterns against the blue backdrop. The lighting suggests it's during the day, possibly late afternoon given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no indication of rain. It's the kind of perfect, mild weather that's ideal for walking along the wooden boardwalk through the marsh grass.

请参阅 LangSmith trace 获取更多详情。

来自 URL 的图片

一些提供商(包括 OpenAIAnthropicGoogle Gemini)也直接接受来自 URL 的图片。

要将图片作为 URL 传递,请按以下形式格式化为内容块:

{
"type": "image",
"source_type": "url",
"url": "https://...",
}

示例:

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "url",
"url": image_url,
},
],
}
response = llm.invoke([message])
print(response.text())
The weather in this image appears to be pleasant and clear. The sky is mostly blue with a few scattered, light clouds, and there is bright sunlight illuminating the green grass and plants. There are no signs of rain or stormy conditions, suggesting it is a calm, likely warm day—typical of spring or summer.

我们也可以传入多个图片:

message = {
"role": "user",
"content": [
{"type": "text", "text": "Are these two images the same?"},
{"type": "image", "source_type": "url", "url": image_url},
{"type": "image", "source_type": "url", "url": image_url},
],
}
response = llm.invoke([message])
print(response.text())
Yes, these two images are the same. They depict a wooden boardwalk going through a grassy field under a blue sky with some clouds. The colors, composition, and elements in both images are identical.

文档 (PDF)

一些提供商(包括 OpenAIAnthropicGoogle Gemini)将接受 PDF 文档。

note

OpenAI 要求为 PDF 输入指定文件名。在使用 LangChain 的格式时,请包含 filename 键。请参阅下方的示例

来自 base64 数据的文档

要行内传递文档,请将其格式化为以下形式的内容块:

{
"type": "file",
"source_type": "base64",
"mime_type": "application/pdf",
"data": "<base64 data string>",
}

示例:

import base64

import httpx
from langchain.chat_models import init_chat_model

# Fetch PDF data
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
pdf_data = base64.b64encode(httpx.get(pdf_url).content).decode("utf-8")


# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
},
],
}
response = llm.invoke([message])
print(response.text())
API Reference:init_chat_model
This document appears to be a sample PDF file that contains Lorem ipsum placeholder text. It begins with a title "Sample PDF" followed by the subtitle "This is a simple PDF file. Fun fun fun."

The rest of the document consists of several paragraphs of Lorem ipsum text, which is a commonly used placeholder text in design and publishing. The text is formatted in a clean, readable layout with consistent paragraph spacing. The document appears to be a single page containing four main paragraphs of this placeholder text.

The Lorem ipsum text, while appearing to be Latin, is actually scrambled Latin-like text that is used primarily to demonstrate the visual form of a document or typeface without the distraction of meaningful content. It's commonly used in publishing and graphic design when the actual content is not yet available but the layout needs to be demonstrated.

The document has a professional, simple layout with generous margins and clear paragraph separation, making it an effective example of basic PDF formatting and structure.

来自 URL 的文档

某些服务商(特别是 Anthropic)也接受直接来自 URL 的文档。

要将文档作为 URL 传递,请将其格式化为以下形式的内容块:

{
"type": "file",
"source_type": "url",
"url": "https://...",
}

示例:

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "url",
"url": pdf_url,
},
],
}
response = llm.invoke([message])
print(response.text())
This document appears to be a sample PDF file with both text and an image. It begins with a title "Sample PDF" followed by the text "This is a simple PDF file. Fun fun fun." The rest of the document contains Lorem ipsum placeholder text arranged in several paragraphs. The content is shown both as text and as an image of the formatted PDF, with the same content displayed in a clean, formatted layout with consistent spacing and typography. The document consists of a single page containing this sample text.

音频

一些提供商(包括 OpenAIGoogle Gemini)会接受音频输入。

来自 base64 数据的音频

要内联传递音频,请将其格式化为以下形式的内容块:

{
"type": "audio",
"source_type": "base64",
"mime_type": "audio/wav", # 或适当的 mime-type
"data": "<base64 data 字符串>",
}

示例:

import base64

import httpx
from langchain.chat_models import init_chat_model

# Fetch audio data
audio_url = "https://upload.wikimedia.org/wikipedia/commons/3/3d/Alcal%C3%A1_de_Henares_%28RPS_13-04-2024%29_canto_de_ruise%C3%B1or_%28Luscinia_megarhynchos%29_en_el_Soto_del_Henares.wav"
audio_data = base64.b64encode(httpx.get(audio_url).content).decode("utf-8")


# Pass to LLM
llm = init_chat_model("google_genai:gemini-2.0-flash-001")

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this audio:",
},
{
"type": "audio",
"source_type": "base64",
"data": audio_data,
"mime_type": "audio/wav",
},
],
}
response = llm.invoke([message])
print(response.text())
API Reference:init_chat_model
The audio appears to consist primarily of bird sounds, specifically bird vocalizations like chirping and possibly other bird songs.

提供商特定参数

某些提供商将支持或要求包含多模态数据的内容块具有额外的字段。 例如,Anthropic 允许您指定特定内容的缓存,以减少 token 消耗。

要使用这些字段,您可以:

  1. 直接将它们存储在内容块上;或者
  2. 使用每个提供商支持的原生格式(有关详细信息,请参阅聊天模型集成)。

我们在下面展示三个示例。

示例:Anthropic 提示缓存

llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "url",
"url": image_url,
"cache_control": {"type": "ephemeral"},
},
],
}
response = llm.invoke([message])
print(response.text())
response.usage_metadata
The image shows a beautiful, clear day with partly cloudy skies. The sky is a vibrant blue with wispy, white cirrus clouds stretching across it. The lighting suggests it's during daylight hours, possibly late afternoon or early evening given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no threatening weather conditions. It's the kind of perfect weather you'd want for a walk along this wooden boardwalk through the marshland or grassland area.
{'input_tokens': 1586,
'output_tokens': 117,
'total_tokens': 1703,
'input_token_details': {'cache_read': 0, 'cache_creation': 1582}}
next_message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize that in 5 words.",
}
],
}
response = llm.invoke([message, response, next_message])
print(response.text())
response.usage_metadata
Clear blue skies, wispy clouds.
{'input_tokens': 1716,
'output_tokens': 12,
'total_tokens': 1728,
'input_token_details': {'cache_read': 1582, 'cache_creation': 0}}

示例:Anthropic 引用

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Generate a 5 word summary of this document.",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
"citations": {"enabled": True},
},
],
}
response = llm.invoke([message])
response.content
[{'citations': [{'cited_text': 'Sample PDF\r\nThis is a simple PDF file. Fun fun fun.\r\n',
'document_index': 0,
'document_title': None,
'end_page_number': 2,
'start_page_number': 1,
'type': 'page_location'}],
'text': 'Simple PDF file: fun fun',
'type': 'text'}]

示例:OpenAI 文件名

OpenAI 要求 PDF 文档必须关联文件名:

llm = init_chat_model("openai:gpt-4.1")

message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
"filename": "my-file",
},
],
}
response = llm.invoke([message])
print(response.text())
The document is a sample PDF file containing placeholder text. It consists of one page, titled "Sample PDF". The content is a mixture of English and the commonly used filler text "Lorem ipsum dolor sit amet..." and its extensions, which are often used in publishing and web design as generic text to demonstrate font, layout, and other visual elements.

**Key points about the document:**
- Length: 1 page
- Purpose: Demonstrative/sample content
- Content: No substantive or meaningful information, just demonstration text in paragraph form
- Language: English (with the Latin-like "Lorem Ipsum" text used for layout purposes)

There are no charts, tables, diagrams, or images on the page—only plain text. The document serves as an example of what a PDF file looks like rather than providing actual, useful content.

工具调用

一些多模态模型也支持工具调用功能。要使用此类模型调用工具,只需以常规方式将工具绑定到它们,然后使用所需类型的类内容(例如,包含图像数据的类内容)调用模型。

from typing import Literal

from langchain_core.tools import tool


@tool
def weather_tool(weather: Literal["sunny", "cloudy", "rainy"]) -> None:
"""Describe the weather"""
pass


llm_with_tools = llm.bind_tools([weather_tool])

message = {
"role": "user",
"content": [
{"type": "text", "text": "Describe the weather in this image:"},
{"type": "image", "source_type": "url", "url": image_url},
],
}
response = llm_with_tools.invoke([message])
response.tool_calls
API Reference:tool
[{'name': 'weather_tool',
'args': {'weather': 'sunny'},
'id': 'toolu_01G6JgdkhwggKcQKfhXZQPjf',
'type': 'tool_call'}]