如何向模型传递多模态数据
这里我们演示了如何将多模态(/docs/concepts/multimodality/) 输入直接传递给模型。
LangChain 支持将多模态数据作为聊天模型的输入:
- 遵循特定提供商的格式
- 遵守跨提供商的标准
下面,我们演示了跨提供商的标准。有关特定提供商的本地格式的详细信息,请参阅聊天模型集成。
note
大多数支持多模态图像输入的聊天模型也接受 OpenAI 的聊天完成格式中的这些值:
{
"type": "image_url",
"image_url": {"url": image_url},
}
图片
许多提供者会接受内联的 base64 格式的图片数据。有些还会直接接受来自 URL 的图片。
来自 base64 数据的图片
要内联传递图片,请将它们格式化为以下形式的内容块:
{
"type": "image",
"source_type": "base64",
"mime_type": "image/jpeg", # 或 image/png 等
"data": "<base64 数据字符串>",
}
示例:
import base64
import httpx
from langchain.chat_models import init_chat_model
# Fetch image data
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")
# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "base64",
"data": image_data,
"mime_type": "image/jpeg",
},
],
}
response = llm.invoke([message])
print(response.text())
API Reference:init_chat_model
The image shows a beautiful clear day with bright blue skies and wispy cirrus clouds stretching across the horizon. The clouds are thin and streaky, creating elegant patterns against the blue backdrop. The lighting suggests it's during the day, possibly late afternoon given the warm, golden quality of the light on the grass. The weather appears calm with no signs of wind (the grass looks relatively still) and no indication of rain. It's the kind of perfect, mild weather that's ideal for walking along the wooden boardwalk through the marsh grass.
请参阅 LangSmith trace 获取更多详情。
来自 URL 的图片
一些提供商(包括 OpenAI、Anthropic 和 Google Gemini)也直接接受来自 URL 的图片。
要将图片作为 URL 传递,请按以下形式格式化为内容块:
{
"type": "image",
"source_type": "url",
"url": "https://...",
}
示例:
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the weather in this image:",
},
{
"type": "image",
"source_type": "url",
"url": image_url,
},
],
}
response = llm.invoke([message])
print(response.text())
The weather in this image appears to be pleasant and clear. The sky is mostly blue with a few scattered, light clouds, and there is bright sunlight illuminating the green grass and plants. There are no signs of rain or stormy conditions, suggesting it is a calm, likely warm day—typical of spring or summer.
我们也可以传入多个图片:
message = {
"role": "user",
"content": [
{"type": "text", "text": "Are these two images the same?"},
{"type": "image", "source_type": "url", "url": image_url},
{"type": "image", "source_type": "url", "url": image_url},
],
}
response = llm.invoke([message])
print(response.text())
Yes, these two images are the same. They depict a wooden boardwalk going through a grassy field under a blue sky with some clouds. The colors, composition, and elements in both images are identical.
文档 (PDF)
一些提供商(包括 OpenAI、 Anthropic 和 Google Gemini)将接受 PDF 文档。
note
OpenAI 要求为 PDF 输入指定文件名。在使用 LangChain 的格式时,请包含 filename 键。请参阅下方的示例。
来自 base64 数据的文档
要行内传递文档,请将其格式化为以下形式的内容块:
{
"type": "file",
"source_type": "base64",
"mime_type": "application/pdf",
"data": "<base64 data string>",
}
示例:
import base64
import httpx
from langchain.chat_models import init_chat_model
# Fetch PDF data
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
pdf_data = base64.b64encode(httpx.get(pdf_url).content).decode("utf-8")
# Pass to LLM
llm = init_chat_model("anthropic:claude-3-5-sonnet-latest")
message = {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the document:",
},
{
"type": "file",
"source_type": "base64",
"data": pdf_data,
"mime_type": "application/pdf",
},
],
}
response = llm.invoke([message])
print(response.text())
API Reference:init_chat_model
This document appears to be a sample PDF file that contains Lorem ipsum placeholder text. It begins with a title "Sample PDF" followed by the subtitle "This is a simple PDF file. Fun fun fun."
The rest of the document consists of several paragraphs of Lorem ipsum text, which is a commonly used placeholder text in design and publishing. The text is formatted in a clean, readable layout with consistent paragraph spacing. The document appears to be a single page containing four main paragraphs of this placeholder text.
The Lorem ipsum text, while appearing to be Latin, is actually scrambled Latin-like text that is used primarily to demonstrate the visual form of a document or typeface without the distraction of meaningful content. It's commonly used in publishing and graphic design when the actual content is not yet available but the layout needs to be demonstrated.
The document has a professional, simple layout with generous margins and clear paragraph separation, making it an effective example of basic PDF formatting and structure.