Skip to main content
Open In ColabOpen on GitHub

VectorizeRetriever

本 Notebook 展示了 LangChain Vectorize Retriever 的用法。

Vectorize 可帮助您更快、更轻松地构建 AI 应用。 它可自动化数据提取,使用 RAG 评估找到最佳向量化策略, 并允许您为非结构化数据快速部署实时 RAG 管道。 您的向量搜索索引将保持最新,并且它可以与您现有的向量数据库集成, 因此您可以完全控制您的数据。 Vectorize 负责繁重的工作,让您可以专注于构建强大的 AI 解决方案,而不必陷入数据管理的泥潭。

设置

在接下来的步骤中,我们将设置 Vectorize 环境并创建一个 RAG 管道。

创建 Vectorize 账户并获取访问令牌

在此处 https://platform.vectorize.io/ 注册一个免费的 Vectorize 账户 在 访问令牌 部分生成一个访问令牌 收集您的组织 ID。从浏览器 URL 中,提取 /organization/ 之后的 UUID

配置 token 和组织 ID

import getpass

VECTORIZE_ORG_ID = getpass.getpass("Enter Vectorize organization ID: ")
VECTORIZE_API_TOKEN = getpass.getpass("Enter Vectorize API Token: ")

安装

此检索器位于 langchain-vectorize 包中:

!pip install -qU langchain-vectorize

下载 PDF 文件

!wget "https://raw.githubusercontent.com/vectorize-io/vectorize-clients/refs/tags/python-0.1.3/tests/python/tests/research.pdf"

初始化 vectorize 客户端

import vectorize_client as v

api = v.ApiClient(v.Configuration(access_token=VECTORIZE_API_TOKEN))

创建文件上传源连接器

import json
import os

import urllib3

connectors_api = v.ConnectorsApi(api)
response = connectors_api.create_source_connector(
VECTORIZE_ORG_ID, [{"type": "FILE_UPLOAD", "name": "From API"}]
)
source_connector_id = response.connectors[0].id

上传 PDF 文件

file_path = "research.pdf"

http = urllib3.PoolManager()
uploads_api = v.UploadsApi(api)
metadata = {"created-from-api": True}

upload_response = uploads_api.start_file_upload_to_connector(
VECTORIZE_ORG_ID,
source_connector_id,
v.StartFileUploadToConnectorRequest(
name=file_path.split("/")[-1],
content_type="application/pdf",
# add additional metadata that will be stored along with each chunk in the vector database
metadata=json.dumps(metadata),
),
)

with open(file_path, "rb") as f:
response = http.request(
"PUT",
upload_response.upload_url,
body=f,
headers={
"Content-Type": "application/pdf",
"Content-Length": str(os.path.getsize(file_path)),
},
)

if response.status != 200:
print("Upload failed: ", response.data)
else:
print("Upload successful")

连接到 AI 平台和向量数据库

ai_platforms = connectors_api.get_ai_platform_connectors(VECTORIZE_ORG_ID)
builtin_ai_platform = [
c.id for c in ai_platforms.ai_platform_connectors if c.type == "VECTORIZE"
][0]

vector_databases = connectors_api.get_destination_connectors(VECTORIZE_ORG_ID)
builtin_vector_db = [
c.id for c in vector_databases.destination_connectors if c.type == "VECTORIZE"
][0]

配置和部署管道

pipelines = v.PipelinesApi(api)
response = pipelines.create_pipeline(
VECTORIZE_ORG_ID,
v.PipelineConfigurationSchema(
source_connectors=[
v.SourceConnectorSchema(
id=source_connector_id, type="FILE_UPLOAD", config={}
)
],
destination_connector=v.DestinationConnectorSchema(
id=builtin_vector_db, type="VECTORIZE", config={}
),
ai_platform=v.AIPlatformSchema(
id=builtin_ai_platform, type="VECTORIZE", config={}
),
pipeline_name="My Pipeline From API",
schedule=v.ScheduleSchema(type="manual"),
),
)
pipeline_id = response.data.id

配置跟踪(可选)

如果你想从单个查询中获取自动化跟踪,还可以通过取消注释以下内容来设置你的 LangSmith API 密钥:

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

实例化

from langchain_vectorize.retrievers import VectorizeRetriever

retriever = VectorizeRetriever(
api_token=VECTORIZE_API_TOKEN,
organization=VECTORIZE_ORG_ID,
pipeline_id=pipeline_id,
)

用法

query = "Apple Shareholders equity"
retriever.invoke(query, num_results=2)

在链中使用

与其他检索器一样,VectorizeRetriever 可以通过集成到 LLM 应用程序中。

我们需要一个 LLM 或聊天模型:

pip install -qU "langchain[google-genai]"
import getpass
import os

if not os.environ.get("GOOGLE_API_KEY"):
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

from langchain.chat_models import init_chat_model

llm = init_chat_model("gemini-2.0-flash", model_provider="google_genai")
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
"""Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)


chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
chain.invoke("...")

API 参考

有关 VectorizeRetriever 的所有功能和配置的详细文档,请访问 API 参考