Skip to main content
Open In ColabOpen on GitHub

构建语义搜索引擎

本教程将帮助你熟悉 LangChain 的 文档加载器嵌入向量存储 抽象。这些抽象旨在支持从(向量)数据库和其他数据源检索数据,并将其集成到 LLM 工作流中。它们对于那些在模型推理过程中需要获取数据进行推理的应用非常重要,例如检索增强生成(Retrieval-Augmented Generation)或 RAG(请参阅我们在此处的 RAG 教程)。

在这里,我们将围绕一个 PDF 文档构建一个搜索引擎。这将使我们能够检索 PDF 中与输入查询相似的段落。

概念

本指南侧重于文本数据的检索。我们将涵盖以下概念:

  • 文档和文档加载器;
  • 文本分割器;
  • 嵌入;
  • 向量存储和检索器。

设置

Jupyter Notebook

这些教程以及其他教程可能在 Jupyter notebook 中运行最为方便。请参阅 此处 的说明了解如何安装。

安装

本教程需要 langchain-communitypypdf 包:

pip install langchain-community pypdf

更多详细信息,请参阅我们的 安装指南

LangSmith

LangChain 构建的许多应用程序将包含多个步骤和多次 LLM 调用。 随着这些应用程序变得越来越复杂,能够检查链或代理内部的确切运行情况变得至关重要。 实现这一点的最佳方法是使用 LangSmith

在上面的链接注册后,请确保设置你的环境变量以开始记录跟踪:

export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

或者,如果你在 notebook 中,可以使用以下方式设置:

import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

文档和文档加载器

LangChain 实现了一个 Document 抽象,它旨在表示一个文本单元及其相关元数据。它有三个属性:

  • page_content:一个表示内容的字符串;
  • metadata:一个包含任意元数据的字典;
  • id:(可选)文档的字符串标识符。

metadata 属性可以捕获有关文档来源、与其他文档的关系以及其他信息的信息。请注意,单个 Document 对象通常代表较大文档中的一个片段。

我们可以在需要时生成示例文档:

from langchain_core.documents import Document

documents = [
Document(
page_content="Dogs are great companions, known for their loyalty and friendliness.",
metadata={"source": "mammal-pets-doc"},
),
Document(
page_content="Cats are independent pets that often enjoy their own space.",
metadata={"source": "mammal-pets-doc"},
),
]
API Reference:Document

然而,LangChain 生态系统实现了 document loaders,它们可以 与数百个常见数据源集成。这使得将这些数据源中的数据整合到您的 AI 应用程序中变得非常容易。

加载文档

让我们将一个 PDF 加载到一系列 Document 对象中。LangChain 仓库中有一个示例 PDF 在此处,是耐克公司 2023 年的 10-k 文件。我们可以参考 LangChain 文档中 可用的 PDF 文档加载器。让我们选择 PyPDFLoader,它相对轻量。

from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))
API Reference:PyPDFLoader
107
tip

请参阅本指南,了解有关 PDF 文档加载器的更多详细信息。

PyPDFLoader 为每个 PDF 页面加载一个 Document 对象。对于每个对象,我们可以轻松访问:

  • 页面的字符串内容;
  • 包含文件名和页码的元数据。
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO

{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}

分割

为了信息检索和下游的问答,页面可能过于粗粒度。我们最终的目标是检索回答输入查询的 Document 对象,进一步分割我们的 PDF 将有助于确保文档相关部分的意义不会被周围文本“冲淡”。

我们可以为此目的使用文本分割器。这里我们将使用一个简单的文本分割器,它根据字符进行分区。我们将文档分割成 1000 个字符的块,块之间有 200 个字符的重叠。重叠有助于缓解将一个语句与其相关的关键上下文分开的可能性。我们使用 RecursiveCharacterTextSplitter,它将递归地使用换行符等常见分隔符分割文档,直到每个块都达到适当的大小。这是通用文本用例推荐的文本分割器。

我们设置 add_start_index=True,以便每个分割后的 Document 在初始 Document 中开始的字符索引被保留为元数据属性“start_index”。

有关使用 PDF 的更多详细信息,包括如何从特定部分和图像提取文本,请参阅本指南

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)
514

Embeddings

向量搜索是存储和搜索非结构化数据(例如非结构化文本)的一种常用方法。其思路是存储与文本相关的数值向量。给定一个查询,我们可以将其 embed 为相同维度的向量,并使用向量相似性度量(如余弦相似度)来识别相关文本。

LangChain 支持来自 数十个提供商 的 embeddings。这些模型规定了文本应如何转换为数值向量。让我们选择一个模型:

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
Generated vectors of length 1536

[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]

有了生成文本嵌入的模型,我们接下来就可以将它们存储在支持高效相似性搜索的特殊数据结构中。

向量存储

LangChain 的 VectorStore 对象包含将文本和 Document 对象添加到存储以及使用各种相似性指标查询它们的方法。它们通常使用 嵌入 模型进行初始化,这些模型决定了文本数据如何转换为数字向量。

LangChain 包含了一系列与不同向量存储技术相关的集成。有些向量存储由提供商托管(例如,各种云提供商),并且需要特定的凭证才能使用;有些(例如 Postgres)在可以本地运行或通过第三方运行的独立基础设施中运行;还有些可以内存运行,适用于轻量级的工作负载。让我们选择一个向量存储:

pip install -qU langchain-core
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

实例化我们的 vector store 后,我们现在可以索引文档了。

ids = vector_store.add_documents(documents=all_splits)

请注意,大多数向量存储实现都允许您连接到现有的向量存储——例如,通过提供客户端、索引名称或其他信息。有关详情,请参阅特定集成的文档。

一旦我们实例化了一个包含文档的 VectorStore,我们就可以对其进行查询。[VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) 包含查询方法:

  • 同步和异步查询;
  • 通过字符串查询和向量查询;
  • 返回或不返回相似度分数;
  • 根据相似度进行检索和最大边际相关性检索(以平衡相似度和查询,确保检索结果的多样性)。

这些方法通常会在输出中包含一个 Document 对象列表。

用法

嵌入(Embeddings)通常将文本表示为“密集”向量,使得含义相似的文本在几何上彼此靠近。这使我们能够仅通过传入一个问题来检索相关信息,而无需了解文档中使用的任何特定关键字。

根据与字符串查询的相似度返回文档:

results = vector_store.similarity_search(
"How many distribution centers does Nike have in the US?"
)

print(results[0])
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}

异步查询:

results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])
page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

返回分数:

# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)
Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

根据与嵌入式查询的相似度返回文档:

embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])
page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}

了解更多:

检索器

LangChain 的 VectorStore 对象不继承自 Runnable。LangChain 的 检索器 是 Runnables,因此它们实现了一组标准方法(例如,同步和异步的 invokebatch 操作)。虽然我们可以从矢量存储构建检索器,但检索器也可以与其他非矢量存储数据源(如外部 API)进行交互。

我们可以自己创建一个简单的版本,而无需继承 Retriever。如果我们选择要使用的检索文档方法,我们可以轻松创建一个 runnable。下面我们将围绕 similarity_search 方法构建一个:

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
return vector_store.similarity_search(query, k=1)


retriever.batch(
[
"How many distribution centers does Nike have in the US?",
"When was Nike incorporated?",
],
)
API Reference:Document | chain
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
[Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]

Vectorstores 实现了一个 as_retriever 方法,该方法将生成一个 Retriever,具体来说是 VectorStoreRetriever。这些 retrievers 包含特定的 search_typesearch_kwargs 属性,用于标识底层 vector store 的哪些方法将被调用以及如何对其进行参数化。例如,我们可以用以下方式复现上述内容:

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1},
)

retriever.batch(
[
"How many distribution centers does Nike have in the US?",
"When was Nike incorporated?",
],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
[Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]

VectorStoreRetriever 支持的搜索类型包括 "similarity"(默认)、"mmr"(最大边际相关性,如上所述)和 "similarity_score_threshold"。后一种类型我们可以用来根据相似度分数对检索器输出的文档进行阈值处理。

检索器可以很容易地整合到更复杂的应用程序中,例如检索增强生成 (RAG) 应用程序,它将给定的问题与检索到的上下文结合起来,形成一个传递给大型语言模型的提示。要了解有关构建此类应用程序的更多信息,请查看RAG 教程教程。

了解更多:

检索策略可以丰富且复杂。例如:

操作指南的 retrievers 部分涵盖了这些以及其他内置检索策略。

通过扩展 BaseRetriever 类来实现自定义检索器也很直接。请在此处参阅我们的操作指南:here

后续步骤

您现在已经了解了如何在一个 PDF 文档上构建语义搜索引擎。

更多关于文档加载器的信息:

更多关于嵌入的信息:

更多关于向量存储的信息:

更多关于 RAG 的信息,请参阅: