如何使用 LangChain 索引 API

在这里，我们将了解使用 LangChain 索引 API 的基本索引工作流程。

索引 API 允许您从任何来源加载文档并将其与向量存储保持同步。具体来说，它有助于：

避免将重复内容写入向量存储
避免重写未更改的内容
避免对未更改的内容重新计算嵌入

所有这些都应该为您节省时间和金钱，并改善您的向量搜索结果。

至关重要的是，索引 API 即使对于那些已经历了多个转换步骤（例如，通过文本分块）的文档，也能与原始源文档进行处理。

工作原理

LangChain 索引使用记录管理器 (RecordManager) 来跟踪写入向量存储的文档。

在索引内容时，会为每个文档计算哈希值，并将以下信息存储在记录管理器中：

文档哈希 (页面内容和元数据的哈希)
写入时间
源 ID -- 每个文档都应在其元数据中包含信息，以便我们能够确定该文档的最终来源

删除模式

在将文档索引到向量存储时，可能需要删除一些现有的文档。在某些情况下，您可能希望删除所有源自正在索引的新文档相同来源的现有文档。在其他情况下，您可能希望完全删除所有现有文档。索引 API 删除模式允许您选择所需行为：

清理模式	去重内容	可并行	清理已删除的源文档	清理源文档及/或派生文档的变异	清理时机
None	✅	✅	❌	❌	-
Incremental	✅	✅	❌	✅	持续进行
Full	✅	❌	✅	✅	索引结束时
Scoped_Full	✅	✅	❌	✅	索引结束时

None 不执行任何自动清理，允许用户手动清理旧内容。

incremental、full 和 scoped_full 提供以下自动化清理：

如果源文档或派生文档的内容已更改，所有 3 种模式都将清理（删除）旧版本的内容。
如果源文档已删除（意味着它未包含在当前正在索引的文档中），full 清理模式将正确地从向量存储中删除它，但 incremental 和 scoped_full 模式则不会。

当内容发生变异时（例如，源 PDF 文件已修订），在索引期间会有一段时间，新旧版本都可能返回给用户。这发生在写入新内容之后，但在删除旧版本之前。

incremental 索引通过能够持续进行清理来最大限度地缩短此时间段，因为它在写入时同时进行清理。
full 和 scoped_full 模式在所有批次写入后进行清理。

要求

请勿与已独立于索引 API 预先填充内容的存储一起使用，因为记录管理器将不知道记录之前已被插入。
仅适用于支持以下功能的 LangChain vectorstore：
- 按 ID 添加文档 (add_documents 方法，带有 ids 参数)
- 按 ID 删除 (delete 方法，带有 ids 参数)

兼容的向量存储：Aerospike, AnalyticDB, AstraDB, AwaDB, AzureCosmosDBNoSqlVectorSearch, AzureCosmosDBVectorSearch, AzureSearch, Bagel, Cassandra, Chroma, CouchbaseVectorStore, DashVector, DatabricksVectorSearch, DeepLake, Dingo, ElasticVectorSearch, ElasticsearchStore, FAISS, HanaDB, Milvus, MongoDBAtlasVectorSearch, MyScale, OpenSearchVectorSearch, PGVector, Pinecone, Qdrant, Redis, Rockset, ScaNN, SingleStoreDB, SupabaseVectorStore, SurrealDBStore, TimescaleVector, Vald, VDMS, Vearch, VespaStore, Weaviate, Yellowbrick, ZepVectorStore, TencentVectorDB, OpenSearchVectorSearch.

注意

记录管理器依赖于基于时间的机制来确定在清理内容时（使用 full、incremental 或 scoped_full 清理模式时）可以清理哪些内容。

如果两个任务连续运行，并且第一个任务在时钟时间更改之前完成，那么第二个任务可能无法清理内容。

在实际场景中，这种情况不太可能成为问题，原因如下：

记录管理器使用更高分辨率的时间戳。
这两项任务之间的数据需要发生变化，如果任务之间的时间间隔很小，这种情况就不太可能发生。
索引任务通常需要超过几毫秒的时间。

快速入门

from langchain.indexes import SQLRecordManager, index
from langchain_core.documents import Document
from langchain_elasticsearch import ElasticsearchStore
from langchain_openai import OpenAIEmbeddings

API Reference:SQLRecordManager | index | Document | ElasticsearchStore | OpenAIEmbeddings

初始化向量存储并设置嵌入：

collection_name = "test_index"

embedding = OpenAIEmbeddings()

vectorstore = ElasticsearchStore(
    es_url="http://localhost:9200", index_name="test_index", embedding=embedding
)

初始化一个记录管理器，并附带一个合适的命名空间。

建议： 使用一个命名空间，该命名空间应同时考虑到向量存储和向量存储中的集合名称；例如，“redis/my_docs”、“chromadb/my_docs”或“postgres/my_docs”。

namespace = f"elasticsearch/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

在启用记录管理器之前创建架构。

record_manager.create_schema()

让我们索引一些测试文档：

doc1 = Document(page_content="kitty", metadata={"source": "kitty.txt"})
doc2 = Document(page_content="doggy", metadata={"source": "doggy.txt"})

在空向量存储中进行索引：

def _clear():
    """Hacky helper method to clear content. See the `full` mode section to to understand why it works."""
    index([], record_manager, vectorstore, cleanup="full", source_id_key="source")

`None` 删除模式

此模式不会自动清理内容的旧版本；但是，它仍然会处理内容的去重。

_clear()

index(
    [doc1, doc1, doc1, doc1, doc1],
    record_manager,
    vectorstore,
    cleanup=None,
    source_id_key="source",
)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

_clear()

index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

第二次，所有内容都将被跳过：

index([doc1, doc2], record_manager, vectorstore, cleanup=None, source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

`"增量"` 删除模式

_clear()

index(
    [doc1, doc2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

再次索引应该会导致这两个文档都被跳过——嵌入操作也会被跳过！

index(
    [doc1, doc2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}

如果我们提供空文档且采用增量索引模式，则不会有任何更改。

index([], record_manager, vectorstore, cleanup="incremental", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

如果我们修改文档，新版本将被写入，并且所有共享相同源的旧版本都将被删除。

changed_doc_2 = Document(page_content="puppy", metadata={"source": "doggy.txt"})

index(
    [changed_doc_2],
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 1, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 1}

`"full"` 删除模式

在 full 模式下，用户应将索引的全部内容（即 full universe of content）传递给索引函数。

任何未传递给索引函数但存在于 vectorstore 中的文档都将被删除！

此行为有助于处理源文档的删除。

_clear()

all_docs = [doc1, doc2]

index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

有人删除了第一个文档：

del all_docs[0]

all_docs

[Document(page_content='doggy', metadata={'source': 'doggy.txt'})]

使用完整模式也会清理已删除的内容。

index(all_docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 0, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}

来源

metadata 属性包含一个名为 source 的字段。此 source 应指向与给定文档关联的最终来源。

例如，如果这些文档代表某个父文档的块，那么这两个文档的 source 都应该相同，并引用父文档。

通常情况下，source 应该始终指定。仅在您从不打算使用 incremental 模式，并且由于某种原因无法正确指定 source 字段时，才使用 None。

from langchain_text_splitters import CharacterTextSplitter

API Reference:CharacterTextSplitter

doc1 = Document(
    page_content="kitty kitty kitty kitty kitty", metadata={"source": "kitty.txt"}
)
doc2 = Document(page_content="doggy doggy the doggy", metadata={"source": "doggy.txt"})

new_docs = CharacterTextSplitter(
    separator="t", keep_separator=True, chunk_size=12, chunk_overlap=2
).split_documents([doc1, doc2])
new_docs

[Document(page_content='kitty kit', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),
 Document(page_content='doggy doggy', metadata={'source': 'doggy.txt'}),
 Document(page_content='the doggy', metadata={'source': 'doggy.txt'})]

_clear()

index(
    new_docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 5, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

changed_doggy_docs = [
    Document(page_content="woof woof", metadata={"source": "doggy.txt"}),
    Document(page_content="woof woof woof", metadata={"source": "doggy.txt"}),
]

这将删除与 doggy.txt 源关联的旧文档版本，并用新版本替换它们。

index(
    changed_doggy_docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 2}

vectorstore.similarity_search("dog", k=30)

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='tty kitty', metadata={'source': 'kitty.txt'}),
 Document(page_content='tty kitty ki', metadata={'source': 'kitty.txt'}),
 Document(page_content='kitty kit', metadata={'source': 'kitty.txt'})]

与加载器一起使用

索引可以接受可迭代的文档，也可以接受任何加载器。

注意：加载器必须正确设置源键。

from langchain_core.document_loaders import BaseLoader


class MyCustomLoader(BaseLoader):
    def lazy_load(self):
        text_splitter = CharacterTextSplitter(
            separator="t", keep_separator=True, chunk_size=12, chunk_overlap=2
        )
        docs = [
            Document(page_content="woof woof", metadata={"source": "doggy.txt"}),
            Document(page_content="woof woof woof", metadata={"source": "doggy.txt"}),
        ]
        yield from text_splitter.split_documents(docs)

    def load(self):
        return list(self.lazy_load())

API Reference:BaseLoader

_clear()

loader = MyCustomLoader()

loader.load()

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]

index(loader, record_manager, vectorstore, cleanup="full", source_id_key="source")

{'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

vectorstore.similarity_search("dog", k=30)

[Document(page_content='woof woof', metadata={'source': 'doggy.txt'}),
 Document(page_content='woof woof woof', metadata={'source': 'doggy.txt'})]

工作原理​

删除模式​

要求​

注意​

快速入门​

None 删除模式​

"增量" 删除模式​

"full" 删除模式​

来源​

与加载器一起使用​