Astra DB Vector Store

本页提供了一个使用 Astra DB 作为 Vector Store 的快速入门指南。

DataStax Astra DB 是一个无服务器的、面向 AI 的数据库，它构建在 Apache Cassandra® 之上，并通过易于使用的 JSON API 提供便利。

设置

依赖项

集成的使用需要 langchain-astradb 合作伙伴包：

!pip install \
    "langchain>=0.3.23,<0.4" \
    "langchain-core>=0.3.52,<0.4" \
    "langchain-astradb>=0.6,<0.7"

参数凭证

在使用 AstraDB 向量存储之前，您必须先访问 AstraDB 网站，创建一个账户，然后创建一个新的数据库——初始化过程可能需要几分钟时间。

数据库初始化完成后，请检索您的连接密钥，稍后会用到。这些密钥包括：

一个 API Endpoint，例如 "https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com/"
以及一个 Database Token，例如 "AstraCS:aBcD123......"

您可以选择性地提供一个 keyspace（在 LangChain 组件中称为“namespace”），您可以通过数据库仪表板的 Data Explorer 选项卡进行管理。如果需要，您可以在下面的提示中将其留空，以回退到默认的 keyspace。

import getpass

ASTRA_DB_API_ENDPOINT = input("ASTRA_DB_API_ENDPOINT = ").strip()
ASTRA_DB_APPLICATION_TOKEN = getpass.getpass("ASTRA_DB_APPLICATION_TOKEN = ").strip()

desired_keyspace = input("(optional) ASTRA_DB_KEYSPACE = ").strip()
if desired_keyspace:
    ASTRA_DB_KEYSPACE = desired_keyspace
else:
    ASTRA_DB_KEYSPACE = None

ASTRA_DB_API_ENDPOINT =  https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com
ASTRA_DB_APPLICATION_TOKEN =  ········
(optional) ASTRA_DB_KEYSPACE =

如果你想获得一流的、自动化的模型调用追踪效果，你也可以通过取消注释以下行来设置你的 LangSmith API 密钥：

# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

初始化

有多种方法可以创建 Astra DB 向量存储：

方法 1：显式嵌入

您可以单独实例化 langchain_core.embeddings.Embeddings 类，并将其传递给 AstraDBVectorStore 构造函数，就像大多数其他 LangChain 向量存储一样。

方法 2：服务器端嵌入（'vectorize'）

或者，您可以使用 Astra DB 的服务器端嵌入计算功能（“vectorize”），并在创建存储的服务器基础结构时简单地指定一个嵌入模型。然后，在后续的读写操作中，嵌入计算将完全在数据库内部处理。（要继续使用此方法，您必须已为您数据库启用了所需的嵌入集成，具体方法请参见文档。）

方法 3：从预先存在的集合自动检测

您可能已经在 Astra DB 中拥有一个集合，可能通过其他方式（例如，通过 Astra UI 或第三方应用程序）预先填充了数据，并且只想在 LangChain 中开始查询它。在这种情况下，正确的方法是在向量存储构造函数中启用 autodetect_collection 模式，并让类自行处理细节。（当然，如果您的集合没有 'vectorize'，您仍然需要提供一个 Embeddings 对象）。

关于“混合搜索”的说明

Astra DB 向量存储支持向量搜索中的元数据搜索；此外，0.6 版本通过findAndRerank 数据库原语完全支持_混合搜索_：文档同时从向量相似性搜索和基于关键字（“词汇”）的搜索中检索，然后通过重排器模型进行合并。这种完全在服务器端处理的搜索策略可以提高结果的准确性，从而改进您的 RAG 应用程序的质量。在可用时，向量存储会自动使用混合搜索（但如果您愿意，也可以对其进行手动控制）。

附加信息

AstraDBVectorStore 可以通过多种方式进行配置；请参阅API Reference 以获取完整指南，其中涵盖了例如异步初始化；非 Astra-DB 数据库；自定义索引允许/拒绝列表；手动混合搜索控制；以及更多内容。

显式嵌入程序初始化 (方法 1)

使用显式嵌入程序类来实例化我们的向量存储：

Select embeddings model:

pip install -qU langchain-openai

import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

from langchain_astradb import AstraDBVectorStore

vector_store_explicit_embeddings = AstraDBVectorStore(
    collection_name="astra_vector_langchain",
    embedding=embeddings,
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace=ASTRA_DB_KEYSPACE,
)

API Reference:AstraDBVectorStore

服务器端嵌入初始化（“vectorize”，方法 2）

在本示例代码中，假设您已完成以下操作：

在 Astra DB 组织中启用了 OpenAI 集成。
向集成添加了一个名为 "OPENAI_API_KEY" 的 API 密钥，并将其范围限定到您正在使用的数据库。

有关更多详细信息，包括切换提供商/模型的说明，请参阅文档。

from astrapy.info import VectorServiceOptions

openai_vectorize_options = VectorServiceOptions(
    provider="openai",
    model_name="text-embedding-3-small",
    authentication={
        "providerKey": "OPENAI_API_KEY",
    },
)

vector_store_integrated_embeddings = AstraDBVectorStore(
    collection_name="astra_vectorize_langchain",
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace=ASTRA_DB_KEYSPACE,
    collection_vector_service_options=openai_vectorize_options,
)

自动检测初始化（方法三）

如果集合已存在于数据库中，并且你的 AstraDBVectorStore 需要使用它（用于读写），你可以使用此模式。LangChain 组件将检查集合并确定相关细节。

如果集合已经创建并且——最重要的是——是通过 LangChain 以外的工具填充的（例如通过 Astra DB Web 界面摄取的数据），这是推荐的方法。

自动检测模式不能与集合设置（如相似性度量等）共存；另一方面，如果没有使用服务器端嵌入，仍然需要向构造函数传递一个 Embeddings 对象。

在下面的示例代码中，我们将“自动检测”由方法二（“vectorize”）创建的同一个集合。因此，不需要提供 Embeddings 对象。

vector_store_autodetected = AstraDBVectorStore(
    collection_name="astra_vectorize_langchain",
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace=ASTRA_DB_KEYSPACE,
    autodetect_collection=True,
)

管理向量库

创建向量库后，可以通过添加和删除不同项来与之交互。

所有与向量库的交互都独立于初始化方法：如果您想选择一个已创建的向量库进行测试，请修改以下单元格以进行选择。

# If desired, uncomment a different line here:

# vector_store = vector_store_explicit_embeddings
vector_store = vector_store_integrated_embeddings
# vector_store = vector_store_autodetected

向向量库添加文档

使用 add_documents 方法向向量库添加文档。

“id” 字段可以单独提供，在匹配的 ids=[...] 参数中传递给 add_documents，或者完全省略，让向量库自行生成 ID。

from langchain_core.documents import Document

documents_to_insert = [
    Document(
        page_content="ZYX, just another tool in the world, is actually my agent-based superhero",
        metadata={"source": "tweet"},
        id="entry_00",
    ),
    Document(
        page_content="I had chocolate chip pancakes and scrambled eggs "
        "for breakfast this morning.",
        metadata={"source": "tweet"},
        id="entry_01",
    ),
    Document(
        page_content="The weather forecast for tomorrow is cloudy and "
        "overcast, with a high of 62 degrees.",
        metadata={"source": "news"},
        id="entry_02",
    ),
    Document(
        page_content="Building an exciting new project with LangChain "
        "- come check it out!",
        metadata={"source": "tweet"},
        id="entry_03",
    ),
    Document(
        page_content="Robbers broke into the city bank and stole "
        "$1 million in cash.",
        metadata={"source": "news"},
        id="entry_04",
    ),
    Document(
        page_content="Thanks to her sophisticated language skills, the agent "
        "managed to extract strategic information all right.",
        metadata={"source": "tweet"},
        id="entry_05",
    ),
    Document(
        page_content="Is the new iPhone worth the price? Read this "
        "review to find out.",
        metadata={"source": "website"},
        id="entry_06",
    ),
    Document(
        page_content="The top 10 soccer players in the world right now.",
        metadata={"source": "website"},
        id="entry_07",
    ),
    Document(
        page_content="LangGraph is the best framework for building stateful, "
        "agentic applications!",
        metadata={"source": "tweet"},
        id="entry_08",
    ),
    Document(
        page_content="The stock market is down 500 points today due to "
        "fears of a recession.",
        metadata={"source": "news"},
        id="entry_09",
    ),
    Document(
        page_content="I have a bad feeling I am going to get deleted :(",
        metadata={"source": "tweet"},
        id="entry_10",
    ),
]


vector_store.add_documents(documents=documents_to_insert)

API Reference:Document

['entry_00',
 'entry_01',
 'entry_02',
 'entry_03',
 'entry_04',
 'entry_05',
 'entry_06',
 'entry_07',
 'entry_08',
 'entry_09',
 'entry_10']

从向量存储中删除项目

使用 delete 函数按 ID 删除项目。

vector_store.delete(ids=["entry_10", "entry_02"])

True

查询向量存储

一旦向量存储创建并填充完毕，你就可以查询它了（例如，作为你的链或代理的一部分）。

直接查询

相似性搜索

搜索与提供的文本相似的文档，如果需要，还可以添加额外的元数据过滤器：

results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=3,
    filter={"source": "tweet"},
)
for res in results:
    print(f'* "{res.page_content}", metadata={res.metadata}')

* "Building an exciting new project with LangChain - come check it out!", metadata={'source': 'tweet'}
* "LangGraph is the best framework for building stateful, agentic applications!", metadata={'source': 'tweet'}
* "Thanks to her sophisticated language skills, the agent managed to extract strategic information all right.", metadata={'source': 'tweet'}

相似性搜索与分数

您也可以返回相似度分数：

results = vector_store.similarity_search_with_score(
    "LangChain provides abstractions to make working with LLMs easy",
    k=3,
    filter={"source": "tweet"},
)
for res, score in results:
    print(f'* [SIM={score:.2f}] "{res.page_content}", metadata={res.metadata}')

* [SIM=0.71] "Building an exciting new project with LangChain - come check it out!", metadata={'source': 'tweet'}
* [SIM=0.70] "LangGraph is the best framework for building stateful, agentic applications!", metadata={'source': 'tweet'}
* [SIM=0.61] "Thanks to her sophisticated language skills, the agent managed to extract strategic information all right.", metadata={'source': 'tweet'}

指定不同的关键词查询（需要混合搜索）

注意：此单元格仅能在集合支持 find-and-rerank 命令，并且向量存储已知晓此事实的情况下运行。

如果向量存储正在使用支持混合搜索的集合并且已检测到此事实，则在默认情况下，它会在执行搜索时使用该功能。

在这种情况下，相同的查询文本将同时用于 find-and-rerank 过程中的向量相似性和基于词汇的检索步骤，除非您明确为后者提供不同的查询：

results = vector_store_autodetected.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=3,
    filter={"source": "tweet"},
    lexical_query="agent",
)
for res in results:
    print(f'* "{res.page_content}", metadata={res.metadata}')

* "Building an exciting new project with LangChain - come check it out!", metadata={'source': 'tweet'}
* "LangGraph is the best framework for building stateful, agentic applications!", metadata={'source': 'tweet'}
* "ZYX, just another tool in the world, is actually my agent-based superhero", metadata={'source': 'tweet'}

上面的示例硬编码了“自动检测到的”向量存储，该存储肯定检查了集合并确定了混合搜索是否可用。另一个选项是显式地向构造函数提供混合搜索参数（有关更多详细信息/示例，请参阅 API 参考）。

其他搜索方法

本笔记本未涵盖各种其他搜索方法，例如 MMR 搜索和按向量搜索。

要查看 AstraDBVectorStore 中可用的搜索模式的完整列表，请参阅 API 参考。

查询并转换为检索器

你也可以将向量存储转换为检索器，以便在你的链中使用。

将向量存储转换为检索器，并使用简单的查询 + 元数据过滤器进行调用：

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

[Document(id='entry_04', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

用于检索增强生成 (Retrieval-Augmented Generation)

有关如何将此向量存储用于检索增强生成 (RAG) 的指南，请参阅以下部分：

要了解更多信息，请查看使用 Astra DB 的完整 RAG 模板：here。

清理向量存储

如果你想完全删除 Astra DB 实例中的集合，请运行此命令。
(你存储在其中的数据将会丢失)

vector_store.delete_collection()

API 参考

有关 AstraDBVectorStore 所有功能和配置的详细文档，请参阅 API 参考。

Vector store conceptual guide
Vector store how-to guides

设置​

依赖项​

参数凭证​

初始化​

方法 1：显式嵌入​

方法 2：服务器端嵌入（'vectorize'）​

方法 3：从预先存在的集合自动检测​

关于“混合搜索”的说明​

附加信息​

显式嵌入程序初始化 (方法 1)​

服务器端嵌入初始化（“vectorize”，方法 2）​

自动检测初始化（方法三）​

管理向量库​

向向量库添加文档​

从向量存储中删除项目​

查询向量存储​

直接查询​

相似性搜索​

相似性搜索与分数​

指定不同的关键词查询（需要混合搜索）​

其他搜索方法​

查询并转换为检索器​

用于检索增强生成 (Retrieval-Augmented Generation)​

清理向量存储​

API 参考​

Related​

设置

依赖项