Skip to main content
Open In ColabOpen on GitHub

Couchbase

Couchbase 是一款屡获殊荣的分布式 NoSQL 云数据库,为您的所有云、移动、AI 和边缘计算应用提供无与伦比的多功能性、性能、可扩展性和经济价值。Couchbase 通过为开发者提供编码辅助以及为应用程序提供向量搜索功能来拥抱 AI。

向量搜索是 Couchbase 中 全文搜索服务(搜索服务)的一部分。

本教程将介绍如何在 Couchbase 中使用向量搜索。您可以选择使用 Couchbase Capella 或您自行管理的 Couchbase Server。

安装

要访问 CouchbaseSearchVectorStore,您首先需要安装 langchain-couchbase 合作伙伴包:

pip install -qU langchain-couchbase

凭证

请访问 Couchbase 网站并创建一个新的连接,请务必保存您的数据库用户名和密码:

import getpass

COUCHBASE_CONNECTION_STRING = getpass.getpass(
"Enter the connection string for the Couchbase cluster: "
)
DB_USERNAME = getpass.getpass("Enter the username for the Couchbase cluster: ")
DB_PASSWORD = getpass.getpass("Enter the password for the Couchbase cluster: ")
Enter the connection string for the Couchbase cluster:  ········
Enter the username for the Couchbase cluster: ········
Enter the password for the Couchbase cluster: ········

如果你想获得一流的模型调用自动化跟踪,你也可以通过取消注释以下内容来设置你的 LangSmith API 密钥:

# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

初始化

在实例化之前,我们需要创建一个连接。

创建 Couchbase 连接对象

我们首先创建与 Couchbase 集群的连接,然后将集群对象传递给 Vector Store。

在此示例中,我们使用上面的用户名和密码进行连接。您也可以使用其他支持的方式连接到您的集群。

有关连接 Couchbase 集群的更多信息,请参阅文档

from datetime import timedelta

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# Wait until the cluster is ready for use.
cluster.wait_until_ready(timedelta(seconds=5))

我们将设置用于向量搜索的 Couchbase 集中的 bucket、scope 和 collection 名称。

在本示例中,我们使用的是默认的 scope 和 collections。

BUCKET_NAME = "langchain_bucket"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "langchain-test-index"

有关如何创建支持 Vector 字段的 Search 索引的详细信息,请参阅文档。

简单实例化

下面,我们使用集群信息和搜索索引名称来创建向量存储对象。

pip install -qU langchain-openai
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
from langchain_couchbase.vectorstores import CouchbaseSearchVectorStore

vector_store = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
)

指定文本和嵌入字段

您还可以使用 text_keyembedding_key 字段来指定文档的文本和嵌入字段。

vector_store_specific = CouchbaseSearchVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
embedding=embeddings,
index_name=SEARCH_INDEX_NAME,
text_key="text",
embedding_key="embedding",
)

管理向量存储

创建向量存储后,我们可以通过添加和删除不同的项目与其进行交互。

向向量存储添加项目

我们可以使用 add_documents 函数向向量存储添加项目。

from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
)

document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

document_3 = Document(
page_content="Building an exciting new project with LangChain - come check it out!",
metadata={"source": "tweet"},
)

document_4 = Document(
page_content="Robbers broke into the city bank and stole $1 million in cash.",
metadata={"source": "news"},
)

document_5 = Document(
page_content="Wow! That was an amazing movie. I can't wait to see it again.",
metadata={"source": "tweet"},
)

document_6 = Document(
page_content="Is the new iPhone worth the price? Read this review to find out.",
metadata={"source": "website"},
)

document_7 = Document(
page_content="The top 10 soccer players in the world right now.",
metadata={"source": "website"},
)

document_8 = Document(
page_content="LangGraph is the best framework for building stateful, agentic applications!",
metadata={"source": "tweet"},
)

document_9 = Document(
page_content="The stock market is down 500 points today due to fears of a recession.",
metadata={"source": "news"},
)

document_10 = Document(
page_content="I have a bad feeling I am going to get deleted :(",
metadata={"source": "tweet"},
)

documents = [
document_1,
document_2,
document_3,
document_4,
document_5,
document_6,
document_7,
document_8,
document_9,
document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)
API Reference:Document
['f125b836-f555-4449-98dc-cbda4e77ae3f',
'a28fccde-fd32-4775-9ca8-6cdb22ca7031',
'b1037c4b-947f-497f-84db-63a4def5080b',
'c7082b74-b385-4c4b-bbe5-0740909c01db',
'a7e31f62-13a5-4109-b881-8631aff7d46c',
'9fcc2894-fdb1-41bd-9a93-8547747650f4',
'a5b0632d-abaf-4802-99b3-df6b6c99be29',
'0475592e-4b7f-425d-91fd-ac2459d48a36',
'94c6db4e-ba07-43ff-aa96-3a5d577db43a',
'd21c7feb-ad47-4e7d-84c5-785afb189160']

从向量存储中删除项目

vector_store.delete(ids=[uuids[-1]])
True

查询向量数据库

一旦你创建了向量数据库并将相关文档添加进去,在运行你的链(chain)或代理(agent)时,你很可能会希望查询它。

直接查询

相似性搜索

可以按以下方式执行简单的相似性搜索:

results = vector_store.similarity_search(
"LangChain provides abstractions to make working with LLMs easy",
k=2,
)
for res in results:
print(f"* {res.page_content} [{res.metadata}]")
* Building an exciting new project with LangChain - come check it out! [{'source': 'tweet'}]
* LangGraph is the best framework for building stateful, agentic applications! [{'source': 'tweet'}]

带分数的相似性搜索

您还可以通过调用 similarity_search_with_score 方法来获取结果的分数。

results = vector_store.similarity_search_with_score("Will it be hot tomorrow?", k=1)
for res, score in results:
print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")
* [SIM=0.553112] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}]

过滤结果

您可以通过指定文档中的文本或元数据的任一过滤器来过滤搜索结果,这些过滤器均受 Couchbase Search 服务支持。

filter 可以是 Couchbase Python SDK 支持的任何有效的 SearchQuery。这些过滤器在执行 Vector Search 之前应用。

如果您想过滤元数据中的一个字段,您需要使用 . 来指定。

例如,要获取元数据中的 source 字段,您需要指定 metadata.source

请注意,过滤器需要得到 Search Index 的支持。

from couchbase import search

query = "Are there any concerning financial news?"
filter_on_source = search.MatchQuery("news", field="metadata.source")
results = vector_store.similarity_search_with_score(
query, fields=["metadata.source"], filter=filter_on_source, k=5
)
for res, score in results:
print(f"* {res.page_content} [{res.metadata}] {score}")
* The stock market is down 500 points today due to fears of a recession. [{'source': 'news'}] 0.3873019218444824
* Robbers broke into the city bank and stole $1 million in cash. [{'source': 'news'}] 0.20637212693691254
* The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees. [{'source': 'news'}] 0.10404900461435318

指定返回的字段

您可以使用搜索中的 fields 参数指定要从文档返回的字段。这些字段将作为 metadata 对象的一部分返回在文档中。您可以获取存储在搜索索引中的任何字段。文档的 text_key 将作为文档的 page_content 返回。

如果您不指定任何要获取的字段,将返回索引中存储的所有字段。

如果您想获取元数据中的某个字段,需要使用 . 来指定。

例如,要获取元数据中的 source 字段,您需要指定 metadata.source

query = "What did I eat for breakfast today?"
results = vector_store.similarity_search(query, fields=["metadata.source"])
print(results[0])
page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.' metadata={'source': 'tweet'}

通过检索器进行查询

您还可以将向量存储转换为检索器,以便在链中使用。

以下是将向量存储转换为检索器,然后使用简单的查询和过滤器调用检索器的方法。

retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "score_threshold": 0.5},
)
filter_on_source = search.MatchQuery("news", field="metadata.source")
retriever.invoke("Stealing from the bank is a crime", filter=filter_on_source)
[Document(id='c7082b74-b385-4c4b-bbe5-0740909c01db', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

混合查询

Couchbase 允许您通过将向量搜索结果与文档的非向量字段(如 metadata 对象)上的搜索相结合来进行混合搜索。

结果将基于向量搜索和 Search Service 支持的搜索这两者的结果组合。每个组件搜索的分数会累加起来以获得结果的总分数。

要执行混合搜索,有一个可选参数 search_options,可以传递给所有相似性搜索。

search_options 的不同搜索/查询可能性可以在这里找到。

为混合搜索创建多样化的元数据

为了模拟混合搜索,让我们从现有文档中创建一些随机元数据。 我们将三个字段均匀地添加到元数据中:date 在 2010 年至 2020 年之间,rating 在 1 至 5 之间,author 设置为 John Doe 或 Jane Doe。

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Adding metadata to documents
for i, doc in enumerate(docs):
doc.metadata["date"] = f"{range(2010, 2020)[i % 10]}-01-01"
doc.metadata["rating"] = range(1, 6)[i % 5]
doc.metadata["author"] = ["John Doe", "Jane Doe"][i % 2]

vector_store.add_documents(docs)

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(query)
print(results[0].metadata)
{'author': 'John Doe', 'date': '2016-01-01', 'rating': 2, 'source': '../../how_to/state_of_the_union.txt'}

按精确值查询

我们可以搜索像 metadata 对象中的作者这样的文本字段的精确匹配。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={"query": {"field": "metadata.author", "match": "John Doe"}},
fields=["metadata.author"],
)
print(results[0])
page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.' metadata={'author': 'John Doe'}

按部分匹配查询

我们可以通过指定模糊度来搜索部分匹配项。当你想要搜索搜索查询的细微变化或拼写错误时,这会非常有用。

在这里,“Jae”接近“Jane”(模糊度为 1)。

query = "What did the president say about Ketanji Brown Jackson"
results = vector_store.similarity_search(
query,
search_options={
"query": {"field": "metadata.author", "match": "Jae", "fuzziness": 1}
},
fields=["metadata.author"],
)
print(results[0])
page_content='A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. 

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.' metadata={'author': 'Jane Doe'}

按日期范围查询

我们可以搜索在日期字段(如 metadata.date)上执行日期范围查询的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search(
query,
search_options={
"query": {
"start": "2016-12-31",
"end": "2017-01-02",
"inclusive_start": True,
"inclusive_end": False,
"field": "metadata.date",
}
},
)
print(results[0])
page_content='And with 75% of adult Americans fully vaccinated and hospitalizations down by 77%, most Americans can remove their masks, return to work, stay in the classroom, and move forward safely. 

We achieved this because we provided free vaccines, treatments, tests, and masks.

Of course, continuing this costs money.

I will soon send Congress a request.

The vast majority of Americans have used these tools and may want to again, so I expect Congress to pass it quickly.' metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}

按数值范围查询

我们可以搜索在数值字段(如 metadata.rating)范围内内的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"min": 3,
"max": 5,
"inclusive_min": True,
"inclusive_max": True,
"field": "metadata.rating",
}
},
)
print(results[0])
(Document(id='3a90405c0f5b4c09a6646259678f1f61', metadata={'author': 'John Doe', 'date': '2014-01-01', 'rating': 5, 'source': '../../how_to/state_of_the_union.txt'}, page_content='In this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded liberty, defeated totalitarianism and terror. \n\nAnd built the strongest, freest, and most prosperous nation the world has ever known. \n\nNow is the hour. \n\nOur moment of responsibility. \n\nOur test of resolve and conscience, of history itself.'), 0.3573387440020518)

组合多个搜索查询

可以使用 AND(合取)或 OR(析取)运算符组合不同的搜索查询。

在此示例中,我们将查找评级在 3 到 4 之间且日期在 2015 到 2018 年之间的文档。

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"conjuncts": [
{"min": 3, "max": 4, "inclusive_max": True, "field": "metadata.rating"},
{"start": "2016-12-31", "end": "2017-01-02", "field": "metadata.date"},
]
}
},
)
print(results[0])
(Document(id='7115a704877a46ad94d661dd9c81cbc3', metadata={'author': 'Jane Doe', 'date': '2017-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}, page_content='And with 75% of adult Americans fully vaccinated and hospitalizations down by 77%, most Americans can remove their masks, return to work, stay in the classroom, and move forward safely. \n\nWe achieved this because we provided free vaccines, treatments, tests, and masks. \n\nOf course, continuing this costs money. \n\nI will soon send Congress a request. \n\nThe vast majority of Americans have used these tools and may want to again, so I expect Congress to pass it quickly.'), 0.6898253780130769)

注意

混合搜索结果可能包含不满足所有搜索条件的文档。这是由于 评分计算方式 的原因。 分数是向量搜索分数和混合搜索中查询的加总。如果向量搜索分数很高,组合分数就会超过那些匹配混合搜索中所有查询的结果。 为避免这种情况,请使用 filter 参数,而不是混合搜索。

结合混合搜索查询与筛选器

混合搜索可以与筛选器结合使用,以获得混合搜索的最佳效果以及符合要求的筛选结果。

在此示例中,我们将检查文档,要求其评分为 3 到 5 之间,并且在文本字段中匹配字符串 "independence"。

filter_text = search.MatchQuery("independence", field="text")

query = "Any mention about independence?"
results = vector_store.similarity_search_with_score(
query,
search_options={
"query": {
"min": 3,
"max": 5,
"inclusive_min": True,
"inclusive_max": True,
"field": "metadata.rating",
}
},
filter=filter_text,
)

print(results[0])
(Document(id='23bb51b4e4d54a94ab0a95e72be8428c', metadata={'author': 'John Doe', 'date': '2012-01-01', 'rating': 3, 'source': '../../how_to/state_of_the_union.txt'}, page_content='And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them.  \n\nPutin has unleashed violence and chaos.  But while he may make gains on the battlefield – he will pay a continuing high price over the long run. \n\nAnd a proud Ukrainian people, who have known 30 years  of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards.'), 0.30549919644400614)

其他查询

同样,您可以在 search_options 参数中使用任何支持的查询方法,如 Geo Distance、Polygon Search、Wildcard、Regular Expressions 等。有关可用查询方法及其语法的更多详细信息,请参阅文档。

用于检索增强生成 (Retrieval-Augmented Generation) 的用法

有关如何将此向量存储用于检索增强生成 (RAG) 的指南,请参阅以下章节:

常见问题

问题:我应该在创建 CouchbaseSearchVectorStore 对象之前创建 Search 索引吗?

是的,目前您需要在创建 CouchbaseSearchVectorStore 对象之前创建 Search 索引。

问题:我的搜索结果中没有显示我指定的所有字段。

在 Couchbase 中,我们只能返回存储在搜索索引中的字段。请确保您尝试在搜索结果中访问的字段是搜索索引的一部分。

处理此问题的一种方法是在索引中动态地索引和存储文档的字段。

  • 在 Capella 中,您需要转到“高级模式”,然后在“常规设置”的下拉菜单下,您可以勾选“[X] 存储动态字段”或“[X] 索引动态字段”。
  • 在 Couchbase Server 中,在索引编辑器(不是快速编辑器)的“高级”下拉菜单下,您可以勾选“[X] 存储动态字段”或“[X] 索引动态字段”。

请注意,这些选项会增加索引的大小。

有关动态映射的更多详细信息,请参阅文档

问题:在搜索结果中看不到 metadata 对象。

这很可能是因为文档中的 metadata 字段未被 Couchbase 搜索索引索引和/或存储。为了索引文档中的 metadata 字段,您需要将其作为子映射添加到索引中。

如果您选择映射所有字段,您将能够搜索所有元数据字段。或者,为了优化索引,您可以选择 metadata 对象内的特定字段进行索引。您可以参考文档了解更多关于索引子映射的信息。

创建子映射

问题:filter 与 search_options / hybrid queries 之间有什么区别?

Filters 是预过滤器,用于限制在 Search 索引中搜索的文档。它在 Couchbase Server 7.6.4 及更高版本中可用。

Hybrid Queries 是额外的搜索查询,可用于优化从 Search 索引返回的结果。

Filters 和 hybrid search queries 都具有相同的能力,只是语法略有不同。Filters 是 SearchQuery 对象,而 hybrid search queries 是字典

API 参考

有关 CouchbaseSearchVectorStore 所有功能和配置的详细文档,请访问API 参考