腾讯云向量数据库
腾讯云向量数据库 是一款全托管、自研的分布式数据库服务,专为存储、检索和分析多维向量数据而设计,面向企业级应用。该数据库支持多种索引类型和相似度计算方法。单个索引可支持高达 10 亿的向量规模,并能支撑千万级 QPS 和毫秒级查询延迟。腾讯云向量数据库不仅可以为大模型提 供外部知识库,提升大模型回答的准确性,还可以广泛应用于推荐系统、NLP 服务、计算机视觉以及智能客服等 AI 领域。
本 Notebook 展示了如何使用与腾讯云向量数据库相关的各项功能。
要运行,您应该有一个 数据库实例。。
基本用法
!pip3 install tcvectordb langchain-community
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.fake import FakeEmbeddings
from langchain_community.vectorstores import TencentVectorDB
from langchain_community.vectorstores.tencentvectordb import ConnectionParams
from langchain_text_splitters import CharacterTextSplitter
API Reference:TextLoader | FakeEmbeddings | TencentVectorDB | ConnectionParams | CharacterTextSplitter
加载文档,将它们分割成块。
loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
我们支持两种嵌入文档的方式:
- 使用任何与 Langchain Embeddings 兼容的 Embeddings 模型。
- 指定腾讯 VectorStore DB 的 Embedding 模型名称,可选的有:
bge-base-zh,维度:768m3e-base,维度:768text2vec-large-chinese,维度:1024e5-large-v2,维度:1024multilingual-e5-base,维度:768
以下代码展示了两种嵌入文档的方式,你可以通过注释掉其 中一种来选用另一种:
## you can use a Langchain Embeddings model, like OpenAIEmbeddings:
# from langchain_community.embeddings.openai import OpenAIEmbeddings
#
# embeddings = OpenAIEmbeddings()
# t_vdb_embedding = None
## Or you can use a Tencent Embedding model, like `bge-base-zh`:
t_vdb_embedding = "bge-base-zh" # bge-base-zh is the default model
embeddings = None
API Reference:OpenAIEmbeddings
现在我们可以创建一个 TencentVectorDB 实例,你必须至少提供 embeddings 或 t_vdb_embedding 参数中的一个。如果两者都提供了,则会使用 embeddings 参数:
conn_params = ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
)
vector_db = TencentVectorDB.from_documents(
docs, embeddings, connection_params=conn_params, t_vdb_embedding=t_vdb_embedding
)
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
docs[0].page_content
'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'
vector_db = TencentVectorDB(embeddings, conn_params)
vector_db.add_texts(["Ankush went to Princeton"])
query = "Where did Ankush go to college?"
docs = vector_db.max_marginal_relevance_search(query)
docs[0].page_content
'Ankush went to Princeton'
元数据与过滤
Tencent VectorDB 支持元数据和 过滤。您可以为文档添加元数据,并根据元数据过滤搜索结果。
接下来,我们将创建一个新的 TencentVectorDB 集合并添加元数据,然后演示如何根据元数据过滤搜索结果:
from langchain_community.vectorstores.tencentvectordb import (
META_FIELD_TYPE_STRING,
META_FIELD_TYPE_UINT64,
ConnectionParams,
MetaField,
TencentVectorDB,
)
from langchain_core.documents import Document
meta_fields = [
MetaField(name="year", data_type=META_FIELD_TYPE_UINT64, index=True),
MetaField(name="rating", data_type=META_FIELD_TYPE_STRING, index=False),
MetaField(name="genre", data_type=META_FIELD_TYPE_STRING, index=True),
MetaField(name="director", data_type=META_FIELD_TYPE_STRING, index=True),
]
docs = [
Document(
page_content="The Shawshank Redemption is a 1994 American drama film written and directed by Frank Darabont.",
metadata={
"year": 1994,
"rating": "9.3",
"genre": "drama",
"director": "Frank Darabont",
},
),
Document(
page_content="The Godfather is a 1972 American crime film directed by Francis Ford Coppola.",
metadata={
"year": 1972,
"rating": "9.2",
"genre": "crime",
"director": "Francis Ford Coppola",
},
),
Document(
page_content="The Dark Knight is a 2008 superhero film directed by Christopher Nolan.",
metadata={
"year": 2008,
"rating": "9.0",
"genre": "superhero",
"director": "Christopher Nolan",
},
),
Document(
page_content="Inception is a 2010 science fiction action film written and directed by Christopher Nolan.",
metadata={
"year": 2010,
"rating": "8.8",
"genre": "science fiction",
"director": "Christopher Nolan",
},
),
]
vector_db = TencentVectorDB.from_documents(
docs,
None,
connection_params=ConnectionParams(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
username="root",
timeout=20,
),
collection_name="movies",
meta_fields=meta_fields,
)
query = "film about dream by Christopher Nolan"
# you can use the tencentvectordb filtering syntax with the `expr` parameter:
result = vector_db.similarity_search(query, expr='director="Christopher Nolan"')
# you can either use the langchain filtering syntax with the `filter` parameter:
# result = vector_db.similarity_search(query, filter='eq("director", "Christopher Nolan")')
result
[Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='The Dark Knight is a 2008 superhero film directed by Christopher Nolan.', metadata={'year': 2008, 'rating': '9.0', 'genre': 'superhero', 'director': 'Christopher Nolan'}),
Document(page_content='Inception is a 2010 science fiction action film written and directed by Christopher Nolan.', metadata={'year': 2010, 'rating': '8.8', 'genre': 'science fiction', 'director': 'Christopher Nolan'})]
Related
- Vector store conceptual guide
- Vector store how-to guides