Azure Cosmos DB No SQL
本笔记本展示了如何利用这个集成的 向量数据库 在集合中存储文档、创建索引,并使用近似最近邻算法(如 COS(余弦距离)、L2(欧几里得距离)和 IP(内积))执行向量搜索查询,以查找接近查询向量的文档。
Azure Cosmos DB 是驱动 OpenAI ChatGPT 服务的数据库。它提供个位数毫秒级的响应时间、自动即时可伸缩性以及任何规模下的速度保证。
Azure Cosmos DB for NoSQL 现在提供预览版的向量索引和搜索功能。此功能旨在处理高维向量,从而在任何规模下实现高效准确的向量搜索。您现在可以将向量直接与数据一起存储在文档中。这意味着数据库中的每个文档不仅可以包含传统的无架构数据,还可以包含作为文档其他属性的高维向量。这种数据和向量的共同定位允许高效的索引和搜索,因为向量与它们所代表的数据存储在相同的逻辑单元中。这简化了数据管理、人工智能应用程序架构以及基于向量的操作的效率。
更多详情请参阅:
注册 以获得永久免费访问权,立即开始。
%pip install --upgrade --quiet azure-cosmos langchain-openai langchain-community
Note: you may need to restart the kernel to use updated packages.
OPENAI_API_KEY = "YOUR_KEY"
OPENAI_API_TYPE = "azure"
OPENAI_API_VERSION = "2023-05-15"
OPENAI_API_BASE = "YOUR_ENDPOINT"
OPENAI_EMBEDDINGS_MODEL_NAME = "text-embedding-ada-002"
OPENAI_EMBEDDINGS_MODEL_DEPLOYMENT = "text-embedding-ada-002"
插入数据
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
API Reference:PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
API Reference:RecursiveCharacterTextSplitter
print(docs[0])
page_content='GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction' metadata={'source': 'https://arxiv.org/pdf/2303.08774.pdf', 'page': 0}
创建 Azure Cosmos DB NoSQL 向量搜索
Azure Cosmos DB for NoSQL 现已支持向量搜索。
indexing_policy = {
"indexingMode": "consistent",
"includedPaths": [{"path": "/*"}],
"excludedPaths": [{"path": '/"_etag"/?'}],
"vectorIndexes": [{"path": "/embedding", "type": "diskANN"}],
"fullTextIndexes": [{"path": "/text"}],
}
vector_embedding_policy = {
"vectorEmbeddings": [
{
"path": "/embedding",
"dataType": "float32",
"distanceFunction": "cosine",
"dimensions": 1536,
}
]
}
full_text_policy = {
"defaultLanguage": "en-US",
"fullTextPaths": [{"path": "/text", "language": "en-US"}],
}
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
AzureCosmosDBNoSqlVectorSearch,
)
from langchain_openai import OpenAIEmbeddings
HOST = "AZURE_COSMOS_DB_ENDPOINT"
KEY = "AZURE_COSMOS_DB_KEY"
cosmos_client = CosmosClient(HOST, KEY)
database_name = "langchain_python_db"
container_name = "langchain_python_container"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}
openai_embeddings = OpenAIEmbeddings(
deployment="smart-agent-embedding-ada",
model="text-embedding-ada-002",
chunk_size=1,
openai_api_key="OPENAI_API_KEY",
)
# insert the documents in AzureCosmosDBNoSql with their embedding
vector_search = AzureCosmosDBNoSqlVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
cosmos_client=cosmos_client,
database_name=database_name,
container_name=container_name,
vector_embedding_policy=vector_embedding_policy,
full_text_policy=full_text_policy,
indexing_policy=indexing_policy,
cosmos_container_properties=cosmos_container_properties,
cosmos_database_properties={},
full_text_search_enabled=True,
)
API Reference:AzureCosmosDBNoSqlVectorSearch | OpenAIEmbeddings