Pinecone 混合搜索

Pinecone 是一个功能广泛的向量数据库。

本 Notebook 将介绍如何使用底层使用 Pinecone 和混合搜索的检索器。

该检索器的逻辑取自此文档

要使用 Pinecone，您必须拥有 API 密钥和环境。以下是安装说明。

%pip install --upgrade --quiet  pinecone pinecone-text pinecone-notebooks

# Connect to Pinecone and get an API key.
from pinecone_notebooks.colab import Authenticate

Authenticate()

import os

api_key = os.environ["PINECONE_API_KEY"]

from langchain_community.retrievers import (
    PineconeHybridSearchRetriever,
)

API Reference:PineconeHybridSearchRetriever

我们要使用 OpenAIEmbeddings，所以我们需要获取 OpenAI API 密钥。

import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

设置 Pinecone

您只需要执行一次此操作。

import os

from pinecone import Pinecone, ServerlessSpec

index_name = "langchain-pinecone-hybrid-search"

# initialize Pinecone client
pc = Pinecone(api_key=api_key)

# create the index
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # dimensionality of dense model
        metric="dotproduct",  # sparse values supported only for dotproduct
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

WhoAmIResponse(username='load', user_label='label', projectname='load-test')

现在索引已经创建，我们可以使用它了。

index = pc.Index(index_name)

获取嵌入和稀疏编码器

嵌入用于稠密向量，分词器用于稀疏向量

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

API Reference:OpenAIEmbeddings

你可以选择 SPLADE 或 BM25 来将文本编码为稀疏值。对于非特定领域（out of domain）的任务，我们建议使用 BM25。

有关稀疏编码器的更多信息，请查看 pinecone-text 库的文档。

from pinecone_text.sparse import BM25Encoder

# or from pinecone_text.sparse import SpladeEncoder if you wish to work with SPLADE

# use default tf-idf values
bm25_encoder = BM25Encoder().default()

以上代码使用的是默认的 tfids 值。强烈建议为语料库拟合 tf-idf 值。您可以按照以下方式进行操作：

corpus = ["foo", "bar", "world", "hello"]

# 在您的语料库上拟合 tf-idf 值
bm25_encoder.fit(corpus)

# 将值存储到 json 文件中
bm25_encoder.dump("bm25_values.json")

# 加载到您的 BM25Encoder 对象中
bm25_encoder = BM25Encoder().load("bm25_values.json")

Load Retriever

至此，我们就可以构建 retriever！

retriever = PineconeHybridSearchRetriever(
    embeddings=embeddings, sparse_encoder=bm25_encoder, index=index
)

添加文本（如有必要）

我们可以选择性地将文本添加到检索器中（如果它们尚不存在）

retriever.add_texts(["foo", "bar", "world", "hello"])

100%|██████████| 1/1 [00:02<00:00,  2.27s/it]

使用检索器

我们现在可以使用检索器了！

result = retriever.invoke("foo")

result[0]

Document(page_content='foo', metadata={})

Retriever conceptual guide
Retriever how-to guides

设置 Pinecone​

获取嵌入和稀疏编码器​

Load Retriever​

添加文本（如有必要）​

使用检索器​

Related​

设置 Pinecone

获取嵌入和稀疏编码器

Load Retriever

添加文本（如有必要）

使用检索器

Related