使用 PebbloRetrievalQA 实现身份识别的 RAG

PebbloRetrievalQA 是一个具有身份和语义强制执行的检索链，用于查询向量数据库进行问答。

本笔记介绍了如何使用身份和语义强制执行（拒绝主题/实体）来检索文档。有关 Pebblo 及其 SafeRetriever 功能的更多详细信息，请访问 Pebblo 文档

步骤：

加载文档： 我们将把具有授权和语义元数据的文档加载到内存中的 Qdrant 向量存储中。此向量存储将用作 PebbloRetrievalQA 中的检索器。

注意： 建议在摄取端使用 PebbloSafeLoader 作为加载具有身份验证和语义元数据的文档的对应组件。PebbloSafeLoader 可确保文档的安全高效加载，同时保持元数据的完整性。

测试强制执行机制： 我们将分别测试身份和语义强制执行。对于每种用例，我们将定义一个特定的“提问”函数，其中包含所需的上下文（auth_context 和 semantic_context），然后提出我们的问题。

设置

依赖项

在本教程中，我们将使用 OpenAI LLM、OpenAI embeddings 和 Qdrant 向量存储。

%pip install --upgrade --quiet langchain langchain_core langchain-community langchain-openai qdrant_client

身份感知数据摄入

这里我们使用 Qdrant 作为向量数据库；然而，您也可以使用任何受支持的向量数据库。

PebbloRetrievalQA 链支持以下向量数据库：

Qdrant
Pinecone
Postgres (利用 pgvector 扩展)

使用授权和语义信息加载向量数据库到元数据中：

在此步骤中，我们将源文档的授权和语义信息捕获到每个 chunk 的 VectorDB 条目元数据中的 authorized_identities、pebblo_semantic_topics 和 pebblo_semantic_entities 字段。

注意：要使用 PebbloRetrievalQA 链，您必须始终将授权和语义元数据放入指定的字段中。这些字段必须包含一个字符串列表。

from langchain_community.vectorstores.qdrant import Qdrant
from langchain_core.documents import Document
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_openai.llms import OpenAI

llm = OpenAI()
embeddings = OpenAIEmbeddings()
collection_name = "pebblo-identity-and-semantic-rag"

page_content = """
**ACME Corp Financial Report**

**Overview:**
ACME Corp, a leading player in the merger and acquisition industry, presents its financial report for the fiscal year ending December 31, 2020. 
Despite a challenging economic landscape, ACME Corp demonstrated robust performance and strategic growth.

**Financial Highlights:**
Revenue soared to $50 million, marking a 15% increase from the previous year, driven by successful deal closures and expansion into new markets. 
Net profit reached $12 million, showcasing a healthy margin of 24%.

**Key Metrics:**
Total assets surged to $80 million, reflecting a 20% growth, highlighting ACME Corp's strong financial position and asset base. 
Additionally, the company maintained a conservative debt-to-equity ratio of 0.5, ensuring sustainable financial stability.

**Future Outlook:**
ACME Corp remains optimistic about the future, with plans to capitalize on emerging opportunities in the global M&A landscape. 
The company is committed to delivering value to shareholders while maintaining ethical business practices.

**Bank Account Details:**
For inquiries or transactions, please refer to ACME Corp's US bank account:
Account Number: 123456789012
Bank Name: Fictitious Bank of America
"""

documents = [
    Document(
        **{
            "page_content": page_content,
            "metadata": {
                "pebblo_semantic_topics": ["financial-report"],
                "pebblo_semantic_entities": ["us-bank-account-number"],
                "authorized_identities": ["finance-team", "exec-leadership"],
                "page": 0,
                "source": "https://drive.google.com/file/d/xxxxxxxxxxxxx/view",
                "title": "ACME Corp Financial Report.pdf",
            },
        }
    )
]

vectordb = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name=collection_name,
)

print("Vectordb loaded.")

API Reference:Qdrant | Document | OpenAIEmbeddings | OpenAI

Vectordb loaded.

带有身份强制的检索

PebbloRetrievalQA 链使用 SafetyRetrieval 来强制确保用于上下文的片段仅从用户授权的文档中检索。为了实现这一点，Gen-AI 应用程序需要为此检索链提供授权上下文。此 auth_context 应填写访问 Gen-AI 应用的用户的身份和授权组。

以下是 PebbloRetrievalQA 的示例代码，其中 user_auth（用户授权列表，可能包括其用户 ID 和他们所属的群组）由访问 RAG 应用程序的用户提供，并传入 auth_context。

from langchain_community.chains import PebbloRetrievalQA
from langchain_community.chains.pebblo_retrieval.models import AuthContext, ChainInput

# Initialize PebbloRetrievalQA chain
qa_chain = PebbloRetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(),
    app_name="pebblo-identity-rag",
    description="Identity Enforcement app using PebbloRetrievalQA",
    owner="ACME Corp",
)


def ask(question: str, auth_context: dict):
    """
    Ask a question to the PebbloRetrievalQA chain
    """
    auth_context_obj = AuthContext(**auth_context) if auth_context else None
    chain_input_obj = ChainInput(query=question, auth_context=auth_context_obj)
    return qa_chain.invoke(chain_input_obj.dict())

API Reference:PebbloRetrievalQA | AuthContext | ChainInput

1. 授权用户的问题

我们已经摄入了授权身份 ["finance-team", "exec-leadership"] 的数据，因此拥有授权身份/组 finance-team 的用户应该会收到正确的答案。

auth = {
    "user_id": "finance-user@acme.org",
    "user_auth": [
        "finance-team",
    ],
}

question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")

Question: Share the financial performance of ACME Corp for the year 2020

Answer: 
Revenue: $50 million (15% increase from previous year)
Net profit: $12 million (24% margin)
Total assets: $80 million (20% growth)
Debt-to-equity ratio: 0.5

2. 未授权用户提问

由于用户的授权身份/用户组 eng-support 未包含在授权身份 ["finance-team", "exec-leadership"] 中，因此我们不应收到答案。

auth = {
    "user_id": "eng-user@acme.org",
    "user_auth": [
        "eng-support",
    ],
}

question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")

Question: Share the financial performance of ACME Corp for the year 2020

Answer:  I don't know.

3. 使用 PromptTemplate 提供额外指令

您可以使用 PromptTemplate 为 LLM 提供额外指令，以生成自定义响应。

from langchain_core.prompts import PromptTemplate

prompt_template = PromptTemplate.from_template(
    """
Answer the question using the provided context. 
If no context is provided, just say "I'm sorry, but that information is unavailable, or Access to it is restricted.".

Question: {question}
"""
)

question = "Share the financial performance of ACME Corp for the year 2020"
prompt = prompt_template.format(question=question)

API Reference:PromptTemplate

3.1 授权用户的提问

auth = {
    "user_id": "finance-user@acme.org",
    "user_auth": [
        "finance-team",
    ],
}
resp = ask(prompt, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")

Question: Share the financial performance of ACME Corp for the year 2020

Answer: 
Revenue soared to $50 million, marking a 15% increase from the previous year, and net profit reached $12 million, showcasing a healthy margin of 24%. Total assets also grew by 20% to $80 million, and the company maintained a conservative debt-to-equity ratio of 0.5.

3.2 未经授权的用户提问

auth = {
    "user_id": "eng-user@acme.org",
    "user_auth": [
        "eng-support",
    ],
}
resp = ask(prompt, auth)
print(f"Question: {question}\n\nAnswer: {resp['result']}")

Question: Share the financial performance of ACME Corp for the year 2020

Answer: 
I'm sorry, but that information is unavailable, or Access to it is restricted.

检索和语义强制

PebbloRetrievalQA 链使用 SafeRetrieval 来确保上下文中使用的代码片段仅从符合提供的语义上下文的文档中检索。为此，Gen-AI 应用程序必须为此检索链提供语义上下文。此 semantic_context 应包括应为访问 Gen-AI 应用程序的用户拒绝的主题和实体。

下面是 PebbloRetrievalQA 的示例代码，其中包含 topics_to_deny 和 entities_to_deny。这些作为 semantic_context 传递给链输入。

from typing import List, Optional

from langchain_community.chains import PebbloRetrievalQA
from langchain_community.chains.pebblo_retrieval.models import (
    ChainInput,
    SemanticContext,
)

# Initialize PebbloRetrievalQA chain
qa_chain = PebbloRetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(),
    app_name="pebblo-semantic-rag",
    description="Semantic Enforcement app using PebbloRetrievalQA",
    owner="ACME Corp",
)


def ask(
    question: str,
    topics_to_deny: Optional[List[str]] = None,
    entities_to_deny: Optional[List[str]] = None,
):
    """
    Ask a question to the PebbloRetrievalQA chain
    """
    semantic_context = dict()
    if topics_to_deny:
        semantic_context["pebblo_semantic_topics"] = {"deny": topics_to_deny}
    if entities_to_deny:
        semantic_context["pebblo_semantic_entities"] = {"deny": entities_to_deny}

    semantic_context_obj = (
        SemanticContext(**semantic_context) if semantic_context else None
    )
    chain_input_obj = ChainInput(query=question, semantic_context=semantic_context_obj)
    return qa_chain.invoke(chain_input_obj.dict())

API Reference:PebbloRetrievalQA | ChainInput | SemanticContext

1. 无语义强制

由于未应用任何语义强制，系统应返回答案，而不因与上下文关联的语义标签而排除任何上下文。

topic_to_deny = []
entities_to_deny = []
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
    f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
    f"Question: {question}\nAnswer: {resp['result']}"
)

Topics to deny: []
Entities to deny: []
Question: Share the financial performance of ACME Corp for the year 2020
Answer: 
Revenue for ACME Corp increased by 15% to $50 million in 2020, with a net profit of $12 million and a strong asset base of $80 million. The company also maintained a conservative debt-to-equity ratio of 0.5.

2. 拒绝 financial-report 主题

数据已以 ["financial-report"] 主题的形式被摄取。因此，一个拒绝 financial-report 主题的应用不应收到回答。

topic_to_deny = ["financial-report"]
entities_to_deny = []
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
    f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
    f"Question: {question}\nAnswer: {resp['result']}"
)

Topics to deny: ['financial-report']
Entities to deny: []
Question: Share the financial performance of ACME Corp for the year 2020
Answer: 

Unfortunately, I do not have access to the financial performance of ACME Corp for the year 2020.

3. 拒绝 us-bank-account-number 实体

由于 us-bank-account-number 实体已被拒绝，系统不应返回该答案。

topic_to_deny = []
entities_to_deny = ["us-bank-account-number"]
question = "Share the financial performance of ACME Corp for the year 2020"
resp = ask(question, topics_to_deny=topic_to_deny, entities_to_deny=entities_to_deny)
print(
    f"Topics to deny: {topic_to_deny}\nEntities to deny: {entities_to_deny}\n"
    f"Question: {question}\nAnswer: {resp['result']}"
)

Topics to deny: []
Entities to deny: ['us-bank-account-number']
Question: Share the financial performance of ACME Corp for the year 2020
Answer:  I don't have information about ACME Corp's financial performance for 2020.

步骤：​

设置​

依赖项​

身份感知数据摄入​

带有身份强制的检索​

1. 授权用户的问题​

2. 未授权用户提问​

3. 使用 PromptTemplate 提供额外指令​

3.1 授权用户的提问​

3.2 未经授权的用户提问​

检索和语义强制​

1. 无语义强制​

2. 拒绝 financial-report 主题​

3. 拒绝 us-bank-account-number 实体​

步骤：

设置

依赖项