Activeloop Deep Memory
Activeloop Deep Memory 是一套工具,可让您针对特定用例优化向量存储,并在 LLM 应用中实现更高的准确性。
检索增强生成 (RAG) 近期备受关注。随着高级 RAG 技术和代理的出现,它们扩展了 RAG 的潜在能力。然而,在生产环境中集成 RAG 可能面临一些挑战。在生产环境中实 施 RAG 时需要考虑的主要因素是准确性 (召回率)、成本和延迟。对于基本用例,OpenAI 的 Ada 模型搭配简单的相似性搜索可以产生令人满意的结果。然而,为了在搜索中获得更高的准确性或召回率,可能需要采用高级检索技术。这些方法可能涉及更改数据块大小、多次重写查询等,这可能会增加延迟和成本。Activeloop 的 Deep Memory 是 Activeloop Deep Lake 用户可用的一个功能,它通过引入一个微小的神经网络层来解决这些问题,该神经网络层经过训练,可以匹配用户查询与语料库中的相关数据。虽然此添加会在搜索过程中带来最小的延迟,但它可以将检索准确性提高多达 27%,并且仍然具有成本效益且易于使用,无需任何额外的先进 RAG 技术。
在本教程中,我们将解析 DeepLake 文档,并创建一个 RAG 系统来回答文档中的问题。
1. 数据集创建
我们将在此教程中使用 BeautifulSoup 库以及 LangChain 的文档解析器,如 Html2TextTransformer、AsyncHtmlLoader 来解析 activeloop 的文档。因此,我们需要安装以下库:
%pip install --upgrade --quiet tiktoken langchain-openai python-dotenv datasets langchain deeplake beautifulsoup4 html2text ragas
您还需要创建一个 Activeloop 账户。
ORG_ID = "..."
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import DeepLake
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import getpass
import os
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API token: ")
# # activeloop token is needed if you are not signed in using CLI: `activeloop login -u <USERNAME> -p <PASSWORD>`
if "ACTIVELOOP_TOKEN" not in os.environ:
os.environ["ACTIVELOOP_TOKEN"] = getpass.getpass(
"Enter your ActiveLoop API token: "
) # Get your API token from https://app.activeloop.ai, click on your profile picture in the top right corner, and select "API Tokens"
token = os.getenv("ACTIVELOOP_TOKEN")
openai_embeddings = OpenAIEmbeddings()
db = DeepLake(
dataset_path=f"hub://{ORG_ID}/deeplake-docs-deepmemory", # org_id stands for your username or organization from activeloop
embedding=openai_embeddings,
runtime={"tensor_db": True},
token=token,
# overwrite=True, # user overwrite flag if you want to overwrite the full dataset
read_only=False,
)
使用 BeautifulSoup 解析网页中的所有链接
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
def get_all_links(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve the page: {url}")
return []
soup = BeautifulSoup(response.content, "html.parser")
# Finding all 'a' tags which typically contain href attribute for links
links = [
urljoin(url, a["href"]) for a in soup.find_all("a", href=True) if a["href"]
]
return links
base_url = "https://docs.deeplake.ai/en/latest/"
all_links = get_all_links(base_url)
正在加载数据:
from langchain_community.document_loaders.async_html import AsyncHtmlLoader
loader = AsyncHtmlLoader(all_links)
docs = loader.load()
将数据转换为用户可读的格式:
from langchain_community.document_transformers import Html2TextTransformer
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)
现在,让我们进一步分割文档,因为其中一些文本过多:
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunk_size = 4096
docs_new = []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
)
for doc in docs_transformed:
if len(doc.page_content) < chunk_size:
docs_new.append(doc)
else:
docs = text_splitter.create_documents([doc.page_content])
docs_new.extend(docs)
填充 VectorStore:
docs = db.add_documents(docs_new)
2. 生成合成查询和训练 Deep Memory
To train the Deep Memory model, we first need to generate a set of synthetic queries. These
下一步将训练一个 deep_memory 模型,该模型会将用户的查询与您已有的数据集进行匹配。如果您还没有用户查询,不用担心,我们将使用 LLM 来生成它们!
待办:添加图片
上面我们展示了 deep_memory 的整体工作流程。正如你所见,要训练它,你需要相关性数据、查询以及语料库数据(我们想要查询的数据)。语料库数据已在上一节中填充,在这里我们将生成问题和相关性数据。
questions- 是一个字符串列表,其中每个字符串代表一个查询。relevance- 包含每个问题对应的真实相关性链接。可能有多篇文档包含对给定问题的答案。因此,relevance是List[List[tuple[str, float]]]类型,其中外层列表代表查询,内层列表代表相关文档。元组包含字符串和浮点数对,字符串代表源文档的 ID(对应数据集中的id张量),而浮点数表示当前文档与该问题的相关程度。
现在,让我们生成合成问题和相关性:
from typing import List
from langchain.chains.openai_functions import (
create_structured_output_chain,
)
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
# fetch dataset docs and ids if they exist (optional you can also ingest)
docs = db.vectorstore.dataset.text.data(fetch_chunks=True, aslist=True)["value"]
ids = db.vectorstore.dataset.id.data(fetch_chunks=True, aslist=True)["value"]
# If we pass in a model explicitly, we need to make sure it supports the OpenAI function-calling API.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
class Questions(BaseModel):
"""Identifying information about a person."""
question: str = Field(..., description="Questions about text")
prompt_msgs = [
SystemMessage(
content="You are a world class expert for generating questions based on provided context. \
You make sure the question can be answered by the text."
),
HumanMessagePromptTemplate.from_template(
"Use the given text to generate a question from the following input: {input}"
),
HumanMessage(content="Tips: Make sure to answer in the correct format"),
]
prompt = ChatPromptTemplate(messages=prompt_msgs)
chain = create_structured_output_chain(Questions, llm, prompt, verbose=True)
text = "# Understanding Hallucinations and Bias ## **Introduction** In this lesson, we'll cover the concept of **hallucinations** in LLMs, highlighting their influence on AI applications and demonstrating how to mitigate them using techniques like the retriever's architectures. We'll also explore **bias** within LLMs with examples."
questions = chain.run(input=text)
print(questions)
import random
from langchain_openai import OpenAIEmbeddings
from tqdm import tqdm
def generate_queries(docs: List[str], ids: List[str], n: int = 100):
questions = []
relevances = []
pbar = tqdm(total=n)
while len(questions) < n:
# 1. randomly draw a piece of text and relevance id
r = random.randint(0, len(docs) - 1)
text, label = docs[r], ids[r]
# 2. generate queries and assign and relevance id
generated_qs = [chain.run(input=text).question]
questions.extend(generated_qs)
relevances.extend([[(label, 1)] for _ in generated_qs])
pbar.update(len(generated_qs))
if len(questions) % 10 == 0:
print(f"q: {len(questions)}")
return questions[:n], relevances[:n]
chain = create_structured_output_chain(Questions, llm, prompt, verbose=False)
questions, relevances = generate_queries(docs, ids, n=200)
train_questions, train_relevances = questions[:100], relevances[:100]
test_questions, test_relevances = questions[100:], relevances[100:]
现在我们创建了 100 个训练查询和 100 个测试查询。现在让我们来训练 deep_memory:
job_id = db.vectorstore.deep_memory.train(
queries=train_questions,
relevance=train_relevances,
)
让我们来跟踪训练进度:
db.vectorstore.deep_memory.status("6538939ca0b69a9ca45c528c")
--------------------------------------------------------------
| 6538e02ecda4691033a51c5b |
--------------------------------------------------------------
| status | completed |
--------------------------------------------------------------
| progress | eta: 1.4 seconds |
| | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------
| results | recall@10: 79.00% (+34.00%) |
--------------------------------------------------------------
3. 评估深度记忆性能
太棒了,我们已经训练好了模型!它的召回率有了显著提升,但我们现在该如何使用它并在未见过的新数据上进行评估呢?在本节中,我们将深入探讨模型评估和推理部分,看看它如何与 LangChain 一起使用以提高检索准确性。
3.1 深度记忆评估
我们可以在开始时使用 deep_memory 的内置评估方法。
它会计算几个 recall 指标。
这可以很容易地在几行代码中完成。
recall = db.vectorstore.deep_memory.evaluate(
queries=test_questions,
relevance=test_relevances,
)
Embedding queries took 0.81 seconds
---- Evaluating without model ----
Recall@1: 9.0%
Recall@3: 19.0%
Recall@5: 24.0%
Recall@10: 42.0%
Recall@50: 93.0%
Recall@100: 98.0%
---- Evaluating with model ----
Recall@1: 19.0%
Recall@3: 42.0%
Recall@5: 49.0%
Recall@10: 69.0%
Recall@50: 97.0%
Recall@100: 97.0%
在未见过的新测试数据集上,它也显示出相当大的改进!!!