Skip to main content
Open In ColabOpen on GitHub

Apache Doris

Apache Doris 是一个用于实时分析的现代数据仓库。 它能够对海量实时数据提供闪电般的分析速度。

通常,Apache Doris 被归类为 OLAP,并且在 [ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP

设置

%pip install --upgrade --quiet  pymysql

在开头设置 update_vectordb = False。如果没有文档被更新,那么我们就不需要重建文档的 embeddings。

!pip install  sqlalchemy
!pip install langchain
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import (
DirectoryLoader,
UnstructuredMarkdownLoader,
)
from langchain_community.vectorstores.apache_doris import (
ApacheDoris,
ApacheDorisSettings,
)
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter

update_vectordb = False

加载文档并将其拆分为 token

加载 docs 目录下的所有 markdown 文件

对于 Apache Doris 文档,你可以从 https://github.com/apache/doris 克隆仓库,其中包含 docs 目录。

loader = DirectoryLoader(
"./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()

将文档拆分为 token,并设置 update_vectordb = True,因为有新的文档/token。

# load text splitter and split docs into snippets of text
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)

# tell vectordb to update text embeddings
update_vectordb = True

split_docs[-20]

print("# docs = %d, # splits = %d" % (len(documents), len(split_docs)))

创建 vectordb 实例

将 Apache Doris 用作向量数据库

def gen_apache_doris(update_vectordb, embeddings, settings):
if update_vectordb:
docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)
else:
docsearch = ApacheDoris(embeddings, settings)
return docsearch

将 token 转换为 embeddings 并放入向量数据库

在这里,我们将 Apache Doris 用作向量数据库,您可以通过 ApacheDorisSettings 配置 Apache Doris 实例。

配置 Apache Doris 实例与配置 MySQL 实例非常相似。您需要指定:

  1. host/port
  2. 用户名(默认值:'root')
  3. 密码(默认值:'')
  4. 数据库(默认值:'default')
  5. 表(默认值:'langchain')
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass()
update_vectordb = True

embeddings = OpenAIEmbeddings()

# configure Apache Doris settings(host/port/user/pw/db)
settings = ApacheDorisSettings()
settings.port = 9030
settings.host = "172.30.34.130"
settings.username = "root"
settings.password = ""
settings.database = "langchain"
docsearch = gen_apache_doris(update_vectordb, embeddings, settings)

print(docsearch)

update_vectordb = False

构建 QA 并对其提问

llm = OpenAI()
qa = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)
query = "what is apache doris"
resp = qa.run(query)
print(resp)