Apache Doris
Apache Doris 是一个用于实时分析的现代数据仓库。 它能够对海量实时数据提供闪电般的分析速度。
通常,
Apache Doris被归类为 OLAP,并且在 [ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — ClickBench — OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP OLAP
设置
%pip install --upgrade --quiet pymysql
在开头设置 update_vectordb = False。如果没有文档被更新,那么我们就不需要重建文档的 embeddings。
!pip install sqlalchemy
!pip install langchain
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import (
DirectoryLoader,
UnstructuredMarkdownLoader,
)
from langchain_community.vectorstores.apache_doris import (
ApacheDoris,
ApacheDorisSettings,
)
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter
update_vectordb = False
加载文档并将其拆分为 token
加载 docs 目录下的所有 markdown 文件
对于 Apache Doris 文档,你可以从 https://github.com/apache/doris 克隆仓库,其中包含 docs 目录。
loader = DirectoryLoader(
"./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()
将文档拆分为 token,并设置 update_vectordb = True,因为有新的文档/token。
# load text splitter and split docs into snippets of text
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
# tell vectordb to update text embeddings
update_vectordb = True
split_docs[-20]
print("# docs = %d, # splits = %d" % (len(documents), len(split_docs)))