Grobid

GROBID 是一个用于提取、解析和重构原始文档的机器学习库。

它被设计并期望用于解析学术论文，并且在这种情况下效果尤佳。

注意: 如果提供给 Grobid 的文章是大型文档（例如学位论文），超出一定元素数量可能会导致处理失败。

本页面将介绍如何使用 Grobid 解析 LangChain 的文章。

安装

grobid 的安装详情参见 https://grobid.readthedocs.io/en/latest/Install-Grobid/。然而，通过 docker 容器运行 grobid 可能更简单、更少麻烦，具体文档见此处。

将 Grobid 与 LangChain 结合使用

一旦 grobid 安装并运行起来（可以通过访问 http://localhost:8070 来检查），你就可以开始使用了。

现在你可以使用 GrobidParser 来生成文档

from langchain_community.document_loaders.parsers import GrobidParser
from langchain_community.document_loaders.generic import GenericLoader

# 从文章段落生成块
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=False)
)
docs = loader.load()

# 从文章句子生成块
loader = GenericLoader.from_filesystem(
    "/Users/31treehaus/Desktop/Papers/",
    glob="*",
    suffixes=[".pdf"],
    parser= GrobidParser(segment_sentences=True)
)
docs = loader.load()

API Reference:GrobidParser | GenericLoader

块的元数据将包含边界框。虽然解析起来有些棘手，但相关解释可在 https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/ 查看。

安装​

将 Grobid 与 LangChain 结合使用​

安装

将 Grobid 与 LangChain 结合使用