递归 URL
RecursiveUrlLoader 允许你从根 URL 递归地抓取所有子链接,并将它们解析成 Documents。
概述
集成详情
| 类 | 包 | 本地 | 可序列化 | JS 支持 |
|---|---|---|---|---|
| RecursiveUrlLoader | langchain_community | ✅ | ❌ | ✅ |
加载器特性
| 源 | 文档延迟加载 | 本地异步支持 |
|---|---|---|
| RecursiveUrlLoader | ✅ | ❌ |
设置
凭证
要使用 RecursiveUrlLoader,无需提供凭证。
安装
RecursiveUrlLoader 位于 langchain-community 包中。没有其他必需的包,不过,如果你也安装了 beautifulsoup4,你将获得更丰富的默认文档元数据。
%pip install -qU langchain-community beautifulsoup4 lxml
实例化
现在我们可以实例化文档加载器对象并加载文档了:
from langchain_community.document_loaders import RecursiveUrlLoader
loader = RecursiveUrlLoader(
"https://docs.python.org/3.9/",
# max_depth=2,
# use_async=False,
# extractor=None,
# metadata_extractor=None,
# exclude_dirs=(),
# timeout=10,
# check_response_status=True,
# continue_on_failure=True,
# prevent_outside=True,
# base_url=None,
# ...
)
加载
使用 .load() 同步将所有 Documents 加载到内存中,每个 Document 对应一个访问过的 URL。从初始 URL 开始,我们递归地查找所有链接的 URL,直到达到指定的 max_depth。
让我们通过一个基本示例来演示如何在 Python 3.9 文档上使用 RecursiveUrlLoader。
docs = loader.load()
docs[0].metadata
/Users/bagatur/.pyenv/versions/3.9.1/lib/python3.9/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
{'source': 'https://docs.python.org/3.9/',
'content_type': 'text/html',
'title': '3.9.19 Documentation',
'language': None}
太好了!第一个文档看起来是我们开始时的根页面。让我们看看下一个文档的元数据。
docs[1].metadata
{'source': 'https://docs.python.org/3.9/using/index.html',
'content_type': 'text/html',
'title': 'Python Setup and Usage — Python 3.9.19 documentation',
'language': None}
这个 URL 看起来是我们根页面的子页面,太棒了!接下来,我们从元数据切换到查看我们其中一个文档的内容。
print(docs[0].page_content[:300])
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="_static/pydoctheme.css" type="text/css" />
<link rel=
这看起来确实是我们期望从 URL https://docs.python.org/3.9/ 获取的 HTML。现在让我们看看一些可以对基本示例进行的修改,这些修改在不同情况下可能会有所帮助。
延迟加载
如果我们加载大量的 Documents,并且下游操作可以在所有已加载 Documents 的子集上完成,我们可以一次性延迟加载 Documents,从而最大限度地减少内存占用:
pages = []
for doc in loader.lazy_load():
pages.append(doc)
if len(pages) >= 10:
# do some paged operation, e.g.
# index.upsert(page)
pages = []
/var/folders/4j/2rz3865x6qg07tx43146py8h0000gn/T/ipykernel_73962/2110507528.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
在此示例中,我们一次最多只在内存中加载 10 个文档。
添加提取器
默认情况下,加载器会将来自每个链接的原始 HTML 设置为 Document 页面的内容。要将此 HTML 解析为更适合人类/LLM 的格式,您可以传递一个自定义的 extractor 方法:
import re
from bs4 import BeautifulSoup
def bs4_extractor(html: str) -> str:
soup = BeautifulSoup(html, "lxml")
return re.sub(r"\n\n+", "\n\n", soup.text).strip()
loader = RecursiveUrlLoader("https://docs.python.org/3.9/", extractor=bs4_extractor)
docs = loader.load()
print(docs[0].page_content[:200])
/var/folders/td/vzm913rx77x21csd90g63_7c0000gn/T/ipykernel_10935/1083427287.py:6: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(html, "lxml")
/Users/isaachershenson/.pyenv/versions/3.11.9/lib/python3.11/html/parser.py:170: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
k = self.parse_starttag(i)
``````output
3.9.19 Documentation
Download
Download these documents
Docs by version
Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (security-fixes)
Python 3.10 (security-fixes)
Python 3.9 (securit
这看起来漂亮多了!
你同样可以传入一个 metadata_extractor 来自定义如何从 HTTP 响应中提取 Document 元数据。请参阅 API reference 以了解更多信息。
API 参考
这些示例仅展示了修改 RecursiveUrlLoader 默认行为的几种方式,但还有更多修改方式可以最适合您的用例。使用 link_regex 和 exclude_dirs 参数可以帮助您过滤掉不需要的 URL,aload() 和 alazy_load() 可用于异步加载,还有更多功能。
有关配置和调用 RecursiveUrlLoader 的详细信息,请参阅 API 参考:https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.recursive_url_loader.RecursiveUrlLoader.html。
Related
- Document loader conceptual guide
- Document loader how-to guides