如何按 token 分割文本

语言模型有 token 限制。您不应超过 token 限制。因此，当您将文本分割成块时，计算 token 的数量是个好主意。有很多分词器。在计算文本中的 token 时，您应该使用与语言模型相同的分词器。

tiktoken

note

tiktoken 是由 OpenAI 创建的快速 BPE 分词器。

我们可以使用 tiktoken 来估算使用的 token 数量。对于 OpenAI 的模型，这可能会更准确。

文本如何被拆分：通过传入的字符进行拆分。
分块大小如何被衡量：使用 tiktoken 分词器进行衡量。

CharacterTextSplitter、RecursiveCharacterTextSplitter 和 TokenTextSplitter 可以直接与 tiktoken 一起使用。

%pip install --upgrade --quiet langchain-text-splitters tiktoken

from langchain_text_splitters import CharacterTextSplitter

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

API Reference:CharacterTextSplitter

要使用 CharacterTextSplitter 进行分割，然后使用 tiktoken 合并块，请使用其 .from_tiktoken_encoder() 方法。请注意，此方法生成的分割块可能大于由 tiktoken 分词器测量的块大小。

.from_tiktoken_encoder() 方法接受 encoding_name 作为参数（例如 cl100k_base），或 model_name（例如 gpt-4）。所有其他参数，如 chunk_size、chunk_overlap 和 separators，都用于实例化 CharacterTextSplitter：

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

为了实现对分块大小的硬性约束，我们可以使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder，其中每个分块如果大于指定大小，都会被递归地再次分割：

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=100,
    chunk_overlap=0,
)

API Reference:RecursiveCharacterTextSplitter

我们也可以加载一个 TokenTextSplitter 分割器，它直接与 tiktoken 一起工作，并确保每个分割都小于块大小。

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API Reference:TokenTextSplitter

Madam Speaker, Madam Vice President, our

某些书面语言（例如中文和日文）的字符编码会产生 2 个或更多 token。直接使用 TokenTextSplitter 有可能将字符的 token 分割到两个块之间，导致 Unicode 字符格式错误。请使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder 或 CharacterTextSplitter.from_tiktoken_encoder 来确保块包含有效的 Unicode 字符串。

spaCy

note

spaCy 是一个用于高级自然语言处理的开源软件库，使用 Python 和 Cython 编程语言编写。

LangChain 实现的分割器基于 spaCy 分词器。

文本如何分割：通过 spaCy 分词器。
块大小如何衡量：通过字符数。

%pip install --upgrade --quiet  spacy

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

API Reference:SpacyTextSplitter

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.  

Last year COVID-19 kept us apart.

This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

SentenceTransformers

SentenceTransformersTokenTextSplitter 是一个专门用于 sentence-transformer 模型的文本分割器。默认行为是将文本分割成适合您想使用的 sentence-transformer 模型 token 窗口的块。

要根据 sentence-transformers tokenizer 来分割文本并限制 token 数量，请实例化一个 SentenceTransformersTokenTextSplitter。您可以选择性地指定：

chunk_overlap： token 重叠的数量（整数）；
model_name： sentence-transformer 模型名称，默认为 "sentence-transformers/all-mpnet-base-v2"；
tokens_per_chunk：每个块所需的 token 数量。

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
text = "Lorem "

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

API Reference:SentenceTransformersTokenTextSplitter

token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

tokens in text to split: 514

text_chunks = splitter.split_text(text=text_to_split)

print(text_chunks[1])

lorem

NLTK

note

自然语言工具包（Natural Language Toolkit），或者更常见的称呼是 NLTK，是一套用 Python 编程语言编写的用于英语的符号和统计自然语言处理 (NLP) 的库和程序集。

我们不必仅仅根据 "\n\n" 来分割文本，而是可以使用 NLTK 来根据 NLTK 分词器进行分割。

文本是如何分割的：通过 NLTK 分词器。
块大小是如何衡量的：通过字符数。

# pip install nltk

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(chunk_size=1000)

API Reference:NLTKTextSplitter

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.

Members of Congress and the Cabinet.

Justices of the Supreme Court.

My fellow Americans.

Last year COVID-19 kept us apart.

This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents.

But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

And with an unwavering resolve that freedom will always triumph over tyranny.

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.

But he badly miscalculated.

He thought he could roll into Ukraine and the world would roll over.

Instead he met a wall of strength he never imagined.

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies.

KoNLPY

note

KoNLPy: Korean NLP in Python 是一个用于韩语自然语言处理 (NLP) 的 Python 包。

分词是指将文本分割成更小、更易于管理单元的过程，这些单元称为词元。这些词元通常是单词、短语、符号或其他有意义的元素，对于后续的处理和分析至关重要。在英语等语言中，分词通常涉及用空格和标点符号分隔单词。分词的有效性在很大程度上取决于分词器对语言结构的理解，以确保生成有意义的词元。由于为英语设计的词分器无法理解韩语等其他语言独特的语义结构，因此无法有效地用于韩语语言处理。

使用 KoNLPy 的 Kkma 分析器进行韩语分词

对于韩语文本，KoNLPY 提供了一个名为 Kkma（Korean Knowledge Morpheme Analyzer）的形态分析器。Kkma 对韩语文本进行详细的形态分析。它将句子分解成单词，并将单词分解成各自的词素，为每个词元识别词性。它可以将文本块分割成单个句子，这对于处理长文本特别有用。

使用注意事项

虽然 Kkma 以其详细的分析而闻名，但需要注意的是，这种精确性可能会影响处理速度。因此，Kkma 最适合那些优先考虑分析深度而非快速文本处理的应用。

# pip install konlpy

# This is a long Korean document that we want to split up into its component sentences.
with open("./your_korean_doc.txt") as f:
    korean_document = f.read()

from langchain_text_splitters import KonlpyTextSplitter

text_splitter = KonlpyTextSplitter()

API Reference:KonlpyTextSplitter

texts = text_splitter.split_text(korean_document)
# The sentences are split with "\n\n" characters.
print(texts[0])

춘향전 옛날에 남원에 이 도령이라는 벼슬아치 아들이 있었다.

그의 외모는 빛나는 달처럼 잘생겼고, 그의 학식과 기예는 남보다 뛰어났다.

한편, 이 마을에는 춘향이라는 절세 가인이 살고 있었다.

춘 향의 아름다움은 꽃과 같아 마을 사람들 로부터 많은 사랑을 받았다.

어느 봄날, 도령은 친구들과 놀러 나갔다가 춘 향을 만 나 첫 눈에 반하고 말았다.

두 사람은 서로 사랑하게 되었고, 이내 비밀스러운 사랑의 맹세를 나누었다.

하지만 좋은 날들은 오래가지 않았다.

도령의 아버지가 다른 곳으로 전근을 가게 되어 도령도 떠나 야만 했다.

이별의 아픔 속에서도, 두 사람은 재회를 기약하며 서로를 믿고 기다리기로 했다.

그러나 새로 부임한 관아의 사또가 춘 향의 아름다움에 욕심을 내 어 그녀에게 강요를 시작했다.

춘 향 은 도령에 대한 자신의 사랑을 지키기 위해, 사또의 요구를 단호히 거절했다.

이에 분노한 사또는 춘 향을 감옥에 가두고 혹독한 형벌을 내렸다.

이야기는 이 도령이 고위 관직에 오른 후, 춘 향을 구해 내는 것으로 끝난다.

두 사람은 오랜 시련 끝에 다시 만나게 되고, 그들의 사랑은 온 세상에 전해 지며 후세에까지 이어진다.

- 춘향전 (The Tale of Chunhyang)

Hugging Face 分词器

Hugging Face 提供了多种分词器 (tokenizers)。

我们使用 Hugging Face 的分词器，具体是 GPT2TokenizerFast 来计算文本的 token 数量。

文本如何被分割：根据传入的字符进行分割。
分块大小如何衡量：根据 Hugging Face 分词器计算出的 token 数量来衡量。

from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
    state_of_the_union = f.read()
from langchain_text_splitters import CharacterTextSplitter

API Reference:CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

tiktoken​

spaCy​

SentenceTransformers​

NLTK​

KoNLPY​

使用 KoNLPy 的 Kkma 分析器进行韩语分词​

使用注意事项​

Hugging Face 分词器​

tiktoken

spaCy

SentenceTransformers

NLTK

KoNLPY

使用 KoNLPy 的 Kkma 分析器进行韩语分词

使用注意事项

Hugging Face 分词器