NVIDIA Riva: ASR 和 TTS
NVIDIA Riva
NVIDIA Riva 是一个 GPU 加速的多语言语音和翻译 AI 软件开发工具包,用于构建完全可定制的实时对话 AI 管道——包括自动语音识别 (ASR)、文本转语音 (TTS) 和神经机 器翻译 (NMT) 应用程序——这些应用程序可以部署在云、数据中心、边缘或嵌入式设备上。
Riva Speech API 服务器公开了一个简单的 API,用于执行语音识别、语音合成以及各种自然语言处理推理,并且已集成到 LangChain 中用于 ASR 和 TTS。有关如何按此处的 设置 Riva Speech API 的说明。
将 NVIDIA Riva 集成到 LangChain 链中
NVIDIARivaASR、NVIDIARivaTTS 实用程序可运行项是 LangChain 可运行项,它们将 NVIDIA Riva 集成到 LCEL 链中,用于自动语音识别 (ASR) 和文本转语音 (TTS)。
本示例将介绍如何使用这些 LangChain 可运行项来实现:
- 接受流式音频,
- 将音频转换为文本,
- 将文本发送到 LLM,
- 流式传输文本 LLM 响应,以及
- 将响应转换为流式听起来自然的音频。
1. NVIDIA Riva Runnables
有两个 Riva Runnables:
a. RivaASR: 使用 NVIDIA Riva 将音频字节转换为文本,以便输入给 LLM。
b. RivaTTS: 使用 NVIDIA Riva 将文本转换为音频字节。
a. RivaASR
RivaASR runnable 使用 NVIDIA Riva 将音频字节转换为文本字符串,供 LLM 使用。
它适用于将音频流(包含流式音频的消息)发送到 chain 中,并通过将其转换为字符串来预处理该音频,从而创建 LLM prompt。
ASRInputType = AudioStream # AudioStream 类型是自定义类型,用于包含流式音频的消息队列
ASROutputType = str
class RivaASR(
RivaAuthMixin,
RivaCommonConfigMixin,
RunnableSerializable[ASRInputType, ASROutputType],
):
"""使用 NVIDIA Riva 执行自动语音识别 (ASR) 的 runnable。"""
name: str = "nvidia_riva_asr"
description: str = (
"一个用于将音频字节转换为文本字符串的 Runnable。"
"这对于将音频流馈送到 chain 并"
"预处理该音频以创建 LLM prompt 非常有用。"
)
# riva 选项
audio_channel_count: int = Field(
1, description="输入音频流中的音频通道数。"
)
profanity_filter: bool = Field(
True,
description=(
"控制 Riva 是否应尝试过滤掉转录文本中的"
"不当言论。"
),
)
enable_automatic_punctuation: bool = Field(
True,
description=(
"控制 Riva 是否应尝试纠正转录文本中的"
"句子标点符号。"
),
)
当此 runnable 被调用时,它会接收一个作为队列的输入音频流,并在接收到转录块时将其连接起来。响应完全生成后,将返回一个字符串。
- 请注意,由于 LLM 需要完整的查询,因此 ASR 会被连接起来,而不是逐个 token 进行流式传输。
b. RivaTTS
RivaTTS runnable 将文本输出转换为音频字节。
它适用于处理 LLM 流式文本响应,将文本转换为音频字节。这些音频字节听起来像自然的人声,可以播放给用户。
TTSInputType = Union[str, AnyMessage, PromptValue]
TTSOutputType = byte
class RivaTTS(
RivaAuthMixin,
RivaCommonConfigMixin,
RunnableSerializable[TTSInputType, TTSOutputType],
):
"""使用 NVIDIA Riva 执行文本到语音 (TTS) 的 runnable。"""
name: str = "nvidia_riva_tts"
description: str = (
"一个用于文本转语音的工具。"
"这对于将 LLM 输出转换为音频字节非常有用。"
)
# riva 选项
voice_name: str = Field(
"English-US.Female-1",
description=(
"Riva 中用于语音的语音模型。"
"预训练模型记录在"
"[Riva 文档]"
"(https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html) 中。"
),
)
output_directory: Optional[str] = Field(
None,
description=(
"应保存所有音频文件的目录。"
"空值表示不应保存 wave 文件。"
"这对于调试很有用。"
),
当此 runnable 被调用时,它会接收可迭代的文本块,并将它们流式传输为输出音频字节,这些字节可以被写入 .wav 文件或直接播放。
2. 安装
必须安装 NVIDIA Riva 客户端库。
%pip install --upgrade --quiet nvidia-riva-client
Note: you may need to restart the kernel to use updated packages.
3. 设置
开始使用 NVIDIA Riva:
- 请按照 Riva 快速入门的设置说明,使用快速入门脚本进行本地部署。
4. 导入并检查可运行组件
导入 RivaASR 和 RivaTTS 可运行组件并检查它们的模式,以了解它们的字段。
import json
from langchain_community.utilities.nvidia_riva import (
RivaASR,
RivaTTS,
)
让我们查看一下 schema。
print(json.dumps(RivaASR.schema(), indent=2))
print(json.dumps(RivaTTS.schema(), indent=2))
{
"title": "RivaASR",
"description": "A runnable that performs Automatic Speech Recognition (ASR) using NVIDIA Riva.",
"type": "object",
"properties": {
"name": {
"title": "Name",
"default": "nvidia_riva_asr",
"type": "string"
},
"encoding": {
"description": "The encoding on the audio stream.",
"default": "LINEAR_PCM",
"allOf": [
{
"$ref": "#/definitions/RivaAudioEncoding"
}
]
},
"sample_rate_hertz": {
"title": "Sample Rate Hertz",
"description": "The sample rate frequency of audio stream.",
"default": 8000,
"type": "integer"
},
"language_code": {
"title": "Language Code",
"description": "The [BCP-47 language code](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) for the target language.",
"default": "en-US",
"type": "string"
},
"url": {
"title": "Url",
"description": "The full URL where the Riva service can be found.",
"default": "http://localhost:50051",
"examples": [
"http://localhost:50051",
"https://user@pass:riva.example.com"
],
"anyOf": [
{
"type": "string",
"minLength": 1,
"maxLength": 65536,
"format": "uri"
},
{
"type": "string"
}
]
},
"ssl_cert": {
"title": "Ssl Cert",
"description": "A full path to the file where Riva's public ssl key can be read.",
"type": "string"
},
"description": {
"title": "Description",
"default": "A Runnable for converting audio bytes to a string.This is useful for feeding an audio stream into a chain andpreprocessing that audio to create an LLM prompt.",
"type": "string"
},
"audio_channel_count": {
"title": "Audio Channel Count",
"description": "The number of audio channels in the input audio stream.",
"default": 1,
"type": "integer"
},
"profanity_filter": {
"title": "Profanity Filter",
"description": "Controls whether or not Riva should attempt to filter profanity out of the transcribed text.",
"default": true,
"type": "boolean"
},
"enable_automatic_punctuation": {
"title": "Enable Automatic Punctuation",
"description": "Controls whether Riva should attempt to correct senetence puncuation in the transcribed text.",
"default": true,
"type": "boolean"
}
},
"definitions": {
"RivaAudioEncoding": {
"title": "RivaAudioEncoding",
"description": "An enum of the possible choices for Riva audio encoding.\n\nThe list of types exposed by the Riva GRPC Protobuf files can be found\nwith the following commands:\n\`\`\`python\nimport riva.client\nprint(riva.client.AudioEncoding.keys()) # noqa: T201\n\`\`\`",
"enum": [
"ALAW",
"ENCODING_UNSPECIFIED",
"FLAC",
"LINEAR_PCM",
"MULAW",
"OGGOPUS"
],
"type": "string"
}
}
}
{
"title": "RivaTTS",
"description": "A runnable that performs Text-to-Speech (TTS) with NVIDIA Riva.",
"type": "object",
"properties": {
"name": {
"title": "Name",
"default": "nvidia_riva_tts",
"type": "string"
},
"encoding": {
"description": "The encoding on the audio stream.",
"default": "LINEAR_PCM",
"allOf": [
{
"$ref": "#/definitions/RivaAudioEncoding"
}
]
},
"sample_rate_hertz": {
"title": "Sample Rate Hertz",
"description": "The sample rate frequency of audio stream.",
"default": 8000,
"type": "integer"
},
"language_code": {
"title": "Language Code",
"description": "The [BCP-47 language code](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) for the target language.",
"default": "en-US",
"type": "string"
},
"url": {
"title": "Url",
"description": "The full URL where the Riva service can be found.",
"default": "http://localhost:50051",
"examples": [
"http://localhost:50051",
"https://user@pass:riva.example.com"
],
"anyOf": [
{
"type": "string",
"minLength": 1,
"maxLength": 65536,
"format": "uri"
},
{
"type": "string"
}
]
},
"ssl_cert": {
"title": "Ssl Cert",
"description": "A full path to the file where Riva's public ssl key can be read.",
"type": "string"
},
"description": {
"title": "Description",
"default": "A tool for converting text to speech.This is useful for converting LLM output into audio bytes.",
"type": "string"
},
"voice_name": {
"title": "Voice Name",
"description": "The voice model in Riva to use for speech. Pre-trained models are documented in [the Riva documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html).",
"default": "English-US.Female-1",
"type": "string"
},
"output_directory": {
"title": "Output Directory",
"description": "The directory where all audio files should be saved. A null value indicates that wave files should not be saved. This is useful for debugging purposes.",
"type": "string"
}
},
"definitions": {
"RivaAudioEncoding": {
"title": "RivaAudioEncoding",
"description": "An enum of the possible choices for Riva audio encoding.\n\nThe list of types exposed by the Riva GRPC Protobuf files can be found\nwith the following commands:\n\`\`\`python\nimport riva.client\nprint(riva.client.AudioEncoding.keys()) # noqa: T201\n\`\`\`",
"enum": [
"ALAW",
"ENCODING_UNSPECIFIED",
"FLAC",
"LINEAR_PCM",
"MULAW",
"OGGOPUS"
],
"type": "string"
}
}
}
5. 声明 Riva ASR 和 Riva TTS 可运行实例
本示例使用的是单通道音频文件(mulaw 格式,即 .wav)。
您需要设置 Riva 语音服务器,如果您还没有 Riva 语音服务器,请前往 设置。