机器学习语音处理

2024-11-08 约 701 字预计阅读 2 分钟

前言

常见的机器学习语音处理库有哪些呢？

常用库

PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.

SpeechBrain

SpeechBrain is an open-source PyTorch toolkit that accelerates Conversational AI development, i.e., the technology behind speech assistants, chatbots, and large language models.

It is crafted for fast and easy creation of advanced technologies for Speech and Text Processing.

Kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

ESPnet

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

语音识别ASR

安装预训练模型

1

pip install espnet espnet-model-zoo

报错

error: Microsoft Visual C++ 14.0 or greater is required

解决

下载Microsoft C++ 生成工具。
安装Visual Studio生成工具2022。

报错

ImportError: cannot import name 'tarfile' from 'backports' (C:\Users\lenovo\anaconda3\lib\site-packages\backports\__init__.py)

解决

pip install setuptools==69.0.0

下载示例

下载ESPNet_asr_egs。

运行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


import warnings
warnings.filterwarnings("ignore")
import os
os.chdir(os.path.split(os.path.realpath(__file__))[0])

lang = 'zh' # 中文
fs = 16000 # 采样率
tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave'

import time
import torch
import string
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
    **d.download_and_unpack(tag),
    device="cuda", # 使用GPU
    minlenratio=0.0, maxlenratio=0.0, # 设置解码参数
    ctc_weight=0.3, # 设置CTC解码器权重
    beam_size=10, # 设置束搜索的大小
    batch_size=0, # 设置批处理大小
    nbest=1 # 设置输出最佳结果的数量
)
# 定义一个文本规范化函数，将文本转换为大写并移除所有标点符号
def text_normalizer(text):
    text = text.upper()
    return text.translate(str.maketrans('', '', string.punctuation))
    
import pandas as pd
import soundfile
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt
egs = pd.read_csv("ESPNet_asr_egs/egs.csv")
for index, row in egs.iterrows():
  if row["lang"] == lang or lang == "multilingual":
    speech, rate = soundfile.read("ESPNet_asr_egs/" + row["path"])
    assert fs == int(row["sr"])
    nbests = speech2text(speech)
    text, *_ = nbests[0]
    print(f"Input Speech: ESPNet_asr_egs/{row['path']}")
    display(Audio(speech, rate=rate))
    librosa.display.waveshow(speech, sr=rate) # 与官网示例相比，有改动，waveplot->waveshow
    plt.show()
    print(f"Reference text: {text_normalizer(row['text'])}")
    print(f"ASR hypothesis: {text_normalizer(text)}")
    print("*" * 50)

结果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


Input Speech: ESPNet_asr_egs/zh/12.wav
<IPython.lib.display.Audio object>
Reference text: 星光在微白的天空闪烁
ASR hypothesis: 星光在微博的天空手术
**************************************************
Input Speech: ESPNet_asr_egs/zh/13.wav
<IPython.lib.display.Audio object>
Reference text: 欢迎有兴趣的青年朋友
ASR hypothesis: 昆凌有兴趣的青年朋友
**************************************************
Input Speech: ESPNet_asr_egs/zh/14.wav
<IPython.lib.display.Audio object>
Reference text: 整个下午依然哭个不停
ASR hypothesis: 整个项目依然库存
**************************************************

目录

机器学习语音处理

常用库

PaddleSpeech

SpeechBrain

Kaldi

ESPnet

语音识别ASR

安装预训练模型

下载示例

运行

结果