目录

机器学习语音处理

misaraty 更新 | 2024-11-15
前言
常见的机器学习语音处理库有哪些呢?

常用库

./2024-11-15_093128.jpg
Github Star趋势

PaddleSpeech

Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.

SpeechBrain

SpeechBrain is an open-source PyTorch toolkit that accelerates Conversational AI development, i.e., the technology behind speech assistants, chatbots, and large language models.

It is crafted for fast and easy creation of advanced technologies for Speech and Text Processing.

Kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

ESPnet

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

语音识别ASR

安装预训练模型

1
pip install espnet espnet-model-zoo
报错

error: Microsoft Visual C++ 14.0 or greater is required

解决
报错

ImportError: cannot import name 'tarfile' from 'backports' (C:\Users\lenovo\anaconda3\lib\site-packages\backports\__init__.py)

解决
pip install setuptools==69.0.0

下载示例

运行

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import warnings
warnings.filterwarnings("ignore")
import os
os.chdir(os.path.split(os.path.realpath(__file__))[0])

lang = 'zh' # 中文
fs = 16000 # 采样率
tag = 'Emiru Tsunoo/aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.ave'

import time
import torch
import string
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
    **d.download_and_unpack(tag),
    device="cuda", # 使用GPU
    minlenratio=0.0, maxlenratio=0.0, # 设置解码参数
    ctc_weight=0.3, # 设置CTC解码器权重
    beam_size=10, # 设置束搜索的大小
    batch_size=0, # 设置批处理大小
    nbest=1 # 设置输出最佳结果的数量
)
# 定义一个文本规范化函数,将文本转换为大写并移除所有标点符号
def text_normalizer(text):
    text = text.upper()
    return text.translate(str.maketrans('', '', string.punctuation))
    
import pandas as pd
import soundfile
import librosa.display
from IPython.display import display, Audio
import matplotlib.pyplot as plt
egs = pd.read_csv("ESPNet_asr_egs/egs.csv")
for index, row in egs.iterrows():
  if row["lang"] == lang or lang == "multilingual":
    speech, rate = soundfile.read("ESPNet_asr_egs/" + row["path"])
    assert fs == int(row["sr"])
    nbests = speech2text(speech)
    text, *_ = nbests[0]
    print(f"Input Speech: ESPNet_asr_egs/{row['path']}")
    display(Audio(speech, rate=rate))
    librosa.display.waveshow(speech, sr=rate) # 与官网示例相比,有改动,waveplot->waveshow
    plt.show()
    print(f"Reference text: {text_normalizer(row['text'])}")
    print(f"ASR hypothesis: {text_normalizer(text)}")
    print("*" * 50)

结果

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Input Speech: ESPNet_asr_egs/zh/12.wav
<IPython.lib.display.Audio object>
Reference text: 星光在微白的天空闪烁
ASR hypothesis: 星光在微博的天空手术
**************************************************
Input Speech: ESPNet_asr_egs/zh/13.wav
<IPython.lib.display.Audio object>
Reference text: 欢迎有兴趣的青年朋友
ASR hypothesis: 昆凌有兴趣的青年朋友
**************************************************
Input Speech: ESPNet_asr_egs/zh/14.wav
<IPython.lib.display.Audio object>
Reference text: 整个下午依然哭个不停
ASR hypothesis: 整个项目依然库存
**************************************************