
In this tutorial, we walk through an advanced yet practical workflow using SpeechBrain. We start by generating our own clean speech samples with gTTS, deliberately adding noise to simulate real-world scenarios, and then applying SpeechBrain’s MetricGAN+ model to enhance the audio. Once the audio is denoised, we run automatic speech recognition with a language model–rescored CRDNN system and compare the word error rates before and after enhancement. By taking this step-by-step approach, we can experience firsthand how SpeechBrain enables us to build a complete pipeline for speech enhancement and recognition in just a few lines of code. Check out the FULL CODES here.
!pip -q install -U speechbrain gTTS jiwer pydub librosa soundfile torchaudio
!apt -qq install -y ffmpeg >/dev/null
import os, time, math, random, warnings, shutil, glob
warnings.filterwarnings("ignore")
import torch, torchaudio, numpy as np, librosa, soundfile as sf
from gtts import gTTS
from pydub import AudioSegment
from jiwer import wer
from pathlib import Path
from dataclasses import dataclass
from typing import List, Tuple
from IPython.display import Audio, display
from speechbrain.pretrained import EncoderDecoderASR, SpectralMaskEnhancement
root = Path("sb_demo"); root.mkdir(exist_ok=True)
sr = 16000
device = "cuda" if torch.cuda.is_available() else "cpu"
We begin by setting up our Colab environment with all the required libraries and tools. We install SpeechBrain along with audio processing packages, define basic paths and parameters, and prepare the device so we are ready to build our speech pipeline. Check out the FULL CODES here.
def tts_to_wav(text: str, out_wav: str, lang="en"):
mp3 = out_wav.replace(".wav", ".mp3")
gTTS(text=text, lang=lang).save(mp3)
a = AudioSegment.from_file(mp3, format="mp3").set_channels(1).set_frame_rate(sr)
a.export(out_wav, format="wav")
os.remove(mp3)
def add_noise(in_wav: str, snr_db: float, out_wav: str):
y, _ = librosa.load(in_wav, sr=sr, mono=True)
rms = np.sqrt(np.mean(y**2) + 1e-12)
n = np.random.normal(0, 1, len(y))
n = n / (np.sqrt(np.mean(n**2)+1e-12))
target_n_rms = rms / (10**(snr_db/20))
y_noisy = np.clip(y + n * target_n_rms, -1.0, 1.0)
sf.write(out_wav, y_noisy, sr)
def play(title, path):
print(f"▶ {title}: {path}")
display(Audio(path, rate=sr))
def clean_txt(s: str) -> str:
return " ".join("".join(ch.lower() if ch.isalnum() or ch.isspace() else " " for ch in s).split())
@dataclass
class Sample:
text: str
clean_wav: str
noisy_wav: str
enhanced_wav: str
We define small utilities that power our pipeline from end to end. We synthesize speech with gTTS and convert it to WAV, inject controlled Gaussian noise at a target SNR, and add helpers to preview audio and normalize text. We also create a Sample dataclass so we neatly track each utterance’s clean, noisy, and enhanced paths. Check out the FULL CODES here.
sentences = [
"Artificial intelligence is transforming everyday life.",
"Open source tools enable rapid research and innovation.",
"SpeechBrain brings flexible speech pipelines to Python."
]
samples: List[Sample] = []
print("🗣️ Synthesizing short utterances with gTTS...")
for i, s in enumerate(sentences, 1):
cw = str(root/f"clean_{i}.wav")
nw = str(root/f"noisy_{i}.wav")
ew = str(root/f"enhanced_{i}.wav")
tts_to_wav(s, cw)
add_noise(cw, snr_db=3.0 if i%2 else 0.0, out_wav=nw)
samples.append(Sample(text=s, clean_wav=cw, noisy_wav=nw, enhanced_wav=ew))
play("Clean #1", samples[0].clean_wav)
play("Noisy #1", samples[0].noisy_wav)
print("⬇️ Loading pretrained models (this downloads once) ...")
asr = EncoderDecoderASR.from_hparams(
source="speechbrain/asr-crdnn-rnnlm-librispeech",
run_opts={"device": device},
savedir=str(root/"pretrained_asr"),
)
enhancer = SpectralMaskEnhancement.from_hparams(
source="speechbrain/metricgan-plus-voicebank",
run_opts={"device": device},
savedir=str(root/"pretrained_enh"),
)
In this step, we generate three spoken sentences with gTTS, save both clean and noisy versions, and organize them into our Sample objects. We then load SpeechBrain’s pre-trained ASR and MetricGAN+ enhancement models, providing us with all the necessary components to transform noisy audio into a denoised transcription. Check out the FULL CODES here.
def enhance_file(in_wav: str, out_wav: str):
sig = enhancer.enhance_file(in_wav)
if sig.dim() == 1: sig = sig.unsqueeze(0)
torchaudio.save(out_wav, sig.cpu(), sr)
def transcribe(path: str) -> str:
hyp = asr.transcribe_file(path)
return clean_txt(hyp)
def eval_pair(ref_text: str, wav_path: str) -> Tuple[str, float]:
hyp = transcribe(wav_path)
return hyp, wer(clean_txt(ref_text), hyp)
print("\n🔬 Transcribing noisy vs enhanced (MetricGAN+)...")
rows = []
t0 = time.time()
for smp in samples:
enhance_file(smp.noisy_wav, smp.enhanced_wav)
hyp_noisy, wer_noisy = eval_pair(smp.text, smp.noisy_wav)
hyp_enh, wer_enh = eval_pair(smp.text, smp.enhanced_wav)
rows.append((smp.text, hyp_noisy, wer_noisy, hyp_enh, wer_enh))
t1 = time.time()
We create helper functions to enhance noisy audio, transcribe speech, and evaluate WER against the reference text. We then run these steps across all our samples, comparing noisy and enhanced versions, and record both transcriptions and error rates along with the processing time. Check out the FULL CODES here.
def fmt(x): return f"{x:.3f}" if isinstance(x, float) else x
print(f"\n⏱️ Inference time: {t1 - t0:.2f}s on {device.upper()}")
print("\n# ---- Results (Noisy → Enhanced) ----")
for i, (ref, hN, wN, hE, wE) in enumerate(rows, 1):
print(f"\nUtterance {i}")
print("Ref: ", ref)
print("Noisy ASR:", hN)
print("WER noisy:", fmt(wN))
print("Enh ASR: ", hE)
print("WER enh: ", fmt(wE))
print("\n🧵 Batch decoding (looping API):")
batch_files = [s.clean_wav for s in samples] + [s.noisy_wav for s in samples]
bt0 = time.time()
batch_hyps = [transcribe(p) for p in batch_files]
bt1 = time.time()
for p, h in zip(batch_files, batch_hyps):
print(os.path.basename(p), "->", h[:80] + ("..." if len(h) > 80 else ""))
print(f"⏱️ Batch elapsed: {bt1 - bt0:.2f}s")
play("Enhanced #1 (MetricGAN+)", samples[0].enhanced_wav)
avg_wn = sum(wN for _,_,wN,_,_ in rows) / len(rows)
avg_we = sum(wE for _,_,_,_,wE in rows) / len(rows)
print("\n📈 Summary:")
print(f"Avg WER (Noisy): {avg_wn:.3f}")
print(f"Avg WER (Enhanced): {avg_we:.3f}")
print("Tip: Try different SNRs or longer texts, and switch device to GPU if available.")
We summarize our experiment by timing inference, printing per-utterance transcriptions, and contrasting WER before and after enhancement. We also batch-decode multiple files, listen to an enhanced sample, and report average WERs so we clearly see the gains from MetricGAN+ in our pipeline.
In conclusion, we clearly see the power of integrating speech enhancement and ASR into a unified pipeline with SpeechBrain. By generating audio, corrupting it with noise, enhancing it, and finally transcribing it, we gain hands-on insights into how these models improve recognition accuracy in noisy environments. The results highlight the practical benefits of using open-source speech technologies. We conclude with a working framework that can be easily extended for larger datasets, different enhancement models, or custom ASR tasks.
Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.
#Building #Speech #Enhancement #Automatic #Speech #Recognition #ASR #Pipeline #Python #SpeechBrain