Skip to content
Go back

2025 Edge Speech-to-Text Model Benchmark

Manideep Pranav Patel

Table of contents

Open Table of contents

Abstract

The proliferation of Speech-to-Text (STT) models and APIs necessitates objective performance benchmarks that reflect real-world conditions. Existing evaluations often fall short, either focusing on theoretical model capabilities or exhibiting bias from self-interested providers. This paper presents a practical and neutral benchmark of seven leading STT models, including open-source options like Whisper and Distil-Whisper, and proprietary APIs such as IBM Granite-speech-3.3, Deepgram, and AssemblyAI. Our methodology emphasizes real-world transcription challenges, particularly in high-stakes environments, utilizing a diverse dataset with clean and noisy speech. We evaluate models based on Word Error Rate (WER) and its components (substitutions, deletions, insertions), as well as Character Error Rate (CER). Findings indicate that Granite-Speech-3.3 consistently outperforms other models across both clean and noisy conditions, demonstrating superior accuracy and robustness. Distil-Whisper shows strong performance, particularly in CER, making it a viable option for resource-constrained applications. Wav2Vec2, however, exhibits significant limitations, rendering it unsuitable for production-grade use. This benchmark provides crucial insights for developers and product teams seeking reliable STT solutions for real-world scenarios where accuracy is paramount.

1 Introduction

The landscape of Speech-to-Text (STT) models and APIs is rapidly expanding, leading to a critical need for unbiased and reliable performance evaluations. Developers and product teams often struggle to answer the fundamental question: “Which STT solution should I use?” This challenge is exacerbated by the limitations of current benchmarking practices.

Academic research, while valuable for understanding model architectures, typically focuses on isolated model comparisons rather than their performance within production-grade APIs. These studies often overlook real-world constraints such as latency, computational cost, and diverse acoustic conditions. Conversely, many online API benchmarks are conducted by the providers themselves, leading to inherent biases where their own solutions are consistently ranked at the top. A prominent example is Deepgram’s benchmark, which, unsurprisingly, positions Deepgram first across all metrics. While strategically sound from a marketing perspective, such evaluations offer limited utility for developers seeking objective and reliable comparisons based on realistic scenarios.

This benchmark was initiated out of a practical necessity for a more neutral and robust evaluation. Our primary objective was not to generate clickbait or marketing collateral, but to answer a crucial question: “Which speech-to-text model can actually be trusted when accuracy isn’t optional?” We specifically focused on real-world transcription challenges, particularly those encountered in high-stakes environments like hospitals, where even a single transcription error can lead to serious consequences. This benchmark is meticulously designed to assist developers and product teams in selecting the most appropriate STT model for real-world speech processing, moving beyond idealized lab conditions to assess actual performance under duress.

2 Methodology

Our evaluation focused on real-world conditions that present significant challenges for speech recognition systems.

2.1 Dataset Description and Real-World Relevance

Dataset Overview:

Audio Specifications:

The dataset is publicly available for review and use at: https://huggingface.co/datasets/Ionio-ai/speech_to_text_benchmark/tree/main

2.2 Evaluated Models

We tested seven leading speech-to-text models representing different approaches to automatic speech recognition:

2.3 Evaluation Metrics

Word Error Rate (WER) was our primary evaluation metric, widely used to quantify transcription accuracy. We also analyzed the specific types of errors that contribute to WER: substitutions, deletions, and insertions. These individual components often have outsized effects in real-world use cases.

$WER = \frac{(S+D+I)}{N} \times 100%$

Where:

Lower WER indicates better transcription quality, with 0% WER representing perfect accuracy.

2.3.1 Significance of Error Types in Domain-Specific Applications

While WER provides a summary view, understanding how a model makes mistakes is equally important, particularly in domains like healthcare, law, or live broadcasts where each word may carry critical meaning:

We also tracked Character Error Rate (CER) to capture more granular errors in spellings or names, important for applications like transcribing names, emails, or short codes.

2.4 Inference Strategy: Multi-Pass vs. Zero-Pass Inference

To better understand model stability and variability, we evaluated open-source models under two inference setups:

2.5 Noise Augmentation Strategy

Our augmentation pipeline was designed to simulate diverse acoustic distortions:

3 Benchmark Analysis Results for Open-Sourced Models

3.1 Clean Speech Evaluation

For clean audio conditions, Granite-Speech-3.3 emerged as the standout performer with an exceptional 8.18% WER, significantly outpacing all competitors. Distil-Whisper secured second place with a respectable 14.93% WER, offering a good balance of accuracy and efficiency. Wav2Vec2 struggled considerably in this category, posting a disappointing 37.04% WER.

3.2 Noisy Speech Evaluation

The introduction of background noise revealed interesting shifts in model robustness. Granite-Speech-3.3 maintained its leading position with 15.72% WER, demonstrating exceptional noise resilience. Distil-Whisper showed strong noise handling capabilities, achieving 21.26% WER, while standard Whisper experienced more significant degradation, jumping to 29.80% WER. Wav2Vec2 remained the weakest performer at 54.69% WER.

Figure 1: CER performance across clean and noisy audio for open-source models.

Figure 2: Comparison of WER degradation from clean to noisy conditions for open-source models.

3.3 Error Pattern Analysis

3.3.1 Insertion Errors

Granite-Speech-3.3 demonstrated minimal insertion errors (0.828 clean, 0.862 noisy), making it suitable for real-time systems that demand accuracy. Whisper, however, showed a sharp increase in insertion errors under noisy conditions, reaching 2.362.

Figure 3: Insertion Error Rates for open-source models across clean and noisy conditions.

3.3.2 Deletion Errors

Parakeet led the field in this category with a deletion error rate of just 0.414 under clean conditions, making it well-suited for controlled environments where completeness is non-negotiable. Wav2Vec2 had a deletion rate of 8.897 under noise, suggesting a frequent tendency to drop entire phrases.

Figure 4: Deletion Error Rates for open-source models across clean and noisy conditions.

3.3.3 Substitution Errors

Granite-Speech-3.3 again led with the lowest substitution rates, scoring 2.276 in clean and 3.276 in noisy conditions. Distil-Whisper delivered moderate substitution rates. Wav2Vec2’s substitution error rate ballooned to 13.879 under noisy conditions.

Figure 5: Substitution Error Rates for open-source models across clean and noisy conditions.

Figure 6: Average insertions, deletions, and substitutions across clean and noisy audio for open-source models.

3.4 Character vs Word Error Analysis

The correlation between WER and CER showed a strong alignment across models. Granite-Speech-3.3 achieved the best CER performance (10.26% overall), while Distil-Whisper led in pure CER metrics (10.14%), demonstrating excellent character-level precision.

A chart comparing CER degradation from clean to noisy conditions for open-source models.

Figure 8: Comparison of CER degradation from clean to noisy conditions for open-source models.

4 Benchmark Analysis Results for All Models

4.1 Clean Speech Evaluation (All Models)

In clean audio conditions, Granite-Speech-3.3’s exceptional 7.9% WER positions it as the leader. AssemblyAI and Parakeet, with WERs of 9.4% and 9.5%, perform admirably. Wav2Vec2’s 29.0% WER indicates significant struggles.

4.2 Noisy Speech Evaluation (All Models)

Granite-Speech-3.3 maintains its dominance with an 11.5% WER in noisy conditions, experiencing only a modest 3.5% increase from clean audio. AssemblyAI’s 14.1% WER also demonstrates notable robustness. Wav2Vec2’s drastic jump to 49.6% WER reveals a critical weakness.

Figure 9: WER performance across clean and noisy audio for all models.

4.3 Error Pattern Analysis (All Models)

4.4 Overall Model Rankings

Figure 10: Overall WER performance across all models, showing average WER across clean and noisy conditions.

RankModelAverage WER
1Granite-Speech-3.39.7%
2AssemblyAI11.75%
3Parakeet12.85%
4Distil-Whisper15.85%
5Deepgram17.6%
6Whisper19.6%
7Wav2Vec239.3%

5 Limitations of This Benchmark

6 Key Insights and Use Case Recommendations

7 Conclusion and Future Work

This benchmark provides a comprehensive and unbiased evaluation of leading speech-to-text models under realistic conditions. Our findings clearly indicate that IBM Granite-Speech-3.3 stands out as the top performer, demonstrating exceptional accuracy and robustness across both clean and noisy audio environments. Distil-Whisper emerges as a strong contender among open-source models, offering a compelling balance of accuracy and efficiency. Conversely, Wav2Vec2 consistently underperformed, suggesting it is not yet ready for production-grade deployment in scenarios demanding high accuracy.

While this benchmark provides valuable guidance, it also highlights areas for future research. Expanding the dataset to include a wider variety of accents, conversational speech with overlapping speakers, and domain-specific vocabulary would further enhance the realism and applicability of such evaluations. Additionally, incorporating metrics for inference speed, latency, and computational efficiency would provide a more complete picture for real-time and edge computing applications. For organizations seeking to implement STT solutions, understanding these nuanced performance characteristics is paramount.


Share this post on:

Previous Post
A Comparative Study of Quantized Large Language Models