Introducing MaskGCT

 author avatar image Ethan Apr 29, 2025 Product

Cutting-edge speech synthesis with lifelike voices, full control, and multilingual fluency.

Introducing MaskGCT banner image

When we set out to build a truly next-generation AI voice platform, we knew we couldn’t rely on existing TTS models. So we built our own — and it changed everything.

Masked Generative Codec Transformer (MaskGCT) is the proprietary speech synthesis model that powers All Voice Lab’s voice capabilities. It’s fast, controllable, multilingual, and most importantly — astonishingly real.

What makes it different?

Most large-scale text-to-speech (TTS) systems today fall into two categories: some generate speech step by step (which can sound smooth, but may lack control), while others generate faster but often rely on rigid, pre-defined timing between text and speech.

MaskGCT takes a different approach.

It’s designed to generate speech in a fully parallel, highly flexible way — without needing manual alignment or detailed timing rules. Instead of trying to match every word to a precise timestamp, it learns how speech should “feel” — how a sentence flows, where to pause, how to emphasize — all from large-scale real-world data. The result is a model that not only sounds better, but adapts better.

Here’s what sets it apart:

1. High-Fidelity Voice Cloning in Seconds

MaskGCT delivers ultra-realistic voice cloning using just 3 seconds of reference audio. It accurately reproduces not only tone and pacing, but also stylistic detail and emotional nuance — whether it’s a human voice, a fictional character, or a whisper-style performance. The result is a voice that doesn’t just sound similar — it feels identical.

2. Cross-Lingual Fluency with Consistent Identity

Trained on the Emilia dataset — a massive 100,000-hour collection of richly diverse multilingual speech — MaskGCT can make a single voice speak six languages fluently — English, Chinese, Japanese, French, German and Korean — while maintaining the same vocal identity and natural rhythm. No mismatched tone. No robotic accent drift. Just one voice, truly global.

3. Fine-Grained Control Without Breaking Naturalness

MaskGCT is designed for precise control over speech attributes, including length, speed, pauses, and emotional tone — while preserving the natural rhythm, prosody, and timbre of the original voice. This fine-tuned controllability is built directly into the model’s generation process, without compromising fluency or expressiveness.

Benchmarked. Verified. Best-in-class.

MaskGCT has achieved state-of-the-art (SOTA) results across three authoritative public TTS benchmarks, surpassing even the most advanced models in the field — and in some cases, exceeding human-level performance.

• On SIM-O (Objective Speaker Similarity), which measures how closely the generated voice matches the reference speaker’s vocal identity, MaskGCT scores 0.728 on SeedTTS English and 0.777 on SeedTTS Chinese — approaching the ground-truth scores of 0.730 and 0.750, and outperforming all competing models.

• In SMOS (Similarity Mean Opinion Score), which reflects how similar the generated voice sounds to the reference based on human ratings, MaskGCT consistently scores above 4.1 across all three test sets — higher than any other model evaluated, and in some cases even higher than ground truth references.

These results confirm the model’s exceptional ability to replicate speaker-specific features such as tone, cadence, and accent. Complementary metrics such as WER, FSD, and CMOS also demonstrate MaskGCT’s overall strength in intelligibility, quality, and naturalness — confirming its position among the most advanced TTS systems available today.

We’re excited to see how MaskGCT will power the next generation of AI voice experiences — more faithful, more expressive, and more human than ever.