Introducing MaskGCT
Cutting-edge speech synthesis with lifelike voices, full control, and multilingual fluency.

When we set out to build a truly next-generation AI voice platform, we knew we couldn’t rely on existing TTS models. So we built our own — and it changed everything.
Masked Generative Codec Transformer (MaskGCT) is the proprietary speech synthesis model that powers All Voice Lab’s voice capabilities. It’s fast, controllable, multilingual, and most importantly — astonishingly real.
Most large-scale text-to-speech (TTS) systems today fall into two categories: some generate speech step by step (which can sound smooth, but may lack control), while others generate faster but often rely on rigid, pre-defined timing between text and speech.
MaskGCT takes a different approach.
It’s designed to generate speech in a fully parallel, highly flexible way — without needing manual alignment or detailed timing rules. Instead of trying to match every word to a precise timestamp, it learns how speech should “feel” — how a sentence flows, where to pause, how to emphasize — all from large-scale real-world data. The result is a model that not only sounds better, but adapts better.
Here’s what sets it apart:
1. High-Fidelity Voice Cloning in Seconds
MaskGCT delivers ultra-realistic voice cloning using just 3 seconds of reference audio. It accurately reproduces not only tone and pacing, but also stylistic detail and emotional nuance — whether it’s a human voice, a fictional character, or a whisper-style performance. The result is a voice that doesn’t just sound similar — it feels identical.
2. Cross-Lingual Fluency with Consistent Identity
Trained on the Emilia dataset — a massive 100,000-hour collection of richly diverse multilingual speech — MaskGCT can make a single voice speak six languages fluently — English, Chinese, Japanese, French, German and Korean — while maintaining the same vocal identity and natural rhythm. No mismatched tone. No robotic accent drift. Just one voice, truly global.
3. Fine-Grained Control Without Breaking Naturalness
MaskGCT is designed for precise control over speech attributes, including length, speed, pauses, and emotional tone — while preserving the natural rhythm, prosody, and timbre of the original voice. This fine-tuned controllability is built directly into the model’s generation process, without compromising fluency or expressiveness.
MaskGCT has achieved state-of-the-art (SOTA) results across three authoritative public TTS benchmarks, surpassing even the most advanced models in the field — and in some cases, exceeding human-level performance.
• On SIM-O (Objective Speaker Similarity), which measures how closely the generated voice matches the reference speaker’s vocal identity, MaskGCT scores 0.728 on SeedTTS English and 0.777 on SeedTTS Chinese — approaching the ground-truth scores of 0.730 and 0.750, and outperforming all competing models.
• In SMOS (Similarity Mean Opinion Score), which reflects how similar the generated voice sounds to the reference based on human ratings, MaskGCT consistently scores above 4.1 across all three test sets — higher than any other model evaluated, and in some cases even higher than ground truth references.
These results confirm the model’s exceptional ability to replicate speaker-specific features such as tone, cadence, and accent. Complementary metrics such as WER, FSD, and CMOS also demonstrate MaskGCT’s overall strength in intelligibility, quality, and naturalness — confirming its position among the most advanced TTS systems available today.
We’re excited to see how MaskGCT will power the next generation of AI voice experiences — more faithful, more expressive, and more human than ever.