Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Xiaoran Fan1*, Zhichao Sun1*, Yangfan Gao1*, Jingfei Xiong1*, Hang Yan2*, Yifei Cao1, Jiajun Sun1, Shuo Li1, Zhihao Zhang1, Zhiheng Xi1, Yuhao Zhou1,
Senjie Jin1, Changhao Jiang1, Junjie Ye1, Ming Zhang1, Rui Zheng1, Zhenhua Han, Yunke Zhang3, Demei Yan3, Shaokang Dong3,
Tao Ji1†, Tao Gui1†, Qi Zhang1†, Xuanjing Huang1†

1Fudan University, 2The Chinese University of Hong Kong, 3Honor Device Co., Ltd

*Equal contribution, Corresponding author

Paper | Github Repo

Abstract Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12× faster decoding and a substantial drop in Word Error Rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Contents

This page is for research demonstration purposes only.

Overview

Figure 1: Left: Overview of a Speech Language Model (SLM) trained with decoupled speech tokenizer (Section 2.1) and Speaker-Aware TTS; Right: The architecture of a possible decoupled speech tokenizer, featuring speech quantization, reconstruction in a decoupled manner and speaker-specified embedding extraction.


Figure 2: Illustration of our NTP and MTP architecture. (a) NTP: single vocabulary and single prediction head; (b) MTP: multiple vocabularies and multiple prediction heads, generating multiple tokens in parallel.


Figure 3: Illustration of Role-Playing Knowledge QA.

SLMs Trained with Different Speech Tokenizers

Comparison of SLMs trained with different kinds of speech tokenizers (Coupled, Semi-decoupled, Decoupled).

Decoupled Tokenizer Semi-Decoupled Tokenizer Coupled Tokenizer
Text FACodec-NTP SpeechTokenizer-5L SpeechTokenizer-8L Encodec-4L Encodec-8L WavTokenizer-v2 StableCodec BigCodec
Good cases "It is a very delicate question," said he.
WER: 0.0  UTMOS: 4.33

WER: 0.0  UTMOS: 3.51

WER: 12.5  UTMOS: 4.17

Failed*

Failed


WER: 0.0  UTMOS: 4.05

WER: 12.5  UTMOS: 3.68

WER: 0.0  UTMOS: 4.46
In the distance the clouds resemble great bales of cotton, piled up in picturesque disorder.
WER: 0.0  UTMOS: 4.47

WER: 6.67  UTMOS: 4.18

WER: 15.4  UTMOS: 3.77

Failed

Failed


WER: 40.0  UTMOS: 4.05

WER: 6.67  UTMOS: 4.46

WER: 6.67  UTMOS: 3.75
It was kind of you to spend so much time in our behalf.
WER: 0.0  UTMOS: 4.45

WER: 13.3  UTMOS: 4.30

WER: 0.0  UTMOS: 4.29

WER: 7.7  UTMOS: 3.04

WER: 15.4  UTMOS: 2.82

WER: 87.7  UTMOS: 4.41

WER: 0.0  UTMOS: 4.39

WER: 7.7  UTMOS: 4.34
Bad case This new plant takes high tension polyphase current from a water power thirty or forty miles away at Paderno, on the river Adda, flowing from the Apennines; but delivers low tension direct current for distribution to the regular Edison three wire system throughout Milan.
WER: 29.5  UTMOS: 3.36

WER: 31.8  UTMOS: 3.82

Failed

Failed

Failed

Failed


WER: 25.0  UTMOS: 2.90

Failed

Failed indicates that SLMs failed to predict the "end of speech" token.

NTP vs. MTP & Speaker-Aware TTS

Comparing Next-Token Prediction (NTP) and Multi-Token Prediction (MTP) for TTS on SLMs using decoupled tokenizers*, w/ and w/o Speaker-Aware paradigm.

NTP MTP-3H MTP-6H MTP-12H
Text Prompt w/o Speaker-Aware w/ Speaker-Aware w/o Speaker-Aware w/ Speaker-Aware w/o Speaker-Aware w/ Speaker-Aware w/o Speaker-Aware w/ Speaker-Aware
Good cases "Yes," replied my uncle, "and there is a sea lizard of vast size.
WER: 0.0 UTMOS: 4.19 SIM: 0.51

WER: 0.0 UTMOS: 4.34 SIM: 0.54

WER: 0.0 UTMOS: 4.17 SIM: 0.45

WER: 0.0 UTMOS: 4.20 SIM: 0.53

WER: 0.0 UTMOS: 4.03 SIM: 0.49

WER: 0.0 UTMOS: 3.93 SIM: 0.59

WER: 0.0 UTMOS: 4.29 SIM: 0.56

WER: 0.0 UTMOS: 4.36 SIM: 0.59
I don't drink it myself, but I like to see it behave when it's poured.
WER: 0.0 UTMOS: 3.73 SIM: 0.47

WER: 0.0 UTMOS: 3.85 SIM: 0.69

WER: 0.0 UTMOS: 4.37 SIM: 0.55

WER: 0.0 UTMOS: 4.38 SIM: 0.61

WER: 0.0 UTMOS: 4.22 SIM: 0.55

WER: 0.0 UTMOS: 4.36 SIM: 0.64

WER: 0.0 UTMOS: 4.10 SIM: 0.32

WER: 0.0 UTMOS: 4.12 SIM: 0.60
After the station had been running several months and was technically a success, we began to look after the financial part.
WER: 0.0 UTMOS: 4.45 SIM: 0.50

WER: 0.0 UTMOS: 4.46 SIM: 0.67

WER: 0.0 UTMOS: 4.31 SIM: 0.56

WER: 0.0 UTMOS: 4.41 SIM: 0.71

WER: 0.0 UTMOS: 3.78 SIM: 0.60

WER: 0.0 UTMOS: 4.44 SIM: 0.71

WER: 0.0 UTMOS: 4.07 SIM: 0.51

WER: 0.0 UTMOS: 4.40 SIM: 0.64
Bad case Sheriff Jones made several visits unmolested on their part, and without any display of writs or demand for the surrender of alleged offenders on his own.
WER: 15.3 UTMOS: 4.32 SIM: 0.32

WER: 15.3 UTMOS: 3.75 SIM: 0.53

WER: 26.9 UTMOS: 4.04 SIM: 0.28

WER: 23.1 UTMOS: 3.73 SIM: 0.55

WER: 0.0 UTMOS: 3.66 SIM: 0.45

WER: 7.69 UTMOS: 3.87 SIM: 0.45

WER: 3.84 UTMOS: 3.62 SIM: 0.50

WER: 0.0 UTMOS: 4.23 SIM: 0.64

* We chose FACodec as a decoupled tokenizer to conduct experiments on different settings above.

Role-Playing Knowledge QA

Comparison of SLMs with Top-3 Coupled, Semi-Decoupled and Decoupled speech tokenizers on Role-Playing Knowledge QA task.

Roles are selected from Genshin Impact.

Decoupled Tokenizer Semi-decoupled Tokenizer Coupled Tokenizer
Sample Type Question Ground
Truth
Prompt FACodec SpeechTokenizer WavTokenizer-v2 BigCodec StableCodec
Good cases In-Domain Question
&
Seen role
How many imperial gallons are in a firkin? The answer is 9
Role: Xiao
Text answer
The answer is nine*

Speech answer
SIM: 0.75
Text answer
The answer is one hundred**

Speech answer
SIM: 0.13
Text answer
The answer is four

Speech answer
SIM: 0.49
Text answer
The answer is five

Speech answer
SIM: 0.21
Text answer
The answer is two

Speech answer
SIM: 0.27
Out-of-Domain Question
&
Seen role
What style of art did Henri Matisse do? The answer is impressionism
Role: Arataki Itto
Text answer
The answer is impressionism

Speech answer
SIM: 0.75
Text answer
The answer is impressionism in media***

Speech answer
SIM: 0.12
Text answer
The answer is fauvisme

Speech answer
SIM: 0.22
Text answer
The answer is impressionism in sculpture

Speech answer
SIM: 0.23
Text answer
The answer is impressionist

Speech answer
SIM: 0.07
In-Domain Question
&
Unseen role
Which golf shot is the opposite of a slice? The answer is hook
Role: Eula
Text answer
The answer is hook

Speech answer
SIM: 0.71
Text answer
The answer is morris

Speech answer
SIM: -0.02
Text answer
The answer is condor yacht

Speech answer
SIM: 0.21
Text answer
The answer is tiddliwinks

Speech answer
SIM: 0.35
Text answer
The answer is cutty sall

Speech answer
SIM: 0.23
Out-of-Domain Question
&
Unseen role
What is the largest city in Spain? The answer is Madrid
Role: Jean
Text answer
The answer is madrid

Speech answer
SIM: 0.60
Text answer
The answer is jorvili

Speech answer
SIM: 0.10
Text answer
The answer is un locode itvce

Speech answer
SIM: 0.19
Text answer
The answer is castro

Speech answer
SIM: 0.07
Text answer
The answer is un locode dehaj

Speech answer
SIM: 0.19
Bad case In-Domain Question
&
Unseen role
Satya Nadella, boss of which vast corporation, apologised in 2014 for suggesting female workers should rely on faith and karma instead of asking for a pay rise? The answer is Microsoft
Role: Albedo
Text answer
The answer is microsoft

Speech answer
SIM: 0.57
Text answer
The answer is microsoft inc

Speech answer
SIM: -0.05
Text answer
The answer is reds

Speech answer
SIM: 0.08
Text answer
The answer is nike

Speech answer
SIM: -0.03
Text answer
The answer is nikkei

Speech answer
SIM: 0.30
Correct answer
** Wrong answer
*** Partially correct answer