SLM-Decoupled-MTP

Speech-Language Models with Decoupled Tokenizers and Multi-Token Prediction

Xiaoran Fan^1*, Zhichao Sun^1*, Yangfan Gao^1*, Jingfei Xiong^1*, Hang Yan^2*, Yifei Cao¹, Jiajun Sun¹, Shuo Li¹, Zhihao Zhang¹, Zhiheng Xi¹, Yuhao Zhou¹,
Senjie Jin¹, Changhao Jiang¹, Junjie Ye¹, Ming Zhang¹, Rui Zheng¹, Zhenhua Han, Yunke Zhang³, Demei Yan³, Shaokang Dong³,
Tao Ji^1†, Tao Gui^1†, Qi Zhang^1†, Xuanjing Huang^1†

¹Fudan University, ²The Chinese University of Hong Kong, ³Honor Device Co., Ltd

^*Equal contribution, ^†Corresponding author

Paper | Github Repo

Abstract Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the impact of key components (i.e., speech tokenizers, speech heads, and speaker modeling) on the performance of LLM-centric SLMs. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12× faster decoding and a substantial drop in Word Error Rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

Contents

Overview
SLMs Trained with Different Speech Tokenizers
NTP vs. MTP & Speaker-Aware TTS
Role-Playing Knowledge QA

This page is for research demonstration purposes only.

Overview

Figure 1: Left: Overview of a Speech Language Model (SLM) trained with decoupled speech tokenizer (Section 2.1) and Speaker-Aware TTS; Right: The architecture of a possible decoupled speech tokenizer, featuring speech quantization, reconstruction in a decoupled manner and speaker-specified embedding extraction.

Figure 2: Illustration of our NTP and MTP architecture. (a) NTP: single vocabulary and single prediction head; (b) MTP: multiple vocabularies and multiple prediction heads, generating multiple tokens in parallel.

Figure 3: Illustration of Role-Playing Knowledge QA.

SLMs Trained with Different Speech Tokenizers

Comparison of SLMs trained with different kinds of speech tokenizers (Coupled, Semi-decoupled, Decoupled).

		Decoupled Tokenizer	Semi-Decoupled Tokenizer		Coupled Tokenizer
	Text	FACodec-NTP	SpeechTokenizer-5L	SpeechTokenizer-8L	Encodec-4L	Encodec-8L	WavTokenizer-v2	StableCodec	BigCodec
Good cases	"It is a very delicate question," said he.	WER: 0.0 UTMOS: 4.33	WER: 0.0 UTMOS: 3.51	WER: 12.5 UTMOS: 4.17	Failed*	Failed	WER: 0.0 UTMOS: 4.05	WER: 12.5 UTMOS: 3.68	WER: 0.0 UTMOS: 4.46
	In the distance the clouds resemble great bales of cotton, piled up in picturesque disorder.	WER: 0.0 UTMOS: 4.47	WER: 6.67 UTMOS: 4.18	WER: 15.4 UTMOS: 3.77	Failed	Failed	WER: 40.0 UTMOS: 4.05	WER: 6.67 UTMOS: 4.46	WER: 6.67 UTMOS: 3.75
	It was kind of you to spend so much time in our behalf.	WER: 0.0 UTMOS: 4.45	WER: 13.3 UTMOS: 4.30	WER: 0.0 UTMOS: 4.29	WER: 7.7 UTMOS: 3.04	WER: 15.4 UTMOS: 2.82	WER: 87.7 UTMOS: 4.41	WER: 0.0 UTMOS: 4.39	WER: 7.7 UTMOS: 4.34
Bad case	This new plant takes high tension polyphase current from a water power thirty or forty miles away at Paderno, on the river Adda, flowing from the Apennines; but delivers low tension direct current for distribution to the regular Edison three wire system throughout Milan.	WER: 29.5 UTMOS: 3.36	WER: 31.8 UTMOS: 3.82	Failed	Failed	Failed	Failed	WER: 25.0 UTMOS: 2.90	Failed

* Failed indicates that SLMs failed to predict the "end of speech" token.

NTP vs. MTP & Speaker-Aware TTS

Comparing Next-Token Prediction (NTP) and Multi-Token Prediction (MTP) for TTS on SLMs using decoupled tokenizers*, w/ and w/o Speaker-Aware paradigm.

			NTP		MTP-3H		MTP-6H		MTP-12H
	Text	Prompt	w/o Speaker-Aware	w/ Speaker-Aware	w/o Speaker-Aware	w/ Speaker-Aware	w/o Speaker-Aware	w/ Speaker-Aware	w/o Speaker-Aware	w/ Speaker-Aware
Good cases	"Yes," replied my uncle, "and there is a sea lizard of vast size.		WER: 0.0 UTMOS: 4.19 SIM: 0.51	WER: 0.0 UTMOS: 4.34 SIM: 0.54	WER: 0.0 UTMOS: 4.17 SIM: 0.45	WER: 0.0 UTMOS: 4.20 SIM: 0.53	WER: 0.0 UTMOS: 4.03 SIM: 0.49	WER: 0.0 UTMOS: 3.93 SIM: 0.59	WER: 0.0 UTMOS: 4.29 SIM: 0.56	WER: 0.0 UTMOS: 4.36 SIM: 0.59
	I don't drink it myself, but I like to see it behave when it's poured.		WER: 0.0 UTMOS: 3.73 SIM: 0.47	WER: 0.0 UTMOS: 3.85 SIM: 0.69	WER: 0.0 UTMOS: 4.37 SIM: 0.55	WER: 0.0 UTMOS: 4.38 SIM: 0.61	WER: 0.0 UTMOS: 4.22 SIM: 0.55	WER: 0.0 UTMOS: 4.36 SIM: 0.64	WER: 0.0 UTMOS: 4.10 SIM: 0.32	WER: 0.0 UTMOS: 4.12 SIM: 0.60
	After the station had been running several months and was technically a success, we began to look after the financial part.		WER: 0.0 UTMOS: 4.45 SIM: 0.50	WER: 0.0 UTMOS: 4.46 SIM: 0.67	WER: 0.0 UTMOS: 4.31 SIM: 0.56	WER: 0.0 UTMOS: 4.41 SIM: 0.71	WER: 0.0 UTMOS: 3.78 SIM: 0.60	WER: 0.0 UTMOS: 4.44 SIM: 0.71	WER: 0.0 UTMOS: 4.07 SIM: 0.51	WER: 0.0 UTMOS: 4.40 SIM: 0.64
Bad case	Sheriff Jones made several visits unmolested on their part, and without any display of writs or demand for the surrender of alleged offenders on his own.		WER: 15.3 UTMOS: 4.32 SIM: 0.32	WER: 15.3 UTMOS: 3.75 SIM: 0.53	WER: 26.9 UTMOS: 4.04 SIM: 0.28	WER: 23.1 UTMOS: 3.73 SIM: 0.55	WER: 0.0 UTMOS: 3.66 SIM: 0.45	WER: 7.69 UTMOS: 3.87 SIM: 0.45	WER: 3.84 UTMOS: 3.62 SIM: 0.50	WER: 0.0 UTMOS: 4.23 SIM: 0.64

* We chose FACodec as a decoupled tokenizer to conduct experiments on different settings above.

Role-Playing Knowledge QA

Comparison of SLMs with Top-3 Coupled, Semi-Decoupled and Decoupled speech tokenizers on Role-Playing Knowledge QA task.

Roles are selected from Genshin Impact.

					Decoupled Tokenizer	Semi-decoupled Tokenizer	Coupled Tokenizer
	Sample Type	Question	Ground Truth	Prompt	FACodec	SpeechTokenizer	WavTokenizer-v2	BigCodec	StableCodec
Good cases	In-Domain Question & Seen role	How many imperial gallons are in a firkin?	The answer is 9	Role: Xiao	Text answer The answer is nine* Speech answer SIM: 0.75	Text answer The answer is one hundred Speech answer** SIM: 0.13	Text answer The answer is four Speech answer SIM: 0.49	Text answer The answer is five Speech answer SIM: 0.21	Text answer The answer is two Speech answer SIM: 0.27
	Out-of-Domain Question & Seen role	What style of art did Henri Matisse do?	The answer is impressionism	Role: Arataki Itto	Text answer The answer is impressionism Speech answer SIM: 0.75	Text answer The answer is impressionism in media* Speech answer** SIM: 0.12	Text answer The answer is fauvisme Speech answer SIM: 0.22	Text answer The answer is impressionism in sculpture Speech answer SIM: 0.23	Text answer The answer is impressionist Speech answer SIM: 0.07
	In-Domain Question & Unseen role	Which golf shot is the opposite of a slice?	The answer is hook	Role: Eula	Text answer The answer is hook Speech answer SIM: 0.71	Text answer The answer is morris Speech answer SIM: -0.02	Text answer The answer is condor yacht Speech answer SIM: 0.21	Text answer The answer is tiddliwinks Speech answer SIM: 0.35	Text answer The answer is cutty sall Speech answer SIM: 0.23
	Out-of-Domain Question & Unseen role	What is the largest city in Spain?	The answer is Madrid	Role: Jean	Text answer The answer is madrid Speech answer SIM: 0.60	Text answer The answer is jorvili Speech answer SIM: 0.10	Text answer The answer is un locode itvce Speech answer SIM: 0.19	Text answer The answer is castro Speech answer SIM: 0.07	Text answer The answer is un locode dehaj Speech answer SIM: 0.19
Bad case	In-Domain Question & Unseen role	Satya Nadella, boss of which vast corporation, apologised in 2014 for suggesting female workers should rely on faith and karma instead of asking for a pay rise?	The answer is Microsoft	Role: Albedo	Text answer The answer is microsoft Speech answer SIM: 0.57	Text answer The answer is microsoft inc Speech answer SIM: -0.05	Text answer The answer is reds Speech answer SIM: 0.08	Text answer The answer is nike Speech answer SIM: -0.03	Text answer The answer is nikkei Speech answer SIM: 0.30

* Correct answer
** Wrong answer
*** Partially correct answer