UNISE: A UNIFIED FRAMEWORK FOR DECODER-ONLY AUTOREGRESSIVE LM-BASED SPEECH ENHANCEMENT

[Code]

Intelligent Connectivity, Alibaba Group

Abstract

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LM-based models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks.

Figures

Fig. 1: Overall architecture of UniSE, where the BiCodec Encoder is only utilized to generate label tokens during training and excluded during inference. The snowflake icon means that parameters are pre-trained and frozen, and the fire icon indicates that parameters are optimized during training.

Speech Restoration (SR)

Clean	Degraded	Enhanced

Target Speech Extraction (TSE)

Target	Mixture	Reference	Enhanced

Speech Separation (SS)

Mixture	Speaker1	Speaker2