UNISE: A UNIFIED FRAMEWORK FOR DECODER-ONLY AUTOREGRESSIVE LM-BASED SPEECH ENHANCEMENT

Intelligent Connectivity, Alibaba Group

Abstract

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LM-based models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks.

Figures

Figure 1

Fig. 1: Overall architecture of UniSE, where the BiCodec Encoder is only utilized to generate label tokens during training and excluded during inference. The snowflake icon means that parameters are pre-trained and frozen, and the fire icon indicates that parameters are optimized during training.

Speech Restoration (SR)

Clean Degraded Enhanced

Target Speech Extraction (TSE)

Target Mixture Reference Enhanced

Speech Separation (SS)

Mixture Speaker1 Speaker2