Abstract:The demand for multi-speaker speech transcription and speaker diarization in applications such as meeting minutes and customer service quality inspection is growing increasingly. Recent advances in multimodal large language models have given rise to Audio–Language Models (ALMs) that can simultaneously interpret audio signals and natural-language prompts within a unified autoregressive decoding framework, making them a natural fit for the speaker diarization task and offering a fresh approach to end-to-end multi-speaker audio transcription. This paper proposes an end-to-end speaker diarization system based on an ALM and achieves synergistic optimization of speech-recognition ability and speaker-attribution capability via a two-stage training strategy, thereby generalizing the capability of ALMs to specific downstream tasks. In the first stage, supervised fine-tuning (SFT) introduces a “speaker loss” into the standard cross-entropy objective to weight and strengthen the learning signal for sparse speaker-label tokens. In the second stage, we employ a reinforcement-learning scheme based on Group Relative Policy Optimization (GRPO), designing a reward function that jointly considers cpCER and SA-CER to break through the performance plateau of supervised learning. Experiments in a two-speaker setting compare against the open-source 3D-Speaker toolkit and the Diar Sortformer model, as well as the proprietary speaker diarization APIs from AssemblyAI and Microsoft Azure. We further conduct ablation studies to validate the training methodology, and subsequently extend the experiments to a four-speaker scenario. Results demonstrate that the two-stage approach significantly improves both ASR and speaker-attribution performance in the two-speaker environment, whereas in the four-speaker setting, conventional supervised fine-tuning already yields substantial improvements. We also discuss challenges such as resource consumption, input-length limits, cross-domain adaptation, and propose future enhancements including streaming encoders, curriculum learning, rejection-sampling strategies and so on. Our study shows that end-to-end ALMs hold great promise for multi-speaker speaker diarization tasks but require additional technical advances to handle more complex acoustic scenarios.