基于音频-语言模型的端到端说话人日志系统

doi:10.13328/j.cnki.jos.007541

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>2026年第37卷第5期 >. DOI:10.13328/j.cnki.jos.007541

PDF HTML阅读 XML下载导出引用引用提醒

基于音频-语言模型的端到端说话人日志系统
DOI:
                        10.13328/j.cnki.jos.007541
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:国家重点研发计划（2023YFF1204100）

An End-to-end Speaker Diarization System Based on Audio-language Models

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

会议纪要、客服质检等应用对多说话人语音转写与归属判断的需求正日益增长。随着近年来多模态大语言模型的迅速发展，音频-语言模型因其能够同时理解音频信号与自然语言提示，并在自回归解码框架中统一处理两种模态的能力，天然契合这种“说话人日志”任务的需求，为端到端多说话人音频转写提供了全新的思路。本文提出一种基于音频-语言模型的端到端说话人日志系统，通过两阶段训练策略实现语音识别能力与判断说话人归属能力的协同优化，将音频-语言模型的能力泛化到具体的下游任务上。训练的第一阶段采用监督微调（SFT），在标准交叉熵损失中引入“说话人损失”，以加权的方式强化对稀疏说话人标签token的学习信号；第二阶段使用了基于组相对策略优化（GRPO）算法的强化学习策略，以联合指标cpCER与SA-CER设计奖励函数，突破了监督学习的性能瓶颈。本研究在双说话人的场景下开展实验，对比了热门开源工具3D-Speaker、Diar Sortformer和闭源的AssemblyAI、Microsoft Azure说话人日志API，并通过消融实验证明了训练方法的合理性，随后将实验拓宽至四说话人场景。结果表明，两阶段的训练方法在双说话人环境中显著提升了模型的语音识别能力与判断说话人归属的能力，而在四说话人场景中，常规的监督微调已取得较大收益。本文进一步讨论了大模型资源消耗、输入时长限制、跨域适应等问题，提出了引入流式音频编码器、课程学习、拒绝采样策略等未来优化方向。研究表明音频-语言模型在多说话人日志任务中具备显著潜力，但亦需在复杂声学场景下完成更多技术突破。

Abstract:

The demand for multi-speaker speech transcription and speaker diarization in applications such as meeting minutes and customer service quality inspection is growing increasingly. Recent advances in multimodal large language models have given rise to Audio–Language Models (ALMs) that can simultaneously interpret audio signals and natural-language prompts within a unified autoregressive decoding framework, making them a natural fit for the speaker diarization task and offering a fresh approach to end-to-end multi-speaker audio transcription. This paper proposes an end-to-end speaker diarization system based on an ALM and achieves synergistic optimization of speech-recognition ability and speaker-attribution capability via a two-stage training strategy, thereby generalizing the capability of ALMs to specific downstream tasks. In the first stage, supervised fine-tuning (SFT) introduces a “speaker loss” into the standard cross-entropy objective to weight and strengthen the learning signal for sparse speaker-label tokens. In the second stage, we employ a reinforcement-learning scheme based on Group Relative Policy Optimization (GRPO), designing a reward function that jointly considers cpCER and SA-CER to break through the performance plateau of supervised learning. Experiments in a two-speaker setting compare against the open-source 3D-Speaker toolkit and the Diar Sortformer model, as well as the proprietary speaker diarization APIs from AssemblyAI and Microsoft Azure. We further conduct ablation studies to validate the training methodology, and subsequently extend the experiments to a four-speaker scenario. Results demonstrate that the two-stage approach significantly improves both ASR and speaker-attribution performance in the two-speaker environment, whereas in the four-speaker setting, conventional supervised fine-tuning already yields substantial improvements. We also discuss challenges such as resource consumption, input-length limits, cross-domain adaptation, and propose future enhancements including streaming encoders, curriculum learning, rejection-sampling strategies and so on. Our study shows that end-to-end ALMs hold great promise for multi-speaker speaker diarization tasks but require additional technical advances to handle more complex acoustic scenarios.

参考文献

相似文献

引证文献

引用本文

韦舒羽,丘德来,刘升平,桑基韬.基于音频-语言模型的端到端说话人日志系统.软件学报,2026,37(5):

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-25
最后修改日期:2025-08-30
录用日期:
在线发布日期: 2025-09-23
出版日期:

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码