融合音乐知识结构化表征的高精度符号音乐理解
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家自然科学基金 (62201524)


High-precision Symbolic Music Understanding Incorporating Structured Representations of Music Knowledge
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    符号音乐理解(Symbolic Music Understanding, SMU)是多媒体内容理解的重要任务之一,旨在从符号化音乐表示中提取旋律、力度、作曲家风格、情感与流派等多维音乐属性.现有方法在音乐序列依赖建模方面取得了显著进展,但是仍然存在两方面关键问题:(1)表示单一化:将复杂的音乐结构简化为线性符号序列,忽略了音乐固有的多维层级信息;(2)乐理知识缺乏:基于序列数据驱动的模型难以融入系统化乐理知识,限制了对音乐深层语义的理解.针对上述问题,本文提出了一种融合音乐知识结构化表征的高精度符号音乐理解模型CNN-Midiformer.该模型首先基于音乐理论构建音乐知识和音乐序列的结构化表征;其次,设计互补音乐特征提取模块,利用卷积神经网络(Convolutional Neural Networks, CNN)提取音乐知识结构化表征的深层局部特征,并通过Transformer编码器的自注意力机制捕获音乐序列的深层语义特征;最后,设计音乐知识自适应增强的特征融合模块,利用高效的交叉注意力机制将CNN提取的深层音乐知识特征与Transformer编码器的深层语义特征进行动态融合,实现对序列语境的感知与特征增强.在6个公开符号音乐理解数据集Pop1K7、ASAP、POP909、Pianist8、EMOPIA和ADL上的对比实验表明,本文提出的模型CNN-Midiformer在旋律识别、力度预测、作曲家分类、情感分类和流派分类5个符号音乐理解的基准下游任务上均优于最新方法,相较于基线模型精度平均提高0.21%~7.14%.

    Abstract:

    Symbolic Music Understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi dimensional musical attributes—such as melody, dynamics, compositional style, emotion, and genre—from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi dimensional hierarchical information; (2) Lack of music theory integration: purely data driven sequence models struggle to incorporate structured knowledge of music theory, limiting deep semantic understanding of music. To address these issues, we propose CNN Midiformer, a high precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for musical theory and musical sequences based on domain knowledge. Second, a complementary music feature extraction module is devised to employ Convolutional Neural Networks (CNN) for capturing deep local features from structured musical knowledge representations, while a Transformer encoder with self attention captures deep semantic features from musical sequences. Finally, a music knowledge adaptive enhancement feature fusion module dynamically integrates the deep musical knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross attention mechanism, thereby enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic music datasets—Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL—demonstrate that CNN Midiformer surpasses state of the art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving an average accuracy gain of 0.21%-7.14% over baseline models.

    参考文献
    相似文献
    引证文献
引用本文

黄恒焱,邹逸,时乐轩,程皓楠,叶龙.融合音乐知识结构化表征的高精度符号音乐理解.软件学报,2026,37(5):

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-26
  • 最后修改日期:2025-08-30
  • 录用日期:
  • 在线发布日期: 2025-09-23
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号