High-precision Symbolic Music Understanding Incorporating Structured Representations of Music Knowledge
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Symbolic Music Understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi dimensional musical attributes—such as melody, dynamics, compositional style, emotion, and genre—from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi dimensional hierarchical information; (2) Lack of music theory integration: purely data driven sequence models struggle to incorporate structured knowledge of music theory, limiting deep semantic understanding of music. To address these issues, we propose CNN Midiformer, a high precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for musical theory and musical sequences based on domain knowledge. Second, a complementary music feature extraction module is devised to employ Convolutional Neural Networks (CNN) for capturing deep local features from structured musical knowledge representations, while a Transformer encoder with self attention captures deep semantic features from musical sequences. Finally, a music knowledge adaptive enhancement feature fusion module dynamically integrates the deep musical knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross attention mechanism, thereby enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic music datasets—Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL—demonstrate that CNN Midiformer surpasses state of the art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving an average accuracy gain of 0.21%-7.14% over baseline models.

    Reference
    Related
    Cited by
Get Citation

向剑文,陈厅,杨珉,周俊伟.新兴软件与系统的可信赖性与安全专题前言.软件学报,2025,36(7):2927-2928

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:December 12,2024
  • Revised:August 30,2025
  • Adopted:
  • Online: May 14,2025
  • Published: July 06,2025
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063