High-precision Symbolic Music Understanding Incorporating Structured Representation of Music Knowledge

doi:10.13328/j.cnki.jos.007542

微信小程序

微信服务号

微信订阅号

Home > Archive>Volume 37, Issue 5, 2026 >1887-1902. DOI:10.13328/j.cnki.jos.007542

PDF HTML XML Export Cite reminder

High-precision Symbolic Music Understanding Incorporating Structured Representation of Music Knowledge
DOI:
                        10.13328/j.cnki.jos.007542
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:TP37
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Symbolic music understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi-dimensional musical attributes such as melody, dynamics, compositional style, emotion, and genre from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: Current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi-dimensional hierarchical information; (2) Lack of music-theory integration: Purely data-driven sequence models struggle to incorporate structured music-theory knowledge, limiting deep semantic understanding of music. To address these issues, this study proposes CNN-Midiformer, a high-precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for music theory and musical sequences based on domain knowledge. Second, a complementary music-feature extraction module is devised to employ convolutional neural networks (CNN) for capturing deep local features from structured musical-knowledge representations, while a Transformer encoder with self-attention captures deep semantic features from musical sequences. Finally, a music-knowledge adaptive-enhancement feature-fusion module dynamically integrates the deep musical-knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross-attention mechanism, thus enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic-music datasets (Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL) demonstrate that CNN-Midiformer surpasses state-of-the-art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving a precision gain of 0.21–7.14 percentage points over baseline models.

Reference

Cited by

Get Citation

黄恒焱,邹逸,时乐轩,程皓楠,叶龙.融合音乐知识结构化表征的高精度符号音乐理解.软件学报,2026,37(5):1887-1902

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:May 26,2025
Revised:July 11,2025
Adopted:
Online: September 23,2025
Published: May 06,2026

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063