Abstract:Symbolic Music Understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi dimensional musical attributes—such as melody, dynamics, compositional style, emotion, and genre—from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi dimensional hierarchical information; (2) Lack of music theory integration: purely data driven sequence models struggle to incorporate structured knowledge of music theory, limiting deep semantic understanding of music. To address these issues, we propose CNN Midiformer, a high precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for musical theory and musical sequences based on domain knowledge. Second, a complementary music feature extraction module is devised to employ Convolutional Neural Networks (CNN) for capturing deep local features from structured musical knowledge representations, while a Transformer encoder with self attention captures deep semantic features from musical sequences. Finally, a music knowledge adaptive enhancement feature fusion module dynamically integrates the deep musical knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross attention mechanism, thereby enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic music datasets—Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL—demonstrate that CNN Midiformer surpasses state of the art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving an average accuracy gain of 0.21%-7.14% over baseline models.