HUANG Heng-Yan , ZOU Yi , SHI Le-Xuan , CHENG Hao-Nan , YE Long
2026, 37(5):1887-1902. DOI: 10.13328/j.cnki.jos.007542
Abstract:Symbolic music understanding (SMU) is a crucial task in multimedia content understanding, aiming to extract multi-dimensional musical attributes such as melody, dynamics, compositional style, emotion, and genre from symbolic representations. Although existing approaches have substantially advanced dependency modeling in musical sequences, two critical challenges remain: (1) Simplified representation: Current methods typically flatten complex musical structures into linear symbolic sequences, overlooking the inherent multi-dimensional hierarchical information; (2) Lack of music-theory integration: Purely data-driven sequence models struggle to incorporate structured music-theory knowledge, limiting deep semantic understanding of music. To address these issues, this study proposes CNN-Midiformer, a high-precision symbolic music understanding model that integrates structured representations of musical knowledge. First, the model constructs structured representations for music theory and musical sequences based on domain knowledge. Second, a complementary music-feature extraction module is devised to employ convolutional neural networks (CNN) for capturing deep local features from structured musical-knowledge representations, while a Transformer encoder with self-attention captures deep semantic features from musical sequences. Finally, a music-knowledge adaptive-enhancement feature-fusion module dynamically integrates the deep musical-knowledge features extracted by CNN with the deep semantic features of the Transformer via an efficient cross-attention mechanism, thus enhancing contextual sequence understanding and representation learning. Comparative experiments conducted on six public symbolic-music datasets (Pop1K7, ASAP, POP909, Pianist8, EMOPIA, and ADL) demonstrate that CNN-Midiformer surpasses state-of-the-art methods across five benchmark downstream tasks: melody recognition, dynamics prediction, composer classification, emotion classification, and genre classification, achieving a precision gain of 0.21–7.14 percentage points over baseline models.
WEI Shu-Yu , QIU De-Lai , LIU Sheng-Ping , SANG Ji-Tao
2026, 37(5):1903-1918. DOI: 10.13328/j.cnki.jos.007541
Abstract:The demand for multi-speaker speech transcription and speaker attribution in applications such as meeting minutes and customer service quality inspection is increasing. Recent advances in multimodal large language models have given rise to audio-language models (ALMs) that can simultaneously interpret audio signals and natural-language prompts within a unified autoregressive decoding framework, making them a natural fit for the speaker diarization task and offering a fresh approach to end-to-end multi-speaker audio transcription. This study proposes an end-to-end speaker diarization system based on an ALM and achieves synergistic optimization of speech-recognition capability and speaker-attribution capability via a two-stage training strategy, thus generalizing the capability of ALMs to specific downstream tasks. In the first stage, supervised fine-tuning (SFT) introduces a “speaker loss” into the standard cross-entropy objective to weight and strengthen the learning signal for sparse speaker-label tokens. In the second stage, a reinforcement-learning scheme based on group relative policy optimization (GRPO) is employed, designing a reward function that jointly considers cpCER and SA-CER to break through the performance bottleneck of supervised learning. Experiments in a two-speaker setting compare with the open-source 3D-Speaker toolkit and the Diar Sortformer model, as well as the proprietary speaker diarization APIs from AssemblyAI and Microsoft Azure. Ablation studies are further conducted to validate the training methodology, and experiments are subsequently extended to a four-speaker scenario. Results demonstrate that the two-stage approach significantly improves both ASR and speaker-attribution performance in the two-speaker environment, whereas in the four-speaker setting, conventional SFT already yields substantial improvements. Challenges such as resource consumption, input-length limitations, and cross-domain adaptation are also discussed, and future enhancements are proposed, including streaming audio encoders, curriculum learning, and rejection-sampling strategies. This study shows that ALMs hold great promise for multi-speaker diarization tasks but require additional technical advances to handle more complex acoustic scenarios.
TANG Wen-Neng , LI Yao-Chen , GAO Sheng-Jing , GAO Cong , PENG Yue-Han , LIU Yue-Hu
2026, 37(5):1919-1935. DOI: 10.13328/j.cnki.jos.007545
Abstract:The latest advancements in intelligent driving technology are primarily reflected in the environmental perception layer, where sensor data fusion is critical for enhancing system performance. Although point cloud data provides accurate 3D spatial descriptions, it suffers from unorderedness and sparsity. Image data, with its regular and dense distribution, can compensate for the limitations of single-modality detection when fused with point clouds. However, existing fusion algorithms face challenges such as limited semantic information and insufficient modal interaction, leaving room for improvement in high-precision multi-modal 3D object detection. To address this issue, this study proposes an innovative multi-sensor fusion method: generating pseudo-point clouds via depth completion from RGB images and combining them with real point clouds to identify regions of interest. It introduces three key improvements: (1) deformable attention-based multi-layer feature extraction that adaptively expands the receptive field to target regions; (2) 2D sparse convolution for efficient pseudo-point cloud feature extraction leveraging their regular distribution in the image domain; and (3) a two-stage feedback mechanism employing multi-modal cross-attention at the feature level to solve data alignment issues and an efficient fusion strategy at the decision level for interactive training across different stages. These innovations effectively resolve the trade-off between pseudo-point cloud accuracy and computational load while significantly enhancing both feature extraction efficiency and detection accuracy. Experimental results on the KITTI dataset demonstrate the superior performance of the proposed method in 3D traffic object detection, validating its effectiveness and offering a new approach for multi-modal fusion in autonomous driving environmental perception.
LI Ze-Chao , JIN Lu , WANG Hao-Hua , TANG Jin-Hui
2026, 37(5):1936-1949. DOI: 10.13328/j.cnki.jos.007543
Abstract:In large-scale image retrieval tasks, image hashing typically relies on a large amount of manually annotated data to train deep hashing models. However, the high cost of manual annotation limits its practical application. To alleviate this dependency, existing studies attempt to use texts provided by web users as weak supervision to guide the model in mining semantic information associated with the texts from images. Nevertheless, the inherent noise in user tags often limits model performance. Multimodal pre-trained models such as CLIP exhibit strong image-text alignment capabilities. Inspired by this, this study utilizes CLIP to optimize user tags and proposes a weakly supervised hashing method called CLIP-guided tag refinement hashing (CTRH). The proposed method consists of three key components: a tag replacement module, a tag weighting module, and a tag-balanced loss function. The tag replacement module fine-tunes CLIP to mine potential image-relevant tags. The tag weighting module performs cross-modal global semantic interaction between the optimized text and images to learn discriminative joint representations. To address the imbalance of user tags, a tag-balanced loss is designed, which dynamically reweights hard samples to enhance the model’s representation learning. Experiments on two general datasets, MirFlickr and NUS-WIDE, verify the effectiveness of the proposed method compared to state-of-the-art approaches.
FANG Cheng-Yang , ZHU Chang , JIANG Wen-Hui , FANG Yu-Ming , YAN Jie-Bin
2026, 37(5):1950-1963. DOI: 10.13328/j.cnki.jos.007544
Abstract:In recent years, training-free video question answering (VQA) models have become a research hotspot for lightweight multimodal reasoning due to their plug-and-play nature. However, although high frame rate videos contain rich semantic information, their inherent redundancy leads to a balance problem between information density and computational efficiency in the temporal dimension, with traditional sampling strategies being susceptible to noise frame interference. Furthermore, in complex dynamic scenes, background clutter and local body parts, as non-target regions, introduce spatial feature bias, significantly affecting the reliability of answer generation. To address these two issues, this study proposes a dual adaptive redundancy elimination (DARE-VQA) framework, which aims to systematically improve the accuracy of video semantic understanding and answer quality in the training-free paradigm through a spatiotemporal redundancy collaborative optimization mechanism. First, a dual-relation temporal sampling method is proposed, based on text-visual alignment and inter-frame semantic consistency. This method selects key frame sequences through bidirectional interactive reasoning, while simultaneously eliminating redundant frames that conflict with the text context. Next, a dynamic spatial sampling method is introduced, which extracts the largest connected semantic region from candidate regions in the prompt-related heatmap, aiming to eliminate scattered non-target regions and enhance the compactness of spatial feature representations. Experiments are conducted on widely used datasets, including MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA. The proposed method is evaluated in a zero-shot setting against 14 state-of-the-art models. The results show that the proposed approach achieves competitive performance with significantly fewer video feature sequences. Visual analysis confirms that the proposed method exhibits more accurate spatiotemporal localization abilities in challenging tasks, such as multi-person interactions and fine-grained action recognition in complex scenes. The proposed DARE-VQA framework achieves significant improvements in video question answering performance by collaboratively optimizing spatiotemporal redundancy. It can generate accurate and high-quality answers within the training-free paradigm, demonstrating its potential in multimodal video understanding.
LAI Pei-Yuan , LU Yi-Hong , LIAO De-Zhang , WANG Chang-Dong , DAI Qing-Yun , LAI Jian-Huang
2026, 37(5):1964-1981. DOI: 10.13328/j.cnki.jos.007537
Abstract:The transformation of scientific and technological innovations into practical applications through patent recommendation is of great significance for realizing the economic value of science and technology and promoting socio-economic development. However, existing patent recommendation algorithms often overlook the multimodal information embedded in patents, leading to recommendation results that fail to comprehensively reflect the value and application potential of patents. Consequently, this affects the accuracy of matching patents with the needs of companies. To address this issue, this study proposes a novel patent recommendation algorithm based on a multimodal heterogeneous graph network (MHGN). The proposed method first utilizes pre-trained models to initialize the representation of multimodal information, including the textual and image attributes of patents as well as company information. Then, a graph attention network is employed to learn the preference representations of companies across different modalities. Based on this, the relationship weights of company-patent interactions are further learned based on the similarity of preference representations, and a graph convolutional network is designed to learn the node preference representations of companies and patents. Finally, to better integrate the multimodal information, an adaptation vector is introduced and an attention mechanism is used to fuse the node preference representations with multimodal representations. In addition, four real-world patent datasets from university-to-company transfers are constructed, and experiments are conducted comparing the proposed model with seven advanced baseline models. The results demonstrate that the proposed model significantly outperforms the baselines across all evaluation metrics. Both the datasets and the source code of the proposed model are released, providing robust data and model support for future research in patent recommendation and the transformation of scientific innovations.
LI Qi-Rui , LI Xue-Wei , ZHAO Qi , LI Jie , LI Xi
2026, 37(5):1982-2005. DOI: 10.13328/j.cnki.jos.007539
Abstract:The rapid development of generative technologies has revealed their potential for real-world applications. The core objective of pose-guided person image and video generation is to transform a person from inputs into a specified pose while maintaining a high level of appearance consistency. This technology can be widely applied in various fields such as virtual try-on and fashion, advertising video generation and editing, and multimodal content creation, driving advancements in user experience and technological innovation. However, despite significant progress, the technology still faces multiple challenges, including effective extraction and rearrangement of appearance information during pose transfer, generation of unseen information, consistency preservation, and efficient model training and deployment. Based on the existing challenges, this study provides a detailed analysis of the strategies employed by current mainstream pose-guided generation methods to address these issues, discussing their feasibility and limitations in practical applications. Moreover, it explores the commonly used generative models and pose representation methods in pose-guided generation. It also reviews the datasets, their sizes, characteristics, and evaluation benchmarks used in this field. Furthermore, this study discusses the applications of this technology in virtual try-on, video generation and editing, and multimodal content generation. It highlights the remaining challenges, such as the retention of personalized information, generation in complex scenes, and model efficiency and real-time performance. Finally, this study discusses potential future development trends of pose-guided generation technology, aiming to provide researchers with a systematic summary and reference to promote its application and innovation across industries.
LI Xin-Jin , WANG Wen-Jie , WANG Kai , TUO Hou-Zhen , WANG Shi-Ya , SUN Wei , TAN Xiao-Hui , TIAN Feng
2026, 37(5):2006-2023. DOI: 10.13328/j.cnki.jos.007540
Abstract:Parkinson’s disease (PD) affects nearly 10 million people worldwide, and there is no cure, but evidence-based medicine suggests that training based on sensory cues can slow disease progression. However, most current studies are based on a single modality and lack user perception and feedback. This study proposes an audiovisual multimodal gait training method, which generates and dynamically adjusts multimodal cues based on users’ gait data to investigate the feasibility of assisting early-stage PD rehabilitation. The method constructs a multimodal cue generation framework to generate visual and auditory cues by calculating cycle and step height parameters from gait data. Then, an interactive intervention training system is built to dynamically adjust the audiovisual cues based on gait changes, which realizes the interactive iteration between user perception and multimodal cue generation. Finally, 40 patients with early-stage PD (H&Y≤2) are recruited for a clinical experiment. Compared with the control group, the audiovisual synergy group shows the best improvement effect. Compared with baseline, the gait symmetry in the audiovisual synergy group increases by an average of 20.776% (p=0.0001) during the training and 21.157% (p=0.0001) after the training, and the velocity on the affected side increases by an average of 33.924% (p=0.0001) during the training and 36.433% (p<0.0001) after the training. The results indicate that audiovisual synergistic cues can help patients improve gait performance more quickly and sustainably. The proposed multimodal cueing training method based on gait data provides a new approach for the establishment of a quantitatively driven precision rehabilitation model, and promotes the application and development of multimodal interaction technology in the medical field.
LIAO Jia-Jun , DONG Yi-Tao , MAO Jia-Li
2026, 37(5):2024-2042. DOI: 10.13328/j.cnki.jos.007563
Abstract:In the bilateral matching problem under dynamic environments, the mechanism for handling time constraints and multi-objective optimization is one of the important factors affecting matching efficiency. The transport order assignment in online freight platforms serves as a typical instance of such problems. Existing methods exhibit significant limitations in rigid modeling of time constraints and in the trade-off mechanisms for multi-objective conflicts, making it difficult to accurately characterize the behavioral patterns of decision agents near constraint boundaries. To address these issues, this study proposes a time-constraint-aware transport order assignment framework called TB-Match. The framework consists of four collaborative modules: elastic constraint quantification, preference representation learning, dynamic objective trade-off optimization, and policy generation. The core contributions are as follows: (1) a constraint elasticity representation mechanism based on conditional diffusion probabilistic models, which converts deterministic time boundaries into continuous probabilistic distributions through progressive noise diffusion and reverse denoising processes, thus accurately modeling the acceptance probability of decision agents in boundary regions; (2) a hierarchical decision framework integrating dynamic objective trade-off and proximal policy optimization, where the high-level network adaptively adjusts objective weights according to feedback signals, and the low-level network maximizes long-term cumulative rewards under trust region constraints. Experimental results on two large-scale real-world logistics datasets demonstrate that TB-Match achieves a 17.66% relative improvement in matching rate compared with state-of-the-art methods. It also exhibits significant advantages in metrics such as satisfaction, verifying the effectiveness and applicability of the proposed method under complex constraint environments.
QIAN Zhong-Sheng , QIN Lang-Yue , FAN Fu-Yu , FU Ting-Feng
2026, 37(5):2043-2062. DOI: 10.13328/j.cnki.jos.007436
Abstract:Test case prioritization (TCP) has gained significant attention due to its potential to reduce testing costs. Greedy algorithms based on various prioritization strategies are commonly used in TCP. However, most existing greedy algorithm-based TCP techniques rely on a single prioritization strategy and process all test cases simultaneously during each iteration, without considering the relationships between test cases. This results in excessive computational overhead when handling coverage information and performing prioritization, thus reducing overall efficiency. Among single-strategy approaches, the Additional strategy has been extensively studied but remains highly sensitive to random factors. When a tie occurs, test cases are typically selected at random, compromising prioritization effectiveness. To address these issues, a test case prioritization approach based on two-phase grouping (TPG-TCP) is proposed. In the first phase, coarse-grained grouping is conducted by mining hidden relationships among test cases, thus dividing them into a key group and an ordinary group. This lays the groundwork for applying diversity-based strategies in the next phase to enhance prioritization efficiency. In the second phase, fine-grained prioritization of test cases is performed. Key test cases are further subdivided based on the number of iterations. To mitigate the randomness inherent in the Additional strategy, a TP-Additional strategy based on test case potency is introduced to prioritize a portion of the key test cases. Meanwhile, a simple and efficient Total strategy is applied to prioritize the ordinary test cases and remaining key test cases. The results from the Total strategy are appended to those produced by the TP-Additional strategy. This method improves both the effectiveness and efficiency of test case prioritization. Experimental results on six datasets, compared with eight existing methods, demonstrate that the proposed method achieves average improvements of 1.29% in APFD and 9.54% in TETC.
PAN Wei-Feng , YANG Yan-Wei , YANG Zi-Jiang , JIANG Bo , WANG Jia-Le , YANG Bai-Lin
2026, 37(5):2063-2084. DOI: 10.13328/j.cnki.jos.007453
Abstract:Key classes are a crucial starting point for understanding complex software, contributing to the optimization of documentation and the compression of reverse-engineered class diagrams. Although many effective key class identification methods have been proposed, three major limitations remain: 1) software networks, which are graphs representing software elements and their dependencies, often include elements that are never or rarely executed at runtime; 2) networks constructed through dynamic analysis are frequently incomplete, potentially omitting truly key classes; and 3) most existing approaches consider only the effect of direct coupling between classes, while ignoring the influence of indirect (non-contact) coupling and the diversity of degree distribution among neighboring nodes. To address these issues, a key class identification approach is proposed that integrates dynamic analysis with a gravitational formula. First, a class coupling network (CCN) is constructed using static analysis to represent classes and their coupling relationships. Second, a gravitational entropy (GEN) metric is introduced to quantify class importance by jointly considering direct and indirect couplings in the CCN and the degree-distribution diversity of neighboring nodes. Third, classes are ranked in descending order based on their GEN values to obtain a preliminary ranking. Finally, dynamic analysis is performed to capture actual runtime interactions between classes, which are used to refine the preliminary results. A threshold is applied to filter out non-key classes, producing a final set of candidate key classes. Experimental results on eight open-source Java projects demonstrate that the proposed method significantly outperforms eleven baseline approaches when considering no more than the top 15% (or top 25) of nodes. The integration of dynamic analysis notably improves the performance of the proposed method. Moreover, the choice of weighting schemes for coupling types has a minimal impact on performance, and the overall computational efficiency is acceptable.
LIU Hao , DU Jun-Wei , LI Yu-Ying , FANG Min-Ying , JIA Xue-Hai
2026, 37(5):2085-2102. DOI: 10.13328/j.cnki.jos.007458
Abstract:Bug localization is a critical aspect of software maintenance, and improving the effectiveness and efficiency of automated fault localization has become a central research focus in software engineering. With the surge in open-source software and the increasing demand for software hot updates, automated bug localization focused on change sets has become a key tool for software quality assurance. Traditional bug localization methods based on information retrieval can only represent textual information and fail to fully account for structural and semantic changes within change sets, making them unsuitable for direct application in change set bug localization tasks. Therefore, this study proposes a method for change set bug localization based on graph Transformer, which uses an abstract syntax tree to represent change information and capture code structure changes. The method represents both local and global semantic information of the changed code and bug reports, enabling the matching and localization of bug information within change sets. To validate the effectiveness of the proposed method, it is tested on bug reports and changes from six sets of bug-inducing change sets. Compared to the state-of-the-art models, the proposed method demonstrates improvements of 11.4% and 12.9% in MAP and MRR metrics, respectively, validating the efficacy of the proposed approach.
ZHANG Tian-Yi , ZHOU Tong , ZHANG Chen-Xi , PENG Xin
2026, 37(5):2103-2130. DOI: 10.13328/j.cnki.jos.007484
Abstract:Software configuration is a crucial component of software systems and plays an important role in enhancing the diversity and flexibility of software functionalities. As software systems become increasingly complex, the intricate constraint relationships between configuration options present a significant challenge for system administrators. To address this, researchers have proposed various constraint extraction methods based on different data sources and techniques to identify complex relationships between configurations. However, these methods face several limitations, such as limited applicability across multiple programming languages, constrained analysis scale, and a heavy reliance on high-quality annotated data. To overcome these issues, this study proposes LLM-Extractor, a configuration constraint extraction method based on large language models. This method consists of two main components: the construction of a configuration-function association graph and configuration constraint inference based on multi-configuration association subgraphs. In the graph construction phase, LLM-Extractor leverages the powerful text understanding and analysis capabilities of large language models to identify entities related to configurations and software functionalities from configuration documents and extract various types of relationships. In the constraint inference phase, LLM-Extractor searches for multi-configuration association subgraphs on the existing function graph and guides the large language model to infer configuration constraints based on the information within the subgraphs. By inferring constraints based on multi-configuration association subgraphs, LLM-Extractor can extract configuration constraints transmitted through software function states, filling the gap left by existing methods. It is also characterized by its language-agnostic nature and scalability. The effectiveness of this approach is evaluated on configuration documents from three open-source software systems, analyzing over 1,400 configuration options. Experimental results show that LLM-Extractor outperforms existing text analysis methods, with a 43.4% improvement in F1 score. Further ablation studies demonstrate the critical positive impact of multi-configuration association subgraphs on the effectiveness of configuration constraint inference.
QU Mu-Zi , KANG Liang-Yi , LIU Jie , WANG Shuai , YE Dan , HUANG Tao
2026, 37(5):2131-2150. DOI: 10.13328/j.cnki.jos.007508
Abstract:With the rapid development of large language model (LLM) technology, many Code LLMs have emerged to support tasks such as code generation, code completion, code testing, and code refactoring. Different models may show significant performance differences when processing the same task, and the decoding parameters at the inference stage will also have an important influence on model performance. This study investigates how to efficiently select the best model and its optimal decoding parameters for a specific code development task. Existing methods generally divide model selection and parameter tuning into two independent stages. Due to the differences in sampling strategies at different stages, sample data cannot be shared, and the computational cost of sampling and evaluation is high. Considering that the decoding parameter space of different Code LLMs is the same, this study proposes the utilization of the propensity score matching (PSM) algorithm for conducting weighted adjustment and aligning sample data of different distributions to improve the reuse efficiency of sample data and reduce computational costs. Therefore, this study proposes a framework CodeLLMTuner for Code LLM selection and decoding parameter tuning based on sample reuse. The framework includes three stages, including the independent sampling stage, which performs decoding parameter tuning (such as Bayesian optimization) on multiple Code LLMs in parallel and conducts data sampling and evaluation to collect sample data. Additionally, the model selection stage adopts PSM technology to align the sample data of different models and selects the model with the optimal performance expectations. The decoding parameter tuning stage of the selected model reuses the sample data of the selected model and continues decoding parameter tuning based on it to fully explore the performance space and significantly reduce sampling costs. Experimental results show that in the three tasks of code generation, code summarization and test case generation, CodeLLMTuner improves performance by 10% to 15% at the same cost compared to the baseline methods, or reduces the cost by more than 20% under the same performance.
LIU Shu-Ning , WU Yi-Jian , SONG Xue-Zhi , CHEN Bi-Huan , PENG Xin , ZHAO Wen-Yun
2026, 37(5):2151-2166. DOI: 10.13328/j.cnki.jos.007510
Abstract:In modern software development, frequent code commits and updates have become the norm, which accelerates feature implementation but may introduce new defects, thus threatening software stability and reliability. Once a defect causes program errors or failures, the development team should take quick action to isolate the defect to ensure the continuous operation of the system. Defect isolation is a key technique for rapidly locating the problem and restoring system stability. However, the traditional delta debugging (DD) methods rely on numerous testing attempts, resulting in significant performance bottlenecks under a large change set. Additionally, they fail to effectively utilize the semantics of code changes, making it difficult to accurately locate defect-related code changes. This study proposes a defect isolation method based on code change semantic decomposition—DISAC. The method decomposes composite commits introduced by the defect into atomic commits with single functional semantics. It then models the sequential dependency between commits to ensure that the dependency chain will not be broken during the isolation process. Compared to the traditional DD methods, DISAC not only returns the smallest functional semantic changes but also preserves necessary context and dependency information, thereby providing developers with more complete and accurate support for defect repair. Experimental results show that compared to the DD method, DISAC significantly improves defect isolation efficiency and accuracy. Specifically, the isolation efficiency is increased by 633.65% on the Defects4J dataset and by 733.75% on the regression defect set. Additionally, when DISAC is combined with DD, the isolation reduction rate improves by 2.36% and 8.66% respectively, significantly enhancing isolation effectiveness. User experiments show that DISAC increases root cause determination efficiency by approximately 59.90% and improves accuracy by 12%. These results demonstrate that DISAC not only improves defect isolation accuracy but also reduces unnecessary change combination attempts, thus showing higher efficiency and stability in defect isolation tasks committed by complex codes.
TANG Rui-Ze , HUANG Yu , OUYANG Ling-Zhi , CHENG Qian , ZHANG Yu-Qi , MA Xiao-Xing
2026, 37(5):2167-2201. DOI: 10.13328/j.cnki.jos.007580
Abstract:Distributed systems serve as the core of modern computing infrastructure, making their correctness essential. However, the high nondeterminism in the computing environment of distributed systems, combined with the complexity of code design and implementation, makes the correctness verification of distributed systems a significant challenge. Distributed system model checking (DMCK) enables the discovery of deep bugs, deterministic reproduction of bugs in real systems, and repair correctness verification by exhaustive code-level state exploration, thereby addressing the typical problems of distributed systems, such as difficult discovery, diagnosis, and repair. This study provides a systematic summary of the research progress in DMCK. Centering around the trade-off between “state explosion” and “manual effort”, it categorizes the development of DMCK into three stages. The first stage focuses on deterministic simulation execution and state space exploration technologies that make DMCK effective. The second stage introduces a small amount of artificial modeling to leverage system semantics for alleviating state explosion, and the third stage aims to enhance the interaction between the model layer and code layer to improve code-level model checking efficiency. Finally, based on the summary of existing work, this study discusses the current limitations of DMCK and promising development directions in the future.
WANG Xie-Yang , CHAO Cheng , JIN Xin , XU Jian-Qiu , GAO Yun-Jun
2026, 37(5):2202-2234. DOI: 10.13328/j.cnki.jos.007483
Abstract:The abundance of sources, ease of acquisition, and frequent movement of moving objects have led to exponential growth in data volume. The growing need for efficient management of moving object data has made indexing and querying such data a pressing issue. Traditional moving object indexes, based on spatial partitioning, can effectively handle changes in the spatial position and temporal dynamics of objects. However, due to the dynamic nature of moving objects, which require frequent index updates, maintaining these indexes becomes costly with large datasets. Learned indexes, as an emerging indexing technique, have the potential to improve query efficiency and reduce storage costs by leveraging machine learning methods. Nevertheless, learned indexes are not well-suited for data with multidimensional characteristics. To address this limitation, the proposed learned index uses the non-uniform grid code algorithm (NUGC_LI). It employs a recursive hierarchical model structure similar to the B+-tree, divided into root, internal, and leaf nodes. The learned index uses a multi-phase linear model to adapt to the flexibly divided data distribution, setting arrays with gaps and node key value ranges in the leaf nodes to improve node update and query efficiency. At the same time, B+-tree, RMI, ALEX, NUGC_LI, 3D R-tree, and TB-tree indexes are constructed for real taxi trajectories, system simulation train trajectories, and randomly generated trajectory datasets for comparison. The number of trajectory points in the real, simulated, and random datasets is approximately 917000, 51544, and 5222752, respectively. Through comparative experiments and scalability tests, NUGC_LI reduces the index construction time by approximately 91.45%, 89.63%, 90.38%, 87.46%, and 13.71% compared to TB-tree, 3D R-tree, B+-tree, RMI, and ALEX, respectively. For update operations, the update time is reduced by at least 93.76%. Range queries, nearest neighbor queries, and similar trajectory queries based on NUGC_LI show significant advantages under large-scale data conditions, with query times reduced by at least 8.74%, 30%, and 16.07% compared to ALEX; 29.38%, 77.44%, and 25.24% compared to RMI; 52.72%, 92.44%, and 70.5% compared to B+-tree; 53.09%, 91.2%, and 67.58% compared to 3D R-tree; and 52.67%, 90.43%, and 67.47% compared to TB-tree. The NUGC_LI index not only demonstrates high scalability under multi-task loads but also achieves significant performance improvements in construction, updates, and query operations.
CHEN Di , YUAN Ye , PAN Ya-Ni , WANG Guo-Ren
2026, 37(5):2235-2256. DOI: 10.13328/j.cnki.jos.007493
Abstract:Graph data can represent a wide range of real-world application scenarios, and query processing over graphs plays a crucial role in various tasks, such as reachability, shortest path, keyword search, graph pattern matching, PageRank, SimRank, k-core, k-truss, and Clique. For specific query problems, existing approaches typically propose corresponding query processing algorithms and build index structures to speed up the query. However, the diversification of application demands and the explosive growth in graph data scale present two major challenges to this methodology. First, a single graph dataset may involve multiple types of queries in practice, yet each query type often requires distinct processing mechanisms and index structures. Consequently, multiple indexes and corresponding query algorithms need to be constructed when designing a graph database. Second, index structures are often larger than the original graph data, and maintaining multiple indexes simultaneously can lead to significant space overhead, resulting in sharp performance degradation and limited practical applicability. To address these challenges, this study proposes a unified query processing mechanism. A unified and efficient index structureis constructed for large-scale graph data, upon which four query processing algorithms are designed, supporting reachability, shortest path, keyword search, and graph pattern matching. To build the unified index structure, the graph data is partitioned, and important vertices are extracted based on the characteristics of the four queries. The resulting unified index is smaller in size than the original graph and efficiently supports all four queries. Finally, the effectiveness and scalability of the unified index and the proposed algorithms are validated through experiments on four real-world datasets.
DU Xiao-Ni , WU Jia-Hui , XU Ying , SUN Rui
2026, 37(5):2257-2273. DOI: 10.13328/j.cnki.jos.007448
Abstract:This study investigates meet-in-the-middle attacks on three types of unbalanced generalized Feistel structures and conducts quantum meet-in-the-middle attacks in Q1 model. First, for the 3-branch Type-III generalized Feistel structure, a 4-round meet-in-the-middle distinguisher is constructed using multiset and differential enumeration techniques. By expanding one round forward and one round backward, a 6-round meet-in-the-middle attack is conducted. With the help of Grover’s algorithm and the quantum claw finding algorithm, a 6-round quantum key recovery attack is performed, requiring O(23?/2·?) quantum queries, where ? is the branch length of the generalized Feistel structure. Then, for the 3-branch Type-I structure, a 9-round distinguisher is similarly extended by one round in both directions to conduct an 11-round meet-in-the-middle attack and a quantum key recovery attack with time complexities of O(22?) 11-round encryptions and O(23?/2·?) quantum queries. Finally, taking the 3-cell generalized Feistel structure as a representative case, this study explores a quantum meet-in-the-middle attack on an n-cell structure. A 2n-round meet-in-the-middle distinguisher is constructed, enabling a 2(n+1)-round meet-in-the-middle attack and quantum key recovery attack. The associated time complexities are O(22?) 2(n+1)-round encryptions and O(23?/2·?) quantum queries. The results demonstrate that the time complexity in Q1 model is significantly reduced compared with classical scenarios.
ZHANG Yu-Han , ZHANG Lei , WU Wen-Ling
2026, 37(5):2274-2285. DOI: 10.13328/j.cnki.jos.007457
Abstract:Differential-linear cryptanalysis, a combined cryptanalysis method, has been applied to the analysis of many symmetric ciphers. Specifically, for the ARX block cipher SPECK, differential-linear cryptanalysis is an effective technique for evaluating its security. In the latest framework of differential-linear cryptanalysis, the cipher is divided into three components: the differential part, the middle part, and the linear part. These parts contain high-probability differential characteristics, high-correlation differential-linear approximations, and high-correlation linear approximations, respectively. For ARX ciphers, the traditional search process for differential-linear distinguishers typically involves first using experimental methods to obtain a high-correlation differential-linear approximation in the middle part. Subsequently, linear and differential characteristics are searched for forward and backward. However, this strategy may overlook some effective differential-linear distinguishers. This study proposes a search method for differential-linear distinguishers, which integrates the characteristics of the differential and linear parts in high-correlation differential-linear approximations and leverages high-probability differential and linear characteristics. The proposed search algorithm is applied to SPECK, yielding for the first time an 11-round differential-linear distinguisher for SPECK32 and a 12-round differential-linear distinguisher for SPECK48. Both outperform the best-known differential-linear distinguishers for these ciphers.
TANG Hao , LI Ze-Chao , JIANG Xin , TANG Jin-Hui
2026, 37(5):2286-2308. DOI: 10.13328/j.cnki.jos.007491
Abstract:With the continuous advancement of computer vision technology, fine-grained image recognition plays a crucial role across various application domains. Unlike traditional coarse-grained image recognition, fine-grained image recognition aims to precisely distinguish subcategories with subtle visual differences within the same major category, making this task particularly challenging. In recent years, the vision Transformer has gained widespread adoption in image recognition due to its exceptional performance in modeling global contextual information. However, the vision Transformer exhibits certain limitations when applied to fine-grained image recognition, particularly in processing detailed features and mitigating background noise. To address these issues, this study proposes a dual-view recognition framework based on the vision Transformer. This framework effectively integrates global and local views to enhance recognition accuracy. In this framework, an attention-based fusion module is designed to filter redundant information and optimize the classification token embedding of global views by merging and filtering patch features through hierarchical attention weights within the encoder. In addition, an attention threshold-based key region localization module is introduced. This module dynamically selects and magnifies key patches in the global view using an adaptive threshold strategy, forming detailed local views for further analysis. Furthermore, an adaptive enhancement module for local region features is proposed to strengthen the focus on local details, thus enhancing the recognition capability of fine-grained features. To optimize the dual-view framework, a contrastive loss function based on dual-view similarity and an adaptive inference strategy based on dual-view confidence are proposed. These strategies aim to enhance the global and local feature discriminability of the vision Transformer model while efficiently saving computational resources and shortening inference time. Experimental results on the CUB-200-2011, Stanford Dogs, NABirds, and iNaturalist2017 public datasets demonstrate that the proposed method achieves significant improvements in recognition accuracy compared to the traditional vision Transformer model. These results validate the proposed method’s effectiveness and superiority in fine-grained image recognition tasks.
HU Bo , TIAN Rong-Ao , ZHENG Jia , GONG Bing-Bing , GAO Xin-Bo
2026, 37(5):2309-2324. DOI: 10.13328/j.cnki.jos.007494
Abstract:Image deblurring has attracted much attention due to its wide applications in fields such as security surveillance, medical image processing, and remote sensing image processing. Although end-to-end methods have made significant progress, a single U-Net network struggles to handle complex motion blur, while restoration approaches based on auxiliary tasks often suffer from large parameter sizes. In addition, the vast majority of methods fail to accurately identify the locations and degrees of blur in different images, while blur perception is often one of the key factors determining the restoration performance of models. Inspired by this, this study proposes a progressive image deblurring algorithm guided by blur perception (PDBP-Net). The main idea of the algorithm is to utilize auxiliary tasks to generate blur perception feature maps, thus guiding the algorithm to achieve more refined restoration. First, the high-frequency difference and image residual generative subnetwork (HDIRG-net) employs auxiliary learning to simultaneously generate high-frequency difference feature maps and residual maps. These are then fed into the blur perception module guided by high-frequency differences (BPGHD) for deep fusion and extraction of blur-related information, resulting in the generation of blur perception feature maps. Moreover, to alleviate the limitations of a single network in restoring complex scenes, this module uses the residual maps and blur maps to generate preliminary restored images. Finally, the blur perception-guided detail restoration subnetwork (BPGDR-net) conducts targeted re-optimization of the preliminary restored images under the guidance of the blur perception feature maps, thus generating the final restored images. The proposed deblurring model is extensively evaluated on multiple benchmark datasets and achieves significant improvements over state-of-the-art deblurring methods. Specifically, on the GoPro dataset, the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) reach 33.85 dB and 0.967, respectively, with the PSNR being 0.39 dB higher than that of the second-best method. Extensive experiments demonstrate that PDBP-Net outperforms state-of-the-art auxiliary learning-based methods and significantly enhances image deblurringperformance, confirming the effectiveness of the proposed method.

