JIANG Wei , YANG Si-Fan , WANG Yi-Bo , ZHANG Li-Jun
2025, 36(11):4893-4905. DOI: 10.13328/j.cnki.jos.007383 CSTR: 32375.14.jos.007383
Abstract:Stochastic optimization algorithms are recognized as essential for addressing large-scale data and complex models in machine learning. Among these, variance reduction methods, such as the STORM algorithm, have gained attention for their ability to achieve optimal convergence rates of $ {\mathrm{O}}\left({T}^{-1/3}\right) $. However, traditional variance reduction methods typically depend on specific problem parameters (e.g., the smoothness constant, noise variance, and gradient upper bound) for setting the learning rate and momentum, limiting their practical applicability. To overcome this limitation, this study proposes an adaptive variance reduction method based on a normalization technique, which eliminates the need for prior knowledge of problem parameters while maintaining optimal convergence rates. Compared to existing adaptive variance reduction methods, the proposed approach offers several advantages: (1) no reliance on additional assumptions, such as bounded gradients, bounded function values, or excessively large initial batch sizes; (2) the achievement of the optimal convergence rate of $ {\mathrm{O}}\left({T}^{-1/3}\right) $without extra term of $ {\mathrm{O}}\left(\mathrm{log}T\right)$; (3) a concise and straightforward proof, facilitating extensions to other stochastic optimization problems. The superiority of the proposed method is further validated through numerical experiments, demonstrating enhanced performance when compared to other approaches.
TANG Jia-Xin , WANG Xuan , LAI Wei , LU Ze-Yu , GUO Zhao-Qiang , YANG Yi-Biao , ZHOU Yu-Ming
2025, 36(11):4906-4952. DOI: 10.13328/j.cnki.jos.007376 CSTR: 32375.14.jos.007376
Abstract:Software vulnerabilities are code segments in software that are prone to exploitation. Ensuring that software is not easily attacked is a crucial security requirement in software development. Software vulnerability prediction involves analyzing and predicting potential vulnerabilities in software code. Deep learning-driven software vulnerability prediction has become a popular research field in recent years, with a long time span, numerous studies, and substantial research achievements. To review relevant research findings and summarize the research hotspots, a survey of 151 studies related to deep learning-driven software vulnerability prediction published between 2017 and 2024 is conducted. It summarizes the research problems, progress, and challenges discussed in the literature, providing a reference for future research.
WEI Qiu-Yang , ZHAO Xu-Feng , ZHU Xue-Yang , ZHANG Wen-Hui , LU Yi-Han
2025, 36(11):4953-4974. DOI: 10.13328/j.cnki.jos.007356 CSTR: 32375.14.jos.007356
Abstract:Since the advent of Bitcoin, blockchain technology has profoundly influenced numerous fields. However, the absence of effective communication mechanisms between heterogeneous and isolated blockchain systems has hindered the advancement and sustainable development of the blockchain ecosystem. In response, cross-chain technology has emerged as a rapidly evolving field and a focal point of research. The decentralized nature of blockchain, coupled with the complexity of cross-chain scenarios, introduces significant security challenges. This study proposes a formal analysis of the IBC (inter-blockchain communications) protocol, one of the most widely adopted cross-chain communication protocols, to assist developers in designing and implementing cross-chain technologies with enhanced security. The IBC protocol is formalized using TLA+, a temporal logic specification language, and its critical properties are verified through the model-checking tool TLC. An in-depth analysis of the verification results reveals several issues impacting the correctness of packet transmission and token transfer. Corresponding recommendations are proposed to mitigate these security risks. The findings have been reported to the IBC developer community, with most of them receiving acknowledgment.
GONG Yuan-Jun , HUANG Jian-Jun , YOU Wei , SHI Wen-Chang , LIANG Bin , BIAN Pan , ZHANG Jian
2025, 36(11):4975-4989. DOI: 10.13328/j.cnki.jos.007362 CSTR: 32375.14.jos.007362
Abstract:The longest common subsequence (LCS) is a practical metric for assessing code similarity. However, traditional LCS-based methods face challenges in scalability and in effectively capturing critical semantics for identifying code fragments that are textually different but semantically similar, due to their reliance on discrete representation-based token encoding. To address these limitations, this study proposes an LCS-oriented embedding method that encodes code into low-dimensional dense vectors, effectively capturing semantic information. This transformation enables the computationally expensive LCS calculation to be replaced with efficient vector arithmetic, further accelerated using an approximate nearest neighbor algorithm. To support this approach, an embeddable LCS-based distance metric is developed, as the original LCS metric is non-embeddable. Experimental results demonstrate that the proposed metric outperforms tree-based and literal similarity metrics in detecting complex code clones. In addition, two targeted loss functions and corresponding training datasets are designed to prioritize retaining critical semantics in the embedding process, allowing the model to identify textually different but semantically similar code elements. This improves performance in detecting complex code similarities. The proposed method demonstrates strong scalability and high accuracy in detecting complex clones. When applied to similar bug identification, it has reported 23 previously unknown bugs, all of which are confirmed by developers in real-world projects. Notably, several of these bugs are complex and challenging to detect using traditional LCS-based techniques.
TIAN Xin-Lei , DONG Yi-Yi , ZHANG Ji-Xian , WANG Wei-Jia
2025, 36(11):4990-5007. DOI: 10.13328/j.cnki.jos.007368 CSTR: 32375.14.jos.007368
Abstract:Falcon, a post-quantum digital signature algorithm, has been selected as one of the first schemes standardized by the National Institute of Standards and Technology (NIST). Its core algorithms, however, are highly error-prone in practical implementations, raising risks of cryptographic misuse. Ensuring the correctness of Falcon through formal verification is therefore essential. In this work, this study introduces a comprehensive proof framework that bridges the gap between Falcon’s mathematical specification and its real-world implementation. Within the EasyCrypt proof system, this study formally verifies the correctness of Falcon’s Montgomery modular multiplication, NTT, and FFT algorithms, and further explores proof techniques for integer Gaussian sampling. Moreover, this study presents and optimizes Falcon’s signing and verification implementations using Jasmin hybrid programming, thereby providing both formal correctness guarantees and practical efficiency.
PAN Xing-Lu , ZHAO Xian-Lin , LIU Chen-Xiao , ZOU Yan-Zhen , XIE Bing
2025, 36(11):5008-5030. DOI: 10.13328/j.cnki.jos.007369 CSTR: 32375.14.jos.007369
Abstract:With the widespread adoption of programming naming conventions and the increasing emphasis on self-explanatory code, traditional summarizing code comments, which are often similar to code literal meaning, are losing appeal among developers. Instead, developers value supplementary code comments that provide additional information beyond the code itself to facilitate program understanding and maintenance. However, generating such comments typically requires external information resources beyond the code base, and the diversity of supplementary information presents significant challenges to existing methods. This study leverages Issue reports as a crucial external information source and proposes an Issue-based retrieval augmentation method using large language models (LLMs) to generate supplementary code comments. The proposed method classifies the supplementary information found in Issue reports into five categories, retrieves Issue sentences containing this information, and generates corresponding comments using LLMs. In addition, the code relevance and Issue verifiability of the generated comments are evaluated to minimize hallucinations. Experiments conducted on two popular LLMs, ChatGPT and GPT-4o, demonstrate the effectiveness of the proposed method. Compared to existing approaches, the proposed method significantly improves the coverage of manual supplementary comments from 33.6% to 72.2% for ChatGPT and from 35.8% to 88.4% for GPT-4o. Moreover, the generated comments offer developers valuable supplementary information, proving essential for understanding some tricky code.
WU Jiang-Hao , DUAN Liang , YUE Kun , LI Ang-Sheng , YANG Pei-Zhong
2025, 36(11):5031-5044. DOI: 10.13328/j.cnki.jos.007374 CSTR: 32375.14.jos.007374
Abstract:Attributed graphs are increasingly used to represent data with relational structures, and detecting anomalies with them is gaining attention. Due to their characteristics, such as rich attribute information and complex structural relationships, various types of anomalies may exist, including global, structural, and community anomalies, which often remain hidden within the graph’s deep structure. Existing methods face challenges such as loss of structural information and difficulty identifying abnormal nodes. Structural information theory leverages encoding trees to represent hierarchical relationships within data and establishes correlations across different levels by minimizing structural entropy, effectively capturing the graph’s essential structure. This study proposes an anomaly detection method for attributed graphs based on structural entropy. First, by integrating the structural and attribute information of attributed graphs, a K-dimensional encoding tree to represent the hierarchical community structure through structural entropy minimization is constructed. Next, using the node attributes and hierarchical community information within the encoding tree, scoring mechanisms for detecting structural and attribute anomalies based on Euclidean distance and connection strength between nodes are designed. This approach identifies abnormal nodes and detects various types of anomalies. The proposed method is evaluated through comparative tests on several attributed graph datasets. Experimental results demonstrate that the proposed method effectively detects different types of anomalies and significantly outperforms existing state-of-the-art methods.
CAO Si-Cong , SUN Xiao-Bing , BO Li-Li , WU Xiao-Xue , LI Bin , CHEN Ting , LUO Xia-Pu , ZHANG Tao , LIU Wei
2025, 36(11):5045-5061. DOI: 10.13328/j.cnki.jos.007375 CSTR: 32375.14.jos.007375
Abstract:Software vulnerabilities pose significant threats to real-world systems. In recent years, learning-based vulnerability detection methods, especially deep learning-based approaches, have gained widespread attention due to their ability to extract implicit vulnerability features from large-scale vulnerability samples. However, due to differences in features among different types of vulnerabilities and the problem of imbalanced data distribution, existing deep learning-based vulnerability detection methods struggle to accurately identify specific vulnerability types. To address this issue, this study proposes MulVD, a deep learning-based multi-class vulnerability detection method. MulVD constructs a structure-aware graph neural network (SA-GNN) that can adaptively extract local and representative vulnerability patterns while rebalancing the data distribution without introducing noise. The effectiveness of the proposed approach in both binary and multi-class vulnerability detection tasks is evaluated. Experimental results demonstrate that MulVD significantly improves the performance of existing deep learning-based vulnerability detection techniques.
QU Yu-Bin , HUANG Song , CHEN Xiang , WANG Xing-Ya , LI Long , WANG Dan , YAO Yong-Ming , JU Xiao-Lin
2025, 36(11):5062-5081. DOI: 10.13328/j.cnki.jos.007379 CSTR: 32375.14.jos.007379
Abstract:In recent years, impressive capabilities have been demonstrated by deep learning-based vulnerability detection models in detecting vulnerabilities. Previous research has widely explored adversarial attacks using variable renaming to introduce disturbances in source code and evade detection. However, the effectiveness of introducing multiple disturbances through various transformation techniques in source code has not been adequately investigated. In this study, multiple synonymous transformation operators are applied to introduce disturbances in source code. A combination optimization strategy based on genetic algorithms is proposed, enabling the selection of source code transformation operators with the highest fitness to guide the generation of adversarial code segments capable of evading vulnerability detection. The proposed method is implemented in a framework named non-vulnerability generator (NonVulGen) and evaluated against deep learning-based vulnerability detection models. When applied to recently developed deep learning models, an average attack success rate of 91.38% is achieved against the CodeBERT-based model and 93.65% against the GraphCodeBERT-based model, representing improvements of 28.94% and 15.52% over state-of-the-art baselines, respectively. To assess the generalization ability of the proposed attack method, common models including Devign, ReGVD, and LineVul are targeted, achieving average success rates of 98.88%, 97.85%, and 92.57%, respectively. Experimental results indicate that adversarial code segments generated by NonVulGenx cannot be effectively distinguished by deep learning-based vulnerability detection models. Furthermore, significant reductions in attack success rates are observed after retraining the models with adversarial samples generated based on the training data, with a decrease of 96.83% for CodeBERT, 97.12% for GraphCodeBERT, 98.79% for Devign, 98.57% for ReGVD, and 97.94% for LineVul. These findings reveal the critical challenge of adversarial attacks in deep learning-based vulnerability detection models and highlight the necessity for model reinforcement before deployment.
LUO Shi-Yu , LI Xin-Lei , LUO Jun-Tao , WANG Xin , ZHANG Guo-Feng , CHEN Yang
2025, 36(11):5082-5101. DOI: 10.13328/j.cnki.jos.007380 CSTR: 32375.14.jos.007380
Abstract:With the widespread adoption and rapid advancement of open-source software, the maintenance of open-source software projects has become a critical phase within the software development cycle. As a globally representative developer community, GitHub hosts numerous software project repositories with similar functionalities within the same domain, creating challenges for users when selecting the appropriate project repository for use or further development. Therefore, accurate identification of project repository maintenance status holds substantial practical value. However, the GitHub platform does not provide direct metrics for assessing the maintenance status of repositories. This study proposes an automatic identification method for project repository maintenance status based on machine learning. A classification model, GitMT, has been developed and implemented to achieve this objective. By effectively integrating dynamic time series features and descriptive features, the proposed model enables accurate identification of “active” and “unmaintained” repository status. Through a series of experiments conducted on large-scale real-world data, an AUC value of 0.964 is achieved in maintenance status identification tasks. In addition, this study constructs an open-source dataset centered on the maintenance status of software project repositories—GitMT Dataset: https://doi.org/10.7910/DVN/OJ2NI3.
LIU Li-Pei , MAO Jian , LIN Qi-Xiao , LYU Yu-Song , LI Jia-Wei , LIU Jian-Wei
2025, 36(11):5102-5117. DOI: 10.13328/j.cnki.jos.007387 CSTR: 32375.14.jos.007387
Abstract:Mini programs are required to provide privacy policies to inform users about the types and purposes of the privacy data being collected and used. However, inconsistencies between the underlying codes and the privacy statements may occur, potentially deceiving users and leading to privacy leakage. Existing methods for detecting such inconsistencies typically rely on converting the code and policies into predefined labels for comparison. This approach introduces information loss during label conversion, resulting in underreporting. In addition, traditional code analysis methods are often ineffective against obfuscated mini program code. To address these limitations, a semantic-analysis-based method for code-to-policy consistency detection in mini programs is proposed. Customized taint analysis is utilized to capture code behaviors based on mini program coding paradigms, and a code language processing model is applied to represent these behaviors as natural language descriptions. By aligning the natural language representation of code behaviors with the stated purposes in privacy policies, expert reviewers can analyze the consistency between the two effectively. Experiments indicate that the proposed taint analysis module covers all three data return methods and four common data flow patterns within mini programs APIs, achieving superior sensitivity compared to existing methods. Semantic analysis of tens of thousands of mini programs reveals privacy leakage risks associated with certain high-frequency API calls. Case studies using the MiniChecker tool further identify real-world instances of mini programs where inconsistencies between code and privacy policies are detected.
LI Deng , WU A-Ming , HAN Ya-Hong
2025, 36(11):5118-5133. DOI: 10.13328/j.cnki.jos.007321 CSTR: 32375.14.jos.007321
Abstract:Visual-language pre-training (VLP) aims to obtain a powerful multimodal representation by learning on a large-scale image-text multimodal dataset. Multimodal feature fusion and alignment is a key challenge in multimodal model training. In most of the existing visual-language pre-training models, for the multimodal feature fusion and alignment problem, the main approach is that the extracted visual features and text features are directly input into the Transformer model. Since the attention mechanism in the Transformer calculates the similarity between pairs, it is difficult to achieve the alignment among multiple entities. Considering that the hyperedges of hypergraph neural networks possess the characteristics of connecting multiple entities and encoding high-order entity correlations, thus enabling the establishment of relationships among multiple entities. In this study, a visual-language multimodal model pre-training method based on multi-entity alignment of hypergraph neural networks is proposed. In this method, the hypergraph neural network learning module is introduced into the Transformer multi-modal fusion encoder to learn the alignment relationship of multi-modal entities, thereby enhancing the entity alignment ability of the multi-modal fusion encoder in the pre-training model. The proposed visual-language pre-training model is pre-trained on the large-scale image-text datasets and fine-tuned on multiple visual-language downstream tasks such as visual question answering, image-text retrieval, visual grounding, and natural language visual reasoning. The experimental results indicate that compared with the baseline method, the proposed method has performance improvements in multiple downstream tasks, among which the accuracy is improved by 1.8% on the NLVR2 task.
YIN Ming , QIAO Sheng , CHEN Wei , JIANG Ji-Jiao
2025, 36(11):5134-5157. DOI: 10.13328/j.cnki.jos.007322 CSTR: 32375.14.jos.007322
Abstract:There are numerous and miscellaneous sources of online information. Judging whether it is a rumor in a timely and accurate manner is a crucial issue in the research of the cognitive domain of social media. Most of the previous studies have mainly concentrated on the text content of rumors, user characteristics, or the inherent features confined to the propagation mode, ignoring the key clues of the collective emotions generated by users’ participation in event discussions and the emotional steady-state characteristics hidden in the spread of rumors. In this study, a social network rumor detection method that is oriented by collective emotional stabilization and integrates temporal and spatial steady-state features is proposed. Based on the text features and user behaviors in rumor propagation, the temporal and spatial relationship steady-state features of collective emotions are combined for the first time, which can achieve strong expressiveness and detection accuracy. Specifically, this method takes the emotional keywords of users’ attitude towards a certain event or topic as the basis and uses recurrent neural networks to construct emotional steady-state features of the temporal relationship, enabling the collective emotions to have temporally consistent features with strong expressiveness, which can reflect the convergence effect of the collective emotions over time. The heterogeneous graph neural network is utilized to establish the connections between users and keywords, as well as between texts and keywords so that the collective emotions possess the fine-grained collective emotional steady-state features of the spatial relationship. Finally, the two types of local steady-state features are fused, possessing globality and improving the feature expression. Further classification can obtain the rumor detection results. The proposed method is run on two internationally publicly available and widely used Twitter datasets. Compared with the best-performing method in the baselines, the accuracy is improved by 3.4% and 3.2% respectively; the T-F1 value is improved by 3.0% and 1.8% respectively; the N-F1 value is improved by 2.7% and 2.3% respectively; the U-F1 value is improved by 2.3% and 1.0% respectively.
XU Sheng , LI Pei-Feng , ZHU Qiao-Ming
2025, 36(11):5158-5177. DOI: 10.13328/j.cnki.jos.007367 CSTR: 32375.14.jos.007367
Abstract:The diversity and complexity of linguistic expressions often lead to event coreference relations being reflected as latent correlations between event mentions. Existing methods predominantly rely on semantic similarity computations based on internal event features, such as triggers and arguments, which limits their ability to address such latent correlations effectively. To overcome this limitation, an external knowledge-enhanced event coreference resolution method is proposed. This approach leverages large language models (LLMs) to generate external knowledge related to coreference, encompassing discourse coherence, logical relationships, and common sense background knowledge. First, the ultra-large language model ChatGPT is utilized to construct training data enriched with external knowledge. Next, foundational LLMs like FlanT5 are fine-tuned on this data to acquire the ability to generate coreference-related external knowledge. Finally, the fine-tuned LLM generates document-level event summaries and chain-of-thought (CoT) style coreference reasoning paths. By integrating internal event features with external knowledge, the proposed method effectively identifies event coreference. Experimental results on the KBP dataset demonstrate that the proposed method outperforms previous state-of-the-art baselines.
DING Rui-Qing , ZHAO Jun-Feng , WANG Le-Ye
2025, 36(11):5178-5196. DOI: 10.13328/j.cnki.jos.007370 CSTR: 32375.14.jos.007370
Abstract:Knowledge graph (KG), as structured representations of knowledge, has a wide range of applications in the medical field. Entity alignment, which involves identifying equivalent entities across different KGs, is a fundamental step in constructing large-scale KGs. Although extensive research has focused on this issue, most of it has concentrated on aligning pairs of KGs, typically by capturing the semantic and structural information of entities to generate embeddings, followed by calculating embedding similarity to identify equivalent entities. This study identifies the problem of alignment error propagation when aligning multiple KGs. Given the high accuracy requirements for entity alignment in medical contexts, this study proposes a multi-source Chinese medical knowledge graph entity alignment method (MSOI-Align) that integrates entity semantics and ontology information. Our method pairs multiple KGs and uses representation learning to generate entity embeddings. It also incorporates both the similarity of entity names and ontology consistency constraints, leveraging a large language model to filter a set of candidate entities. Subsequently, based on triadic closure theory and the large language model, MSOI-Align automatically identifies and corrects the propagation of alignment errors for the candidate entities. Experimental results on four Chinese medical knowledge graphs show that MSOI-Align significantly enhances the precision of the entity alignment task, with the Hits@1 metric increasing from 0.42 to 0.92 compared to the state-of-the-art baseline. The fused knowledge graph, CMKG, contains 13 types of ontologies, 190000 entities, and approximately 700000 triplets. Due to copyright restrictions on one of the KGs, the fusion of the other three KGs is released, named OpenCMKG.
SUN Chen-Chen , JIN Yu-Yuan , SHEN De-Rong , NIE Tie-Zheng , KOU Yue
2025, 36(11):5197-5212. DOI: 10.13328/j.cnki.jos.007371 CSTR: 32375.14.jos.007371
Abstract:Entity alignment (EA) aims to identify equivalent entities across different knowledge graph (KG). Embedding-based EA methods still have several limitations, listed below. First, the heterogeneous structures within KGs are not fully modeled. Second, the utilization of text information is constrained by word embeddings. Third, alignment inference algorithms are underexplored. To address these limitations, this study proposes a heterogeneous graph attention network for entity alignment (HGAT-EA). HGAT-EA consists of two channels: one for learning structural embeddings and the other for learning character-level semantic embeddings. The first channel employs a heterogeneous graph attention network (HGAT), which fully leverages heterogeneous structures and relation triples to learn entity embeddings. The second channel utilizes character-level literals to learn character-level semantic embeddings. HGAT-EA incorporates multiple views through these channels and maximizes the use of