LI Yan , YANG Wen-Zhang , ZHANG Yi , XUE Yin-Xing
2025, 36(6):2404-2431. DOI: 10.13328/j.cnki.jos.007323 CSTR: 32375.14.jos.007323
Abstract:Fuzzing, as an automated software testing method, aims to detect potential security vulnerabilities, software defects, or abnormal behaviors by inputting a large quantity of automatically generated test data into the target software system. However, traditional fuzzing techniques are restricted by such factors as low automation level, low testing efficiency, and low code coverage, being unable to handle modern large-scale software systems. In recent years, the rapid development of large language models has not only brought significant breakthroughs to the field of natural language processing but also introduced new automation solutions to the field of fuzzing. Therefore, to better enhance the effectiveness of fuzzing technology, existing works have proposed various fuzzing methods combined with large language models, covering modules like test input generation, defect detection, and post-fuzzing. Nevertheless, the existing works lack systematic investigation and discussion on fuzzing techniques based on large language models. To fill the above-mentioned gaps in the review, this study comprehensively analyzes and summarizes the current research and development status of fuzzing techniques based on large language models. The main contents include (1) summarizing the overall process of fuzzing and the relevant technologies related to large language models commonly used in fuzzing research; (2) discussing the limitations of deep learning based fuzzing methods before the era of large language model (LLM); (3) analyzing the application methods of large language models in different stages of fuzzing; (4) exploring the main challenges and possible future development directions of large language model technology in fuzzing.
LIU Tian-Yang , YE Jia-Wei , JI Wei-Xing , LIU Hui
2025, 36(6):2432-2452. DOI: 10.13328/j.cnki.jos.007327 CSTR: 32375.14.jos.007327
Abstract:Resource leaks, which are defects caused by the failure to timely and properly close the limited system resources, are widely present in programs of various languages and possess a certain degree of concealment. The traditional defect detection methods usually predict the resource leaks in software based on rules and heuristic search. In recent years, defect detection methods based on deep learning have captured the semantic information in the code through different code representation forms and by using techniques such as recurrent neural networks and graph neural networks. Recent studies show that language models have performed outstandingly in tasks such as code understanding and generation. However, the advantages and limitations of large language models (LLMs) in the specific task of resource leak detection have not been fully evaluated. The effectiveness of the detection methods based on traditional models, small models, and LLMs in the task of resource leak detection is studied, and various improvement methods such as few-shot learning, fine-tuning and the combination of static analysis and LLMs are explored. Specifically, taking the JLeaks and DroidLeaks datasets as the experimental objects, the performance of different models is analyzed from multiple dimensions such as the root causes of resource leaks, resource types and code complexity. The experimental results show that the fine-tuning technique can significantly improve the detection effect of LLMs in the field of resource leak detection. However, most models still need to be improved in identifying the resource leaks caused by third-party libraries. In addition, the code complexity has a greater influence on the detection methods based on traditional models for resource leak detection.
LI Xiao-Peng , YAN Ming , FAN Xing-Yu , TANG Zhen-Tao , KAI Shi-Xiong , HAO Jian-Ye , YUAN Ming-Xuan , CHEN Jun-Jie
2025, 36(6):2453-2476. DOI: 10.13328/j.cnki.jos.007328 CSTR: 32375.14.jos.007328
Abstract:In the current intelligent era, chips, serving as the core components of intelligent electronic devices, play a critical role in multiple fields such as artificial intelligence, the Internet of Things, and 5G communication. It is of great significance to ensure the correctness, security, and reliability of chips. During the chip development process, developers first need to implement the chip design into a software form (i.e., chip design programs) by using hardware description languages, and then conduct physical design and finally tape-out (i.e., production and manufacturing). As the basis of chip design and manufacturing, the quality of the chip design program directly impacts the quality of the chips. Therefore, the testing of chip design programs is of important research significance. The early testing methods for chip design programs mainly depend on the test cases manually designed by developers to test the chip design programs, often requiring a large amount of manual cost and time. With the increasing complexity of chip design programs, various simulation-based automated testing methods have been proposed, improving the efficiency and effectiveness of chip design program testing. In recent years, more and more researchers have been committed to applying intelligent methods such as machine learning, deep learning, and large language models (LLMs) to the field of chip design program testing. This study surveys 88 academic papers related to intelligent chip design program testing, and sorts and summarizes the existing achievements in intelligent chip design program testing from three perspectives: test input generation, test oracle construction, and test execution optimization. It focuses on the evolution of chip design program testing methods from the machine learning stage to the deep learning stage and then to the large language model stage, exploring the potential of different stages’ methods in improving testing efficiency and coverage, as well as reducing testing costs. Additionally, it introduces research datasets and tools in the field of chip design program testing and envisions future development directions and challenges.
WANG Zhi-Peng , HE Tie-Ke , ZHAO Ruo-Yu , ZHENG Tao
2025, 36(6):2477-2500. DOI: 10.13328/j.cnki.jos.007325 CSTR: 32375.14.jos.007325
Abstract:As a crucial part of automated code review, the code refinement task is of great significance for improving development efficiency and code quality. Since large language models (LLMs) have shown far better performance than traditional small-scale pre-trained models in the field of software engineering, this study aims to explore the performance of these two types of models in the task of automatic code refinement, so as to evaluate the comprehensive advantages of LLMs. The traditional code quality evaluation metrics (e.g., BLEU, CodeBLEU, edit progress) are used to evaluate the performance of four mainstream LLMs and four representative small-scale pre-trained models in the code refinement task. Findings indicate that the refinement quality of LLMs in the pre-review code refinement subtask is inferior to that of small-scale pre-trained models. Due to the difficulty of the existing code quality evaluation metrics in explaining the above phenomenon, this study proposes Unidiff-based code refinement evaluation metrics to quantify the change operations in the refinement process, in order to explain the reasons for the inferiority and reveal the tendency of the models to perform change operations: (1) The pre-review code refinement task is rather difficult, the accuracy of the models in performing correct change operations is extremely low, and LLMs are more “aggressive” than small-scale pre-trained models, that is, they tend to perform more code change operations, resulting in their poor performance; (2) Compared with small-scale pretrained models, LLMs tend to perform more ADD and MODIFY change operations in the code refinement task, and the average number of inserted code lines in ADD change operations is larger, further proving their “aggressive” nature. To alleviate the disadvantages of LLMs in the pre-review refinement task, this study introduces the LLM-Vote method based on LLMs and ensemble learning, which includes two sub-schemes: Inference-based and Confidence-based, aiming to integrate the advantages of different base models to improve the code refinement quality. On this basis, a refinement determination mechanism is further introduced to enhance the decision stability and reliability of the model. Experimental results demonstrate that the Confidence-based LLM-Voter method significantly increases the exact match (EM) value and obtains a refinement quality better than all base models, thus effectively alleviating the disadvantages of large language models.
XU Zi-Mao , JIANG Yan-Jie , ZHANG Yu-Xia , LIU Hui
2025, 36(6):2501-2514. DOI: 10.13328/j.cnki.jos.007329 CSTR: 32375.14.jos.007329
Abstract:Long methods, along with other types of code smells, prevent software applications from reaching their optimal readability, reusability, and maintainability. Consequently, automated detection and decomposition of long methods have been widely studied. Although these approaches have significantly facilitated the decomposition, their solutions often differ significantly from the optimal ones. To address this, the automatable portion of the publicly available dataset containing real-world long methods is investigated. Based on the findings of this investigation, a new method (called Lsplitter) based on large language models (LLMs) is proposed in this study for automatically decomposing long methods. For a given long method, the Lsplitter decomposes the method into a series of shorter methods according to heuristic rules and LLMs. However, LLMs often split out similar methods. In response to the decomposition results of LLMs, Lsplitter utilizes a location-based algorithm to merge physically contiguous and highly similar methods into a longer method. Finally, these candidate results are ranked. Experiments are conducted on 2 849 long methods in real Java projects. The experimental results show that compared with the traditional methods combined with a modularity matrix, the hit rate of Lsplitter is improved by 142%, and compared with the methods purely based on LLMs, the hit rate is improved by 7.6%.
WANG Xi-Zao , SHEN Tian-Qi , BIN Xiang-Rong , BU Lei
2025, 36(6):2515-2534. DOI: 10.13328/j.cnki.jos.007330 CSTR: 32375.14.jos.007330
Abstract:Datalog, a declarative logic programming language, is widely applied in various fields. In recent years, there has been a growing interest in Datalog from both the academic and industrial communities, leading to the design and development of multiple Datalog engines and corresponding dialects. However, one problem brought about by the multiple dialects is that the code implemented in one Datalog dialect generally cannot be executed on the engine of another dialect. Therefore, when a new Datalog engine is adopted, the existing Datalog code needs to be translated into the new dialect. The current Datalog code translation techniques can be classified into two categories: manually rewriting the code and manually designing translation rules, which have problems such as being time-consuming, involving a large amount of repetitive work, and lacking flexibility and scalability. In this study, a Datalog code translation technology empowered by large language model (LLM) is proposed. By leveraging the powerful code understanding and generation capabilities of LLM, through the divide-and-conquer translation strategy, the prompt engineering based on few-shot and chain-of-thought prompts, and an iterative error-correction mechanism based on check-feedback-repair, high-precision code translation between different Datalog dialects can be achieved, reducing the workload of developers in repeatedly developing translation rules. Based on this code translation technology, a general declarative incremental program analysis framework based on Datalog is designed and implemented. The performance of the proposed LLM-powered Datalog code translation technology is evaluated on different Datalog dialect pairs, and the evaluation results verify the effectiveness of the proposed code translation technology. This study also conducts an experimental evaluation of the general declarative incremental program analysis framework, verifying the speedup effect of incremental program analysis based on the proposed code translation technology.
WANG Yi-Bo , WANG Ying , YU Yue , XU Chang , YU Hai , ZHU Zhi-Liang
2025, 36(6):2535-2557. DOI: 10.13328/j.cnki.jos.007324 CSTR: 32375.14.jos.007324
Abstract:The field of software engineering has been significantly influenced by the rapid development of large language models (LLMs). These models, which are pre-trained with a vast amount of code from open-source repositories, are capable of efficiently accomplishing tasks such as code generation and code completion. However, a large number of codes in the open-source software repositories are constrained by open-source licenses, bringing potential open-source license violation risks to the large models. This study focuses on the license violation risks between code generated by LLMs and open-source repositories. A detection framework that supports the tracing of the source of code generated by large models and the identification of copyright infringement issues is developed based on code clone technology. For 135 000 Python codes generated by 9 mainstream code large models, the source is traced and the open-source license compatibility is detected in the open-source community by this framework. Through practical investigation of three research questions, the impact of large model code generation on the open-source software ecosystem is explored: (1) To what extent is the code generated by large models cloned from open-source software repositories? (2) Is there a risk of open-source license violations in the code generated by large models? (3) Is there a risk of open-source license violations in the large model-generated code included in real open-source software? The experimental results indicate that among the 43 130 and 65 900 python codes longer than six lines generated by using functional descriptions and method signatures, 68.5% and 60.9% of the codes respectively are traced to have cloned open-source code segments. The CodeParrot and CodeGen series models have the highest clone ratios, while GPT-3.5-Turbo has the lowest. Besides, 92.7% of the codes generated by using functional descriptions lack license declaration. By comparing with the licenses of the traced codes, 81.8% of the codes have open-source license violation risks. Furthermore, among 229 codes generated by LLMs collected from GitHub, 136 codes are traced to have open-source code segments, among which 38 are of Type1 and Type2 clone types, and 30 have open-source license violation risks. These issues are reported to the developers in the form of problem reports. Up to now, feedback has been received from eight developers.
WANG Lu-Qiao , ZHOU Yang-Tao , LI Qing-Shan , WANG Ming-Kang , XU Zi-Xuan , CUI Di , WANG Lu , LUO Yi-Xing
2025, 36(6):2558-2575. DOI: 10.13328/j.cnki.jos.007326 CSTR: 32375.14.jos.007326
Abstract:The pull request (PR)-based software development mechanism is of great significance in the practice of open-source software. Appropriate code reviewers can assist contributors in detecting potential errors in PRs through code review, thus providing quality assurance for the continuous development and integration process. However, the complexity of code change content and the inherent diversity of review behaviors enhance the difficulty of reviewer recommendation. The existing methods mainly concentrate on mining the semantic information of changed codes from PRs or constructing reviewer portraits based on review history and then making recommendations through various static strategy combinations. These studies are restricted by the richness of model training corpora and the complexity of interaction types, leading to unsatisfactory recommendation performance. Given this, this study proposes a novel code reviewer recommendation method based on inter-agent collaboration. This method utilizes advanced large language models to accurately capture the rich textual semantics information of PRs and reviewers. Moreover, the powerful planning, collaboration, and decision-making capabilities of AI agents enable the integration of information from different interaction types, possessing high flexibility and adaptability. The experimental analysis based on real datasets shows that compared with the baseline reviewer recommendation methods, the performance of the proposed method is improved by 4.45% to 26.04%. In addition, the case study proves that the proposed method has outstanding performance in interpretability, further verifying its effectiveness and reliability in practical applications.
CHEN Quan-Lin , CHEN Yi-Yu , HUO Jing , CAO Hong-Ye , GAO Yang , LI Dong , HAO Jian-Ye
2025, 36(6):2576-2603. DOI: 10.13328/j.cnki.jos.007304 CSTR: 32375.14.jos.007304
Abstract:Bayesian optimization is a technique for optimizing black-box functions. Due to its high sample utilization efficiency, it is widely applied across various scientific and engineering fields, such as hyperparameters tuning of deep models, compound design, drug development, and material design. However, the performance of Bayesian optimization significantly deteriorates when the input space is of high dimensionality. To overcome this limitation, numerous studies carry out high-dimensional extensions on Bayesian optimization methods. To deeply analyze research methods of high-dimensional Bayesian optimization, this study categorizes these methods into three types based on assumptions and characteristics of different kinds of work: methods based on the effective low-dimensional hypothesis, methods based on additive assumptions, and methods based on local search. Then, this study elaborates on and analyzes these methods. This study first focuses on analyzing the research progress of these three types of methods. Then, the advantages and disadvantages of each method in the application of Bayesian optimization are compared. Finally, the main research trends in high-dimensional Bayesian optimization at the current stage are summarized, and future development directions are discussed.
SUN Ze-Yu , WU Jing-Zheng , LING Xiang , WEI Yi-Lin , LUO Tian-Yue , WU Yan-Jun
2025, 36(6):2604-2642. DOI: 10.13328/j.cnki.jos.007308 CSTR: 32375.14.jos.007308
Abstract:The current mainstream mode of software development is the supply chain-level reuse of open-source software and components. It avoids repetitive development, reduces research and development costs, and enhances development efficiency. However, it inevitably brings about issues such as unknown component sources, unclear component compositions, unidentified component vulnerabilities, and license violations. To address these issues, researchers propose software bill of materials (SBOM). SBOM provides a detailed list of software components and their relationships, reveals potential and known threats, and makes software transparent. Since its proposal, research on SBOM by researchers both at home and abroad mainly focus on its current status, applications, and tools, lacking theoretical and systematic research. This study presents a comprehensive review of the background, basic concepts, generation techniques, tools and performance analysis, applications, challenges, and trends of SBOM. It also proposes the new concept of SBOM+, which integrates fine-grained security vulnerability perception and license conflict detection. The aim is to provide support for researchers engaged in SBOM, software development, and supply chain security from the perspectives of concepts, technologies, tools, applications, and development.
WANG Lei , YUAN Ye , WANG Guo-Ren
2025, 36(6):2643-2682. DOI: 10.13328/j.cnki.jos.007290 CSTR: 32375.14.jos.007290
Abstract:Design pattern detection is an essential research topic in software engineering. Many scholars both domestically and internationally have dedicated their efforts to researching and resolving design pattern detection, thereby yielding fruitful results. This study reviews the current technologies in software design pattern detection and points out their prospects. Firstly, this study briefly introduces the development history of software design pattern detection, discusses the objects of design pattern detection, summarizes the feature types of design patterns, and provides the evaluation indexes of design pattern detection. Then, the existing classification methods for design pattern detection techniques are summarized, and the classification method proposed in this study is introduced. Next, according to the development timeline of design pattern detection technologies, the research status and latest advancements of current software design pattern detection technologies are discussed from three approaches, including non-machine learning design pattern detection, machine learning design pattern detection, and design pattern detection based on pre-trained language models, with the current achievements summarized and compared. Finally, the main problems and challenges in this field are analyzed, and further research directions and potential solutions are pointed out. Covering contents from early non-machine learning methods and utilization of machine learning technologies to the application of modern pre-trained language models, this study comprehensively and systematically presents the development history, latest advancements, and prospects of this field. It provides valuable guidance for future research directions and ideas within this area.
LI Heng , WU Bang , GONG Zhu , GAO Cui-Ying , YUAN Wei , LUO Xia-Pu
2025, 36(6):2683-2712. DOI: 10.13328/j.cnki.jos.007312 CSTR: 32375.14.jos.007312
Abstract:In the face of the severe security risks posed by Android malware, effective Android malware detection has become the focus of common concern in both the industry and academia. However, with the emergence of Android adversarial example techniques, existing malware detection systems are facing unprecedented challenges. Android malware adversarial example attacks can bypass existing malware detection models by perturbing the source code or characteristics of malware while keeping its original functionality inact. Despite substantial research on adversarial example attacks against malware, there is still a lack of a comprehensive review specifically focusing on adversarial example attacks in the Android system at present, and the unique requirements for adversarial example design within the Android system are not studied. Therefore, this study begins by introducing the fundamental concepts of Android malware detection. It then classifies existing Android adversarial example techniques from various perspectives and provides an overview of the development sequence of Android adversarial example techniques. Subsequently, it reviews Android adversarial example techniques in recent years, introduces representative work in different categories and analyzes their pros and cons. Furthermore, it categorizes and introduces common means of code perturbation in Android adversarial example attacks, and analyzes their application scenarios. Finally, it discusses the challenges faced by Android malware adversarial example techniques, and envisions future research directions in this emerging field.
WANG Bo , CHEN Chong , DENG Ming , DONG Zhen , LIN You-Fang , HAO Dan
2025, 36(6):2713-2746. DOI: 10.13328/j.cnki.jos.007313 CSTR: 32375.14.jos.007313
Abstract:Mobile applications, a new computing mode that has emerged in the past decade, significantly impact people’s lifestyles. Mobile applications primarily interact through graphical user interfaces (GUIs) and conducting manual testing for them requires significant manpower and material resources. In response to this, researchers propose automated GUI test generation techniques for mobile applications to enhance testing efficiency and detect potential defects. This study collects 145 relevant papers and systematically sorts out, analyzes, and summarizes existing work. This study proposes a research framework called “Test Generator-Test Environment” to categorize research in this domain based on the modules to which it belongs. Particularly, this study classifies existing methods roughly into five categories according to the methods on which the test generator is based: random-based, heuristic-search-based, model-based, machine-learning-based, and test-migration-based approaches. Furthermore, this study analyzes and discusses existing methods from other classification dimensions, such as defect categories and test action categories. Additionally, influential datasets and open-source tools in this field are compiled. Finally, this study summarizes the current challenges and provides an outlook on future research directions.
ZOU Bai-Han , WANG Ying , PENG Xin , LOU Yi-Ling , LIU Li-Hua , ZHANG Xin-Dong , LIN Fan , LIU Ming-Wei
2025, 36(6):2747-2773. DOI: 10.13328/j.cnki.jos.007226 CSTR: 32375.14.jos.007226
Abstract:When writing code, software developers often refer to code snippets that implement similar functions in the project. The code generation model shares similar features when generating code fragments and uses the code context provided in the input as a reference. The code completion technology based on retrieval augmentation is akin to this idea. The external code retrieved from the retrieval library is used as additional context information to prompt the generation model so as to complete the unfinished code fragments. The existing code completion method based on retrieval augmentation directly splices the input code and retrieval results together as the input of the generated model. This method brings a risk that the retrieved code fragments may not prompt the model, but mislead the model, resulting in inaccurate or irrelevant code results. In addition, whether the retrieved external code is completely related to the input code or not, it will be spliced with the input code and input to the model. Consequently, the effect of this method largely depends on the accuracy of the code retrieval stage. If the available code fragments cannot be returned in the retrieval phase, the subsequent code completion effect may also be affected. An empirical study is conducted on the retrieval augmentation strategies in the existing code completion methods. Through qualitative and quantitative experiments, the impact of each stage of retrieval augmentation on the code completion effect is analyzed. The empirical study focuses on identifying three factors for the effect of retrieval augmentation, namely, code granularity, code retrieval methods, and post-processing methods. Based on the conclusion of the empirical research, an improved method is designed, and a code completion method MAGIC (multi-stage optimization for retrieval augmented code completion) is proposed to improve the retrieval augmentation by optimizing the code retrieval strategy in stages. The improved strategies such as code segmentation, retrieval-reranking, and template prompt generation are designed to effectively enhance the auxiliary generation effect of the code retrieval module on the code completion model. Meanwhile, those strategies can also reduce the interference of irrelevant code in the code generation phase of the model and improve the quality of generated code. The experimental results on the Java code dataset show that, compared with the existing code completion methods based on retrieval augmentation, this method increases the editing similarity and perfect matching index by 6.76% and 7.81%, respectively. Compared with the large code model with 6B parameters, this method can save 94.5% of the video memory and 73.8% of the inference time, and improve the editing similarity and complete matching index by 5.62% and 4.66% respectively.
XIAO Quan-Bin , CHEN Yuan , WU Yi-Jian , PENG Xin
2025, 36(6):2774-2793. DOI: 10.13328/j.cnki.jos.007228 CSTR: 32375.14.jos.007228
Abstract:In the field of software engineering, code repositories contain a wealth of knowledge resources, which can provide developers with examples of programming practices. If repetitive patterns, frequently occurring in source code, can be effectively extracted in the form of code templates, programming efficiency could be significantly improved. In current practice, developers often reuse existing solutions by searching through source code. However, this method typically generates a large number of similar and redundant results, increasing the burden of subsequent filtering. Moreover, template mining techniques based on cloned code often fail to cover extensive patterns constructed from dispersed small clones, thereby limiting the practicality of the templates. A method is proposed for extracting and retrieving code templates based on code clone detection. This method achieves more efficient function-level code template extraction by stitching together multiple fragment-level clones and extracting and aggregating the shared parts of method-level clones and addresses the issue of template quality. Based on the mined code templates, this study comes up with a triplet representation method for code structural features that effectively supplements plain text features, and implements an efficient and concise structural representation. In addition, this study presents a template feature retrieval method that combines structural and textual search to retrieve these templates by matching features of the programming context. The tool implemented based on this method, CodeSculptor, demonstrates its significant capability to extract high-quality code templates in a test against a codebase containing 45 high-quality Java open-source projects. The results show that the templates mined by the tool achieve an average code reduction of 60.87%, with 92.09% produced by stitching fragment-level clones, a proportion of templates that is not identifiable by traditional method. It proves the superior performance of the method in recognizing and constructing code templates. Furthermore, the accuracy of the top-5 search results in the code template search and recommendation is 96.87%. A preliminary case study on 9600 randomly selected templates reveals that most of the sampled code templates are complete and coherent in semantics, thus affirming their practicality. Nonetheless, there are a few meaningless templates, highlighting the future potential to refine the proposed template extraction strategy. The user research further shows that code development tasks can be done more efficiently with CodeSculptor.
LI Ya-Cong , LIU Hao-Bing , JIANG Ruo-Bing , LIU Cong , ZHU Yan-Min
2025, 36(6):2794-2826. DOI: 10.13328/j.cnki.jos.007319 CSTR: 32375.14.jos.007319
Abstract:Heterogeneous graphs, which can effectively capture the complex and diverse relationships between entities in the real world, play a crucial role in many domains. Heterogeneous graph representation learning aims to map the information in graphs into a low-dimensional space, so as to capture the deep semantic associations between nodes and support downstream tasks such as node classification and clustering. This study presents a comprehensive review of the latest research progress in heterogeneous graph representation learning, covering both methodological advancements and real-world applications. It first formally defines the concept of heterogeneous graphs and discusses the key challenges in heterogeneous graph representation learning. From the perspectives of shallow models and deep models. It then systematically reviews the mainstream methods for heterogeneous graph representation learning, with a particular focus on deep models. Especially for deep models, they are categorized and analyzed from the perspective of heterogeneous graph transformation. The strengths, limitations, and application scenarios of various methods are thoroughly analyzed, aiming to provide readers with a holistic research perspective. Furthermore, the commonly used datasets and tools in the field of heterogeneous graph representation learning are introduced, and their applications in the real world are discussed. Finally, the main contributions of this study are summarized and the outlook on the future research directions in this area is presented. This study intends to offer researchers a comprehensive understanding of the field of heterogeneous graph representation learning, laying a solid foundation for future research and application.
BAI Rui-Rui , WANG Zhong-Qing , ZHOU Guo-Dong
2025, 36(6):2827-2843. DOI: 10.13328/j.cnki.jos.007246 CSTR: 32375.14.jos.007246
Abstract:Cross-lingual sentiment classification is very important in natural language processing and has been widely studied. It uses label information from the source language to construct a sentiment classification system for the target language, thereby greatly reducing the laborious labeling work in the target language. A fundamental challenge in cross-lingual sentiment classification is the obvious difference in the expressions of different languages. This study proposes a method for cross-lingual sentiment classification based on a bilingual dependency graph model. Although the expressions in different languages are various, their internal syntactic dependencies are similar. By establishing edges among word nodes in different languages to represent the semantic relevance of bilingual comment instances, the bilingual dependency graph can explicitly model the similarity of the dependency relationships among words in different languages, allowing graph neural networks to integrate syntactic structure information within and across languages for cross-lingual sentiment classification. Experiments conducted on datasets in both English and Chinese show that the proposed method achieves an improvement of 3% over the baseline method. It is proven that bilingual dependency graphs can effectively model the correlation of comment instances in different languages, thereby significantly improving the accuracy of cross-lingual sentiment classification.
ZHU Ming-Hui , LI Zheng , LI Rui-Yuan , CHEN Chao , ZHENG Yu
2025, 36(6):2844-2874. DOI: 10.13328/j.cnki.jos.007317 CSTR: 32375.14.jos.007317
Abstract:Advances of IoT (Internet of Thing) generate a sheer volume of floating-point time series data, which poses great challenges in storing and transmitting these data. To this end, floating-point time series data compression is extremely crucial. It can be classified into lossy and lossless compression based on data reversibility. Lossy compression methods achieve a better compression ratio by discarding some data information and are suitable for applications with lower precision requirements. Lossless compression methods, while reducing data size, retain all data information, which is essential for applications that require maintaining data integrity and accuracy. In addition, to meet the requirements of real-time monitoring on edge devices, streaming compression algorithms emerge. Current review studies on time series compression encounter issues such as incomplete sorting, unclear line of thought, single classification standards, and lack of inclusion of relatively new and representative algorithms. Time series compression algorithms over the years are divided into lossy compression and lossless compression. Then, different algorithm frameworks are further distinguished, including those based on data representation, prediction, machine learning, and transformation. Meanwhile, the compression characteristics of streaming and batch processing are summarized. Then, the design ideas of various compression algorithms are deeply analyzed, and the development context diagrams of these algorithms are presented. Next, the advantages and disadvantages of various algorithms are compared with experiments. Finally, common application scenarios are summarized. Future research is envisioned.
ZHANG Bin , ZHANG Yu , ZHANG Wei-Zhe , QIAO Yan-Chen , LIU Xiang , LIU Peng-Hui
2025, 36(6):2875-2899. DOI: 10.13328/j.cnki.jos.007305 CSTR: 32375.14.jos.007305
Abstract:PKI system is currently an important facility for users to securely access basic resources. It ensures the security of users’ access to resources through public third-party authentication. With the gradual deployment and application of PKI technology, various security issues in deployment arise. Attackers can steal user information and disrupt user access by attacking the PKI system. This study starts from the basic working principle of PKI and comprehensively introduces all the elements involved in the practical deployment and application of the PKI system, including PKI architecture, workflow, certificates, certificate chains, certificate revocation, and CI log services. Based on the basic working principles of PKI, this study focuses on comprehensively sorting out and summarizing the security issues that the PKI system faces during its operation from the perspective of PKI system security, including operational and technical risks, measurement and risk detection of PKI system, and various risk prevention technologies for PKI systems. Finally, future research directions in the field of PKI are prospected.
WANG Xin-Rui , YAO Yue , YU Dong-Xiao , GAO Hong , CHENG Xiu-Zhen
2025, 36(6):2900-2926. DOI: 10.13328/j.cnki.jos.007243 CSTR: 32375.14.jos.007243
Abstract:Dynamic information networks (DIN), which contain evolving objects in the real world and the links among them, are often modeled as a series of static undirected graph snapshots. A community consists of a group of well-connected objects in an information network. In a DIN, there is often a community whose size increases over time but its members always keep well-connected during that period of time. The evolving trajectory of such a community over time forms a sequence of the community on several snapshots of the DIN, which is termed a lasting enlarging community sequence in this study. It is meaningful to search for lasting enlarging community sequences in a DIN. However, no previous research has paid attention to such community sequences. This study formally defines the q-based lasting enlarging community sequence (qLEC) in a DIN by combining set inclusion with the triangle-connected k-truss model. A two-phase search algorithm is developed, which includes computing candidate vertex sets of communities from the beginning to the end of the time window and performing community sequence search from the end to the beginning of the time window. This study also provides optimization strategies based on early termination and TCP index compression to reduce time and space costs. Sufficient experiments demonstrate that the qLEC model has specific practical significance compared to existing dynamic community models. The two-phase search algorithm effectively finds qLEC-based lasting enlarging community sequences. The proposed optimization strategies significantly reduce the spatiotemporal cost of the two-phase algorithm.

