WANG Shu-Lan , QIU Yao , ZHAO Chen-Bin , ZOU Jia-Xu , WANG Cai-Fen
2025, 36(7):2929-2946. DOI: 10.13328/j.cnki.jos.007332
Abstract:Differential privacy, owing to its strong privacy protection capacity, is applied to the random forest algorithm to address the privacy leakage problem. However, the direct application of differential privacy to the random forest algorithm leads to a significant decline in the model’s classification accuracy. To balance the contradiction between privacy protection and model accuracy, this study proposes an efficient differential privacy random forest training algorithm, efficient differential privacy random forest (eDPRF). Specifically, the study designs a decision tree construction method based on the permute-and-flip mechanism. By introducing the efficient query output advantage of the permute and flip mechanism, the corresponding utility functions are further designed to achieve the precise output of split features and labels, effectively enhancing the learning ability of the tree model for data information under perturbation circumstances. At the same time, the study designs a privacy budget allocation strategy based on the composition theorem, which improves the privacy budget utilization rate of nodes by obtaining training subsets without replacement sampling and adjusting internal budgets through differentiation. Finally, through theoretical analysis and experimental evaluation, it is demonstrated that the proposed algorithm outperforms similar algorithms in terms of the model’s classification accuracy when given the same privacy budget.
OU Xian-Fei , JIANG Yan-Yan , XU Chang
2025, 36(7):2947-2963. DOI: 10.13328/j.cnki.jos.007333
Abstract:Fuzz testing techniques play a significant role in software quality assurance and software security testing. However, when dealing with systems like compilers that have complex input semantics, existing fuzz testing tools often struggle as a lack of semantic awareness in their mutation strategies leads to the generated programs failing to pass compiler frontend checks. This study proposes a semantically-aware greybox fuzz testing method, aiming at enhancing the efficiency of fuzz testing tools in the domain of compiler testing. It designs and implements a series of mutation operators that can maintain input semantic validity and explore contextual diversity, and develops efficient selection strategies according to the characteristics of these operators. The greybox fuzz testing tool SemaAFL is developed by integrating these strategies with traditional greybox fuzz testing tools. Experimental results indicate that by applying these mutation operators, SemaAFL achieves approximately 14.5% and 11.2% higher code coverage on GCC and Clang compilers compared to AFL++ and similar tools like GrayC. During a week-long experimental period, six previously unknown bugs in GCC and Clang are discovered and reported by SemaAFL.
CHEN Yuan-Liang , MA Fu-Chen , ZHOU Yuan-Hang , YAN Zhen , JIANG Yu , SUN Jia-Guang
2025, 36(7):2964-3002. DOI: 10.13328/j.cnki.jos.007334
Abstract:Distributed systems are the pillars of the current computing ecosystem, which make modern computing more powerful, reliable, and flexible, covering several key fields from cloud computing and big data processing to the Internet of Things. However, due to the complexity of the system, some code defects are inevitably introduced during the code implementation of distributed systems, thus posing a huge threat to the availability, robustness, and security of the system. Therefore, the testing and defect detection work of distributed systems is very important. Dynamic testing technology conducts real-time analysis during the system operation to detect its defects and evaluate its behavior and functions, and is widely used in the defect detection of various system applications and has successfully found many code defects. A four-layer defect threat model of distributed systems is proposed in this study. Based on it, the testing requirements and main challenges of distributed systems are analyzed, and a general framework for dynamic testing of distributed systems is proposed. Then, typical dynamic testing tools for distributed systems are introduced from the perspective of detecting different types of system defects. Next, the study highlights critical techniques such as multidimensional test input generation, system-critical state awareness, and defect judgment criteria. Additionally, the paper reviews popular dynamic testing tools and evaluates their effectiveness in defect discovery and test coverage. The coverage and defect discovery capabilities of the current mainstream dynamic testing tools for distributed systems are evaluated. The findings show that multidimensional input generation significantly enhances testing efficiency. Finally, the study discusses emerging trends and future directions in dynamic testing of distributed systems, aiming to address their inherent challenges and improve testing outcomes.
JIA Ang , FAN Ming , XU Xi , JIN Wu-Xia , WANG Hai-Jun , LIU Ting
2025, 36(7):3003-3021. DOI: 10.13328/j.cnki.jos.007335
Abstract:Binary2Source function similarity detection is regarded as one of the fundamental tasks in software composition analysis. In the existing binary2Source matching works, the 1-to-1 matching mechanism is mainly adopted, where one binary function is matched against one source function. However, it is found that such a mapping may be 1-to-n (one binary function is mapped to multiple source functions) due to the existence of function inlining. A 30% performance loss is suffered by the existing binary2Source matching methods under function inlining due to this difference. Aimed at the matching requirement of binary to source functions in the scene of function inlining, a binary2Source function similarity detection method for 1-to-n matching is proposed in this study, which is designed to generate source function sets as the matching objects for the inlined binary functions to make up for the lack of the source function library. The effectiveness of the proposed method is evaluated through a series of experiments. The experimental data indicate that the method can not only improve the existing binary2Source function similarity detection ability but also identify the inlined source code functions, helping the existing tools better cope with the challenges of inlining.
SHEN Qing-Chao , TIAN Jia-Shuo , CHEN Jun-Jie , CHEN Xiang , CHEN Qing-Yan , WANG Zan
2025, 36(7):3022-3040. DOI: 10.13328/j.cnki.jos.007336
Abstract:Deep Learning compilers (DL compilers) are widely applied to optimize and deploy deep learning models. Similar to traditional compilers, DL compilers also possess bugs. The buggy DL compilers can cause compilation failures, generate incorrect compilation results and even lead to disastrous consequences sometimes. To deeply understand the characteristics of DL compiler bugs, the existing works have analyzed 603 early bugs in DL compilers. In recent years, DL compilers have been updated rapidly, along with the introduction of a large number of new features and the abandonment of some old ones. At the same time, several bug detection approaches for DL compilers have been developed. Therefore, it is necessary to analyze whether the previous research conclusions on DL compiler bugs are still applicable. In addition, there is a lack of in-depth exploration of the relationship among the symptoms, root causes, and locations of bugs, and the characteristics of bug-revealing tests and bug-fixing patches have not been studied. To deeply analyze the evolution process of the current DL compiler bug characteristics and distribution over time, 613 recently fixed bugs in three mainstream DL compilers (i.e., TVM of Apache, Glow of Facebook, and AKG of Huawei) are collected in this study, and the characteristics such as root causes, symptoms and locations of bugs are manually labeled. Based on the labeling results, this study deeply explores the distribution characteristics of bugs from multiple dimensions and compares them with that in the existing works. Meanwhile, we further investigate the characteristic of bug-revealing regression tests and bug-fixing patches. In total, this study summarizes 12 major findings to comprehensively understand the current situation and evolution of DL compiler bugs and provide a series of feasible suggestions for the detection, location, and repair of DL compiler bugs. Finally, to verify the effectiveness of the research findings in this work, a testing tool CfgFuzz based on optimized configuration is developed. CfgFuzz conducts combinatorial tests on compilation configuration options and finally detects 8 TVM bugs, 7 of which have been confirmed or fixed by developers.
SUN Wei-Jie , XU Chang , WANG Ying
2025, 36(7):3041-3086. DOI: 10.13328/j.cnki.jos.007338
Abstract:Java has become one of the most popular programming languages for application project development nowadays, due to its rich dependency libraries and convenient build tools such as Maven and Gradle. However, with the continuous increase in the scale of dependency libraries, the dependency management of Java projects becomes increasingly complex and constantly exceeds the management capabilities of existing tools. The potential problems are likely to be triggered unexpectedly, seriously affecting the building and running of the current project and other projects in the Java ecosystem, such as causing build errors, runtime crashes, or semantic conflicts. This study aims to address the gaps in the analysis of dependency management issues found in existing research and technical literature by introducing the concept of “dependency smell”, to build a unified model for these challenges. This study conducts a comprehensive empirical study on dependency management issues, covering all categories of Maven and Gradle related problems. This study analyzes diverse dependency management issues gathered from open-source communities (e.g., GitHub), official documentation (e.g., Maven manual), as well as various surveys and technical papers. Finally, 13 types of dependency smell, as well as their triggering roots and impact characteristics, are summarized. Based on the findings of this empirical study, a unified detection algorithm for dependency smells in Java projects is designed, and a special detection tool JDepAna suitable for Maven and Gradle build tools is implemented. Experimental results demonstrate that for known dependency smells, JDepAna achieves a detection recall rate of 95.9%. For hundreds of new Java projects, JDepAna detects 30689 instances of dependency smells. 360 instances are selected, and the true positive rate of manual verification reaches 96.1%. Additionally, this study reports 48 instances to developers, with 42 instances promptly confirmed and 21 promptly fixed, thereby validating the efficacy and practicality of the proposed Java dependency smell detection algorithm and tool in facilitating quality assurance for Java projects.
ZHAO Dong-Dong , SONG Bao-Gang , LIAO Hu-Cheng , YAN Jiang , XIANG Jian-Wen
2025, 36(7):3087-3108. DOI: 10.13328/j.cnki.jos.007339
Abstract:With the rapid development of information technology, security authentication technology has become a crucial safeguard for personal privacy and data security. Among them, iris recognition technology, with its outstanding accuracy and stability, is widely applied in fields such as system access control, healthcare, and judicial practices. However, once the iris feature data of a user is leaked, it means permanent loss, as it cannot be changed or revoked. Therefore, the privacy protection of iris feature data is particularly important. With the prominent performance of neural network technology in image processing, secure iris recognition schemes based on neural networks have been proposed, which maintain the high performance of recognition systems while protecting privacy data. However, in the face of constantly changing data and environments, secure iris recognition schemes are required to have effective scalability, that is, the recognition scheme should be able to maintain performance with new user registrations. However, most of the existing research on neural network-based secure iris recognition schemes does not consider the scalability of the schemes. Aiming at the above problems, the generative feature replay-based secure incremental iris recognition (GFR-SIR) method and the privacy-preserving template replay-based secure incremental iris recognition (PTR-SIR) method are proposed in this study. Specifically, the GFR-SIR method uses generative feature replay and feature distillation techniques to alleviate the forgetting of previous task knowledge during the expansion of neural networks and adopts the improved TNCB method to protect the privacy of iris feature data. The PTR-SIR method preserves the privacy-protecting templates obtained through the TNCB method in previous tasks and replays these templates during the model training of the current task to achieve the scalability of the recognition scheme. Experimental results show that after completing 5 rounds of expansion tasks, the recognition accuracy of GFR-SIR and PTR-SIR on the CASIA-Iris-Lamp dataset reaches 68.32% and 98.49% respectively, which is an improvement of 58.49% and 88.66% compared with the fine-tuning method. The analysis indicates that the GFR-SIR method has significant advantages in terms of security and model training efficiency since the data of previous tasks is not saved; while the PTR-SIR method is more outstanding in maintaining recognition performance, but its security and efficiency are lower than those of GFR-SIR.
ZHANG Xiao-Yi , LI Xing , LIU Yang , ZHENG Zheng , SUN Chang-Ai
2025, 36(7):3109-3133. DOI: 10.13328/j.cnki.jos.007340
Abstract:Path planning algorithms for intelligent agents are designed to plan the behavior trajectory of an agent so that it can safely and efficiently reach the target point from the starting point without colliding with obstacles. Currently, path planning algorithms have been widely applied in various critical cyber-physical systems. Therefore, it is essential that the path planning algorithms be tested before being put into use to evaluate whether their performance can meet the requirements. However, the distribution patterns of threat obstacles in the task space, which are the inputs of the path planning algorithm, are complex and diverse. Moreover, a relatively high operational cost is usually required when the path planning algorithm plans a path for each test case. To improve the testing efficiency of the path planning algorithms, this study adapts the concept of dynamic random testing into path planning algorithms and proposes the dynamic random testing approach for intelligent agent path planning algorithms (DRT-PP). Specifically, DRT-PP discretely divides the path planning task space and introduces the threat generation probability within each sub-region, thus constructing the test profile. This test profile can be used as a testing strategy in the process of test case generation. Furthermore, the test profile is dynamically adjusted by DRT-PP during the testing process to make it gradually optimized, thereby improving the testing efficiency. Experimental results show that, compared with random testing and adaptive random testing, the DRT-PP approach can not only ensure the diversity of test cases but also generate more test cases that can expose the performance defects of the tested algorithm.
QIU Shao-Jian , CHENG Jia-Hao , HUANG Meng-Yang , HUANG Qiong
2025, 36(7):3134-3150. DOI: 10.13328/j.cnki.jos.007341
Abstract:Vulnerability detection is a critical technology in software system security. In recent years, deep learning has achieved remarkable progress in vulnerability detection due to its outstanding ability in code feature extraction. However, the existing deep learning-based methods only concentrate on the independent structural features of code instances, overlooking the structural feature similarities and correlations among different vulnerable codes, which limits the performance of vulnerability detection technology. To address this issue, this study proposes a vulnerability detection method based on correlation of structural features between functions (CSFF-VD). This method first parses functions into code property graphs and the independent structural features within functions are extracted by using gated graph neural networks. On this foundation, an association network among functions is constructed based on feature similarity, and a graph attention network is utilized to further extract the correlation information between functions, thus improving the performance of vulnerability detection. Experimental results show that CSFF-VD surpasses the current deep learning-based vulnerability detection methods on three public vulnerability detection datasets. In addition, based on the extraction of independent features within the function, this study proves the effectiveness of integrating the correlation information between functions by adding experiments on the inter-function correlation feature extraction method in CSFF-VD.
ZHANG Xiao , QIN Chun-Ling , WANG Wen-Shou , LIU Hao , CHEN Jin-Chuan , DU Xiao-Yong
2025, 36(7):3151-3183. DOI: 10.13328/j.cnki.jos.007366
Abstract:In recent years, blockchain technology has undergone rapid development and found widespread application in areas such as data circulation, finance, logistics, government affairs, and judiciary. Alongside this growth, various benchmarks have emerged to evaluate the performance of different blockchain systems. However, these benchmarks exhibit significant differences in scope and methodology, lacking a unified framework to standardize their design and evaluation processes. In addition, there is no consistent metric system to explicitly define the performance and security capabilities required of blockchain systems. Drawing from the development history of databases, a unified, repeatable, and fair benchmarking standard can provide valuable guidance for industry advancement. As a specialized form of distributed database management system, blockchain can benefit from the extensive experience gained through the evolution of database technology. Inspired by database benchmarks and considering the unique characteristics of blockchain systems such as decentralization, immutability, and trustworthiness, a blockchain benchmarking reference framework, UFBCB, has been proposed. This framework identifies five core elements essential to blockchain benchmarks: application model, data model, workload, metrics, and execution rules, and clarifies the interrelationship among these elements. It provides a unified standard for evaluating blockchain performance. Moreover, a comprehensive metric system encompassing performance, energy efficiency, scalability, and security is introduced, aiming to capture the critical capabilities of blockchain systems comprehensively. Based on this framework, a detailed comparative analysis of existing blockchain benchmarks is conducted, highlighting common issues and deficiencies. Finally, this study discusses potential directions for the future development of blockchain benchmarking.
QIN Zheng , XU Li-Jie , CHEN Wei , WANG Yi , WU Ming-Chao , ZENG Hong-Bin , WANG Wei
2025, 36(7):3184-3208. DOI: 10.13328/j.cnki.jos.007235
Abstract:With the advent of the big data era, massive volumes of user data have empowered numerous data-driven industry applications, such as smart grids, intelligent transportation, and product recommendations. In scenarios where real-time data is crucial, the business value embedded within data rapidly diminishes over time. Consequently, data analysis systems require high throughput and low latency. Stream processing systems in big data, exemplified by Apache Flink, have been widely applied. Flink enhances system throughput by parallelizing computing tasks across cluster nodes. However, current research indicates that Flink has weak single-point performance and poor cluster scalability. To improve the throughput of stream processing systems, researchers have focused on optimizations in designing control planes, implementing system operators, and improving vertical scalability. However, there is still a lack of attention to the data flow in streaming analysis applications. These applications are driven by event streams and employ stateful processing functions, including low voltage detection in smart grids and advertising recommendation. This study analyzes the data flow characteristics of typical streaming analysis applications, identifies three bottlenecks in optimizing scalability, and proposes corresponding optimization strategies: the key-level watermark strategy, the dynamic load distribution strategy, and the the key-value based exchange strategy. Based on these optimization strategies, this study implements Trilink based on Flink and applies it to various applications such as low voltage detection, bridge arch crowns monitoring, and the Yahoo Streaming Benchmark. Experimental results show that the modified system, Trilink, achieves more than a 5-fold increase in throughput in a single-machine environment and over a 1.6-fold improvement in horizontal scalability acceleration in an 8-node setup, compared to Flink.
YANG Hong-Yu , WANG You-Wei , ZHANG Liang , HU Ze , JIANG Lai-Wei , CHENG Xiang
2025, 36(7):3209-3225. DOI: 10.13328/j.cnki.jos.007230
Abstract:As both Android frameworks and malware continue to evolve, the performance of existing malware classifiers degrades significantly over time. This study proposes droid slow aging (DroidSA), a method for Android malware detection based on API clustering and call graph optimization. Firstly, API clustering is performed before malware detection to generate cluster centers that reflect API functionality. To make clustering results more accurate, this study obtains embeddings fully reflecting the semantic similarity of APIs by designing API sentences to summarize vital features such as API names and permissions and using NLP tools to mine the semantic information of API sentences. Then, call graphs are extracted from apps and optimized by removing unknown methods while preserving the connectivity among API nodes. Call graph optimization enables detection methods to extract more robust contextual information of APIs which reflects the mode of app behavior. DroidSA extracts pairs of function calls from the optimized call graphs and abstracts the APIs in the pairs into cluster centers obtained in API clustering to better adjust to the changes in Android frameworks and malware. Finally, one-hot encoding is used to generate feature vectors, and the best-performing classifier is selected from random forests, support vector machines, and the k-nearest neighbors algorithm for malware detection. Experimental results demonstrate that DroidSA achieves an average F1-Measure of 96.7% for malware detection. Under the experimental setup where temporal bias is eliminated, DroidSA trained with apps from 2012 to 2013 achieves an average F1-Measure of 82.6% when detecting malware developed from 2014 to 2018. Compared with the state-of-the-art detection methods MaMaDroid and MalScan, DroidSA stably maintains high detection metrics with minimal impact from temporal changes and effectively detects evolved malware.
ZHAO Xiang-Fu , HUANG Sen , WEI Xia , TONG Xiang-Rong , OUYANG Dan-Tong , ZHANG Li-Ming
2025, 36(7):3226-3238. DOI: 10.13328/j.cnki.jos.007227
Abstract:In the field of model-based diagnosis, all minimal hitting sets (MHSs) of minimal conflict sets (MCSs) are the candidate diagnoses of the device to be diagnosed, so the calculation of MHS is a key step for generating candidate diagnoses. MHS is a classic NP-hard constraint solving problem. The bigger the problems get, the harder it becomes exponentially to solve them. Boolean algorithm is typical in calculating MHS. However, in the process of solving, most of the runtime is taken up by the minimization of the intermediate solution sets. This study proposes BWSS (Boolean with suspicious sets) algorithm combined with suspicious set clusters for calculating MHS. By analyzing the spanning tree rule of Boolean algorithm in depth, the set that causes the candidate solution to become a superset is found. When extending elements to the root node, the candidate solution, if discovered to share at least one empty set with the suspicious set cluster, shall be minimal. Otherwise, the solution will be removed. The recursive strategy will be employed to ensure that all and only MHS are generated at the end of the algorithm. In addition, each candidate solution has at least m (m≥1) elements or even the entire solution in no need of complex minimization. Theoretically, BWSS algorithm is far less complex than Boolean algorithm. According to random data and mass reference circuit data, experimental results show that compared with many other state-of-the-art methods, the proposed algorithm reduces several orders of magnitude in runtime.
HUANG Jing , TAO Zhu-Lin , DU Xiao-Yu , XIANG Xin-Guang
2025, 36(7):3239-3252. DOI: 10.13328/j.cnki.jos.007229
Abstract:Multi-label text classification aims to assign several predefined labels or categories to text. To fully explore the correlations among labels, current methods typically utilize a label relation graph and integrate it with graph neural networks to obtain the representations of label features. However, such methods often overly rely on the initial graph construction, overlooking the inherent label correlations in the current text. Consequently, classification results heavily depend on the statistics of datasets and may overlook label-related information within the text. Therefore, this study proposes an algorithm for multi-label text classification based on feature-fused dynamic graph networks. It designs dynamic graphs to model label correlations within the current text and integrates feature fusion with graph neural networks to form label representations based on the current text, thus achieving more accurate multi-label text classifications. Experimental results on three datasets demonstrate the effectiveness and feasibility of the proposed model as it shows excellent performance in multi-label text classifications.
CAO Yi , GUO Mao-Zu , WU Wei-Ning
2025, 36(7):3253-3270. DOI: 10.13328/j.cnki.jos.007233
Abstract:Domain adaptation (DA) is a group of machine learning tasks where the training set (source domain) and the test set (target domain) exhibit different distributions. Its key idea lies in how to overcome the negative impact given by these distributional differences, in other words, how to design an effective training strategy to obtain a classifier with high generalization performance by minimizing the difference between data domains. This study focuses on the tasks of unsupervised DA (UDA), where annotations are available in the source domain but absent in the target domain. This problem can be considered as how to use partially annotated data and unannotated data to train a classifier in a semi-supervised learning framework. Then, two kinds of semi-supervised learning techniques, namely pseudo labels (PLs) and consistent regularization (CR), are used to augment and annotate data in the observed domain for learning the classifier. Consequently, the classifier can obtain better generalization performance in the tasks of UDA. This study proposes augmentation-based UDA (A-UDA), in which the unannotated data in the target domain are augmented by random augmentation, and the high-confident data are annotated by adding pseudo-labels based on the predicted output of the model. The classifier is trained on the augmented data set. The distribution distance between the source domain and the target domain is calculated by using the maximum mean difference (MMD). By minimizing this distance, the classifier achieves high generalization performance. The proposed method is evaluated on multiple UDA tasks, including MNIST-USPS, Office-Home, and ImageCLEF-DA. Compared to other existing methods, it achieves better performance on these tasks.
GAO Meng-Nan , CHEN Wei , WU Li-Fa , ZHANG Bo-Lei
2025, 36(7):3271-3305. DOI: 10.13328/j.cnki.jos.007364
Abstract:Deep learning models are integral components of artificial intelligence systems, widely deployed in various critical real-world scenarios. Research has shown that the low transparency and weak interpretability of deep learning models render them highly sensitive to perturbations. Consequently, artificial intelligence systems are exposed to multiple security threats, with backdoor attacks on deep learning models representing a significant concern. This study provides a comprehensive overview of the research progress on backdoor attacks and defenses in mainstream deep learning systems, including computer vision and natural language processing. Backdoor attacks are categorized based on the attacker’s capabilities into full-process controllable backdoors, model modification backdoors, and data poisoning backdoors, which are further classified according to the backdoor construction methods. Defense strategies are divided into input-based defenses and model-based defenses, depending on the target of the defensive measures. This study also summarizes commonly used datasets and evaluation metrics in this domain. Lastly, existing challenges in backdoor attack and defense research are discussed, alongside recommendations and future directions focusing on security application scenarios of backdoor attacks and the efficacy of defense mechanisms.
YIN Xin-Chun , WANG Jing-Wei , NING Jian-Ting
2025, 36(7):3306-3320. DOI: 10.13328/j.cnki.jos.007216
Abstract:Cloud storage auditing guarantees the security of data stored in the cloud, enabling data owners to easily verify the integrity of data. However, a vast amount of data in the cloud can lead to significant computational overhead during cloud storage auditing when verifying data integrity and modifying data ownership. To solve this problem as well as provide practical solutions, this study proposes a dynamic auditing scheme for cloud storage with efficient data ownership sharing. An efficient validation structure is constructed to aggregate information for data verification, avoiding a large number of bilinear pairing operations that incur high computational costs. An efficient mechanism for data ownership sharing is designed based on the chameleon hash function’s ability to generate new collisions. It allows updating the secret key of the corresponding user for shared data ownership without modifying the ciphertexts stored in the cloud. In addition, the proposed scheme achieves fine-grained data sharing, encrypted data auditing, and dynamic data modification. The security and performance analyses show that the proposed scheme ensures the security of data in the cloud without affecting its performance, which means it is a practical scheme.
XU Yin-Song , LUO Yi-Yuan , DONG Xiao-Yang , YUAN Zheng
2025, 36(7):3321-3338. DOI: 10.13328/j.cnki.jos.007218
Abstract:In the Q1 model, this paper proposes a low-data quantum key-recovery attack against Lai-Massey structures, Misty structures, Type-1 generalized Feistel structures, SMS4-like generalized Feistel structures and MARS-like generalized Feistel structures. This attack only needs to select constant-sized plain-ciphertexts, analyze the encryption process of block cipher structures, and recover the key by searching and calculating some intermediate states and round keys using Grover’s algorithm. This attack belongs to the Q1 model, which is more practical than the Q2 model since no quantum superposition query is required. For the 3-round Lai-Massey structure, compared with other quantum attacks, this attack requires only $ {\rm O}(1) $ data and belongs to the Q1 model, and is even reduced by the $ n{2^{n/4}} $ factor on the evaluation of the complexity product (time×data×classical memory×quantum bits). For the 6-round Misty structure, this attack still retains the advantage of low data complexity, and especially for the 6-round Misty L/R-FK structure, this attack is reduced by the$ {2^{n/2}} $factor on the evaluation of the complexity product. For the 9-round 3-branch Type-1 generalized Feistel structure, in line with other quantum attacks on the evaluation of the complexity product, this attack still retains the advantage of low data complexity and belongs to the chosen plaintext attack. In addition, a low-data quantum key-recovery attack for SMS4-like generalized Feistel structures and MARS-like generalized Feistel structures are also given in this study, complementing their security evaluation in the Q1 model.
WANG Han-Cheng , CHEN Zhi-Peng , DAI Hai-Peng , GU Rong , KIM Chaewon , CHEN Gui-Hai
2025, 36(7):3339-3357. DOI: 10.13328/j.cnki.jos.007214
Abstract:The cuckoo filter is an efficient probabilistic data structure that can quickly determine whether an element exists in a given set. The cuckoo filter is widely used in computer networks, IoT applications, and database systems. These systems usually involve the handling of massive amounts of data and numerous concurrent requests in practice. A cuckoo filter that supports high concurrency can significantly improve system throughput and data processing capabilities, which is crucial to system performance enhancement. Therefore, a cuckoo filter that supports lock-free concurrency is designed. The filter achieves high-performance lookup, insertion, and deletion through the two-stage query, separation of path exploration and element migration, and atomic migration based on multi-word compare-and-swap. Theoretical analysis and experimental results indicate that the lock-free concurrent cuckoo filter significantly improves the concurrent performance of the most cutting-edge algorithms in current times. The lookup throughput of a lock-free concurrent cuckoo filter is on average 1.94 times that of a cuckoo filter using fine-grained locks.
LI Meng , LUO Wen-Qi , DAI Hai-Peng , WANG Han-Cheng , GU Rong , CHEN Gui-Hai
2025, 36(7):3358-3374. DOI: 10.13328/j.cnki.jos.007221
Abstract:A cuckoo filter is a space-efficient approximate membership query data structure, widely used in network systems for applications such as network routing, network measurement, and network caching. However, the traditional design of cuckoo filters has not adequately considered the scenario in network systems where some or all queries in the collection are known, and these queries come with associated costs. This limitation results in the suboptimal performance of existing cuckoo filters in such situations. To address this, the variable hashing-fingerprint cuckoo filter (VHCF) has been developed. VHCF introduces variable fingerprint hashing technology, taking into account the known query collection and their associated costs. By searching for the optimal fingerprint hash function for each hash bucket, the overall cost of false positives is significantly reduced. In addition, this study proposes a single-hash technology to reduce the additional computational overhead caused by the variable-hash technology. A theoretical analysis of the operational complexity and false positive rate of VHCF is also provided. Finally, experimental and theoretical results both demonstrate that VHCF achieves a significantly lower false positive rate than existing cuckoo filters and their variants while ensuring comparable query throughput. Specifically, VHCF only needs to allocate 1–2 bits for each hash index unit, which can reduce the false positive rate to 12.5%–50% of the original.
WU Hua , LUO Hao , ZHAO Shi-Shun , LIU Song-Tao , CHENG Guang , HU Xiao-Yan
2025, 36(7):3375-3404. DOI: 10.13328/j.cnki.jos.007236
Abstract:The rise of video platforms has led to the rapid dissemination of videos, integrating them into various aspects of social life. Videos transmitted in the network may include harmful content, highlighting an urgent need for cyberspace security supervision to accurately identify harmful videos that are encrypted and transmitted in the network. The existing methods collect traffic data at main network access points to extract the features of encrypted video traffic and identify the harmful videos by matching the traffic features based on harmful video databases. However, with the progress of encryption protocol for video transmission, HTTP/2 using new multiplexing technologies has been widely applied, which makes the traditional traffic analysis method based on HTTP/1.1 features fail to identify encrypted videos using HTTP/2. Moreover, the current research mostly focuses on videos with a fixed resolution during playback. Few studies have considered the impact of resolution switching in video identification. To address the above problems, this study analyzes the factors that cause offsets in the length of the audio/video data during the HTTP/2 transmission process and proposes a method to precisely reconstruct corrected fingerprints for encrypted videos by calculating the size of the combined audio and video segments in the encrypted traffic. The study also proposes an encrypted video identification model based on the hidden Markov model and the Viterbi algorithm by using the corrected fingerprints of encrypted videos and a large plaintext fingerprint database for videos. The model applies dynamic planning to solve the problems caused by adaptive video resolution switching. The proposed model achieves identification accuracy of 98.41% and 97.91% respectively for encrypted videos with fixed and adaptive resolutions in 400000-level fingerprint databases, namely Facebook and Instagram. The study validates the generality and generalization of the proposed method using three video platforms: Triller, Twitter, and Mango TV. The higher application value of the proposed method has been validated through comparisons with similar work in terms of recognition effectiveness, generalization, and time overhead.
ZHANG Peng-Fei , ZHU Yi-Bo , CHENG Xiang , ZHANG Zhi-Kun , LIU Xi-Meng , SUN Li , FANG Xian-Jin , ZHANG Ji
2025, 36(7):3405-3428. DOI: 10.13328/j.cnki.jos.007287
Abstract:To conduct necessary aggregation on varying-quality sensed data uploaded by workers in mobile crowdsensing, truth discovery technology has emerged as the cornerstone for providing precise data support for subsequent applications. Existing studies tend to adopt local differential privacy for protection against potential privacy breaches, but often ignore the influence of outliers in the sensed data on the truth discovery accuracy under local differential privacy. These outliers often have a large range of values, resulting in a large amount of noise in the injected data. Additionally, due to workers’ concerns about privacy breaches, mobile crowdsensing servers cannot preprocess data without privacy protection. To this end, this study proposes NATURE, which meets local differential privacy based on adaptive pruning. The core idea of the algorithm is to consider the noise types in the data to adaptively prune all unnecessary workers’ values or certain task values. In NATURE, the noise-aware weight and importance estimation (NWIE) method based on a formalized constraint optimization problem is designed to facilitate data pruning. Based on proving the optimal pruning problem is NP-hard, this study designs the utility-aware adaptive pruning (UAP) method with polynomial time complexity to conduct pruning. Furthermore, a theoretical analysis of NATURE’s privacy, utility, and complexity is carried out. Experimental results on two real-world datasets and one synthetic dataset demonstrate that NATURE achieves an accuracy improvement of at least 20% in obtaining “truth” compared to its comparative algorithms.