• Current Issue
  • Online First
  • Archive
  • Click Rank
  • Most Downloaded
  • 综述文章
  • 专刊文章
  • 分辑系列
    Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2024,35(7):3071-3092, DOI: 10.13328/j.cnki.jos.007110
    [Abstract] (417) [HTML] (103) [PDF 2.60 M] (1104)
    Abstract:
    Repairing software defects is an inevitable and significant problem in the field of software engineering, while automated program repair (APR) techniques aim to alleviate software defect problem by repairing the defective programs automatically, accurately, and efficiently. In recent years, with the rapid development of deep learning, the field of automated program repair has emerged a method that utilizes deep neural networks to automatically capture the relationship between defective programs and their patches, called neural program repair (NPR). In terms of the number of defects that can be correctly repaired on the benchmark, NPR tools have significantly outperformed non-deep learning APR tools. However, a recent study found that the performance improvement of NPR systems may be due to the presence of test data in the training data, i.e., the data leakage. Inspired by this, to further investigate the causes and effects of data leakage in NPR systems and to evaluate existing systems more fairly, this study: (1) systematically categorizes and summarizes the existing NPR systems, defines the data leakage of NPR systems based on this classification, and designs the data leakage detection method for each category of system; (2) conducts a large-scale testing of existing models according to the data leakage detection method in the previous step and investigates the effect of data leakage on model realism and evaluation performance and the impact on the model itself; (3) analyzes the collection and filtering strategies of existing NPR system datasets, improves and supplements them, then constructs a pure large-scale NPR training dataset based on the improved strategy with the existing popular dataset, and verifies the effectiveness of this dataset in preventing data leakage. From the experimental results, it is found that the ten NPR systems studied in this investigation all had data leakage on the evaluation dataset, among which the NPR system RewardRepair had the more serious data leakage problem, with 24 data leaks on the Defects4J (v1.2.0) benchmark, and the leakage ratio was as high as 53.33%. In addition, data leakage has an impact on the robustness of the NPR system, and all five NPR systems investigated had reduced robustness due to data leakage. As a result, data leakage is a very common problem and can lead to unfair performance evaluation results of NPR systems and affect therobustness of the NPR system on the benchmark. When training NPR models, researchers should avoid data leakage as much as possible and consider the impact of data leakage on the evaluation of the performance of NPR systems to evaluate the NPR systems as fairly as possible.
    2024,35(7):3093-3114, DOI: 10.13328/j.cnki.jos.007113
    [Abstract] (354) [HTML] (71) [PDF 2.86 M] (1011)
    Abstract:
    Mutation testing is an effective software testing technique. It helps improve the defect detection capability of an existing test suite by generating mutants that simulate software defects. The quality of mutants has a significant impact on the effectiveness of mutation testing. The traditional mutation testing approach typically employs manually designed syntactic rule-based mutation operators to generate mutants, and has achieved some academic success. In recent years, many studies have started to incorporate deep learning techniques to generate mutants by learning historical code from open source projects. This new approach has achieved preliminary progress in mutant generation. A comprehensive comparison of the two mutation techniques, i.e. rule-based and learning-based, which have different mechanisms but both aim to improve the defect detection capability of the test suite by mutation, is crucial for mutation testing and its downstream tasks. To handle the problem, this study designs and implements an empirical study of rule-based and learning-based mutation techniques, aiming to understand the performance of mutation techniques with different mechanisms on the task of mutation testing, as well as the variability of the generated mutants in terms of program semantics. Specifically, this study uses the Defect4J v1.2.0 dataset to compare the syntactic rule-based mutation techniques represented by MAJOR and PIT with the deep learning-based mutation techniques represented by DeepMutation, μBERT, and LEAM. The experimental results show that both rule-based and learning-based mutation techniques can effectively support mutation testing practices, but MAJOR has the best testing performance and is able to detect 85.4% of real defects. In terms of semantic representation, MAJOR has the strongest semantic representation capability, and its constructed test suite is able to kill more than 95% of the mutants generated by other mutation techniques. In terms of defect representation, both types of techniques are unique.
    2024,35(7):3115-3140, DOI: 10.13328/j.cnki.jos.007104
    [Abstract] (471) [HTML] (65) [PDF 3.30 M] (1158)
    Abstract:
    Due to the large number of complex service dependencies and componentized modules, a failure in one service often causes one or more related services to fail, making it increasingly difficult to locate the cause of the failure. Therefore, how to effectively detect system faults and locate the root cause of faults quickly and accurately is the focus of current research in the field of microservices. Existing research generally builds a failure relationship model by analyzing the relationship between failures and services and metrics, but there are problems such as insufficient utilization of operation and maintenance data, incomplete modeling of fault information, coarse granularity of root cause localization, etc. Therefore, this study proposes AmazeMap, for which a multi-level fault impact graph modeling method and a microservice fault localization method are designed based on the fault impact graph. Specifically, the multi-level fault impact graph modeling method can comprehensively model the fault information by mining the collected temporal metric data and trace data while system running and considering the interrelationships between different levels; the fault localization method narrows the scope of fault impact, discovers the root cause from service instances and metrics, and finally outputs the most probable root cause of fault and metrics sequence. Based on an open-source benchmark microservice system and the AIOps contest dataset, this study designs experiments to validate AmazeMap, and also compares it with the existing methods. The results confirm AmazeMap’s effectiveness, accuracy, and efficiency.
    2024,35(7):3141-3161, DOI: 10.13328/j.cnki.jos.007105
    [Abstract] (486) [HTML] (81) [PDF 3.66 M] (1081)
    Abstract:
    As a widely used automated software testing technique, the primary goal of fuzzy testing is to explore as many code areas of the program under test as possible, thereby achieving higher coverage as well as detecting more bugs or errors. Most of existing fuzzy testing methods schedule the seed based on the historical mutation data of the seed, which is simpler to implement but ignores the distribution of program space explored by the seed, resulting in that the testing may fall into only a single region of the program to be probed, and causing the waste of testing resources. This study proposes the Cluzz, a fuzzing approach of clustering analysis-driven in seed scheduling. Firstly, Cluzz analyzes the difference between seeds in the feature space by combining the distribution of seed execution path coverage, and uses cluster analysis to classify the distribution of seeds execution in the program space. And then, Cluzz prioritizes the seeds according to the path coverage patterns of different seed clusters and the results of cluster analysis, explores the rare code regions and prioritizes the seeds with higher evaluation scores. Secondly, energy is allocated to the seeds by their evaluation scores, and the interesting inputs obtained from mutations are retained and categorized to update the seed cluster information. Cluzz reevaluates the seeds based on the updated seed clusters to ensure the validity of seeds during testing process, thereby exploring more unknown code regions in a limited time and improving the coverage of the program under test. Finally, the Cluzz is implemented on three current mainstream fuzzers and extensive testing work is conducted on eight popular real-world programs. The results show that Cluzz can detect an average of 1.7 times more unique crashes than a regular fuzzer, and it also outperforms a benchmark fuzzer by an average of 22.15% in terms of the number of new edges found. In addition, compared with the existing seed scheduling methods, the comprehensive performance of Cluzz is better than that of other benchmark fuzzers.
    2024,35(7):3162-3179, DOI: 10.13328/j.cnki.jos.007106
    [Abstract] (463) [HTML] (112) [PDF 2.57 M] (1098)
    Abstract:
    GUI fuzzing plays a crucial role in enhancing the reliability and compatibility of mobile apps. However, most existing GUI fuzzing methods are inefficient, mainly because they are coarse-grained, relying solely on single-modal features to understand the GUI pages holistically. The excessive abstraction of app states leads to the neglect of many details, resulting in an insufficient understanding of GUI states and widgets. To address this issue, a GUI fuzzing framework called GUIFuzzer for mobile apps is proposed based on multi-modal representation. This framework leverages multi-modal features, such as visual features, layout context features, and fine-grained meta-attribute features, to jointly infer the semantics of GUI widgets. Then, it trains a multi-level reward-driven deep reinforcement learning model to optimize the GUI event selection strategy, thus improving the efficiency of fuzz testing. The proposed framework is evaluated on a large number of real apps. Experimental results show that GUIFuzzer significantly improves the coverage of fuzz testing compared with existing competitive baselines. A case study is also conducted on customized search for specific targets, namely sensitive API triggering, which further demonstrates the practicality of the GUIFuzzer framework.
    2024,35(7):3180-3203, DOI: 10.13328/j.cnki.jos.007107
    [Abstract] (319) [HTML] (157) [PDF 3.16 M] (1001)
    Abstract:
    The speed of evolution in mobile application (APP) software market is accelerating. Effective analysis of software defects can help developers understand and repair software defects in time. However, the analysis object of existing research is not enough, which leads to isolated, fragmented information, and poor information quality. In addition, because of insufficient consideration of data verification and version mismatch issues, there are some errors in the analysis results, resulting in invalid software evolution. In order to provide more effective defect analysis results, an APP software defect tracking and analysis method oriented to version evolution (ASD-TAOVE) is proposed. First, the content of APP software defects is extracted from multi-source, heterogeneous APP software data, and the causal relationship of defect events is discovered. Then, a verification method for APP software defect content is designed, which is based on information entropy combined with text features and structural features to calculate the defect suspicious formula for verification and construction of APP software defect content heterogeneity graph. In order to consider the impact of version evolution, an APP software defect tracking analysis method is designed to analyze the evolution relationship of defects in version evolution. The evolution relationship can be transformed into the defect/evolutionary meta-paths which are useful for defect analysis. Finally, this study designs a heterogeneous information network based on deep learning to complete APP software defect analysis. The experimental results of four research questions (RQ) confirmed the effectiveness of ASD-TAOVE method of defect content verification and tracking analysis in the process of version-oriented evolution, and the accuracy of defect identification increased by about 9.9% and 5% respectively (average 7.5%). Compared with baseline methods, the ASD-TAOVE method can analyze more APP software data and provide effective defect information.
    Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    Available online:  July 17, 2024 , DOI: 10.13328/j.cnki.jos.007151
    Abstract:
    As merchant review websites develop rapidly, the efficiency improvement brought by recommender systems makes rating prediction one of the emerging research tasks in recent years. Existing rating prediction methods are usually limited to collaborative filtering algorithms and various types of neural network models, and do not take full advantage of the rich semantic knowledge learned in advance by the current pre-trained models. To address this problem, this study proposes a personalized rating prediction method based on pre-trained language models. The method analyzes the historical reviews of users and merchants to provide users with rating predictions as a reference before consumption. It first designs a pre-training task for the model to learn to capture key information in the text. Next, the review text is processed by a fine-grained sentiment analysis method to obtain aspect terms in the review text. Subsequently, the method designs an aspect term embedding layerto incorporate the aforementioned external domain knowledge into the model. Finally, it utilizes an information fusion strategy based on the attention mechanism to fuse the global and local semantic information of the input text. The experimental results show that the method achieves significant improvement in both automatic evaluation metrics compared to the benchmark models.
    Available online:  July 17, 2024 , DOI: 10.13328/j.cnki.jos.007153
    Abstract:
    As mobile data is growing everyday, how to predicate the wireless traffic accurately is crucial for the efficient and sensible allocation of communication and network resources. However, most existing prediction methods use a centralized training architecture, which involves large-scale traffic data transmission, leading to security issues such as user privacy leakage. Federated learning can train a global model with local data storage, which protects users’ privacy and effectively reduces the burden of frequent data transmission. However, in wireless traffic prediction, the amount of data from the single base station is limited, and the traffic patterns vary among different base stations, making it difficult to capture the traffic patterns and resulting in poor generalization of the global model. In addition, traditional federated learning methods employ averaging in model aggregation, ignoring the differences in guest contributions, which further leads to the degradation of the global model performance. To address the above issues, this study proposes an attention-based “intra-cluster average, inter-cluster attention” federated wireless traffic prediction model. The model first clusters base stations based on their traffic data to better capture the traffic variation characteristics of base stations with similar traffic patterns. At the same time, a warm-up model is designed to alleviate data heterogeneity by a small amount of base station data to improve the generalization ability of the global model. The study introduces the attention mechanism in the aggregation stage to quantify the contributions of different objects to the global model and incorporates the warm-up model in the model iteration process to improve the prediction accuracy of the model. Extensive experiments are conducted on two real-world datasets (Milano and Trento), and the results show that the DualICA outperforms all baseline methods. The mean absolute error performance gain over the state-of-the-art method is up to 10.1% and 9.6% on the two datasets, respectively.
    Available online:  July 17, 2024 , DOI: 10.13328/j.cnki.jos.007154
    Abstract:
    The label distribution in the real world often shows the long-tail effect, where a small number of categories account for the vast majority of samples. The temporal action detection problem is no exception. The existing temporal action detection methods often focus on the head categories with a large number of samples, while neglecting the few-sample categories. This study systematically defines the long-tail temporal action detection problem and proposes a weighted class-rebalancing self-training method (WCReST) based on a semi-supervised learning framework. WCReST makes full use of the large-scale unlabeled data that exists in the real world to rebalance the label distribution in the training samples to improve the model’s fit for the tail categories. Additionally, a pseudo-label loss weighting method is proposed for the temporal action detection task to enhance the stability of model training. Experiments are conducted on the THUMOS14 and HACS Segments datasets, using video samples from the THUMOS15 and ActivityNet1.3 datasets to form corresponding unlabeled datasets. In addition, the Dance dataset is collected to meet the application requirements of video review, which includes 35 action categories, 6632 labeled videos, and 13264 unlabeled videos, preserving the significant long-tail effect in data distribution. A variety of baseline models are used to conduct experiments on the THUMOS14, HACS Segments, and Dance datasets. The results demonstrate that the proposed WCReST can improve the model’s detection performance on tail action categories and can be applied to different baseline temporal action detection models to enhance their performance.
    Available online:  July 17, 2024 , DOI: 10.13328/j.cnki.jos.007150
    Abstract:
    In recent years, multi-agent reinforcement learning methods have demonstrated excellent decision-making capabilities and broad application prospects in successful cases such as AlphaStar, AlphaDogFight, and AlphaMosaic. In the multi-agent decision-making system in a real-world environments, the decision-making space of its task is often a parameterized action space with both discrete and continuous action variables. The complex structure of this type of action space makes traditional multi-agent reinforcement learning algorithms no longer applicable. Therefore, researching for parameterized action spaces holds important significance in real-world application. This study proposes a factored multi-agent centralised policy gradients algorithm for parameterized action space in multi-agent settings. By utilizing the factored centralised policy gradient algorithm, effective coordination among multi-agent is ensured. After that, the output of the dual-headed policy in the parameterized deep deterministic policy gradient algorithm is employed to achieve effective coupling in the parameterized action space. Experimental results under different parameter settings in the hybrid predator-prey scenario show that the algorithm has good performance on classic multi-agent parameterized action space collaboration tasks. Additionally, the algorithm’s effectiveness and feasibility is validated in a multi-cruise-missile collaborative penetration tasks with complex and high dynamic properties.
    Available online:  July 10, 2024 , DOI: 10.13328/j.cnki.jos.007161
    Abstract:
    Constructing post-quantum key encapsulation mechanisms based on Lattice (especially NTRU Lattice) is one of the popular research fields in Lattice-based cryptography. Commonly, most Lattice-based cryptographic schemes are constructed over cyclotomic rings, which, however, are vulnerable to some attacks due to their abundant algebraic structures. An optional and more secure underlying algebraic structure is the large-Galois-group prime-degree prime-ideal number field. NTRU-Prime is an excellent NTRU-based key encapsulation mechanism over the large-Galois-group prime-degree prime-ideal number field and has been widely adopted as the default in the OpenSSH standard. This study aims to construct a key encapsulation mechanism over the same algebraic structure but with better performance than NTRU-Prime. Firstly, this work studies the security risks of cyclotomic rings, especially the attacks on quadratic power cyclotomic rings, and demonstrates the security advantages of a large-Galois-group prime-degree prime-ideal number field in resisting these attacks. Next, an NTRU-based key encapsulation mechanism named CNTR-Prime over a large-Galois-group prime-degree prime-ideal number field is proposed, along with the corresponding detailed analysis and parameter sets. Then, a pseudo-Mersenne incomplete number theoretic transform (NTT) is provided, which can compute polynomial multiplication efficiently over a large-Galois-group prime-degree prime-ideal number field. In addition, an improved pseudo-Mersenne modular reduction algorithm is proposed, which is utilized in pseudo-Mersenne incomplete NTT. It is faster than Barrett reduction by 2.6% in software implementation and is 2 to 6 times faster than both Montgomery reduction and Barrett reduction in hardware implementation. Finally, a C-language implementation of CNTR-Prime is presented. When compared to SNTRU-Prime, CNTR-Prime has advantages in security, bandwidth, and implementation efficiency. For example, CNTR-Prime-761 has an 8.3% smaller ciphertext size, and its security is strengthened by 19 bits for both classical and quantum security. CNTR-Prime-761 is faster in key generation, encapsulation, and decapsulation algorithms by 25.3×, 10.8×, and 2.0×, respectively. The classical and quantum security of CNTR-Prime-653 is already comparable to that of SNTRU-Prime-761, with a 13.8% reduction in bandwidth, and it is faster in key generation, encapsulation, and decapsulation by 33.9×, 12.6×, and 2.3×, respectively. This study provides an important reference for subsequent research, analysis, and optimization of similar Lattice-based cryptographic schemes.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007195
    Abstract:
    Existing multi-view attributed graph clustering methods usually learn consistent information and complementary information in a unified representation of multiple views. However, not only will the specific information of the original views be lost under the method of learning after fusion, but also the consistency and complementarity are difficult to balance under the unified representation. To retain the original information of each view, this study adopts the method of learning first and then fusing. Firstly, the shared representation and specific representation of each view are learned separately before fusion, and the consistent information and complementary information of multiple views are learned more fine-grained. A multi-view attributed graph clustering model based on shared and specific representation (MSAGC) is constructed. Specifically, the primary representation of each view is obtained by a multi-view graph encoder, and then the shared information and specific information of each view are obtained. Then the consistent information of multiple views is learned by aligning the view shared information, the complementary information of multiple views is utilized by combining the view specific information, and the redundant information is processed through the difference constraint. After that, the topological structure and attribute feature matrix of the multi-view decoder reconstruction graph are trained. Finally, the additional self-supervised clustering module makes the learning and clustering tasks of graph representation tend to be consistent. The effectiveness of MSAGC is well verified on real multi-view attributed graph datasets.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007217
    Abstract:
    Machine translation (MT) aims to build an automatic translating system to transform a given sequence in the source language into another target language sequence that shares identical semantic information. MT has been an important research direction in natural language processing and artificial intelligence fields for its widely applied scenarios. In recent years, the performance of neural machine translation (NMT) greatly surpasses that of statistical machine translation (SMT), becoming the mainstream method in MT research. However, NMT generally takes the sentence as the translated unit, and in document-level translation scenarios, some discourse errors such as the mistranslation of words and incoherent sentences may occur due to the separation with discourse context if the sentence is translated independently. Therefore, incorporating document-level information into the procedure of translation may be a more reasonable and natural way to solve discourse errors. This conforms with the goal of document-level neural machine translation (DNMT) and has been a popular direction in MT research. This study reviews and summarizes works in DNMT research in terms of discourse evaluation methods, datasets and models applied, and other aspects to help the researchers efficiently learn the research status and further directions of DNMT. Meanwhile, this study also introduces the prospect and some challenges in DNMT, hoping to bring some inspiration to researchers.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007219
    Abstract:
    Self-training, a common strategy for tackling the annotated-data scarcity, typically involves acquiring auto-annotated data with high confidence generated by a teacher model as reliable data. However, in low-resource scenarios for Relation Extraction (RE) tasks, this approach is hindered by the limited generalization capacity of the teacher model and the confusable relational categories in tasks. Consequently, efficiently identifying reliable data from automatically labeled data becomes challenging, and a large amount of low-confidence noise data will be generalized. Therefore, this study proposes a novel self-training framework for low-resource relation extraction (SF-LRE). This approach aids the teacher model in selecting reliable data based on prediction ways of paraphrases, and extracts ambiguous data with reliability from low-confidence data based on partially-labeled modes. Considering the candidate categories of ambiguous data, this study proposes a negative training approach based on the set of negative labels. Finally, a unified approach capable of both positive and negative training is proposed for the integrated training of reliable data and ambiguous data. In the experiments, SF-LRE consistently demonstrates significant improvements in low-resource scenarios of two widely used RE datasets SemEval2010 Task-8 and Re-TACRED.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007220
    Abstract:
    Learned indexes are assisting or gradually replacing traditional index structures due to their low memory usage and high query performance. However, the online retraining caused by data updates makes it unable to adapt to the scenario of frequent data updates. To avoid index reconstruction due to frequent data updates without significantly increasing memory consumption, this study proposes an adaptive update-distribution-aware learned index named DRAMA. It uses an LSM-Tree-like delayed learning method to actively learn the characteristics of the data update distribution, approximate fitting techniques to quickly establish the update-distribution model, a model merging strategy to replace the frequent retraining, and a hybrid compression technique to reduce the memory usage of model parameters in the index. The index is constructed and validated on both real and synthetic datasets. The results show that, compared to traditional indexes and state-of-the-art learned indexes, the proposed index can effectively reduce query delay in a data update environment without additional memory consumption.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007228
    Abstract:
    In the field of software engineering, code repositories contain a wealth of knowledge resources, which can provide developers with examples of programming practices. If repetitive patterns, frequently occurring in source code, can be effectively extracted in the form of code templates, programming efficiency could be significantly improved. In current practice, developers often reuse existing solutions by searching through source code. However, this method typically generates a large number of similar and redundant results, increasing the burden of subsequent filtering. Moreover, template mining techniques based on cloned code often fail to cover extensive patterns constructed from dispersed small clones, thereby limiting the practicality of the templates. A new method is proposed for extracting and retrieving code templates based on code clone detection. This method achieves more efficient function-level code template extraction by stitching together multiple fragment-level clones and extracting and aggregating the shared parts of method-level clones and addresses the issue of template quality. Based on the mined code templates, this study comes up with a triplet representation method for code structural features that effectively supplements plain text features, and implements an efficient and concise structural representation. In addition, this study presents a template feature retrieval method that combines structural and textual search to retrieve these templates by matching features of the programming context. The tool implemented based on this method, CodeSculptor, demonstrates its significant capability to extract high-quality code templates in a test against a codebase containing 45 high-quality Java open-source projects. The results show that the templates mined by the tool achieve an average code reduction of 60.87%, with 92.09% produced by stitching fragment-level clones, a proportion of templates that is not identifiable by traditional method., It proves the superior performance of the method in recognizing and constructing code templates. Furthermore, the accuracy of the top-5 search results in our code template search and recommendation is 96.87%. A preliminary case study on 9600 randomly selected templates reveals that most of the sampled code templates are complete and coherent in semantics, thus affirming their practicality. Nonetheless, there are a few meaningless templates, highlighting the future potential to refine the proposed template extraction strategy. The user research further shows that code development tasks can be done more efficiently with CodeSculptor.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007213
    Abstract:
    The task of completing knowledge graphs aims to reveal the missing fact triples within the knowledge graph based on existing fact triples (head entity, relation, tail entity). Existing research primarily focuses on utilizing the structural information within the knowledge graph. However, these efforts overlook that other modal information contained within the knowledge graph may also be helpful for knowledge graph completion. In addition, since task-specific knowledge is typically not integrated into general pre-training models, the process of incorporating task-related knowledge into modal information extraction becomes crucial. Moreover, given that different modal features contribute uniquely to knowledge graph completion, effectively preserving useful multimodal information poses a significant challenge. To address these issues, this paper proposes a multimodal knowledge graph completion method that incorporates task knowledge. It utilizes a fine-tuned multimodal encoder tailored to the current task to acquire entity vector representations across different modalities. Subsequently, a modal fusion-filtering module based on recurrent neural networks is utilized to eliminate task-independent multimodal features. Finally, the study utilizes a simple isomorphic graph network to represent and update all features, thus effectively accomplishing multimodal knowledge graph completion. Experimental results demonstrate the effectiveness of our approach in extracting information from different modalities. Furthermore, it shows that our method enhances entity representation capability through additional multimodal filtering and fusion, consequently improving the performance of multimodal knowledge graph completion tasks.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007226
    Abstract:
    When writing code, software developers often refer to code snippets that implement similar functions in the project. The code generation model shares similar features when generating code fragments and uses the code context provided in the input as a reference. The code completion technology based on retrieval augmentation is akin to this idea. The external code retrieved from the retrieval library is used as additional context information to prompt the generation model so as to complete the unfinished code fragments. The existing code completion method based on retrieval augmentation directly splices the input code and retrieval results together as the input of the generated model. This method brings a risk that the retrieved code fragments may not prompt the model, but mislead the model, resulting in inaccurate or irrelevant code results. In addition, whether the retrieved external code is completely related to the input code or not, it will be spliced with the input code and input to the model。 Consequently, the effect of this method largely depends on the accuracy of the code retrieval stage. If the available code fragments cannot be returned in the retrieval phase, the subsequent code completion effect may also be affected. An empirical study is conducted on the retrieval augmentation strategies in the existing code completion methods. Through qualitative and quantitative experiments, the impact of each stage of retrieval augmentation on the code completion effect is analyzed. The empirical study focuses on identifying three factors for the effect of retrieval augmentation, namely, code granularity, code retrieval methods, and post-processing methods. Based on the conclusion of the empirical research, an improved method is designed, and a code completion method MAGIC (multi-stage optimization for retrieval augmented code completion) is proposed to improve the retrieval augmentation by optimizing the code retrieval strategy in stages. The improved strategies such as code segmentation, retrieval-reranking, and template prompt generation are designed to effectively enhance the auxiliary generation effect of the code retrieval module on the code completion model. Meanwhile, those strategies can also reduce the interference of irrelevant code in the code generation phase of the model and improve the quality of generated code. The experimental results on the Java code dataset show that, compared with the existing code completion methods based on retrieval augmentation, this method increases the editing similarity and perfect matching index by 6.76% and 7.81%, respectively. Compared with the large code model with 6B parameters, this method can save 94.5% of the video memory and 73.8% of the inference time, and improve the editing similarity and complete matching index by 5.62% and 4.66% respectively.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007227
    Abstract:
    In the field of model-based diagnosis, all minimal hitting sets(MHS) of minimal conflict sets(MCS) are the candidate diagnoses of the device to be diagnosed, so the calculation of MHS is a key step for generating candidate diagnoses. MHS is a classic NP-hard constraint solving problem. The bigger the problems get, the harder it becomes exponentially to solve them. Boolean algorithm is typical in calculating MHS. However, in the process of solving, most of the runtime is taken up by the minimization of the intermediate solution sets. This study proposes BWSS Algorithm combined with suspicious set clusters for calculating MHS. By analyzing the spanning tree rule of Boolean algorithm in depth, the set that causes the candidate solution to become a superset is found. When extending elements to the root node, the candidate solution, if discovered to share at least one empty set with the suspicious set cluster, shall be minimal. Otherwise, the solution will be removed. The recursive strategy will be employed to ensure that all and only MHS are generated at the end of the algorithm. In addition, each candidate solution has at least m (m≥1) elements or even the entire solution in no need of complex minimization. Theoretically, BWSS Algorithm is far less complex than Boolean Algorithm. According to random data and mass reference circuit data, Experimental results show that compared with many other state-of-the-art methods, the proposed algorithm reduces several orders of magnitude in runtime.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007221
    Abstract:
    A cuckoo filter is a space-efficient approximate membership query data structure, widely used in network systems for applications such as network routing, network measurement, and network caching. However, the traditional design of cuckoo filters has not adequately considered the scenario in network systems where some or all queries in the collection are known, and these queries come with associated costs. This limitation results in the suboptimal performance of existing cuckoo filters in such situations. To address this, the variable hashing-fingerprint cuckoo filter (VHCF) has been developed. VHCF introduces variable fingerprint hashing technology, taking into account the known query collection and their associated costs. By searching for the optimal fingerprint hash function for each hash bucket, the overall cost of false positives is significantly reduced. In addition, this study proposes a single-hash technology to reduce the additional computational overhead caused by the variable-hash technology. A theoretical analysis of the operational complexity and false positive rate of VHCF is also provided. Finally, experimental and theoretical results both demonstrate that VHCF achieves a significantly lower false positive rate than existing cuckoo filters and their variants while ensuring comparable query throughput. Specifically, VHCF only needs to allocate 1–2 bits for each hash index unit, which can reduce the false positive rate by 2–8 times compared to the standard cuckoo filter.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007222
    Abstract:
    Smart contracts are scripts running on the Ethereum blockchain capable of handling intricate business logic with most written in the Solidity. As security concerns surrounding smart contracts intensify, a formal verification method employing the modeling, simulation, and verification language (MSVL) alongside propositional projection temporal logic (PPTL) is proposed. A SOL2M converter is developed, facilitating semi-automatic modeling from the Solidity to MSVL programs. However, the proof of operational semantic equivalence of Solidity and MSVL is lacking. This study initially defines Solidity’s operational semantics using big-step semantics across four levels: semantic elements, evaluation rules, expressions, and statements. Subsequently, it establishes equivalence relations between states, expressions, and statements in Solidity and MSVL. In addition, leveraging the operational semantics of both languages, it employs structural induction to prove expression equivalence and rule induction to establish statement equivalence.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007223
    Abstract:
    Formal methods have made significant strides in the field of requirements consistency verification. However, as the complexity of embedded system requirements continues to increase, verifying requirements consistency faces the challenge of dealing with an excessively large state space. To effectively reduce the verification state space, while also considering the strong dependency among devices in embedded system requirements, this study proposes a compositional verification method for ensuring the consistency of requirements in complex embedded systems. This method is based on requirement decomposition and identification of dependencies among requirements. By leveraging these dependencies, it assembles verification subsystems, enabling the compositional verification of complex embedded system requirements and facilitating the initial identification of inconsistencies. Specifically, the problem frames approach is employed for requirement modeling and decomposition, while a domain-specific device knowledge base is utilized for modeling the physical characteristics of devices. During the assembly of verification subsystems, models of expected software behavior are generated and dynamically integrated with physical device models. Finally, the feasibility and effectiveness of this method are validated through a case study of an airborne reconnaissance control system, demonstrating a significant reduction in the verification state space through five case evaluations. This method thus provides a practical solution for verifying the requirements of complex embedded systems.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007224
    Abstract:
    The rich development ecosystem of Python provides a lot of third-party libraries, significantly boosting developers' efficiency and quality. Third-party library developers encapsulate underlying code, enabling upper-layer application developers to swiftly accomplish tasks by calling relevant APIs. However, APIs of third-party libraries are not constant. Owing to fixes, refactoring, and feature additions, these libraries undergo continuous updates. Incompatible changes are seen in some APIs after updates, leading to abnormal termination or inconsistent results in upper-layer applications. Therefore, the API compatibility of the Python third-party library has become one of the issues that needs to be solved. There have been related studies focusing on API compatibility issues of Python third-party libraries, of which reasons have yet to be fully classified so that, the fine-grained cause can not be provided. An empirical study is conducted on the symptoms and causes of API compatibility issues with Python third-party library and a targeted static detection method is proposed. Initially, this study gathers 108 pairs of incompatible API versions by combining version update logs and regression tests across 6 version pairs of the Flask and Pandas libraries. Subsequently, an empirical study is conducted on the collected data, summarizing the symptoms and causes of compatibility issues. Finally, this study proposes a static analysis-based detection method for incompatible Python APIs, providing syntactic-level causes of incompatible API issues. This study conducts experimental evaluations on 12 version pairs of 4 popular Python third-party libraries. The results show that the proposed method is good in effectiveness, generalization, time performance, memory performance, and usefulness.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007225
    Abstract:
    Requirements for the effective real-time analysis of instant data modification of database systems have driven the rapid development of Hybrid Transactional/Analytical Processing (HTAP) database systems, which support to process both OLTP and OLAP workloads. To realize fair comparisons and healthy development, it is crucial to define and implement new benchmarks to evaluate new features of HTAP database systems. Firstly, this study analyzes the key characteristics of HTAP database systems and summarizes the distinct technologies in their implementations. Secondly, the difficulties of designing HTAP database systems and the challenges of constructing HTAP benchmarks are extracted. Based on these, the design dimensions of HTAP benchmarks are proposed, including data generation, workload generation, evaluation metrics, and consistency model supportability. This study compares differences between existing HTAP benchmarks in terms of design dimensions and implementation technologies and sums up their merits and defects in different dimensions. Additionally, the published benchmarks are demonstrated and their abilities of evaluating key features and supporting horizontal comparisons among HTAP database systems are analyzed. Finally, this study concludes the requirements for HTAP benchmarks and some future research directions, pointing out that semantically consistent workload control and fresh data access metrics are the key issue in defining benchmarks for HTAP database systems.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007208
    Abstract:
    As a distributed approach to problem solving, crowdsourcing reduces costs and efficiently utilizes resources. While blockchain technology is introduced to solve the problem of over-centralization in traditional crowdsourcing platforms, its transparency brings the risk of privacy leakage. The traditional anonymous authentication can hide the user's identity, but the anonymity is abused, and the worker selection gets more difficult. In this paper, a decentralized accountable attribute-based authentication scheme is proposed and combined with blockchain to design a novel crowdsourcing scheme. Using decentralized attribute-based encryption and non-interactive zero-knowledge proof, the scheme protects the privacy of users’ identities with linkability and traceability, and the requester can devise access policies to select workers. In addition, the scheme improves the security of the system by implementing attribute authorization authority and tracking groups through the threshold secret sharing technique. Through experimental simulation and analysis, it is demonstrated that the scheme meets the requirements of time and storage overhead in practical application.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007149
    Abstract:
    Recently, deep learning has received increasing attention from researchers due to its excellent performance in various scenarios, but these methods often rely on the independent and identically distribution assumption. Domain adaptation is a problem proposed to mitigate the impact of distribution offset, which uses labeled source domain data and unlabeled target domain data to achieve better performance on target data. Existing methods are devised for static data, while the methods for time series data need to capture the dependencies between variables. Although these methods use feature extractors for time series data, such as recurrent neural networks, to learn the dependencies between variables, they often extract redundant information. This information is easily entangled with semantic information, affecting the model performance. To solve these problems, this study proposes a path-signature-based time-series domain adaptation (PSDA). On the one hand, this method uses path signature transformation to capture sparse dependencies between variables and eliminate redundant correlations while preserving semantic dependencies, thereby facilitating the extraction of discriminative features from temporal data. On the other hand, the invariant dependency relationships are preserved by constraining the consistency of dependency relationships among different domains, and the changing dependency relationships between domains are excluded, which is conducive to extracting generalized features from temporal data. Based on the above methods, the study further proposes a distance metric and generalized boundary theory and obtains the best experimental results on multiple time series domain adaptation standard datasets.
    Available online:  July 03, 2024 , DOI: 10.13328/j.cnki.jos.007159
    Abstract:
    The network traffic measurement technology of programmable switches is capable of handling high-speed network traffic and offers significant advantages in terms of flexibility and real-time processing. However, due to the necessity of configuring the internal logic of switches using the complex P4 programming language, the deployment of measurement tasks becomes intricate and error-prone. Furthermore, measurement accuracy is often constrained by the available measurement resources within the switch of measurement tasks. This study proposes a detailed exploration of intent-based networking and network traffic measurement technology, introducing an intent-driven network traffic distributed measurement method. Firstly, an intent representation format based on measurement intent primitives is designed, and an intent compiler is developed to translate abstract intent representations into executable P4 code. Secondly, a network traffic distributed measurement approach is introduced, utilizing the resources of multiple switches to collaboratively complete a measurement task in a distributed manner. The dynamic allocation of measurement resources and counter-configuration algorithms are exemplified with heavy-hitter measurements. Finally, experimental results demonstrate the feasibility and certain advantages of the proposed method.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007199
    Abstract:
    Social media text summarization aims to provide concise summaries for large-scale social media short texts (referred to as posts) targeting specific topics. Given the brief and informal contents of posts, traditional methods confront the challenges of sparse features and insufficient information. Recent research endeavors have leveraged social relationships among posts to refine post contents and remove redundant information, but these efforts neglect the presence of unreliable noise relationships in real social media contexts, leading to erroneous assessments of post importance and diversity. Therefore, this study proposes a novel unsupervised model DSNSum, which improves summarization performance by removing noise relationships in the social networks. Firstly, the noise relationships in real social relationship networks are statistically verified. Secondly, two noise functions are designed based on sociological theories, and a denoising graph auto-encoder (DGAE) is constructed to mitigate the influence of noise relationships and cultivate post contents of credible social relationships. Finally, a sparse reconstruction framework is utilized to select posts that maintain coverage, importance, and diversity to form a summary of a certain length. Experimental results on a total of 22 topics from two real social media platforms (Twitter and Sina Weibo) demonstrate the efficacy of the proposed model and provide new insights for subsequent research in related fields.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007200
    Abstract:
    As the human pose estimation (HPE) method based on graph convolutional network (GCN) cannot sufficiently aggregate spatiotemporal features of skeleton joints and restrict discriminative features extraction, in this paper, a parallel multi-scale spatiotemporal graph convolutional network (PMST-GNet) model is built to improve the performance of 3D HPE. Firstly, a diagonally dominant spatiotemporal attention graph convolutional layer (DDA-STGConv) is designed to construct a cross-domain spatiotemporal adjacency matrix and model the joint features based on self-constraint and attention mechanism constrain, therefore enhancing information interaction among nodes. Then, a graph topology aggregation function is devised to construct different graph topologies, and a parallel multi-scale sub-graph network module (PM-SubGNet) is constructed with DDA-STGConv as the basic unit. Finally, a multi-scale feature cross fusion block (MFEB) is designed, by which multi-scale information among PM-SubGNets can interact to improve the feature representation of GCN, therefore better extracting the context information of skeleton joints. The experimental results on the mainstream 3D HPE datasets Human3.6M and MPI-INF-3DHP show that the proposed PMST-GNet model has a good effect in 3D HPE and is superior to the current mainstream GCN-based algorithms such as Sem-GCN, GraphSH, and UGCN.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007201
    Abstract:
    Many computational problems on graphs can be attributed to the NP-hard problem, so these problems are often restricted within well-structured graphs. This approach has seen many efficient algorithms in the natural graph class, many of which can be subsumed under the framework of algorithmic meta-theorems. Algorithmic meta-theorems are general results that provide efficient algorithms for model-checking problems, which means the satisfiability of any formula under certain logic frameworks is verified on a specific structure. Most existing algorithmic meta-theorems rely on structural graph theory in studying graph properties and take efficiency under parameterized complexity into consideration. On many well-structured graphs, some model-checking problems with common logics have efficient algorithms under parameterized complexity, in other words, they turn out to be fixed-parameter tractable. Due to the varying expressive power of different logics, the tractability of the corresponding model-checking problems has huge differences Therefore, understanding the tractability is a significant question for algorithmic meta-theorems. Results have shown that the tractability of first-order logic model checking is closely related to the sparsity of input graphs. As the understanding of sparse graphs is fairly complete now, the focus of current research has shifted towards well-structured dense graphs, where challenging problems are abundant. Results show that model-checking problems may be tractable for many complex dense graphs, but many unsolved problems in this field still require further exploration. By giving an overview of the research about algorithmic meta-theorems, this survey aims to offer assistance and momentum to related research in China.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007202
    Abstract:
    In recent years, research achievements in deep learning have found widespread applications globally. To enhance the training efficiency of large-scale deep learning models, industry practices often involve constructing GPU clusters and configuring efficient task schedulers. However, deep learning training tasks exhibit complex performance characteristics such as performance heterogeneity and placement topological sensitivity. Scheduling without considering performance can lead to issues such as low resource utilization and poor training efficiency. In response to this challenge, a great number of schedulers of deep learning training tasks based on performance modeling have emerged. These schedulers, by constructing accurate performance models, delve into the intricate performance characteristics of tasks. Based on this understanding, they design more optimized scheduling algorithms, thereby forming more efficient scheduling solutions. This study begins with a modeling design perspective, providing a categorized review of the performance modeling methods employed by current schedulers. Subsequently, based on the optimized scheduling approaches from performance modeling by schedulers, a systematic analysis of existing task scheduling efforts is presented. Finally, this study outlines prospective research directions for performance modeling and scheduling in the future.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007203
    Abstract:
    As too many redundant events included in crash test sequences generated by Android automated test tools may result in test replay, defect comprehension, and repairing difficulty, a great number of test sequence reduction works have been proposed. While current works only focus on the application interface changes and ignore the internal state changes during program execution. Moreover, current works only model application states at a single and abstract granularity, such as control widget granularity or activity granularity, resulting in long test sequences after reduction or inefficient reduction. This study proposes an Android test sequence reduction method combined with multi-granularity based on event labeling. By taking into account the Android lifecycle management mechanism and data flow analysis to label critical events that trigger crashes, this method can narrow the sequence reduction space and design a strategy of rough selection under low granularity and detailed reduction under high granularity. At last, a crash test sequence set containing complex scenarios such as inter-application interaction and user input is collected, and the comparison with other test sequence reduction works on this set verifies the effectiveness of the method proposed in this study.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007205
    Abstract:
    As software vulnerabilities grow in type, volume, and complexity, researchers have proposed various techniques to help developers discover, detect, and localize vulnerabilities. However, researchers still need to exert considerable effort to manually repair these vulnerabilities. In recent years, some researchers have focused on automated software vulnerability repair. However, such a task is merely considered a generic text generation problem by the current advanced technology, and the detects are not located. As a result, the generation space of the repair program is large, and the generated repair program is low-quality. Providing developers with such low-quality repairs affects the efficiency and effectiveness of vulnerability repair. To solve the above problems, a general type vulnerability repair approach based on chain-of-thought is proposed in this study, which is named CotRepair. By utilizing the chain-of-thought technology, the model first predicts the locations that are most likely to contain vulnerable code, and then generates the repair program more accurately based on the predicted locations. The experimental results show that CotRepair outperforms the baselines in various metrics, and the effectiveness of the proposed approach is demonstrated from multiple aspects.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007206
    Abstract:
    In the task of numerical question-answering with texts and tables, the models are required to perform numerical reasoning based on given texts and tables. The goal is to generate a computational program consisting of multi-step numerical calculations, and the program’s results are used as the answer to the question. To model the texts and tables, the current work linearizes the table into a series of cell sentences through templates and then designs a generator based on the texts and cell sentences to produce the computational program. However, this approach faces a specific problem: the differences between cell sentences generated by templates are minimal, making it difficult for the generator to distinguish between cell sentences that are essential for answering the question (supporting cell sentences) and those irrelevant to the question (distracting cell sentences). Ultimately, the model generates incorrect computational programs based on distracting cell sentences. To tackle this issue, this study proposes an approach called multi-granularity cell semantic contrast (MGCC) for our generator. The main purpose of this approach is to enhance the representation distances between supporting and distracting cell sentences, thereby helping the generator differentiate between them. Specifically, this contrast mechanism is composed of coarse-grained cell semantic contrasts and fine-grained constituent element contrasts, including contrasts in row names, column names, and cell values. The experimental results validate that the proposed MGCC approach enables the generator to achieve better performance than the benchmark model on the FinQA and MultiHiertt numerical reasoning datasets. On the FinQA dataset, it leads to an improvement of up to 3.38% in answer accuracy. Notably, on the more challenging hierarchical table dataset MultiHiertt, it yields a 7.8% increase in the accuracy of the generator. The subsequent analytical experiments further verify that the multi-granularity cell semantic contrast approach contributes to the model’s improved differentiation between supporting and distracting cell sentences.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007207
    Abstract:
    The interactions between elements in contemporary software systems are notably intricate, encompassing relationships between packages, classes, and functions. Accurate comprehension of these relationships is pivotal for optimizing system structures and enhancing software quality. Analyzing inter-package relationships can help unveil dependencies between modules, thereby assisting developers in more effectively managing and organizing software architectures. On the other hand, a clear understanding of inter-class relationships contributes to the creation of code repositories that are more scalable and maintainable. Moreover, a clear understanding of inter-function relationships facilitates rapid identification and resolution of logical errors within programs, consequently enhancing the robustness and reliability of the software. However, current predictions of software system interaction confront challenges such as granularity disparities, inadequate features, and version changes. To address this challenge, this study constructs corresponding software network models based on the three granularities, including software packages, classes, and functions. It introduces a novel approach combining local and global features to reinforce the analysis and prediction of software systems through feature extraction and link prediction of software networks. This approach is based on the construction and handling of software networks, involving specific steps such as leveraging the node2vec method to learn local features of software networks and combining Laplacian feature vector encoding to comprehensively represent the global positional information of nodes. Subsequently, the Graph Transformer model is employed to further optimize the feature vectors of node attributes, culminating in the completion of the interaction prediction task of the software system. Extensive experimental validations are conducted on three Java open-source projects, encompassing within-version and cross-version interaction prediction tasks. The experimental results demonstrate that, compared to benchmark methods, the proposed approach achieves an average increase of 8.2% and 8.5% in AUC and AP values, respectively in within-version prediction tasks. This approach reaches an average rise of 3.5% and 2.4% in AUC and AP values, respectively, in cross-version prediction tasks.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007209
    Abstract:
    In this study, the problem of mining cluster frequent patterns in time-ordered transaction data is discussed for the first time. To deal with redundant operations when the Naive algorithm solves this problem, the improved cluster frequent pattern mining (ICFPM) algorithm is proposed. The algorithm uses two optimization strategies. On the one hand, it can use the defined parameter minCF to effectively reduce the search space of mining results; on the other hand, it can refer to the discriminative results of (n – 1)-itemsets to accelerate the discriminative process of cluster frequent n-itemset. The algorithm also applies the ICFPM-list structure to reduce the overhead of the candidate n-itemsets construction. Simulation experiments based on two real-world datasets demonstrate the effectiveness of the ICFPM algorithm. Compared with the Naive algorithm, the ICFPM algorithm improves substantially in terms of time and space efficiency, which makes it an effective method for solving clustered frequent pattern mining.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007210
    Abstract:
    DEFAULT, a new lightweight cryptosystem presented at Asiacrypt in 2021, is designed to protect the information security of Internet of Things (IoT) devices, such as microchips, microcontrollers, and sensors. Based on the ciphertext-only attack assumption, the statistical fault analysis of the DEFAULT cipher with the algebraic relationship is proposed. The statistical fault analysis uses the random nibble-oriented fault model. It not only combines statistical distributions of the intermediate states before and after the fault injections but also takes advantage of the algebraic relationship and novel distinguishers, including Anderson Darling test–Square Euclidean imbalance, Anderson Darling test–Maximum likelihood estimate, and Anderson Darling test–Hamming weight. The analysis requires at least 1344 faults to achieve the reliability of 99% in the recovery of the 128-bit secret key of DEFAULT. The theoretical analysis and experimental results show that the DEFAULT lightweight cryptosystem is not resistant to the statistical fault attack based on the algebraic relationship. This study provides an important reference for the security analysis of the other lightweight cryptosystems.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007211
    Abstract:
    In real-world scenarios, rich interaction relationships often exist among users on different platforms such as e-commerce, consumer reviews, and social networks. Constructing these relationships into a graph structure and applying graph neural networks (GNNs) for malicious user detection has become a research trend in related fields in recent years. However, due to the small proportion of malicious users, as well as their disguises and high labeling costs, traditional GNN methods are limited by problems such as data imbalance, data inconsistency, and label scarcity. This study proposes a semi-supervised graph representation learning-based method for detecting malicious nodes. The method improves the GNN method for node representation learning and classification. Specifically, a class-aware malicious node detection (CAMD) method is constructed, which introduces a class-aware attention mechanism, inconsistent GNN encoders, and class-aware imbalance loss functions to solve the problems of data inconsistency and imbalance. Furthermore, to address the limitation of CAMD in detecting malicious nodes with scarce labels, a graph contrastive learning-based method, CAMD+, is proposed. CAMD+ introduces data augmentation, self-supervised graph contrastive learning, and class-aware graph contrastive learning to enable the model to learn more information from unlabeled data and fully utilize scarce label information. Finally, a large number of experimental results on real-world datasets verify that the proposed methods outperform all baseline methods and demonstrate good detection performance in situations with different degrees of label scarcity.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007212
    Abstract:
    In recent years, there has been rapid advancement in the application of artificial intelligence technology to sequential decision-making and adversarial game scenarios, resulting in significant progress in domains such as Go, various games, poker, and Mahjong. Notably, systems like AlphaGo, OpenAI Five, AlphaStar, DeepStack, Libratus, Pluribus, and Suphx have achieved or surpassed human expert-level performance in these areas. While these applications primarily focus on zero-sum games involving two players, two teams, or multiple players, there has been limited substantive progress in addressing mixed-motive games. Unlike zero-sum games, mixed-motive games necessitate comprehensive consideration of individual gains, collective gains, and equilibrium. These games are extensively applied in real-world contexts such as public resource allocation, task scheduling, and autonomous driving, making research in this area crucial. This paper offers a comprehensive overview of key concepts and relevant research in the field of mixed-motive games, providing an in-depth analysis of current trends and future directions both domestically and internationally. Specifically, this study first introduces the definition and classification of mixed-motive games. It then elaborates on game solution concepts and objectives, including Nash equilibrium, correlated equilibrium, and Pareto optimality, as well as objectives related to maximizing individual and collective gains, while considering fairness. Furthermore, the study engages in a thorough exploration and analysis of game theory methods, reinforcement learning methods, and their combination based on different solution objectives. In addition, the paper discusses relevant application scenarios and experimental simulation environments before concluding with a summary and outlook on future research directions.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007214
    Abstract:
    The cuckoo filter is an efficient probabilistic data structure that can quickly determine whether an element exists in a given set. The cuckoo filter is widely used in computer networks, IoT applications, and database systems. These systems usually involve the handling of massive amounts of data and numerous concurrent requests in practice. A cuckoo filter that supports high concurrency can significantly improve system throughput and data processing capabilities, which is crucial to system performance enhancement. Therefore, a cuckoo filter that supports lock-free concurrency is designed. The filter achieves high-performance lookup, insertion, and deletion through the two-stage query, separation of path exploration and element migration, and atomic migration based on multi-word compare-and-swap. Theoretical analysis and experimental results indicate that the lock-free concurrent cuckoo filter significantly improves the concurrent performance of the most cutting-edge algorithms in current times. The lookup throughput of a lock-free concurrent cuckoo filter is on average 1.94 times that of a cuckoo filter using fine-grained locks.
    Available online:  June 20, 2024 , DOI: 10.13328/j.cnki.jos.007215
    Abstract:
    Previous pre-trained language models (PLMs) have demonstrated excellent performance in numerous tasks of natural language understanding (NLU). However, they generally suffer shortcut learning, which means learning the spurious correlations between non-robust features and labels, resulting in poor generalization in out-of-distribution (OOD) test scenarios. Recently, the outstanding performance of generative large language models (LLMs) in understanding tasks has attracted widespread attention, but the extent to which it is affected by shortcut learning has not been fully studied. In this paper, the shortcut learning effect of generative LLMs in three NLU tasks is investigated for the first time using the LLaMA series models and FLAN-T5 models as representatives. The results show that the shortcut learning problem still exists in generative LLMs. Therefore, a hybrid data augmentation framework is proposed based on controllable explanations as a mitigation strategy for the shortcut learning problem in generative LLMs. The framework is data-centric, constructing a small-scale mix dataset composed of model-generated controllable explain data and partial original prompting data for model fine-tuning. The experimental results in three representative NLU tasks show that the framework can effectively mitigate shortcut learning, and significantly improve the robustness and generalization of the model in OOD test scenarios while avoiding sacrifice of or even improving the model performance in in-distribution test scenarios. The solution code is available at https://github.com/Mint9996/HEDA.
    Available online:  June 18, 2024 , DOI: 10.13328/j.cnki.jos.007143
    Abstract:
    In recent years, deep learning has achieved excellent performance in software engineering (SE) tasks. Excellent performance in practical tasks depends on large-scale training sets, and collecting and labeling large-scale training sets require a lot of resources and costs, which limits the wide application of deep learning techniques in practical tasks. With the release of pre-trained models (PTMs) in the field of deep learning, researchers in SE have begun to pay attention to PTMs and introduced PTMs into SE tasks. PTMs have made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (pre-trained models for software engineering, PTM4SE), this study systematically reviews the current studies related to PTM4SE. Specifically, the study first describes the framework of the intelligent software engineering methods based on pre-trained models and then analyzes the commonly used pre-trained models in SE. Meanwhile, it introduces the downstream tasks in SE with pre-trained models in detail and compares and analyzes the performance of pre-trained model techniques on these tasks. The study then presents the datasets used in SE for training and fine-tuning the PTMs. Finally, it discusses the challenges and opportunities for PTM4SE. The collated PTMs and datasets in SE are published at https://github.com/OpenSELab/PTM4SE.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007170
    Abstract:
    Pre-training knowledge graph (KG) models facilitate various downstream tasks in e-commerce applications. However, large-scale social KGs are highly dynamic, and the pre-training models need to be updated regularly to reflect the changes in node features caused by user interactions. This paper proposes an efficient incremental update framework for the pre-training KG models. The framework mainly includes a bidirectional imitation distillation method to fully use the different types of facts in new data, and a sampling strategy based on samples’ normality and abnormality is proposed to sample the most valuable facts from all new facts to reduce the training data size, and a reverse replay mechanism is proposed to generate high-quality negative facts that are more suitable for the incremental training of social KGs in e-commerce. Experimental results on real-world e-commerce datasets and related downstream tasks demonstrate that the proposed framework can incrementally update the pre-training KG models more effectively and efficiently compared to state-of-the-art methods.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007171
    Abstract:
    In recent years, online transactions of digital collections have been increasing, with platforms such as Alibaba Auctions and OpenSea facilitating their circulation in the market. However, the bidder’s bidding privacy is at risk of being disclosed during an online auction. To address this issue, this study proposes a privacy-preserving online auction approach based on the homomorphic property of SM2, which not only protects the users’ bidding privacy but also ensures the usability of the bidding data. Specifically, this study creates a homomorphic encryption scheme based on SM2, encrypting bidders’ bidding information and constructing a piece of noisy bidding information to conceal the privacy data. The efficiency of the online auction privacy preservation approach is improved by integrating the Chinese Reminder Theorem and Baby-Step-Giant-Step (CRT-BSGS) into the homomorphic encryption process with SM2, which has proved to be more efficient than the Paillier algorithm. Finally, the security and efficiency of the proposed scheme are verified in detail.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007174
    Abstract:
    The cross-shard state transition protocol is the basis for ensuring the atomicity of cross-shard transactions, and its efficiency directly affects the performance of the sharding system. The cross-transaction process of the existing protocols can be divided into three phases: source-shard state move-out, cross-shard state transition, and destination-shard state move-in. These phases are executed sequentially, and all phases are tightly coupled. This paper proposes the ChannelLink cross-shard state transition protocol based on the off-chain state channel. Since the off-chain channels are highly flexible and can be confirmed instantly, the ChannelLink protocol can effectively decouple the tightly coupled three-phase process, reducing the average cost of cross-shard transactions, and improving state transition efficiency. On this basis, this paper designs a low-overhead off-chain channel routing algorithm. This algorithm solves the optimal state routing scheme by improving the genetic algorithm based on the characteristics of state transition transactions and off-chain channel topology. It reduces the user's cross-shard state transition overhead and guarantees transition efficiency. Finally, this paper implements the ChannelLink protocol prototype system and uses Bitcoin transactions and the Lightning Network state to construct the dataset for experimental verification. Results show that in a scenario with 16 shards and a cross-shard transaction ratio of 5.21%, the sharding system integrated with the ChannelLink protocol can improve the throughput by 7.04%, reduce the transaction confirmation latency by 52.51%, and reduce the cost of cross-shard state transition by more than 45.44%. Meanwhile, the performance advantages of the ChannelLink protocol gradually increase as the number of shards and the cross-shard transaction ratio increase.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007188
    Abstract:
    Although convolutional neural networks (CNNs) are widely used in image recognition due to their excellent generalization performance, adversarial samples contaminated by noise can easily deceive fully trained network models, posing security risks. Many existing defense methods improve the robustness of models, but most inevitably sacrifice model generalization. To alleviate this issue, a label-filtered weight parameter regularization method is proposed to balance the generalization and robustness of models using the label information of samples during model training. Many previous robust model training methods suffer from two main issues: 1) The robustness of models is mainly enhanced by increasing the quantity or complexity of training set samples, which not only diminishes the dominant role of clean samples in model training but also significantly increases the workload of training tasks. 2) The label information of samples is used only to compare with model predictions to control the direction of model parameter updates, neglecting the additional information hidden in sample labels. The proposed method selects weight parameters that play a decisive role in classifying samples by filtering the correct labels of samples and the classification labels of adversarial samples and optimizes these parameters regularly to achieve a balance between model generalization and robustness. Experiments and analysis on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that the proposed method achieves good training results.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007189
    Abstract:
    This study considers slot filling as a crucial component of task-oriented dialogue systems, which serves downstream tasks by identifying specific slot entities in utterances. However, in a specific domain, it necessitates a large amount of labeled data, which is costly to collect. In this context, cross-domain slot filling emerges and efficiently addresses the issue of data scarcity through transfer learning. However, existing methods overlook the dependencies between slot types in utterances, leading to the suboptimal performance of existing models when transferring to new domains. To address this issue, a cross-domain slot filling method based on slot dependency modeling is proposed in this study. Leveraging the prompt learning approach based on generative pre-trained models, a prompt template integrating slot dependency information is designed, establishing implicit dependency relationships between different slot types and fully exploiting the predictive performance of slot entities in the pre-trained model. Furthermore, to enhance the semantic dependencies between slot types, slot entities, and utterance texts, discourse filling subtask is introduced in this study to strengthen the inherent connections between utterances and slot entities through reverse filling. Transfer experiments across multiple domains demonstrate significant performance improvements in zero-shot and few-shot settings achieved by the proposed model. Additionally, a detailed analysis of the main structures in the model and ablation experiments are conducted in this study to further validate the necessity of each part of the model.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007190
    Abstract:
    Serverless computing is an emerging cloud computing model based on the “function as a service (FaaS)” paradigm. Functions serve as the fundamental unit for deployment and scheduling, providing users with massively parallel and automatically scalable function execution services without the need to manage underlying resources. For users, serverless computing helps them alleviate the burden of managing cluster-level infrastructure, enabling them to focus on business-layer development and innovation. For service providers, applications are decomposed into fine-grained functions, leading to significantly improved scheduling efficiency and resource utilization. The significant advantages have swiftly drawn the attention from the industry and propelled serverless computing into popularity. However, the distinct computing mode of serverless computing, divergent from traditional cloud computing, along with its stringent limitations on various aspects of tasks, poses numerous obstacles to application migration. The escalating complexity of migrated tasks also imposes higher performance requirements on serverless computing. Therefore, performance optimization technology for serverless computing systems has emerged as a critical research topic. This study reviews and summarizes research efforts on performance optimization of serverless computing from four perspectives, and introduces existing system. Firstly, this study introduces the optimization technologies for typical tasks, including task adaptation and system optimization for specific task types. Secondly, it reviews the optimization work on sandbox environments, encompassing sandbox solutions and cold start optimization methods, which play a crucial role in the execution of serverless functions. Thirdly, it provides an overview of the optimization in I/O and communication technologies, which are major performance bottlenecks of serverless applications. Lastly, it briefly outlines related resource scheduling technologies, including platform-oriented and user-oriented scheduling strategies, which determine system resource utilization and task execution efficiency. In conclusion, it summarizes the current issues and challenges of performance optimization technologies of serverless computing and anticipates potential future research directions.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007191
    Abstract:
    Serverless computing is an emerging paradigm of cloud computing, allowing developers to focus only on application logic development without the need to manage complex underlying tasks. This paradigm allows developers to quickly build smaller-granularity applications, the one at the function level. With the increasing popularity of serverless computing, major cloud computing vendors have introduced their commercial serverless platforms one after another. However, the characteristics of these platforms have yet to be systematically studied and reliably compared. A comprehensive analysis of these characteristics can help developers choose an appropriate serverless platform while developing and executing serverless applications in the right way. To this end, an empirical study is conducted on the characteristics of mainstream commercial serverless platforms. This study involves such mainstream serverless platforms as AWS Lambda, Google Cloud Functions, Microsoft Azure Functions, and Alibaba Function Compute. This study is divided into two major parts: feature summarization and runtime performance analysis. In the feature summarization, the official documents of these serverless platforms are discussed and their key features are summarized and compared in terms of development, deployment, and runtime. In the runtime performance analysis, representative benchmarks are applied to analyze the runtime performance offered by these serverless platforms on a multidimensional basis. Specifically, key factors for the cold-start performance of the applications are first analyzed, such as programming languages and memory sizes. Furthermore, the tasks-executing performance of serverless platforms is discussed. Based on the results of feature summarization and runtime performance analysis, this study sums up a series of findings and provides practical insights and potential research opportunities for developers, cloud computing vendors, and researchers.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007192
    Abstract:
    Microkernels migrate system services to user mode. Thanks to the isolated framework, microkernels are superior in high reliability, which meets the needs of the aerospace field. SPARC processors are widely applied on the control equipment of spacecraft, satellite payloads, and planetary vehicles. The register window mechanism of SPARC will affect the performance of inter-process communication (IPC) on microkernels. Besides, its inter-processor interrupt (IPI) also seriously affects the efficiency of cross-core IPC. As a key mechanism, IPC is vital to the overall performance of applications on microkernels. Through observing the register window mechanism, this study redesigns and implements the register bank mechanism, where the register window is allocated and managed by the kernel. Thus BankedIPC on SPARC is implemented. At the same time, as IPI underperforms on SPARC, FlexIPC is designed to optimize the performance of cross-core IPC. These approaches are employed to optimize the general synchronous IPC implemented on a self-developed microkernel ChCore. According to the test, the average IPC performance of microkernels on the optimized SPARC is about two times better with the application performance up to 15%.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007193
    Abstract:
    Multi-path transmission technology establishes multiple transmission paths between communication parties via various network interfaces on devices. In this way, bandwidth aggregation, load balance, and path redundancy will be achieved to increase transmission throughput and reliability. These benefits allow the multipath transmission technology to be widely used in several application scenarios such as servers, terminals, and data centers. As a part and parcel of network architecture and transmission technology studies, the technology is of research significance and value. To this end, this study systematically analyzes the multi-path transmission technology in terms of its concepts and core mechanisms. Firstly, the basic concepts, standardized process and application value of multi-path transmission are outlined. Secondly, the core mechanisms of the multi-path transmission technology are enunciated, including congestion control, packet scheduling, path management, retransmission mechanism, security mechanism, and the mechanism for specialized applications. Classification methods and the main research results of each mechanism are elaborated, and the advantages, disadvantages and the development direction of mechanisms are summarized. Finally, this study probes into challenges faced by multi-path transmission technology research and envisions the prospect for relevant studies.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007194
    Abstract:
    Capturing an accurate view of IP geolocation is of great interest to the networking research community as it has many uses ranging from network measuring and mapping to analyzing the network’s infrastructure. However, the scale of today’s Internet, coupled with the rapid development of Internet applications, makes it very challenging to acquire a complete and accurate snapshot of the IP geolocation technology. To the best of our knowledge, there is no systematic survey of the relevant research in this field. To fill this gap, this study systematically summarizes the research of client-independent IP geolocation, in which the clients do not participate in the geolocation process, for the first time. This study aims to examine the major research studies that have been conducted on topics related to IP geolocation in the last 22 years since the first IP-based geolocation technology was proposed. To this end, these prior studies are classified according to the measurement method, that is, active, passive, and hybrid. The main techniques for each category are described, identifying their significant advantages and limitations. Also, the primary experience and lessons learned from these past efforts are presented. After the process, the latest progress in IP geolocation both in academia and industry is shown. Finally, the survey and present promising directions in the future are concluded, hoping to promote the development of IP geolocation.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007172
    Abstract:
    The assessment of adversarial robustness requires a complete and accurate evaluation of deep learning models’ noise resistance by combining the attack ability and noise magnitude of adversarial samples. However, the lack of completeness in the adversarial robustness evaluation metric system is a key problem with the existing adversarial attack and defense methods. The existing work on adversarial robustness evaluation lacks analysis and comparison of the evaluation metric system. The impact of attack success rate and different norms on the completeness of the robustness evaluation metric system and the restrictions on designing attack and defense methods are neglected. In this study, the adversarial robustness evaluation metric system is discussed in two dimensions: norm selection and metric indicators. The theoretical analysis of robustness evaluation completeness is carried out from three aspects: the inclusion relation of the evaluation metric domain, robustness description granularity, and the order relationship of the robustness evaluation metric system. The following conclusions are drawn: using noise statistical quantities such as the mean results in a larger and more comprehensive definition domain of evaluation indicators compared to using attack success rates, while also ensuring that any two adversarial sample sets can be compared. Using the L2 norm is more complete in the description of adversarial robustness evaluation compared to using other norms. Extensive experiments on 23 models and 20 adversarial attacks across 6 datasets validate these conclusions.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007173
    Abstract:
    Benefiting from the rapid development of information technology and the widespread adoption of medical information systems, a vast amount of medical knowledge has been accumulated in medical databases, including patient clinical treatment events and medical expert consensus. It is crucial to extract knowledge from these medical facts and effectively manage and utilize them, which can advance the automation and intelligence of diagnosis and treatment. Knowledge graphs, as a novel knowledge representation tool, can effectively mine and organize information from abundant medical facts and have received extensive attention in the medical field. However, existing medical knowledge graphs often suffer from limitations such as small scale, numerous restrictions, poor scalability, and so on, leading to a limited ability to express knowledge from medical facts. To address these issues, this innovatively proposes a bilayer medical knowledge graph architecture and employs information extraction techniques on both English patient clinical treatment events and Chinese medical expert consensus to construct a billion-scale medical knowledge graph that is cross-lingual, multimodal, dynamically updated, and highly scalable, aiming to provide more accurate, intelligent medical services.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007196
    Abstract:
    As artificial intelligence and 5G technology are applied in the automotive industry, the intelligent connected vehicle came into being. It is a complex distributed heterogeneous system composed of a large number of electronic control units (ECUs) from different suppliers and collaborates to control each ECU through the in-vehicle network protocol represented by CAN. However, an attacker could attack an intelligent connected vehicle through a variety of interfaces to penetrate the in-vehicle network, and then attack the in-vehicle network and its components such as ECU. Therefore, in-vehicle network security for intelligent connected vehicles has become one of the focuses of vehicle security research in recent years. On the basis of introducing the structure of intelligent connected vehicle, ECU, CAN bus and on-board diagnostic protocol, this study first summarizes the research progress of reverse engineering technology for in-vehicle network protocols. The reverse engineering technology aims to obtain the implementation details of in-vehicle network protocols that are usually not disclosed in the automotive industry. It is also a prerequisite for the implementation of in-vehicle network attack and defense. The remaining part is developed from two angles of attack and defense. On the one hand, the attack vectors and main attack technologies of in-vehicle network are summarized, including the attack technologies implemented through physical access and remote access, as well as the attack technologies implemented against ECU and CAN bus. On the other hand, the existing in-vehicle network defense technologies are discussed, including the intrusion detection technology based on feature extraction and machine learning methods, and the security enhancement technology of in-vehicle network protocols based on cryptographic approaches. Finally, the future research direction is prospected.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007198
    Abstract:
    Persistent Memory (PM), serving as a supplement and potential replacement for main memory, offers a lower cost for data storage while ensuring data persistence. However, traditional index structures tailored for PM like B+ trees fail to fully exploit the distribution characteristics of data for optimizing reading and writing performance on PM. Recent research endeavors have sought to enhance indexes’ reading and writing performance on PM and support index persistence through the data distribution awareness of learning indexes. Nonetheless, existing designs of persistent learning index structures suffer from additional PM accesses and poor performance when confronted with real-world data. To address the performance degradation of persistent learning indexes in the face of real data distributions, this study proposes a learning index PLTree, a DRAM/PM hybrid architecture. PLTree optimizes reading and writing performance under real data distributions through the following approaches: (1) a two-stage approach to construct the index, eliminating last-mile search in internal nodes and reducing the access of PM, (2) model-based search for efficient query performance on PM and accelerated query by leveraging metadata in DRAM, and (3) a log-based hierarchical overflow buffer structure tailored to PM characteristics to optimize writing performance. The results show that, compared with the existing persistent memory indexes (APEX, FPTree, uTree, NBTree, and DPTree), PLTree achieves significantly better performance in index construction 1.9× to 34× across various datasets. In single-threaded scenarios, PLTree exhibits an average query and insertion performance improvement of 1.26× to 4.45× and 2.63× to 6.83×, respectively. In multi-threaded scenarios, PLTree surpasses the baseline by up to 10.2× and 23.7× in query and insertion performance, respectively.
    Available online:  June 14, 2024 , DOI: 10.13328/j.cnki.jos.007169
    Abstract:
    With the application of artificial intelligence (AI) and end-to-end recognition methods in handwritten mathematical expression recognition, there has been a significant improvement in recognition accuracy. However, in contrast to tests on public datasets, real-world applications involving human input introduce more uncertain factors into recognition algorithms in practice. Factors such as personalized stroke information, ambiguous handwritten characters, and uncertain formula structures can significantly impact the performance of the recognition method. To address these challenges, HchMER, a hybrid human-machine intelligence method for handwritten mathematical expression recognition, is proposed. HchMER combines handwritten mathematical formula recognition algorithms, knowledge bases, and user feedback to enhance the machine's comprehension of user-input mathematical expressions, thereby improving the editing speed and accuracy of handwritten mathematical expressions. To assess the effectiveness of HchMER, it is compared with MyScript Math Recognition (MyScript) and a mature commercial product named “Microsoft Ink Equation” (InkEquation). Results show that HchMER outperformed MyScript and InkEquation in accuracy by 23.2% and 26.51%, respectively. In terms of average completion time, HchMER exceeded MyScript by 44.46% (9.6s) but fell short of InkEquation by 11.48% (4.05s). Furthermore, participants affirm HchMER in a questionnaire survey and semi-structured interviews.
    Available online:  June 12, 2024 , DOI: 10.13328/j.cnki.jos.007094
    Abstract:
    Online class-increment learning aims to learn new classes effectively under data stream scenarios and guarantee that the model meets the small cache and small batch constraints. However, due to the one-pass nature of data streams, it is difficult for the category information in small batches like offline learning to be exploited by multiple explorations. To alleviate this problem, current studies adopt multiple data augmentation combined with contrastive learning for model training. Nevertheless, considering the limitations of small cache and small batches, existing methods of selecting and storing data randomly are not conducive to obtaining diverse negative samples, which restricts the model discriminability. Previous studies have shown that hard negative samples are the key to improving contrastive learning performance, but this is rarely explored in online learning scenarios. The condued data proposed in traditional Universum learning provides a simple yet intuitive strategy using hard negative samples. Specifically, this study has proposed mixup-induced Universum (MIU) with certain coefficients previously, which effectively improves the performance of offline contrastive learning. Inspired by this, it tries to introduce MIU to online scenes, which is different from the previously statically generated Universum, and data stream scenarios face some additional challenges. Firstly, due to the increasing number of classes, the conventional approach of generating Universum based on globally given classes statically becomes inapplicable, necessitating redefinition and dynamic generation. Therefore, this study proposes to recursively generate MIU with the maximum entropy (incremental MIU, IMIU) relative to the seen (local) class and provides it with an additional small cache to meet the memory limit generally. Secondly, the generated IMIU and positive samples in small batches are mixed up together again to produce diverse and high-quality hard negative samples. Finally, by combining the above steps, the IMIU-based contrastive learning (IUCL) algorithm is developed. Meanwhile, comparison experiments on the standard datasets CIFAR-10, CIFAR-100, and Mini-ImageNet verify the validity of the proposed algorithm.
    Available online:  May 29, 2024 , DOI: 10.13328/j.cnki.jos.007156
    Abstract:
    As a fine-grained sentiment analysis method, aspect-based sentiment analysis is playing an increasingly important role in many application scenarios. However, with the ubiquity of social media and online reviews, cross-domain aspect-based sentiment analysis faces two major challenges: insufficient labeled data in the target domain and textual distribution differences between the source and target domains. Currently, many data augmentation methods attempt to alleviate these issues, yet the target domain text generated by these methods often suffers from shortcomings such as lack of fluency, limited diversity of generated data, and convergent source domain. To address these issues, this study proposes a method for cross-domain aspect-based sentiment analysis based on data augmentation from a large language model (LLM). This method leverages the rich language knowledge of large language models to construct appropriate prompts for the cross-domain aspect-based sentiment analysis task. It mines similar texts between the target domain and the source domain and uses context learning to guide the LLM to generate labeled text data in the target domain with domain-associated keywords. This approach addresses the lack of data in the target domain and the domain-specificity problem, effectively improving the accuracy and robustness of cross-domain sentiment analysis. Experiments on multiple real datasets show that the proposed method can effectively enhance the performance of the baseline model in cross-domain aspect-based sentiment analysis.
    Available online:  May 15, 2024 , DOI: 10.13328/j.cnki.jos.007157
    Abstract:
    As embedded systems are widely applied, their requirements are becoming increasingly complex, making requirements analysis a critical stage in embedded system development. How to correctly describe and model requirements has become a primary issue. This study systematically investigates the current requirements descriptions of embedded systems and conducts a comprehensive comparative analysis to deepen the understanding of the core concerns of embedded system requirements. The study first applies the systematic literature review method to identify, retrieve, summarize, and analyze the relevant literature published between January 1979 and November 2023. Through the automatic retrieval and snowball processes, 150 papers closely related to the topic are finally selected for the comprehensiveness of the review. The study analyzes the existing capabilities of embedded requirements description languages from their description concerns, description contents, requirements analysis elements, etc. Finally, it summarizes the challenges to the current requirements descriptions. Moreover, aiming at the task of intelligent synthesis of embedded software, it puts forward the need for the expressive ability of embedded system requirement description languages.
    Available online:  May 08, 2024 , DOI: 10.13328/j.cnki.jos.007144
    Abstract:
    Multi-modal affective computing is a fundamental and important research task in the field of affective computing, using multi-modal signals to understand the sentiment of user-generated video. Although existing multi-modal affective computing approaches have achieved good performance on benchmark datasets, they generally ignore the problem of modal reliability bias in multi-modal affective computing tasks, whether in designing complex fusion strategies or learning modal representations. This study believes that compared to text, acoustic and visual modalities often express sentiment more realistically. Therefore, voice and vision have high reliability, while text has low reliability in affective computing tasks. However, existing learning abilities of different modality feature extraction tools are different, resulting in a stronger ability to represent textual modality than acoustic and visual modalities (e.g., GPT3 and ResNet). This further exacerbates the problem of modal reliability bias, which is unfavorable for high-precision sentiment judgment. To mitigate the bias caused by modal reliability, this study proposes a model-agnostic multi-modal reliability-aware affective computing approach (MRA) based on cumulative learning. MRA captures the modal reliability bias by designing a single textual-modality branch and gradually shifting the focus from sentiments expressed in low-reliability textual modality to high-reliability acoustic and visual modalities during the model learning process. Thus, MRA effectively alleviates inaccurate sentiment predictions caused by low-reliability textual modality. Multiple comparative experiments conducted on multiple benchmark datasets demonstrate that the proposed approach MRA can effectively highlight the importance of high-reliability acoustic and visual modalities and mitigate the bias of low-reliability textual modality. Additionally, the model-agnostic approach significantly improves the performance of multi-modal affective computing, indicating its effectiveness and generality in multi-modal affective computing tasks.
    Available online:  May 08, 2024 , DOI: 10.13328/j.cnki.jos.007155
    Abstract:
    The scene sketch is made up of multiple foreground and background objects, which can directly and generally express complex semantic information. It has a wide range of practical applications in real life and has gradually become one of the research hotspots in the field of computer vision and human-computer interaction. As the basic task of the semantic understanding of scene sketch, scene sketch semantic segmentation is rarely studied. Most of the existing methods are improved from the semantic segmentation of natural images, which cannot overcome the sparsity and abstraction of sketches. To solve the above problems, this study proposes a graph Transformer model directly from sketch strokes. The model combines the temporal-spatial information of sketch strokes to solve the semantic segmentation task of free-hand scene sketches. First, the vector scene sketch is constructed into a graph with strokes as the nodes of the graph and temporal and spatial correlations between strokes as the edges of the graph. The temporal-spatial global context information of the strokes is then captured by the edge-enhanced Transformer module. Finally, the encoded temporal-spatial features are optimized for multi-classification learning. The experimental results on the SFSD scene sketch dataset show that the proposed method can effectively segment scene sketches using stroke temporal-spatial information and achieve excellent performance.
    Available online:  May 08, 2024 , DOI: 10.13328/j.cnki.jos.007158
    Abstract:
    In the era of big data, the sample scale and the dynamic update and variation of dimensionality greatly increase the computational burden. Most of these data sets do not exist in the form of a single data type but are more often hybrid data containing both symbolic and numerical data. For this reason, scholars have proposed many feature selection algorithms for hybrid data. However, most of the existing algorithms are only applicable to static data or small-scale incremental data and cannot handle large-scale dynamic changing data, especially large-scale incremental data sets with changing data distribution. To address this limitation, this paper proposes a multi-granulation incremental feature selection algorithm for dynamic hybrid data based on an information fusion mechanism by analyzing the variations and updates of granularity space and granularity structure in dynamic data. The algorithm focuses on the mechanism of granularity space construction in dynamic hybrid data, the mechanism of dynamic update of multiple data granularity structures, and the mechanism of information fusion for data distribution variations. Finally, the paper verifies the feasibility and efficiency of the proposed algorithm by comparing the experimental results with other algorithms on the UCI dataset.
    Available online:  May 08, 2024 , DOI: 10.13328/j.cnki.jos.007141
    Abstract:
    As image data grows explosively on the Internet and image application fields widen, the demand for large-scale image retrieval is increasing greatly. Hash learning provides significant storage and retrieval efficiency for large-scale image retrieval and has attracted intensive research interest in recent years. Existing surveys on hash learning are confronted with the problems of weak timeliness and unclear technical routes. Specifically, they mainly conclude the hashing methods proposed five to ten years ago, and few of them conclude the relationship between the components of hashing methods. In view of this, this study makes a comprehensive survey on hash learning for large-scale image retrieval by reviewing the hash learning literature published in the past twenty years. First, the technical route of hash learning and the key components of hashing methods are summarized, including loss function, optimization strategy, and out-of-sample extension. Second, hashing methods for image retrieval are classified into two categories: unsupervised hashing methods and supervised ones. For each category of hashing methods, the research status and evolvement process are analyzed. Third, several image benchmarks and evaluation metrics are introduced, and the performance of some representative hashing methods is analyzed through comparative experiments. Finally, the future research directions of hash learning are summarized considering its limitations and new challenges.
    Available online:  April 29, 2024 , DOI: 10.13328/j.cnki.jos.007160
    Abstract:
    As mobile devices are widely used, the performance of their graphics processors has increasingly improved. To meet users’ continuous pursuit of excellent experience, the screen resolution and refresh rate of mobile devices are constantly increasing every year. At the same time, the programmable shading pipeline in mobile games is becoming more complex, which leads to game applications becoming the main source of power consumption for mobile devices. This paper studies the rendering pipeline in mobile games and proposes a motion-aware rendering frame rate adjustment method to ensure rendering quality in power-saving mode. Unlike previous prediction models that only consider rendering errors of historical frames, this method builds a nonlinear model between camera pose and inter-frame rendering error and predicts error based on the new frame’s camera pose, thus achieving more accurate frame rate adjustment strategies. In addition, the method also includes a lightweight scene recognition module that can adjust the error threshold according to the specific scene where the player is located, thereby adopting different degrees of frame rate adjustment strategies. Quantitatively compared with the prediction model that only considers historical frame errors, the proposed model improves the prediction accuracy on game frame sequences by more than 30%. At the same time, in the qualitative comparison of user experiments, under the same frame-skipping ratio, the proposed algorithm can achieve higher rendering quality and better user experience. The algorithm integrates historical frame errors and camera information to predict more accurate future frame errors. It also combines prediction and scene recognition results to achieve better dynamic frame rate adjustment performance.
    Available online:  April 29, 2024 , DOI: 10.13328/j.cnki.jos.007148
    Abstract:
    The time-sensitive networking standard developed by IEEE 802.1 Task Group can be applied to build highly reliable, low latency, low jitter Ethernet, and the extension of time-sensitive networking to the wireless field is also a hot topic. Compared with traditional wired communication, wireless time-sensitive networking can not only achieve high reliability and low delay communication but also has the advantages of higher flexibility, stronger mobility, and lower wiring and maintenance costs. Therefore, wireless time-sensitive networking is considered a promising technology in the face of emerging applications such as autonomous driving, collaborative robotics, and remote medical control in the future. Generally, wireless networks can be divided into infrastructure-based wireless networks and non-infrastructure-based wireless networks. The latter can be divided into two categories based on mobility: mobile Ad hoc networks and wireless sensor networks. Therefore, this paper mainly studies and summarizes the application scenarios, related technologies, routing protocols, and high-reliability and low-delay transmission of the three types of networks.
    Available online:  April 24, 2024 , DOI: 10.13328/j.cnki.jos.007142
    Abstract:
    While keeping frequent application updates, Android application developers need to detect Android runtime permission (ARP) bugs as quickly as possible. Android applications cannot effectively be tested for permission-related behaviors with automated testing tools since they are rarely designed for ARP bugs. This study proposes a state transition graph guided testing approach for detecting ARP bugs in Android applications. First, it analyzes the APK file of the application under test for permission misuse, instruments the APIs that may cause ARP bugs in the APK file, and re-signs the APK file. Then, it installs the APK file and dynamically explores the application to generate its state transition graph (STG). Finally, it detects ARP bugs quickly by automated testing with the guidance of STG. To evaluate the effectiveness of the approach, the study implements a prototype tool RPBDroid and conducts comparative experiments with the ARP bug detection tools SetDroid, PermDroid, and the automated testing tool APE. The experimental results show that RPBDroid successfully detects 15 ARP bugs out of 17 applications, which detects 14, 12, and 14 more ARP bugs than APE, SetDroid, and PermDroid respectively. In addition, RPBDroid reduces the average time required to detect ARP bugs by 86.42%, 86.72%, and 86.70% in comparison with SetDroid, PermDroid, and APE.
    Available online:  April 24, 2024 , DOI: 10.13328/j.cnki.jos.007092
    Abstract:
    Identity-based matchmaking encryption is a new cryptographic primitive that allows both the receiver and the sender to specify each other’s identity and communicate with each other only when the identities match. Meanwhile, it provides a non-interactive secret handshake protocol to get rid of real-time interaction and further improve participant privacy. This study proposes an identity-based matchmaking encryption (IB-ME) scheme in prime-order groups under symmetric external Diffie-Hellman (SXDH) assumption under the standard model. Realizing short parameters and reducing the matchmaking times during decryption are the most efficient identity-based matchmaking encryption scheme. Additionally, this study also puts forward the first inner product with equality matchmaking encryption (IPE-ME) scheme under the SXDH assumption in the standard model. Technically, it first constructs two schemes in composite-order groups, then simulates them with dual pairing vector space (DPVS) into prime-order groups, and further reduces the parameter size by decreasing the required dimension of dual basis. Finally, for the proposed IPE-ME scheme, this study replaces the equality policy in the first layer of an IB-ME scheme with inner-product policy.
    Available online:  April 12, 2024 , DOI: 10.13328/j.cnki.jos.007146
    [Abstract] (183) [HTML] (0) [PDF 9.64 M] (1314)
    Abstract:
    A function-as-a-service (FaaS) workflow, composed of multiple function services, can realize a complex business application by orchestrating and controlling the function services. The current FaaS workflow execution systems achieve data transfer among function services mainly based on centralized data storages, resulting in heavy data transmission overhead and affecting application performance significantly. In the cases of high concurrency, frequent data transmission will also cause serious contention for network bandwidth resources, resulting in application performance degradation. To address the above problems, this study analyzes the fine-grained data dependency between function services and proposes a critical path-based FaaS workflow deployment optimization method. In addition, the study designs a dependency-sensitive data access and management mechanism to effectively reduce the data transmission between function services, thereby reducing the data transmission latency and end-to-end execution latency of FaaS workflow applications. The study implements a FaaS workflow system, FineFlow, and conducts experiments based on five real-world FaaS workflow applications. The experimental results show that FineFlow can effectively reduce the data transmission latency (the highest reduction and the average reduction are 74.6% and 63.8%, respectively) compared with the FaaS workflow platform with the centralized data storing-based function interaction mechanism. On average, FineFlow reduces the latency of the end-to-end FaaS workflow executions by 19.6%. In particular, for the FaaS workflow application with fine-grained data dependencies, FineFlow can further reduce its data transmission latency and the end-to-end execution latency by 28.4% and 13.8% respectively compared with the state-of-the-art work. In addition, FineFlow can effectively alleviate the impact of network bandwidth fluctuations on application performance by reducing cross-node data transmission, improving the robustness of application performance influenced by the network bandwidth changes.
    Available online:  April 03, 2024 , DOI: 10.13328/j.cnki.jos.007054
    [Abstract] (246) [HTML] (0) [PDF 7.17 M] (1249)
    Abstract:
    Deep learning has been widely employed in many fields and yields excellent performance. However, this often requires the support of large amounts of labeled data, which usually means high costs and harsh application conditions. Therefore, with the development of deep learning, how to break through data limitations in practical scenarios has become an important research problem. Specifically, as one of the most important research directions, semi-supervised learning greatly relieves the data requirement pressure of deep learning by conducting learning with the assistance of abundant unlabeled data and a small number of labeled data. The pseudo-labeling method plays a significant role in semi-supervised learning, and the quality of its generated pseudo labels will influence the final results of semi-supervised learning. Focusing on pseudo-labeling in semi-supervised learning, this study proposes the pseudo-labeling method based on optimal transport theory, which introduces the pseudo-labeling procedure constraint with labeled data as generation process guidance. On this basis, the pseudo-labeling procedure is converted to the optimization problem of optimal transport, which offers a new form for solving pseudo-labeling. Meanwhile, to solve this problem, this study introduces the Sinkhorn-Knopp algorithm for approximate fast solutions to avoid the heavy computation burden. As an independent module, the proposed method can be combined with other semi-supervised learning tricks such as consistency regularization for complete semi-supervised learning. Finally, this study conducts experiments on four classic public image classification datasets of CIFAR-10, SVHN, MNIST, and FashionMNIST to verify the effectiveness of the proposed method. The experimental results show that compared with the state-of-the-art semi-supervised learning methods, this method yields better performance, especially under fewer labeled data.
    Available online:  March 27, 2024 , DOI: 10.13328/j.cnki.jos.007084
    Abstract:
    The homegrown Shenwei AI acceleration card is equipped with the Shenwei many-core processor based on systolic array enhancement, and although its intelligent computing power can be comparable to the mainstream GPU, there is still a lack of basic software support. To lower the utilization threshold of the Shenwei AI acceleration card and effectively support the development of AI applications, this study designs a runtime system SDAA for the Shenwei AI acceleration card, whose semantics is consistent with the mainstream CUDA. For key paths such as memory management, data transmission, and kernel function launch, the software and hardware co-design method is adopted to realize the multi-level memory allocation algorithm with segment and paged memory combined on the card, pageable memory transmission model of multiple threads and channels, adaptive data transmission algorithm with multi-heterogeneous components, and fast kernel function launch method based on on-chip array communication. As a result, the runtime performance of SDAA is better than that of the mainstream GPU. The experimental results indicate that the memory allocation speed of SDAA is 120 times the corresponding interface of NVIDIA V100, the memory transmission overhead is 1/2 of the corresponding interface, and the data transmission bandwidth is 1.7 times the corresponding interface. Additionally, the launch time of the kernel function is equivalent to the corresponding interface, and thus the SDAA runtime system can support the efficient operation of mainstream frameworks and actual model training on the Shenwei AI acceleration card.
    Available online:  March 27, 2024 , DOI: 10.13328/j.cnki.jos.007081
    Abstract:
    Embedded systems are becoming increasingly complex, and the requirements analysis of their software systems has become a bottleneck in embedded system development. Device dependency and interleaving execution logic are typical characteristics of embedded software systems, necessitating effective requirement analysis methods to decouple the requirements based on device dependencies. Starting from the idea of environment-based modeling in requirement engineering, this study proposes a projection-based requirement analysis approach from system requirements to software requirements for embedded software systems, helping requirement engineers to effectively decouple the requirements. The study first summarizes the system requirement and software requirement descriptions of embedded software systems, defines the requirement decoupling strategies of embedded software systems based on interactive environment characteristics, and designs the specification process from system requirements to software requirements. A real case study is carried out in the spacecraft sun search system, and five representative case scenarios are quantitatively evaluated through two metrics of coupling and cohesion, which demonstrate the effectiveness of the proposed approach.
    Available online:  March 27, 2024 , DOI: 10.13328/j.cnki.jos.007090
    Abstract:
    Elephant flow identification is a fundamental task in network measurements. Currently, the mainstream methods generally employ sketch data structure Sketch to quickly count network traffic and efficiently find elephant flows. However, the rapid influx of numerous packets will significantly decrease the identification accuracy of elephant flows under network traffic jitters. To this end, this study proposes an elastic identification method for elephant flows supporting network traffic jitters, which is named RobustSketch. This method first designs a stretchable mice flow filter based on the cyclic Sketch chain, and adaptively increases and reduces the number of Sketch in real-time packet arrival rates. As a result, it always completely records all arrived packets within the current period to ensure accurate mice flow filtering even under network traffic jitters. Subsequently, this study designs a scalable elephant flow record table based on dynamic segmented hashing, which adaptively increases and reduces segments according to the number of candidate elephant flows filtered out by the mice flow filter. Finally, this can fully record all candidate elephant flows and keep high storage space utilization. Furthermore, the error bounds of the proposed mice flow filter and elephant flow recording table are provided by theoretical analysis. Finally, experimental evaluation is conducted on the proposed elephant flow identification method RobustSketch with real network traffic samples. Experimental results indicate that the identification accuracy of elephant flows of the proposed method is significantly higher than that of the existing methods, and can stably keep high accuracy of over 99% even under network traffic jitters. Meanwhile, its average relative error is reduced by more than 2.7 times, which enhances the accuracy and robustness of elephant flow identification.
    Available online:  March 27, 2024 , DOI: 10.13328/j.cnki.jos.007091
    Abstract:
    Internet service providers employ routing protection algorithms to meet real-time, low-latency, and high-availability application needs. However, existing routing protection algorithms have the following three problems. (1) The failure protection ratio is generally low under the premise of not changing the traditional routing protocol forwarding mechanism. (2) The traditional routing protocol forwarding mechanism should be changed to pursue a high failure protection ratio, which is difficult to deploy in practice. (3) The optimal next hop and backup next hop cannot be utilized simultaneously, which causes poor network load balancing capability. For the three problems, this study proposes a routing protection algorithm based on the shortest path serialization graph, which does not need to change the forwarding mechanism, supports incremental deployment and adopts both optimal next hop and backup next hop without routing loops, with a high failure protection ratio. The proposed algorithm mainly includes the following two steps. (1) A sequence number for each node is calculated, and the shortest path sequencing graph is generated. (2) The shortest path serialization graph is generated based on the node sequence number and reverse order search rules, and the next hop set between node pairs is calculated according to the backup next hop calculation rules. Tests on real and simulated network topologies show that the proposed scheme has significant advantages over other routing protection schemes in the average number of backup next hops, failure protection ratio, and path stretch.
    Available online:  March 20, 2024 , DOI: 10.13328/j.cnki.jos.007095
    [Abstract] (336) [HTML] (0) [PDF 5.24 M] (1089)
    Abstract:
    In recent years, with the popularity of cloud services, increasingly more enterprises and individuals have stored their data in cloud databases. However, enjoying the convenience of cloud services also brings about data security issues. One of the crucial problems is data confidentiality protection, which is to safeguard the sensitive data of users from being spied on or leaked. Fully encrypted databases have emerged to face this challenge. Compared with traditional databases, fully encrypted databases can encrypt data in the entire lifecycle of data transmission, storage, and computation, thereby ensuring data confidentiality. Currently, there are still many challenges in encrypting data while supporting all SQL functionalities and maintaining high performance. This study comprehensively investigates the key techniques of encrypted computing in fully encrypted databases, summarizes the techniques according to the types, and compares and sums up them based on functionality, security, and performance. Firstly, it introduces the architecture of fully encrypted databases, including crypto-based architecture, trusted execution environment (TEE)-based architecture, and hybrid architecture. Then, the key techniques of each architecture are summarized. Finally, the challenges and opportunities of current research are discussed, with some open problems provided for future research.
    Available online:  March 20, 2024 , DOI: 10.13328/j.cnki.jos.007085
    Abstract:
    As a new type of distributed machine learning paradigm, federated learning makes full use of the computing power of many distributed clients and their local data to jointly train a machine learning model under the premise of meeting user privacy and data confidentiality requirements. In cross-device federated learning scenarios, the client usually consists of thousands or even tens of thousands of mobile devices or terminal devices. Due to the limitations of communication and computing costs, the aggregation server only selects few clients for the training during each round of training. Meanwhile, several widely employed federated optimization algorithms adopt a completely random client selection algorithm, which has been proven to have a huge optimization space. In recent years, how to efficiently and reliably select a suitable set from massive heterogeneous clients to participate in training and thus optimize the resource consumption and model performance of federated learning protocols has been extensively studied, but there is still no comprehensive investigation on the key issue. Therefore, this study conducts a comprehensive survey of client selection algorithms for cross-device federated learning. Specifically, it provides a formal description of the client selection problem, then gives the classification of selection algorithms, and discusses and analyzes the algorithms one by one. Finally, some future research directions for client selection algorithms are explored.
    Available online:  March 20, 2024 , DOI: 10.13328/j.cnki.jos.007087
    [Abstract] (621) [HTML] (0) [PDF 7.80 M] (1709)
    Abstract:
    Financial risk prediction plays an important role in financial market regulation and financial investment, and has become a research hotspot in artificial intelligence and financial technology in recent years. Due to the complex investment, supply and other relationships among financial event entities, existing research on financial risk prediction often employs various static and dynamic graph structures to model the relationship among financial entities. Meanwhile, convolutional graph neural networks and other methods are adopted to embed relevant graph structure information into the feature representation of financial entities, which enables the representation of both semantic and structural information related to financial risks. However, previous reviews of financial risk prediction only focus on studies based on static graph structures, but ignore the characteristics that the relationship among entities in financial events will change dynamically over time, which reduces the accuracy of risk prediction results. With the development of temporal graph neural networks, increasingly more studies have begun to pay attention to financial risk prediction based on dynamic graph structures, and a systematic and comprehensive review of these studies will help learners foster a complete understanding of financial risk prediction research. According to different methods to extract temporal information from dynamic graphs, this study first reviews three different neural network models for temporal graphs. Then, based on different graph learning tasks, it introduces the research on financial risk prediction in four areas, including stock price trend risk prediction, loan default risk prediction, fraud transaction risk prediction, and money laundering and tax evasion risk prediction. Finally, the difficulties and challenges facing the existing temporal graph neural network models in financial risk prediction are summarized, and potential directions for future research are prospected.
    Available online:  March 20, 2024 , DOI: 10.13328/j.cnki.jos.007093
    [Abstract] (582) [HTML] (0) [PDF 7.17 M] (1576)
    Abstract:
    As a research hotspot in artificial intelligence in recent years, knowledge graphs have been applied to many fields in reality. However, with the increasingly diversified application scenarios of knowledge graphs, people gradually find that static knowledge graphs which do not change with time cannot fully adapt to the scenarios of high-frequency knowledge update. To this end, researchers propose the concept of temporal knowledge graphs containing temporal information. This study organizes all existing temporal knowledge graph representation and reasoning models and summarizes and constructs a theoretical framework for these models. Then, on this basis, it briefly introduces and analyzes the current research progress of temporal representation reasoning, and carries out the future trend prediction to help researchers develop and design better models.
    Available online:  March 13, 2024 , DOI: 10.13328/j.cnki.jos.007082
    Abstract:
    In recent years, service-oriented IoT architectures have received a lot of attention from academia and industry. By encapsulating IoT resources into intelligent IoT services, interconnecting and collaborating these resource-constrained and capacity-evolving IoT services to facilitate IoT applications has become a widely adopted and flexible mechanism. Upon capacity-fluctuating and resource-varying edge devices, IoT services may experience QoS degradations or resource mismatches during their execution, making it difficult for IoT applications to continue and possibly inducing failures. Therefore, quantitative monitoring of IoT services at runtime has become the key to guaranteeing the robustness of IoT applications. Different monitoring mechanisms have been proposed in recent literature, but they are inadequate in formal interpretation with strong domain relevance and empirical subjectivity. Based on formal methods, such as signal temporal logic (STL), the problem of IoT service monitoring can be formulated as a temporal logic task to achieve runtime quantitative monitoring. However, STL and its extensions suffer from issues of non-differentiability, loss of soundness, and inapplicability in dynamic environments. Moreover, existing works are inadequate for the monitoring of composite services, with a lack of integrity, linkage, and dynamics. To solve these problems, this study proposes a compositional signal temporal logic (CSTL) to achieve quantitative monitoring of different QoS constraints and time constraints upon intra-, inter-, and composite services. Specifically, CSTL extends an accumulative operator based on positively and negatively biased Riemann sums to emphasize the robust satisfaction of all sub-formulae over their entire time domains and to evaluate qualitative and quantitative constraint satisfaction for IoT service monitoring. Besides, CSTL extends a compositional operator based on constraint types and composite structures, as well as dynamic variables that can vary with the dynamic environment, to effectively monitor QoS variations and temporal violations of composite services. As a result, temporal and QoS constraints upon intra-, inter-, and composite services, can be specified by CSTL formulae, and formally interpreted with qualitative and quantitative satisfaction at runtime. Extensive evaluations show that the proposed CSTL performs better than baseline techniques in terms of expressiveness, applicability, and robustness.
    Available online:  March 13, 2024 , DOI: 10.13328/j.cnki.jos.007089
    Abstract:
    As the core foundation for ensuring network security, cryptography plays a crucial role in data protection, identity verification, encrypted communication, and other aspects. With the rapid popularization of 5G and the Internet of Things technology, network security is facing unprecedented challenges, and the demand for cryptographic performance is showing explosive growth. GPU can utilize thousands of parallel computing cores to accelerate complex computing problems, which is very suitable for the computationally intensive nature of cryptographic algorithms. Therefore, researchers have extensively explored methods to accelerate various cryptographic algorithms on GPU platforms. Compared with platforms such as CPU and FPGA, GPU has significant performance advantages. This study discusses the classification of various cryptographic algorithms and GPU platform architecture, and provides a detailed analysis of current research on various ciphers on GPU heterogeneous platforms. Additionally, it summarizes the current technical challenges confronted by high-performance cryptography based on GPU platforms and provides prospects for future technological development. Finally, comprehensive references can be provided for practitioners in cryptography engineering research on the latest research progress and application practices of high-performance cryptography based on GPU by in-depth studies and summaries.
    Available online:  March 06, 2024 , DOI: 10.13328/j.cnki.jos.007083
    Abstract:
    Multi-modal medical image fusion provides a more comprehensive and accurate medical image description for medical diagnosis, surgical navigation, and other clinical applications by effectively combining human tissue structure and lesion information reflected by different modal datasets. This study aims to address partial spectral degradation, lack of edges and details and insufficient color reproduction of adhesion lesion-invaded regions in current fusion methods. It proposes a novel multi-modal medical image fusion method to achieve multi-feature enhancement and color preservation in the multi-scale feature frequency domain decomposition filter domain. This method decomposes the source image into four parts: smoothing, texture, contour, and edge feature layers, which employ specific fusion rules and generate fusion results by image reconstruction. In particular, given the potential feature information contained in the smoothing layer, the study proposes a visual saliency decomposition strategy to explore the energy and partial fiber texture features with multi-scale and multi-dimensionality, enhancing the utilization of source image information. In the texture layer, the study introduces a texture enhancement operator to extract details and hierarchical information through spatial structure and information measurement, addressing the issue of distinguishing the invasion status of adherent lesion areas in current fusion methods. In addition, due to the lack of a public abdominal dataset, 403 sets of abdominal images are registered in this study for public access and download. Experiments conducted on public dataset Atlas and abdominal datasets are compared with six baseline methods. Compared to the most advanced methods, the results show that the similarity between the fused image and the source image is improved by 22.92%, the edge retention, spatial frequency, and contrast ratio of fused images are improved by 35.79%, 28.79%, and 32.92%, respectively. In addition, the visual and computing efficiency of the proposed method are better than those of other methods.
    Available online:  February 28, 2024 , DOI: 10.13328/j.cnki.jos.007065
    Abstract:
    Graph data is ubiquitous in real-world applications, and graph neural networks (GNNs) have been widely used in graph data analysis. However, the performance of GNNs can be severely impacted by adversarial attacks on graph structures. Existing defense methods against adversarial attacks generally rely on low-rank graph structure reconstruction based on graph community preservation priors. However, existing graph structure adversarial defense methods cannot adaptively seek the true low-rank value for graph structure reconstruction, and low-rank graph structures are semantically mismatched with downstream tasks. To address these problems, this study proposes the over-parameterized graph neural network (OPGNN) method based on the implicit regularization effect of over-parameterization. In addition, it formally proves that this method can adaptively solve the low-rank graph structure problem and also proves that over-parameterized residual links on node deep representations can effectively address semantic mismatch. Experimental results on real datasets demonstrate that the OPGNN method is more robust than existing baseline methods, and the OPGNN framework is notably effective on different graph neural network backbones such as GCN, APPNP, and GPRGNN.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007067
    Abstract:
    The code search method based on deep learning realizes the code search task by calculating the similarity of the corresponding representation of the code and the description statement. However, this manner does not consider the real probability distribution of relevance between the code and the description. To solve this problem, this study proposes a code search method based on a generative adversarial game that combines the correlation between the code and the description in the classical probability model with the feature extraction in the vector space model. Then the generative adversarial game is adopted to apply the probability distribution between the code and the description to the alternate training of the generator and discriminator. Meanwhile, the code encoder and the description encoder are optimized, and high-quality code representation and description statement representation are generated for the code search task. Finally, experimental verification is carried out on the public dataset, and the results show that the proposed method improves the Recall@10, MRR@10, and NDCG@10 metrics by 8.4%, 32.5%, and 24.3% respectively compared to the DeepCS method.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007062
    Abstract:
    Spoken language understanding is a key task in task-based dialogue systems, mainly composed of two sub-tasks: slot filling and intent detection. Currently, the mainstream method is to jointly model slot filling and intent detection. Although this method has achieved good results in both slot filling and intent detection, there are still issues with error propagation in the interaction process between intent detection and slot filling in joint modeling, as well as the incorrect correspondence between multi-intent information and slot information in multi-intent scenarios. In response to these problems, this study proposes a joint model for multi-intent detection and slot filling based on graph attention networks (WISM). The WISM established a word-level one-to-one mapping relationship between fine-grained intentions and slots to correct incorrect correspondence between multi-intent information and slots. By constructing an interaction graph of word-intent-semantic slots and utilizing a fine-grained graph attention network to establish bidirectional connections between the two tasks, the problem of error propagation during the interaction process can be reduced. Experimental results on the MixSINPS and MixATIS datasets showed that, compared with the latest existing models, WISM has improved semantic accuracy by 2.58% and 3.53%, respectively. This model not only improves accuracy but also verifies the one-to-one correspondence between multi-intent and semantic slots.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007055
    Abstract:
    Detecting aligned double joint photographic experts group (JPEG) compression is a challenging task in digital image forensics. Previous studies have proposed methods that can effectively detect aligned double JPEG compression, but these methods mostly rely on features extracted during the JPEG decompression process. If the aligned double compressed JPEG image is saved in BMP format, these methods may be difficult to be directly applied. To address this issue, this study proposes a quantization step estimation method based on dual thresholds, which allows for the acquisition of quantization tables and the extraction of features. Furthermore, the study defines a minimum error based on the unique properties of JPEG compression with a quality factor of 100, and by removing the minimum error from the features, the feature detection performance of the proposed method is further improved. Finally, the study extracts first-order relative error features based on the convergence properties of the de-quantized JPEG coefficients, which further enhances the detection performance of the proposed method at lower quality factors. Experimental results demonstrate that the proposed method outperforms current state-of-the-art algorithms at different quality factors.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007064
    Abstract:
    Temporal graph is a type of graph where each edge is associated with a timestamp. Seasonal-bursting subgraph is a dense subgraph characterized by burstiness over multiple time periods, which can applied for activity discovery and group relationship analysis in social networks. Unfortunately, most previous studies for subgraph mining in temporal networks ignore the seasonal or bursting features of subgraphs. To this end, this study proposes a maximal ($\omega,\theta $)-dense subgraph model to represent a seasonal-bursting subgraph in temporal networks. Specially, the maximal ($\omega,\theta $)-dense subgraph is a subgraph that accumulates its density at the fastest speed during at least $ \omega $ particular periods of length no less than $ \theta $ on the temporal graph. To compute all seasonal bursting subgraphs efficiently, the study first models the mining problem as a mixed integer programming problem, which consists of finding the densest subgraph and the maximum burstiness segment. Then corresponding solutions are given for each subproblem, respectively. The study further conceives two optimization strategies by exploiting key-core and dynamic programming algorithms to boost performance. The results of experiments show that the proposed model is indeed able to identify many seasonal-bursting subgraphs. The efficiency, scalability, and effectiveness of the proposed algorithms are also verified on five real-life datasets.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007031
    Abstract:
    The utilization range of Internet of Things (IoT) devices is expanding. Model checking is an effective approach to improve the reliability and security of such devices. However, the commonly adopted model checking methods cannot well describe the cross-space movement and communication behavior common in such devices. To this end, this study proposes a modeling and verification method for the mobile and communication behavior of IoT devices to verify their spatio-temporal properties. Additionally, push/pull action and global communication mechanism are integrated into ambient calculus to propose the ambient calculus with global communication (ACGC) and provide a model checking algorithm for ACGC against the ambient logic. Then, the modeling language for mobility and communication (MLMC) is put forward to describe mobile and communication behavior of IoT devices. Additionally, a method to convert the MLMC-based description into an ACGC model is given. Furthermore, a model checking tool ACGCCk is implemented to verify whether the properties of IoT devices are satisfied. Meanwhile, some optimizations are conducted to accelerate the checking. Finally, the effectiveness of the proposed method is demonstrated by case study and experimental analysis.
    Available online:  February 05, 2024 , DOI: 10.13328/j.cnki.jos.007045
    [Abstract] (465) [HTML] (0) [PDF 1.63 M] (2060)
    Abstract:
    Network congestion control algorithms are the key factor indetermining network transport performance. In recent years, the spreading network, the growing network bandwidth, and the increasing user requirements for network performance have brought challenges to the design of congestion control algorithms. To adapt to different network environments, many novel design ideas of congestion control algorithms have been proposed recently, which have greatly improved the performance of networks and user experience. This study reviews innovative congestion control algorithm design ideas and classifies them into four major categories: reservation scheduling, direct measurement, machine learning-based learning, and iterative detection. It introduces the corresponding representative congestion control algorithms, and further compares and analyzes the advantages and disadvantages of various congestion control ideas and methods. Finally, the study looks forward to future development direction on congestion control to inspire research in this field.
    Available online:  January 31, 2024 , DOI: 10.13328/j.cnki.jos.007056
    Abstract:
    Time series forecasting models have been widely used in various domains of daily life, and the attack against these models is related to the security of data in applications. At present, adversarial attacks on time series mostly perform large-scale perturbation at the global level, which leads to the easy perception of adversarial samples. At the same time, the effectiveness of adversarial attacks decreases significantly with the magnitude shrinkage of the perturbation. Therefore, how to generate imperceptible adversarial samples while maintaining a competitive performance of attack is an urgent problem that needs to be solved in the current adversarial attack field of time series forecasting. This study first proposes a local perturbation strategy based on sliding windows to narrow the perturbation interval of the adversarial sample. Second, it employs the differential evolutionary algorithm to find the optimal attack points and combine the segmentation function to partition the perturbation interval to further reduce the perturbation range and complete the semi-white-box attack. The comparison experiments with existing adversarial attack methods on several different deep learning models show that the proposed method can generate less perceptible adversarial samples and effectively change the prediction trend of the model. The proposed method achieves sound attack results in four challenging tasks, namely stock trading, electricity consumption, sunspot observation, and temperature prediction.
    Available online:  January 31, 2024 , DOI: 10.13328/j.cnki.jos.007052
    Abstract:
    The performance of image classification algorithms is limited by the diversity of visual information and the influence of background noise. Existing works usually apply cross-modal constraints or heterogeneous feature alignment algorithms to learn visual representations with strong discrimination. However, the difference in feature distribution caused by modal heterogeneity limits the effective learning of visual representations. To address this problem, this study proposes an image classification framework (CMIF) based on cross-modal semantic information inference and fusion and introduces the semantic description of images and statistical knowledge as privileged information. The study uses the privileged information learning paradigm to guide the mapping of image features from visual space to semantic space in the training stage, and a class-aware information selection (CIS) algorithm is proposed to learn the cross-modal enhanced representation of images. In view of the heterogeneous feature differences in representation learning, the partial heterogeneous alignment (PHA) algorithm is used to achieve cross-modal alignment of visual features and semantic features extracted from privileged information. In order to further suppress the interference caused by visual noise in semantic space, the CIS algorithm based on graph fusion is selected to reconstruct the key information in the semantic representation, so as to form an effective supplement to the visual prediction information. Experiments on the cross-modal classification datasets VireoFood-172 and NUS-WIDE show that CMIF can learn robust semantic features of images, and it has achieved stable performance improvement on the convolution-based ResNet-50 and Transform-based ViT image classification models as a general framework.
    Available online:  January 31, 2024 , DOI: 10.13328/j.cnki.jos.007063
    Abstract:
    With the development of Internet information technology, large-scale graphs have widely emerged in social networks, computer networks, and biological information networks. In view of the storage and performance limitations of traditional graph data management technology when dealing with large-scale graphs, distributed management technology has become a hotspot in industry and academia fields. The core decomposition is adopted to get core numbers of vertices in a graph and plays a key role in many applications, including community search, protein structure analysis, and network structure visualization. The existing distributed core decomposition algorithm applied a broadcast message delivery mechanism based on the vertex-centric mode, which may generate a large amount of redundant communication and computation overhead and lead to memory overflow when processing large-scale graphs. To address these issues, this study proposes novel distributed core decomposition algorithms based on global activation and peeling calculation frameworks, respectively. In addition, there are several strategies designed to improve algorithm performance. Based on the locality of the vertex core number, the study proposes a new message-pruning strategy and a new worker-centric computing mode, thereby improving the efficiency of our algorithms. To verify those strategies, this study deploys the proposed models and algorithms on the distributed cluster of the National Supercomputing Center in Changsha, and the effectiveness and efficiency of the proposed methods are evaluated through a large number of experiments on real and synthetic data sets. The total time performance of the algorithm is improved by 37% to 98%.
    Available online:  January 31, 2024 , DOI: 10.13328/j.cnki.jos.007066
    Abstract:
    Raft is one of the most popular distributed consensus protocols. Since it was proposed in 2014, Raft and its variants have been widely used in different kinds of distributed systems. To prove the correctness of the Raft protocol, developers use the TLA+ formal specification to model and verify its design. However, due to the gap between the abstract formal specification and practical implementation, distributed systems that implement the Raft protocol can still violate the protocol design and introduce intricate bugs. This study proposes a novel testing technique based on TLA+ formal specification to unearth bugs in Raft implementations. To be specific, the study maps the formal specification to the corresponding system implementation and then uses the specification-defined state space to guide the testing in the implementations. To evaluate the feasibility and effectiveness of the proposed approach, the study applies it on two different Raft implementations and finds 3 previously unknown bugs.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007051
    Abstract:
    A heterogeneous graph is a graph with multiple types of nodes and edges, also known as a heterogeneous information network, which is often used to model systems with rich features and association patterns in the real world. Link prediction between heterogeneous nodes is a fundamental task in network analysis. In recent years, the development of heterogeneous graph neural network (HGNN) has greatly advanced the task of link prediction, which is usually regarded as a feature similarity analysis between nodes or a binary classification problem based on paired node features. However, when learning node feature representations, existing HGNNs usually only focus on the associations between adjacent nodes or the meta-path-based structural information. This not only makes these HGNNs difficult to capture the semantic information of the ring structure inherent in heterogeneous graphs but also ignores the complementarity of structural information at different levels. To solve the above issues, this study proposes a cascade graph convolution network based on multi-level graph structures (CGCN-MGS), which is composed of graph neural networks based on three graph structures of different levels: neighboring, meta-path, and ring structures. CGCN-MGS can mine rich and complementary information from multi-level features and improve the representation ability of the learned node features on the semantics and structure information of nodes. Experimental results on several benchmark datasets show that CGCN-MGS can achieve state-of-the-art performance on the link prediction of heterogeneous graphs.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007080
    [Abstract] (622) [HTML] (0) [PDF 4.43 M] (1216)
    Abstract:
    Malware detection is a hotspot of cyberspace security research, such as Windows malware detection and Android malware detection. With the development of machine learning and deep learning, some outstanding algorithms in the fields of image recognition and natural language processing have been applied to malware detection. These algorithms have shown excellent learning performance with a large amount of data. However, there are some challenging problems in malware detection that have not been solved effectively. For instance, conventional learning methods cannot achieve effective detection based on a few novel malware. Therefore, few-shot learning (FSL) is adopted to solve the few-shot for malware detection (FSMD) problems. This study extracts the problem definition and the general process of FSMD by the related research. According to the principle of the method, FSMD methods are divided into methods based on data augmentation, methods based on meta-learning, and hybrid methods combining multiple technologies. Then, the study discusses the characteristics of each FSMD method. Finally, the background, technology, and application prospects of FSMD are proposed.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007042
    [Abstract] (585) [HTML] (0) [PDF 6.35 M] (2006)
    Abstract:
    In recent years, machine learning has always been a research hotspot, and has been applied to various fields with an important role played. However, as the data amount continues to increase, the training time of machine learning algorithms is getting longer. Meanwhile, quantum computers demonstrate a powerful computing ability. Therefore, researchers try to solve the problem of long machine learning training time, which leads to the emergence of quantum machine learning. Quantum machine learning algorithms have been proposed, including quantum principal component analysis, quantum support vector machine, and quantum deep learning. Additionally, experiments have proven that quantum machine learning algorithms have a significant acceleration effect, leading to a gradual upward trend in research on quantum machine learning. This study reviews research on quantum machine learning algorithms. First, the fundamental concepts of quantum computing are introduced. Then, five quantum machine learning algorithms are presented, including quantum supervised learning, quantum unsupervised learning, quantum semi-supervised learning, quantum reinforcement learning, and quantum deep learning. Next, related applications of quantum machine learning are demonstrated with the algorithm experiments provided. Finally, the relevant summary and prospect of future study are discussed.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007068
    Abstract:
    Currently, most of the published image steganalysis methods are designed for grayscale images, which cannot effectively detect color images widely used in social media. To solve this problem, this study proposes a color image steganalysis method based on central difference convolution and attention enhancement. The proposed method first designs a backbone flow consisting of three stages: preprocessing, feature extraction, and feature classification. In the preprocessing stage, the input color image is color channel-separated, and the residual images after SRM filtering are concatenated through each channel. In the feature extraction stage, the study constructs three convolutional blocks based on central difference convolution to extract deeper steganalysis feature maps. In the classification stage, the study uses global covariance pooling and two fully connected layers with dropout operation to classify the cover and stego images. Additionally, to further enhance the feature expression ability of the backbone flow at different stages, it introduces a residual spatial attention enhancement module and a channel attention enhancement module at the early and late stages of the backbone flow, respectively. Specifically, the residual spatial attention enhancement module first uses Gabor filter kernels to perform channel-separated convolution on the input image and then obtains the effective information of the residual feature map through the spatial attention mechanism. The channel attention enhancement module enhances the final feature classification ability of the model by obtaining the dependence relationship between channels. A large number of comparative experiments have been conducted, and the results show that the proposed method can significantly improve the detection performance of color image steganography and achieve the best results currently. In addition, the study also conducts corresponding ablation experiments to verify the rationality of the proposed network architecture.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007059
    [Abstract] (425) [HTML] (0) [PDF 4.23 M] (1267)
    Abstract:
    As big data and computing power rapidly develop, deep learning has made significant breakthroughs and rapidly become a field with numerous practical application scenarios and active research topics. In response to the growing demand for the development of deep learning tasks, deep learning frameworks have arisen. Acting as an intermediate component between application scenarios and hardware platforms, deep learning frameworks facilitate the development of deep learning applications, enabling users to efficiently construct diverse deep neural network (DNN) models, and deeply adapt to various computing hardware, meeting the computational needs across different computing architectures and environments. Any issues that arise within deep learning frameworks, which serve as the fundamental software in the realm of artificial intelligence, can have severe consequences. Even a single bug in the code can trigger widespread failures within models built upon the framework, thereby posing a serious threat to the safety of deep learning systems. As the first review exclusively focuses on the testing of deep learning frameworks, this study initially introduces the developmental history and basic architectures of deep learning frameworks. Subsequently, by systematically examining 55 academic papers directly related to the testing of deep learning frameworks, the study systematically analyzes and summarizes bug characteristics, key technologies for testing, and methods based on various input forms for testing. The study explores how to combine key technologies to address research problems. Lastly, it summarizes the unresolved difficulties in the testing of deep learning frameworks and provides insights into promising research directions for the future. This study can offer valuable references and guidance to individuals involved in the research field of deep learning framework testing, ultimately promoting the sustained development and maturity of deep learning frameworks.
    Available online:  January 24, 2024 , DOI: 10.13328/j.cnki.jos.007050
    Abstract:
    A directed acyclic graph (DAG)-based blockchain adopts a parallel topology and can significantly improve system performance compared with conventional chain-based blockchains with a serial topology. As a result, it has attracted wide attention from the industry. However, the storage model and the consensus protocol of the existing DAG-based blockchains are highly coupled, which lacks the flexibility to meet diversified application demands. Furthermore, most DAG-based blockchains lack flexibility at the consensus protocol level and are limited to probabilistic consensus protocols, which is difficult to take into account confirmation latency and security and is especially unfriendly to delay-sensitive applications. Therefore, this study presents the elastic DAG-based blockchain, namely ElasticDAG. The core idea is to decouple the storage model and the consensus protocol, enabling them to proceed in parallel and independently, so as to flexibly adapt to diversified applications. In order to improve the throughput and activity of the system, an adaptive block confirmation strategy and an epoch-based block ordering algorithm are designed for the storage model. In response to the need to reduce transaction confirmation latency, a low-latency DAG blockchain hybrid consensus protocol is designed. Experimental results demonstrate that the ElasticDAG prototype in WAN can achieve a throughput exceeding 11 Mb/s, and it yields a confirmation latency of tens of seconds. Compared with OHIE and Haootia, ElasticDAG can reduce confirmation latency by 17 times and improve security from 91.04% to 99.999 914% while maintaining the same throughput and consensus latency.
    Available online:  January 17, 2024 , DOI: 10.13328/j.cnki.jos.007058
    Abstract:
    Due to the continuous advancements in the field of deep learning, there is growing interest in extending relational databases with collaborative query processing (CQP) techniques to handle advanced analytical queries involving structured and unstructured data. State-of-the-art CQP methods employ user-defined functions (UDFs) to implement deep neural network (NN) models for processing unstructured data while utilizing relational operations for structured data. UDF-based approaches simplify query composition, allowing users to submit analytical queries with a single SQL statement. However, they require manual selection of appropriate and efficient models based on desired performance metrics during ad-hoc data analysis, posing significant challenges to users. To address this issue, this research proposes a CQP technique based on declarative inference functions (DIF), which constructs a complete CQP framework by optimizing model selection, execution strategies, and device bindings across multiple query execution paths. Leveraging the cost model and optimization rules designed in this study, the query processor is capable of estimating the cost of different query plans and automatically selecting the optimal physical query plan. Experimental results on four datasets validate the effectiveness and efficiency of the proposed DIF-based CQP approach.
    Available online:  January 17, 2024 , DOI: 10.13328/j.cnki.jos.007043
    Abstract:
    The graphical user interface (GUI/UI) provides a visual bridge between the application and its end users, and users can use the application through interactive operations. With the development of mobile applications, GUI, which combines aesthetics and interaction design, has become more and more complex, and users are increasingly concerned about the accessibility and availability of applications. However, the complexity of GUI also brings great challenges to its design and implementation. Due to user-defined settings for mobile devices and different device models and screen resolutions, UI display issues frequently occur. For example, due to software or hardware compatibility, when rendering interfaces on different devices, there will always be display issues such as text overlap, component masking, and image loss. They have a negative impact on the availability and accessibility of applications, resulting in poor user experience. Unfortunately, little is known about the causes of UI display issues of mobile applications. In order to cope with this challenge, this study collects 6729 screenshots of applications with UI display issues from Baidu crowdtesting platform and 1016 screenshots of applications provided by issue reports in GitHub and identifies nine types of UI display issues using the theme analysis method. Through the analysis of 1061 UI issue reports from GitHub and the corresponding defective code, the essence and causes of UI display issues are summarized. The research found that (1) 62.1% of the total screenshots in crowdtesting dataset are defective screenshots displayed on the UI; (2) the reason for the UI display issues is that the font scaling setting does not match the adaptive setting of components to a great extent; (3) the layout setting of the interface will lead to display issues; (4) If the hardware acceleration is not turned on, the normal display of the interface will be affected.
    Available online:  January 17, 2024 , DOI: 10.13328/j.cnki.jos.007049
    Abstract:
    In the white-box attack context, an attacker can access the implementation process of the cryptographic algorithm, observe the dynamic execution and internal details of the algorithm, and modify it arbitrarily. In 2002, Chow et al. proposed the concept of white-box cipher and pointed out the white-box implementation of the AES algorithm and DES algorithm by using lookup table technology, which is called the CEJO framework. The white-box implementation obfuscates the existing cryptographic algorithms, protects the key in the form of software under white-box attack, and ensures the correctness of the algorithm results. SIMON is a lightweight block cipher algorithm, which is widely used in Internet of Things devices because of its great software and hardware performance. It is of great practical significance to study the white-box implementation of this algorithm. This study presents two white-box implementations of the SIMON algorithm. The first scheme (SIMON-CEJO) uses the classical CEJO framework to protect the lookup tables by using network codings, so as to confuse the key. In this scheme, the occupied memory space is 369.016 KB. The security analysis shows that the SIMON-CEJO scheme can resist BGE attack and affine equivalent algorithm attack, but it fails to resist differential computing analysis. The second scheme (SIMON-Masking) uses the encoding method proposed by Battistello et al. to encode the plaintext information and key information, and it uses the homomorphism of encoding to convert the XOR operation and AND operation into modular multiplication and table lookup operation. Finally, the corresponding ciphertext result is obtained by decoding. During the operation of the algorithm, the Boolean mask is added to the AND operation. The randomness of the codings protects the real key information and improves the ability of the scheme to resist differential computing analysis and other attacks. SIMON-Masking occupies 655.81 KB of memory space, and the time complexity of the second-order differential computing based on the Legendre symbol is O(n2klog2p). The comparison results of the two schemes show that the classical CEJO framework cannot effectively defend against differential computing analysis, but using new coding and adding masks are effective white-box implementation methods.
    Available online:  January 10, 2024 , DOI: 10.13328/j.cnki.jos.007048
    Abstract:
    Database management systems (DBMSs) are the infrastructure for efficient storage, management, and analysis of data, playing a pivotal role in modern data-intensive applications. Vulnerabilities in DBMSs pose a great threat to the security of data and the operation of applications. Fuzzing is one of the most popular dynamic vulnerability detection techniques and has been applied to analyze DBMSs, uncovering many vulnerabilities. This study analyzes the requirements and the difficulties involved in testing a DBMS and proposes a foundational framework for DBMS fuzzing. It also analyzes the challenges encountered by DBMS fuzzers and identifies the dimensions that necessitate support. It introduces typical DBMS fuzzers from the perspective of discovering different types of vulnerabilities and summarizes key techniques in DBMS fuzzing, including SQL statement synthesis, code coverage tracking, and test oracle construction. Several popular DBMS fuzzers are evaluated in terms of coverage, syntax and semantic correctness of the generated test cases, and the ability to find vulnerabilities. Finally, it presents the problems faced by current DBMS fuzzing research and practices and prospects for future research directions in DBMS fuzzing.
    Available online:  January 10, 2024 , DOI: 10.13328/j.cnki.jos.007037
    Abstract:
    Ubiquitous computing for human-cyber-physical integration is becoming a new requirement and trend in software development. Based on this new computing paradigm, human-cyber-physical applications further extend software technology to the effective utilization of offline resources, including physical devices and human resources. As a typical human-cyber-physical scenario, the collaboration between the device and human resources in the physical world features resource selectivity, high task frequency, and worker dynamics. Traditional resource scheduling techniques cannot meet the scheduling requirements of this task type (referred to as DHRC task). Thus, this study proposes an optimal scheduling method for collaborative tasks between device and human resources. This method includes two stages of device resource scheduling and human resource scheduling. In the device resource scheduling stage, a device resource scheduling algorithm based on NSGA-II is proposed to optimize task resource selection by comprehensively considering such factors as task distance, device load, and the worker number around the device location. In the human resource scheduling stage, a human resource scheduling algorithm based on DPSO is put forward to optimize the worker selection and corresponding path planning according to such factors as worker location and collaboration dependency. Experiments in a simulated environment show that the algorithm in the first stage is equivalent in efficiency and superior in utility to the compared algorithm (discrete particle swarm optimization algorithm). The algorithm in the second stage is superior in efficiency and utility to the compared algorithm (the genetic algorithm improved by the tournament mechanism).
    Available online:  January 10, 2024 , DOI: 10.13328/j.cnki.jos.007040
    Abstract:
    Named entity recognition (NER) is a fundamental task in information extraction and aims to locate the boundaries of entities in a sentence and classify them. In response to the fuzzy boundaries of nested entities based on span detection models, this study proposes a nested NER model based on span boundary perception. Firstly, it utilizes a bidirectional affine attention mechanism to capture the semantic relevance among word tokens and then generates a span semantic representation matrix. Secondly, it designs a second-order diagonal neighborhood difference operator and establishes a span semantic difference mechanism to extract semantic difference information among spans. Additionally, a span boundary perception mechanism is introduced to employ the local feature extraction ability of sliding windows to enhance the span boundary semantic differences, thereby accurately locating the entity span. The model is validated on three benchmark datasets of ACE04, ACE05, and Genia. The experimental results show that the proposed model outperforms related work in entity recognition accuracy. Additionally, the study conducts ablation experiments and case studies to verify the effectiveness of the proposed semantic difference mechanism and span boundary perception mechanism, providing new ideas and empirical evidence for further research on NER.
    Available online:  January 10, 2024 , DOI: 10.13328/j.cnki.jos.007060
    Abstract:
    To solve the problems of users’ private key security, this study proposes a user-oriented and practical private key protection framework by combining secret sharing and edge computing mode. Based on this framework, it designs a private key protection scheme for the SM2 public-key cryptographic system. In this scheme, a user’s SM2 private key is divided into two shares via a secret sharing scheme and kept by the user’s device and the edge server respectively. The public-key cryptographic task requested by Web3 applications is executed cooperatively by the user’s device and the edge server without having to recover the original private key. After the user’s device or the edge server is attacked, a key updating protocol will be executed among them to update the private key shares and scrap the one that may have been leaked. Experiment results show that the computing time of the new scheme is acceptable for common devices (smartphones, laptops, etc.) in the real world.
    Available online:  January 03, 2024 , DOI: 10.13328/j.cnki.jos.007057
    Abstract:
    Currently, sentiment analysis research is generally based on big data-driven models, which heavily rely on expensive annotation and computational costs. Therefore, research on sentiment analysis in low-resource scenarios is particularly urgent. However, existing research on sentiment analysis in low-resource scenarios mainly focuses on a single task, making it difficult for models to acquire external task knowledge. Therefore, this study constructs successive sentiment analysis in low-resource scenarios, aiming to allow models to learn multiple sentiment analysis tasks over time by continual learning methods. This can make full use of data from different tasks and learn sentiment information from different tasks, thus alleviating the problem of insufficient training data for a single task. There are two core problems with successive sentiment analysis in low-resource scenarios. One is preserving sentiment information for a single task, and the other is fusing sentiment information between different tasks. To solve these two problems, this study proposes continual attention modeling for successive sentiment analysis in low-resource scenarios. Sentiment masked Adapter (SMA) is first constructed, which is used to generate hard attention emotion masks for different tasks. This can preserve sentiment information for different tasks and mitigate catastrophic forgetting. Secondly, dynamic sentiment attention (DSA) is proposed, which dynamically fuses features extracted by different Adapters based on the current time step and task similarity. This can fuse sentiment information between different tasks. Experimental results on multiple datasets show that the proposed approach significantly outperforms the state-of-the-art benchmark approaches. Additionally, experimental analysis indicates that the proposed approach has the best sentiment information retention ability and sentiment information fusion ability compared to other benchmark approaches while maintaining high operational efficiency.
    Available online:  January 03, 2024 , DOI: 10.13328/j.cnki.jos.007044
    Abstract:
    Due to the complex features of multi-view data, multi-view outlier detection has become a very challenging research topic in outlier detection. There are three types of outliers in multi-view data, namely class outliers, attribute outliers, and class-attribute outliers. Most of the early multi-view outlier detection methods are based on the assumption of clustering, which makes it difficult to detect outliers when there is no clustering structure in the data. In recent years, many multi-view outlier detection methods use the multi-view consistent nearest neighbor assumption instead of the clustering assumption, but they still suffer from the problem of inefficient detection of new data. In addition, most existing multi-view outlier detection methods are unsupervised, which are affected by outliers during model learning and do not work well when dealing with datasets with high outlier rates. To address these issues, this study proposes an intra-view reconstruction and cross-view generation network for effective multi-view outlier detection to detect the three types of outliers, which consists of two modules: intra-view reconstruction and cross-view generation. By training with normal data, the proposed method can fully capture the features of each view in the normal data and reconstruct and generate the corresponding views better. In addition, a new outlier calculation method is proposed to calculate the corresponding outlier scores for each sample to efficiently detect new data. Extensive experimental results show that the proposed method significantly outperforms existing methods. It is known that this is the first work to apply a deep model based on generative adversarial networks to multi-view outlier detection.
    Available online:  December 27, 2023 , DOI: 10.13328/j.cnki.jos.007032
    [Abstract] (443) [HTML] (0) [PDF 6.33 M] (1047)
    Abstract:
    Federated learning has caught much attention because it can solve data islands. However, it also faces challenges such as the risk of privacy leakage and performance degradation due to model heterogeneity under non-independent and identically distributed data. To this end, this study proposes a personalized federated learning method based on Bregman divergence and differential privacy (FedBDP). This method employs Bregman divergence to measure the differences between local and global parameters and adopt it as a regularization term to update the loss function, thereby reducing model differences to improve model accuracy. Meanwhile, adaptive differential privacy technology is utilized to perturb local model parameters, and the attenuation coefficient is defined to dynamically adjust the level of the differential privacy noise in each round, and thus reasonably allocate the privacy noise level and improve the model availability. Theoretical analysis shows that FedBDP satisfies convergence conditions under both strongly convex and non-convex smooth functions. Experimental results demonstrate that the FedBDP method can guarantee accuracy in the MNIST and CIFAR10 datasets on the premise of satisfying differential privacy.
    Available online:  December 27, 2023 , DOI: 10.13328/j.cnki.jos.007036
    Abstract:
    Partitioned DM (deadline-monotonic) scheduling of sporadic real-time tasks is a classic research problem. This study proposes a partitioned scheduling algorithm PDM-FFD (partitioned deadline-monotonic first-fit decrease) with higher processor utilization for constrained-deadline sporadic tasks. In PDM-FFD, firstly tasks are sorted in non-decreasing order according to the relative deadline, then the first-fit strategy is utilized to select the processor core to allocate tasks, and each core adopts DM scheduling policy. Finally, a tighter schedulability determination method is obtained by analyzing the task interference time to determine the task schedulability. This study proves that the speedup factor of PDM-FFD is $3 - (3\Delta + 1)/(m + \Delta )$ and the time complexity is ${\rm{O}}({n^2}) + {\rm{O}}(nm)$. $\Delta =\displaystyle{\sum }_{{\tau }_{j}\in \tau }{C}_{j} \times {u}_{j}/{D}_{{\rm{max}}}$ where ${\tau _j}$ belongs to the task set $\tau $, ${C_j}$is the worst-case execution time, ${u_j}$is the utilization, ${D_{{\rm{max}}}}$ is the maximum relative deadline, n is the task number, and m is the processor core number. The speedup factor of PDM-FFD is strictly less than $3 - 1/m$, which outperforms the existing multi-core partitioned scheduling algorithm FBB-FFD. Experiments show that PDM-FFD improves processor utilization by 18.5% compared to other available algorithms on a four-core processor. The PDM-FFD performance improves with the increasing processor core number, task set utilization, and task number. Due to high performance, PDM-FFD can be widely utilized in typical real-time systems such as resource-constrained spacecraft, autonomous vehicles, and industrial robots.
    Available online:  December 20, 2023 , DOI: 10.13328/j.cnki.jos.007047
    Abstract:
    As the scale of open-source artificial intelligence (AI) systems expands, software development and maintenance become difficult. GitHub is one of the most important hosting platforms for open-source projects in the open-source community. Developers can easily participate in the development of open-source projects through pull request systems provided by GitHub. The description of pull requests can help the core teams of the project understand the content of the pull requests and the intention of the developers and promote the acceptance of the pull request. At present, a considerable proportion of developers do not provide a description for the pull request, which not only increases the workload of the core team but also is not conducive to the maintenance of the project in the future. This study proposes a method named PRSim to automatically generate descriptions for pull requests. This method extracts features including commit messages, comment updates, and code changes from pull requests, builds a syntax modification tree, and uses a tree-structured autoencoder to find other pull requests with similar code changes. Then, with the help of the description of a similar pull request, it summarizes commit messages and comment updates through an encoder-decoder network to generate the description of a new pull request. The experimental results show that the generation effect of PRSim reaches 36.47%, 27.69%, and 35.37% in terms of the F1 score of metrics Rouge-1, Rouge-2, and Rouge-L, respectively, which is 34.3%, 75.2%, and 55.3% higher than LeadCM, 16.2%, 22.9%, and 16.8% higher than Attn+PG+RL, and 23.5%, 72.0%, and 24.8% higher than PRHAN.
    Available online:  December 06, 2023 , DOI: 10.13328/j.cnki.jos.007033
    Abstract:
    In natural scenes, logos such as trademarks and traffic signs are susceptible to shooting angle, carrier deformation, and scale changes, which reduces logo detection accuracy. Thus, this study proposes an attention guided logo detection and recognition network (AGLDN) to jointly optimize the model robustness for multi-scale and complex deformation. First, a logo synthesis dataset is established by image collection and mask generation of logo templates, image selection of logo background, and logo image generation. Then, based on RetinaNet and FPN, multi-scale features are extracted and high-level semantic feature mapping is formed. Finally, the attention mechanism guided network is employed to focus on the logo area, and the influence of logo deformation on feature robustness is suppressed to improve logo detection and recognition. Experimental results show that the proposed method can reduce the influence of scale changes and non-rigid deformation, and improve detection accuracy.
    Available online:  December 06, 2023 , DOI: 10.13328/j.cnki.jos.007034
    Abstract:
    Unlimited by the state and space, the formal verification technology based on mechanized theorem proof is an important method to ensure software correctness and avoid serious loss from potential software bugs. LLRB (left-leaning red-black trees) is a variant of binary search trees, and its structure has an additional left-leaning constraint over the traditional red-black trees. During verification, conventional proof strategies cannot be employed, which requires more manual intervention and effort. Thus, the LLRB correctness verification is widely acknowledged as a challenging problem. To this end, based on the Isabelle verification framework for the binary search tree algorithm, this study refines the additional property part of the framework and provides a concrete verification scheme. The LLRB insertion and deletion operations are functionally modeled in Isabelle, with modular treatment of the LLRB invariants. Subsequently, the function correctness is verified. This is the first mechanized verification of functional LLRB insertion and deletion algorithms in Isabelle. Compared to the current Dafny verification of the LLRB algorithm, the theorem number is reduced from 158 to 84, and it is unnecessary for constructing intermediate assertions, which alleviates the verification burden. Meanwhile, this study provides references for functional modeling and verification of complex tree structure algorithms.
    Available online:  December 06, 2023 , DOI: 10.13328/j.cnki.jos.007035
    Abstract:
    Detecting latent topics in social media texts is a meaningful task, and the short and informal posts will cause serious data sparsity. Additionally, models based on variational auto-encoders (VAEs) ignore the social relationships among users during topic inference and VAE assumes that each input data point is independent. This results in the lack of correlation information between the inferred latent topic variables and incoherent topics. Social network structure information can not only provide clues for aggregating contextual messages but also indicate topic correlation among users. Therefore, this study proposes to utilize the microblog topic model based on message passing and graph prior distribution. This model can encode richer context information by graph convolution network (GCN) and integrate the interactive relationship of users by graph prior distribution during VAE topic inference to better understand the complex correlation among multiple data points and mine social media topic information. The experiments on three actual datasets validate the effectiveness of the proposed model.
    Available online:  November 29, 2023 , DOI: 10.13328/j.cnki.jos.007003
    Abstract:
    In the field of cyber security, the mendacious domains generated by the domain generation algorithm (DGA) are called DGA domains. Similar to real domains, they are usually a random combination of characters or numbers, which makes DGA domains highly camouflaged. Hackers take advantage of the disguised nature of DGA domains to carry out cyber attacks, so as to bypass security detection. How to effectively detect DGA domains has become a research hotspot. Traditional statistical machine learning detection methods require the manual construction of domain feature sets. However, the quality of domain features constructed manually or semi-automatically varies, which affects the accuracy of detection. In view of the powerful automatic feature extraction and representation capability of deep neural networks, a DGA domain detection method based on multi-view contrastive learning (MCL4DGA) is proposed. Different from existing methods, it incorporates attentional neural networks, convolutional neural networks, and recurrent neural networks to effectively capture global, local, and bidirectional multi-view feature dependencies of domain sequences. Besides, the self-supervision signals derived by contrastive learning can enhance the expressiveness between multi-view feature learning encoders and thus improve the accuracy of detection. The effectiveness of the proposed method is verified by experimental comparison with current methods on a real dataset.
    Available online:  November 29, 2023 , DOI: 10.13328/j.cnki.jos.007005
    Abstract:
    Nowadays, deep neural network (DNN) is widely used in autonomous driving, medical diagnosis, speech recognition, face recognition, and other safety-critical fields. Therefore, DNN testing is critical to ensure the quality of DNN. However, labeling test cases to judge whether the DNN model predictions are correct is costly. Therefore, selecting test cases that reveal incorrect behavior of DNN models and labeling them earlier can help developers debug DNN models as soon as possible, thus improving the efficiency of DNN testing and ensuring the quality of DNN models. This study proposes a test case selection method based on data mutation, namely DMS. In this method, a data mutation operator is designed and implemented to generate a mutation model to simulate model defects and capture the dynamic pattern of test case bug-revealing, so as to evaluate the ability of test case bug-revealing. Experiments are conducted on the combination of 25 deep learning test sets and models. The results show that DMS is significantly better than the existing test case selection methods in terms of both the proportion of bug-revealing and the diversity of bug-revealing directions in the selected samples. Specifically, taking the original test set as the candidate set, DMS can filter out 53.85%–99.22% of all bug-revealing test cases when selecting 10% of the test cases. Moreover, when 5% of the test cases are selected, the selected cases by DMS can cover almost all bug-revealing directions. Compared with the eight comparison methods, DMS finds 12.38%–71.81% more bug-revealing cases on average, which proves the significant effectiveness of DMS in the task of test case selection.
    Available online:  November 29, 2023 , DOI: 10.13328/j.cnki.jos.007007
    [Abstract] (393) [HTML] (0) [PDF 7.72 M] (1271)
    Abstract:
    In current real life where data sources are diverse, and manual labeling is difficult, semi-supervised multi-view classification algorithms have important research significance in various fields. In recent years, graph neural networks-based semi-supervised multi-view classification algorithms have achieved great progress. However, most of the existing graph neural networks carry out multi-view information fusion only in the classification stage, while neglecting the multi-view information interaction between the same sample during the training stage. To solve the above issue, this study proposes a model for semi-supervised classification, named multi-view interaction graph convolutional network (MIGCN). The Transformer Encoder module is introduced to the graph convolution layer trained on different views, which aims to adaptively acquire complementary information between different views for the same sample during the training stage. More importantly, the study introduces the consistency constraint loss to make the similar relationship of the final feature expressions of different views as similar as possible. This operation can make graph convolutional neural networks during the classification stage better utilize the consistency and complementarity information between different views reasonably, and then it can further improve the robust performance of the multi-view fusion feature. Extensive experiments on several real-world multi-view datasets demonstrate that compared with the graph-based semi-supervised multi-view classification model, MIGCN can better learn the essential features of multi-view data, thereby improving the accuracy of semi-supervised multi-view classification.
    Available online:  November 22, 2023 , DOI: 10.13328/j.cnki.jos.006968
    Abstract:
    Apache Flink is one of the most popular stream computing platforms and has many applications in industry. Complex event processing (CEP) is one of the important usage scenarios of stream computation. Apache Flink defines and implements a language for complex event processing (referred to as FlinkCEP). FlinkCEP includes rich syntactic features, not only the usual features of filtering, connecting, and looping, but also the advanced features of iterative conditions and after-match skip strategies. The semantics of FlinkCEP is complex, no language specification of FlinkCEP defines its semantics precisely, so it can only be understood by checking the implementation details. This motivates the definition of formal semantics for FlinkCEP so that the developers could understand its semantics precisely. This study proposes an automaton model called data stream transducers (DST) for FlinkCEP, where the data variables are applied to capture the iterative conditions, the data stream variables are adopted to store the outputs, and transition priorities are introduced to capture the after-match skip strategies. DST is leveraged to define the formal semantics of FlinkCEP and design the query evaluation algorithms based on the formal semantics. Moreover, a prototype of the CEP engine is implemented. Finally, test case sets are generated, which cover the syntactic features of FlinkCEP more comprehensively. They are utilized to conduct comparison experiments against the actual results of FlinkCEP on the Flink platform. The experimental results show that the proposed formal semantics of FlinkCEP conforms to the actual semantics of FlinkCEP in the vast majority of the cases. Furthermore, the inconsistencies between the formal and the actual semantics are analyzed and it is discovered that the Flink implementation of FlinkCEP may not deal with the group patterns correctly.
    Available online:  November 15, 2023 , DOI: 10.13328/j.cnki.jos.007002
    [Abstract] (365) [HTML] (0) [PDF 2.01 M] (1109)
    Abstract:
    Temporal knowledge graph reasoning aims to fill in missing links or facts in knowledge graphs, where each fact is associated with a specific timestamp. The dynamic variational framework based on variational autoencoder is particularly effective for this task. By jointly modeling entities and relations using Gaussian distributions, this method not only offers high interpretability but also solves complex probability distribution problems. However, traditional variational autoencoder-based methods often suffer from overfitting during training, which limits their ability to accurately capture the semantic evolution of entities over time. To address this challenge, this study proposes a new temporal knowledge graph reasoning model based on a diffusion probability distribution approach. Specifically, the model uses a bi-directional iterative process to divide the entity semantic modeling process into multiple sub-modules. Each sub-module uses a forward noisy transformation and a backward Gaussian sampling to model a small-scale evolution process of entity semantics. Compared with the variational autoencoder-based method, this study can obtain more accurate modeling by learning the dynamic representation of entity semantics in the metric space over time through the joint modeling of multiple submodules. Compared with the variational autoencoder-based method, the model improves by 4.18% and 1.87% on the Yago11k dataset and Wikidata12k dataset for evaluating the MRR of the indicator and by 1.63% and 2.48% on the ICEWS14 and ICEWS05-15 datasets, respectively.
    Available online:  November 15, 2023 , DOI: 10.13328/j.cnki.jos.006993
    Abstract:
    Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.
    Available online:  November 15, 2023 , DOI: 10.13328/j.cnki.jos.006997
    Abstract:
    Safety-critical embedded software usually has rigorous time constraints over the runtime behaviors, raising additional requirements for enforcing security properties. To protect the information flow security of embedded software and mitigate the limitations of the existing simplex verification approaches and their potential false positives, this study first proposes a new timed noninterference property, i.e., timed SIR-NNI, based on the security requirement of a realistic scenario. Then the study presents an information flow security verification approach that unifies the verification of multiple timed noninterference properties, i.e., timed BNNI, timed BSNNI, and timed SIR-NNI. Based on the different timed noninterference requirements, the approach constructs the refined automata and test automata from the timed automata under verification. The study uses UPPAAL’s reachability analysis to implement the refinement relation check and the security verification. The verification tool, i.e., TINIVER, extracts timed automata from SysML’s sequential diagrams or C++ source code to conduct the verification procedure. The verification results of TINIVER on existing timed automata models and security properties justify the usability of the proposed approach. The security verifications on the typical flight-mode switch models of the UAV flight control systems ArduPilot and PX4 demonstrate the practicability and scalability of the proposed approach. Besides, the approach is effective in mitigating the false positives of a state-of-the-art verification approach.
    Available online:  November 15, 2023 , DOI: 10.13328/j.cnki.jos.006995
    Abstract:
    Multi-view clustering has attracted more and more attention in the fields of image processing, data mining, and machine learning. Existing multi-view clustering algorithms have two shortcomings. One is that in the process of graph construction, only the pairwise relationship between each view data is considered to generate an affinity matrix, which lacks the characterization of neighborhood relationships; the second is that existing methods separate the process of multi-view information fusion and clustering, thereby reducing the clustering performance of the algorithm. Therefore, this study proposes a more accurate and robust joint spectral embedding multi-view clustering algorithm based on bipartite graphs. Firstly, based on the multi-view subspace clustering idea,bipartite graphs are constructed, and similar graphs are generated.Then the spectral embedding matrix of similar graphs is used to perform graph fusion. Secondly, by considering the importance of each view during the fusion process, weight constraints are applied, and an indicator matrix is introduced to obtain the final clustering result. A model is proposed to optimize the bipartite graph, embedding matrix, and clustering indicator matrix within a single framework. In addition, a fast optimization strategy for solving the model is provided, which decomposes the optimization problem into small module subproblems and efficiently solves them through iterative steps. The proposed algorithm and existing multi-view clustering algorithms have been experimentally analyzed on real data sets. Experimental results show that the proposed algorithm is more effective and robust in dealing with multi-view clustering problems compared with existing methods.
    Available online:  November 08, 2023 , DOI: 10.13328/j.cnki.jos.006994
    Abstract:
    With the development of mobile services’ computing and sensing abilities, spatial crowdsourcing, which is based on location information, comes into being. There are many challenges to improving the performance of task assignments, one of which is how to assign workers the tasks that they are interested in. Existing research methods only focus on workers’ temporal preference but ignore the impact of spatial factors on workers’ preference, and they only focus on long-term preference but ignore workers’ short-term preference and face the problem of inaccurate predictions caused by sparse historical data. This study analyzes the task assignment problem based on long-term and short-term spatio-temporal preference. By comprehensively considering workers’ preferences from both long-term and short-term perspectives, as well as temporal and spatial dimensions, the quality of task assignment is improved in task assignment success rate and completion efficiency. In order to improve the accuracy of spatio-temporal preference prediction, the study proposes a sliced imputation-based context-aware tensor decomposition algorithm (SICTD) to reduce the proportion of missing values in preference tensors and calculates short-term spatio-temporal preference through the ST-HITS algorithm and short-term active range of workers under spatio-temporal constraints. In order to maximize the total task reward and the workers’ average preference for completing tasks, the study designs a spatio-temporal preference-aware greedy and Kuhn-Munkres (KM) algorithm to optimize the results of task assignment. Extensive experimental results on real datasets show the effectiveness of the long- and short-term spatio-temporal preference-aware task assignment framework. Compared with baselines, the RMSE prediction error of the proposed SICTD for temporal and spatial preferences is decreased by 22.55% and 24.17%, respectively. In terms of task assignment, the proposed preference-aware KM algorithm significantly outperforms the baseline algorithms, with the workers’ total reward and average preference for completing tasks averagely increased by 40.86% and 22.40%, respectively.
    Available online:  November 08, 2023 , DOI: 10.13328/j.cnki.jos.007001
    Abstract:
    As an important production factor, data need to be exchanged between different entities to create value. In this process, data integrity needs to be ensured, or in other words, data cannot be tampered without authorization, or otherwise, it may lead to extremely serious consequences. The existing work realizes data evidence preservation by combining distributed ledger with data encryption and verification technology to ensure the integrity of data to be exchanged in transmission, storage, and other related data processing phrases. However, such work is difficult to confirm the integrity of the data provided by the data supplier. Once the data supplier provides forged data, all subsequent integrity assurance will be meaningless. Therefore, this study proposes a method for verifying the integrity of data services based on remote attestation. By using the trusted execution environment as the trust anchor, this method can measure and verify the integrity of the static code, execution process, and execution result of a specific data service. It also optimizes the integrity verification of a specific data service through program slicing, thus extending the scope of data integrity assurance to the time point when the data supplier provides data. A series of experiments are carried out on 25 data services of three real Java information systems to validate the proposed method.
    Available online:  November 01, 2023 , DOI: 10.13328/j.cnki.jos.006986
    Abstract:
    Distributed storage system is receiving more and more attention in mobile network scenarios. Data placement, a key technology of distributed storage, is crucial to improve the success rate of distributed data storage. However, due to unstable wireless signals and fluctuating network bandwidth in mobile environments, the traditional data placement strategies, such as random placement strategy and storage-aware placement strategy, have low success rates of data transmission because both of them do not take network bandwidth into account during data placement. To solve the problem faced by mobile distributed storage systems, this study proposes a bandwidth-aware adaptive data placement strategy (BADP). The main breakthrough is that BADP adopts the group mobility model to sense the network bandwidth of nodes and takes the network bandwidth as an important factor for data placement, thus selecting nodes with good performance to achieve adaptive data placement and improve the success of data transmission. BADP consists of three design features: (1) adopting the group mobility model to sense the network bandwidth of nodes; (2) managing node information in groups to reduce communication overhead, and taking advantage of the heap to build a node selection tree; (3) selecting nodes with good performance using adaptive data placement to improve the success rate of data transmission. Experiments show that when the network changes dynamically, BADP gains at least 30.6% and 34.6% improvements in the success rate of data transmission compared with random placement strategy and storage-aware placement strategy. At the same time, it consistently keeps communication overhead low.
    Available online:  November 01, 2023 , DOI: 10.13328/j.cnki.jos.006987
    [Abstract] (465) [HTML] (0) [PDF 2.25 M] (1067)
    Abstract:
    Internet users need to resolve through DNS before accessing network applications. DNS security is the first portal to ensure the normal operation of the network. If the security of DNS cannot be effectively guaranteed, even if the level of security protection measures of other network systems is high, attackers can attack the DNS system to make the network unusable. At present, DNS malignant events still have an upward trend, and the development of DNS attack detection and defense technology still cannot meet practical needs. From the perspective of recursive servers that directly serve users’ DNS requests, this study comprehensively summarizes the security problems faced in the DNS process through two classification methods, including various security events caused by attacks or system vulnerabilities, different detection methods for various security events, and various defense and protection technologies. When summarizing various security events, detection and defense protection technologies, the study analyzes the characteristics of the corresponding typical methods and prospects for the future research direction of the DNS security field.
    Available online:  October 25, 2023 , DOI: 10.13328/j.cnki.jos.007000
    Abstract:
    Fuzzy C-means (FCM) clustering algorithm has become one of the commonly used image segmentation techniques with its low learning cost and algorithm overhead. However, the conventional FCM clustering algorithm is sensitive to noise in images. Recently, many of improved FCM algorithms have been proposed to improve the noise robustness of the conventional FCM clustering algorithm, but often at a cost of detail loss on the image. This study presents an improved FCM clustering algorithm based on Lie group theory and applies it to image segmentation. The proposed algorithm constructs matrix Lie group features for the pixels of an image, which summarizes the low-level image features of each pixel and its relationship with other pixels in the neighborhood window. By doing this, the proposed method transforms the clustering problem of measuring the Euclidean distances between pixels into calculating the geodesic distances between Lie group features of pixels on the Lie group manifold. Aiming at the problem of updating the clustering center and fuzzy membership matrix on the Lie group manifold, the proposed method uses an adaptive fuzzy weighted objective function, which improves the generalization and stability of the algorithm. The effectiveness of the proposed method is verified by comparing with conventional FCM and several classic improved algorithms on the experiments of three types of medical images.
    Available online:  October 25, 2023 , DOI: 10.13328/j.cnki.jos.006966
    Abstract:
    The current authentication protocol based on username and password has been difficult to meet the increasing security requirements. Specifically, users choose different passwords to access different online services, which greatly increases the user’s memory burden. In addition, password authentication has low security and faces many known attacks. To solve such problems, this study proposes a user-centric two-factor authentication key agreement protocol UC-2FAKA based on the Pointcheval-Sanders signature. Firstly, to prevent the leakage of authentication factors, passwords, and biometric two-factor credentials are constructed based on the Pointcheval-Sanders signature. The identity is authenticated to the service provider (SP) in a zero-knowledge proof manner. Secondly, using a user-centric single sign-on (SSO) architecture, users can request identity credentials by registering with an identity provider (IDP) to log in different SPs to avoid IDP or SP tracking or linking users. Thirdly, the Diffie-Hellman key exchange is used to authenticate SP identities and negotiate communication keys to ensure subsequent communication security. Finally, comprehensive security analysis and performance comparison of the proposed protocol are carried out. The results show that the proposed protocol can resist various known attacks, and the proposed protocol performs better in communication overhead and computational overhead.
    Available online:  October 25, 2023 , DOI: 10.13328/j.cnki.jos.006970
    Abstract:
    Existing hypergraph network representation methods need to analyze the full batch nodes and hyperedges to recursively extend the neighbors across layers, which brings huge computational costs and leads to lower generalization accuracy due to over-expansion. To solve this problem, this study proposes a hypergraph network representation method based on importance sampling. First, the method treats nodes and hyperedges as two sets of independent identically distributed samples that satisfy specific probability measures and interprets the structural feature interactions of the hypergraph in an integral form. Second, it designs a neighbor importance sampling rule with learnable parameters and calculates sampling probabilities based on the physical relations and features of nodes and hyperedges. A fixed number of objects are recursively acquired layer by layer to construct a smaller sampled adjacency matrix. Finally, the spatial features of the entire hypergraph are approximated using Monte Carlo methods. In addition, with the advantage of physically informed neural networks, the sampling variance that needs to be reduced is added to the hypergraph neural network as a physical constraint to obtain sampling rules with better generalization capability. Extensive experiments on multiple datasets show that the method proposed in this study can obtain more accurate hypergraph representation results with a faster convergence rate.
    Available online:  October 18, 2023 , DOI: 10.13328/j.cnki.jos.006971
    Abstract:
    Fast vulnerability root cause analysis is crucial for patching vulnerabilities and has always been a hotspot in academia and industry. The existing vulnerability root cause analysis methods based on the statistical feature analysis of a large number of test sample execution records have problems such as random noise and missing important logical correlation instructions. According to the test set measurement in this study, the proportion of random noise in the existing statistical methods reaches more than 61%. To solve the above problems, this study proposes a vulnerability root cause analysis method based on the local path graph, which extracts vulnerability-related information such as the inter-function call graph and intra-function control flow transfer graph from the execution paths. The local path graph is utilized for eliminating irrelevant instruction (i.e., noise instructions) elimination, constructing the logic relations for vulnerability root cause relevant points, and adding missing critical instructions. An automated root cause analysis system for binary software, LGBRoot, has been implemented. The effectiveness of the system has been evaluated on a dataset of 20 public CVE memory corruption vulnerabilities. The average time for single-sample root cause analysis is 12.4 seconds. The experimental data show that the system can automatically eliminate 56.2% of noise instructions, and mend as well as visualize the 20 logical structures of vulnerability root cause relevant points, speeding up the vulnerability analysis of analysts.
    Available online:  October 18, 2023 , DOI: 10.13328/j.cnki.jos.006991
    Abstract:
    Conformance checking is one of the important scenarios in the field of process mining, and its goal is to determine whether the actual running business behavior is consistent with the desired behavior and then provide a basis for business process management decisions. Traditional methods of conformance checking face the problems of too many metrics and low efficiency. In addition, the existing methods for checking the conformance between process text and process model rely heavily on expert-defined knowledge. Therefore, this study proposes a process text-oriented conformance checking method. Firstly, the study generates graph traces based on the execution semantics of the process model and obtains the structural features by the word vector model from graph traces. At the same time, Hoffman trees are introduced to reduce the computational effort. Then, the word vector representation of the process text and the activities is performed. The study also uses the Siamese mechanism to improve training efficiency. Finally, all the features of the text and the model are fused, and then the consistency score between the text and the model is predicted using a fully connected layer. Experiments show that the average absolute error value of the method in this study is two percentage points lower than that of existing methods.
    Available online:  October 18, 2023 , DOI: 10.13328/j.cnki.jos.006976
    Abstract:
    Disassembly of binary codes is hard but necessary for improving the security of binary software. One of the major reasons for the difficult binary disassembly is that the compilers create many jump tables in the binary code for efficiency. In order to solve the targets of the jump table, mainstream disassembly tools use various strategies. However, the details of the implementation of these strategies and their effectiveness are not well studied. To help researchers to well understand the algorithm implementation and performance of disassembly tools, this study first systematically summarizes the strategies used by disassembly tools to solve jump tables; then the study builds an automatic framework for testing jump tables, based on which a large-scale testsuite on jump tables (2410455 jump tables) can be generated. Lastly, this study evaluates the performance of the disassembly tools in solving jump tables on the testsuite and manually analyzes the errors introduced by each strategy of the disassembly tools. In addition, this study finds six bugs in the implementation of the disassembly tools benefiting from the systematic summary of the implementation of the disassembly tool algorithm.
    Available online:  October 18, 2023 , DOI: 10.13328/j.cnki.jos.006977
    Abstract:
    The database performance is affected by the database configuration parameters. The quality of parameter settings will directly affect the performance of the database. Therefore, the quality of the database parameter tuning method is important. However, traditional database parameter tuning methods have many limitations, such as the inability to make full use of historical parameter tuning data, wasting time and human resources, and so on. The counterfactual interpretation methods aim to change the original prediction to the expected prediction by making small modifications to the original data. The method plays a role of suggestion, and this can be used for database configuration optimization, namely, making small modifications to the database configuration to optimize the performance of the database. Therefore, this study proposes a counterfactual interpretation method for database configuration optimization. For databases with poor performance under specific load conditions, this method can modify the database configuration and generate corresponding database configuration counterfactuals to optimize database performance. This study conducts two kinds of experiments to evaluate the counterfactual interpretation method and verify the effect of optimizing the database. The experimental results show that the counterfactual interpretation methods proposed in this study are better than other typical counterfactual interpretation methods in terms of various evaluation indicators, and the generated counterfactuals can effectively improve database performance.
    Available online:  October 11, 2023 , DOI: 10.13328/j.cnki.jos.006978
    [Abstract] (416) [HTML] (0) [PDF 3.90 M] (1097)
    Abstract:
    There are a lot of two-party threshold schemes for SM2 digital signatures proposed in recent years, which can significantly enhance the security of private keys for SM2 digital signatures. According to different methods of key splitting, public schemes can be divided into two types: multiplicative key splitting and additive key splitting. Further, these public schemes can be subdivided into various two-party threshold schemes according to different constructions of the signature random number. This study proposes the framework of two-party threshold schemes for SM2 digital signature, which provides a safe basic calculation process of two-party threshold schemes and introduces the signature random number that can be constructed variously. With the proposed framework and various constructions of the random number, this study achieves the instantiation of the framework, obtaining a variety of two-party threshold schemes for SM2 digital signature. The instantiation includes 23 known two-party threshold schemes, as well as a variety of new schemes.
    Available online:  October 11, 2023 , DOI: 10.13328/j.cnki.jos.006990
    Abstract:
    The informationization 3.0 represented by deep mining and fusion applications of big data is starting, and the software in the traditional static environment is evolving into complex software in the human-cyber-physical ternary environment which is open and dynamic. How to realize the trusted, manageable, and controllable data interconnection on the untrusted and uncontrollable Internet is an urgent problem to be solved. The Internet of Data technical system represented by digital object architecture and identi?er resolution technology provides a feasible solution for these challenges. In order to solve the problems such as low transmission efficiency, high coordination cost, and security management issues in the process of data sharing on the Internet, this study proposes identi?er resolution standard specifications for human-cyber-physical ternary environments. Moreover, to meet the demands that data resources owned by different entities need to be discoverable, accessible, understandable, trustworthy, and interoperable in the human-cyber-physical ternary environment, this study designs the identi?er resolution protocol and implements the identi?er/resolution prototype system for human-cyber-physical ternary environments. At last, this study tests the performance of the prototype system, and the validity of the system is verified by applying it to application scenarios.
    Available online:  October 11, 2023 , DOI: 10.13328/j.cnki.jos.006974
    Abstract:
    The functions are the smallest naming unit of aggregation behavior in most traditional programming languages. The readability of function names plays a vital role in programmers’ understanding of program functions and the interaction between different modules. Low-quality function names may confuse developers, increase the smell in the code, and then result in software defects caused by API misuse. Therefore, a method of function name consistency checking and recommendation based on deep learning is proposed, which is named DMName. Firstly, for the given source code of the target function, the internal context, interactive context, sibling context, and closed context are constructed respectively, and the context information tag sequence is obtained after merging them. Then the tag sequence is converted into the context representation vector sequence by using the word embedding technology FastText and input into the encoder of the seq2seq model. The copy mechanism and coverage mechanism are utilized to solve the OOV problem and the repeated decoding problem, respectively. Finally, the vector sequence of the prediction result of the target function name is output, and the consistency of the function name is predicted with the help of the two-channel CNN classifier. If the function name is inconsistent, the recommended function name can be obtained by direct mapping according to the vector space similarity matching. The experimental results show that the F1-measure of DMName in function name consistency check and recommendation reaches 82.65% and 73.31% respectively, which is 2.01% and 2.96% higher than the current optimal DeepName. Finally, the DMName is verified in the large-scale open-source project, namely lancia in GitHub. A total of 16 function name inconsistency problems are found, and reasonable name recommendations are made, which further confirms the effectiveness of DMName.
    Available online:  October 11, 2023 , DOI: 10.13328/j.cnki.jos.006982
    Abstract:
    Static analysis tools often suffer from high false positive rates of reported alarms, despite their ability to aid developers in detecting potential defects early in the software development life cycle. To improve the availability of these tools, many automated warning identification techniques have been proposed to assist developers in classifying false positive alarms. However, existing approaches mainly focus on using hand-engineered features or statement-level abstract syntax tree token sequences to represent the defective code, failing to capture semantics from the reported alarms. To overcome the limitations of traditional approaches, this study employs deep neural networks with powerful feature extraction and representation abilities to generate code semantics from control flow graph paths for warning identification. The control flow graph abstractly represents the execution process of a given program. Thus, the generated path sequences of the control flow graph can guide the deep neural networks to learn semantic information about the potential defect more accurately. In this study, the pre-trained language model is fine-tuned to encode the path sequences and capture the semantic representations for model building. Finally, the study conducts extensive experiments on eight open-source projects to verify the effectiveness of the proposed approach by comparing it with the state-of-the-art baselines.
    Available online:  October 11, 2023 , DOI: 10.13328/j.cnki.jos.006965
    Abstract:
    The major challenges traditional operating system (OS) design faces are the increasing number, diversity, and distribution scope of resources to be managed and the frequent changes in system state. However, the structures of existing OSs have become the biggest obstacle to solving the above problems as (1) tight coupling and centralization of the structure lead to poor flexibility and scalability and separate OS ecology; (2) contradiction between various capabilities, e.g., security and performance, due to the unitary isolation mechanism such as kernel-user isolation. Therefore, this study combines the hierarchical software bus (softbus) principles with isolation mechanisms to organize the OS and proposes a new OS model termed Yggdrasil. Yggdrasil decomposes an OS into component nodes connected by softbuses, whose communications are standardized to message passing via the softbus. To support the division of isolated states such as supervisor mode and different software hierarchies, Yggdrasil introduces bridge nodes for cascading and controlled communication between softbuses, and enhances the logical representation capability and scalability of OS through self-similar topology. Additionally, the simplicity and hierarchy of the softbus help to achieve decentralization. To verify the feasibility of Yggdrasil, the study builds hierarchical softbus model for OS (HiBuOS) and demonstrates the feasibility of developing a new OS based on Yggdrasil’s ideas through three specific designs: (1) designing and planning a hierarchical softbus structure according to the scale and requirements of the target operating system; (2) selecting specific isolation and communication mechanisms to instantiate bridge nodes and softbuses; (3) realizing OS services based on the hierarchical softbus style. Finally, the evaluation shows that HiBuOS has notable potential and advantages to enhance system scalability, security, performance, and ecological development without significant performance loss.
    Available online:  September 27, 2023 , DOI: 10.13328/j.cnki.jos.006972
    Abstract:
    Subset repair for inconsistent data is an important research problem in the field of data cleaning. Most of the existing methods are based on integrity constraint rules and adopt the principle of the minimum number of deleted tuples for subset repair. However, these methods take no account of the quality of deleted tuples, and the repair accuracy is low. Therefore, this study proposes a subset repair method combining rules and probabilities. The probability of inconsistent tuples is modeled so that the average probability of correct tuples is greater than that of wrong tuples, and the optimal subset repair with the smallest sum of the probability of deleted tuples is calculated. In addition, in order to reduce the time overhead of calculating the probability of inconsistent tuples, this study proposes an efficient error detection method to reduce the size of inconsistent tuples. Experimental results on real data and synthetic data verify that the proposed method outperforms the state-of-the-art subset repair method in terms of accuracy.
    Available online:  September 27, 2023 , DOI: 10.13328/j.cnki.jos.006973
    Abstract:
    In recent years, software system security issues are attracting increasing attention. The security threats existing in systems can be easily exploited by attackers. Attackers usually attack systems by using various attacking techniques, such as password brute force cracking, phishing, and SQL injection. Threat modeling is a method of structurally analyzing, identifying, and processing threats. Traditional tests mainly focus on testing code defects, which take place in the late stage of software development. It is not able to well connect the results from early threat modeling and analysis for building secure software. Threat modeling tools in the industry lack the function of generating security tests. In order to tackle this problem, this study proposes a framework that is able to generate security test cases from threat models and designs and implements a tool prototype. In order to facilitate tests, this study improves the traditional attack tree model and performs compliance checks. Test scenarios can be automatically generated from the model. The test scenarios are evaluated according to the probabilities of attack nodes, and the scenarios of the threats with higher probabilities will be tested first. The defense nodes are evaluated, and the defense scheme with higher profit is selected to alleviate the threats, so as to improve the system’s security design. By setting parameters for attack nodes, test scenarios can be specified as test cases. In the early stage of software development, with the inputs of the threats identified by threat modeling, test cases can be generated through this framework and tool to guide subsequent security development and test design, which improves the integration of security technology in software design and development. The case study applies this framework and tool in test generation for very high security risks, which shows their effectiveness.
    Available online:  September 27, 2023 , DOI: 10.13328/j.cnki.jos.006957
    Abstract:
    As one of the ten block cipher algorithms selected for the second round of the 2018 National Cryptographic Algorithm Design Contest, Feistel-based block cipher (FBC) is an efficient and lightweight block cipher algorithm with a four-branch and two-fold Feistel structure. In this study, the FBC algorithm is abstracted as the FBC model, and the pseudorandomness and super-pseudorandomness of the model are studied. It is assumed that the FBC round functions are independent random functions, and a method to find the minimal number of FBC rounds is provided, which will keep FBC indistinguishable from a random permutation. Finally, the study comes to the conclusion that under the chosen-plaintext attack, four rounds of FBC are indistinguishable from random permutation, so the model has pseudorandomness; under the adaptive chosen-plaintext and ciphertext attack, five rounds of FBC are indistinguishable from random permutation, so the model has super-pseudorandomness.
    Available online:  September 27, 2023 , DOI: 10.13328/j.cnki.jos.006998
    [Abstract] (524) [HTML] (0) [PDF 5.08 M] (1227)
    Abstract:
    Multimodal sentiment analysis is a task that uses subjective information from multiple modalities to analyze sentiment. Exploring how to effectively learn the interaction between modalities has always been an essential task in multimodal analysis. In recent research, it is found that the learning rate of different modalities is unbalanced, leading to the convergence of one modality while the rest of the modalities are under-fitting, which weakens the effect of multimodal collaborative decision-making. In order to combine multiple modalities more effectively and learn the multimodal sentiment features with rich expression, this study proposes a multimodal sentiment analysis method based on adaptive weight fusion. The method of adaptive weight fusion is divided into two phases. The first phase is to adaptively change the fusion weights of unimodal feature representations according to the difference of unimodal learning gradients to dynamically balance the modal learning rate. The study calls this phase balanced fusion (B-fusion). The second phase is to eliminate the impact of the fusion weights of B-fusion on task analysis, propose the modal attention to explore the contributions of modalities to the task, and dynamically allocate the fusion weight to each modality. The study calls this phase attention fusion (A-fusion). The experimental results show that the introduction of the B-fusion method into existing multimodal sentiment analysis methods can effectively improve the accuracy of sentiment analysis. The ablation experiment results show that adding the A-fusion method to B-fusion can effectively reduce the impact of B-fusion weights on the task, which is conducive to improving the analysis results of sentiment analysis. Compared with the existing multimodal sentiment analysis models, the proposed method has a simpler structure, lower computational consumption, and better task accuracy than these comparison models, which shows that the method has high efficiency and excellent performance in multimodal sentiment analysis tasks.
    Available online:  September 27, 2023 , DOI: 10.13328/j.cnki.jos.006999
    Abstract:
    Revealing the complex relations among emotions is an important fundamental study in cognitive psychology. From the perspective of natural language processing, the key to exploring the relations among emotions lies in the embedded representation of emotional categories. Recently, there has been some interest in obtaining a category representation in the emotion space that can characterize emotion relations. However, the existing methods for emotion category representations have several drawbacks. For example, fixed dimensionality, the dimensionality of the emotion category representation, depends on the selected dataset. In order to obtain better representations for the emotion categories, this study introduces a supervised contrastive learning representation method. In the previous supervised contrastive learning, the similarity between samples depends on the similarity of the annotated labels of the samples. In order to better reflect the complex relations among different emotion categories, the study further proposes a partially similar supervised contrastive learning representation method, which suggests that samples of different emotion categories (e.g., anger and annoyance) may also be partially similar to each other. Finally, the study organizes a series of experiments to verify the ability of the proposed method and the other five benchmark methods in representing the relationship between emotion categories. The experimental results show that the proposed method achieves satisfactory results for the emotion category representations.
    Available online:  September 20, 2023 , DOI: 10.13328/j.cnki.jos.006955
    Abstract:
    The detection of the human respiration waveform in the sleep state is crucial for applications in intelligent health care as well as medical and healthcare in that different respiration waveform patterns can be examined to analyze sleep quality and monitor respiratory diseases. Traditional respiration sensing methods based on contact devices cause various inconveniences to users. In contrast, contactless sensing methods are more suitable for continuous monitoring. However, the randomness of the device deployment, sleep posture, and human motion during sleep severely restrict the application of contactless respiration sensing solutions in daily life. For this reason, the study proposes a detection method for the human respiration waveform in the sleep state based on impulse radio-ultra wide band (IR-UWB). On the basis of the periodic changes in the propagation path of the wireless pulse signal caused by the undulation of the human chest during respiration in the sleep state, the proposed method generates a fine-grained human respiration waveform and thereby achieves the real-time output of the respiration waveform and high-precision respiratory rate estimation. Specifically, to obtain the position of the human chest during respiration from the received wireless radio-frequency (RF) signals, this study proposes the indicator respiration energy ratio based on IR-UWB signals to estimate the target position. Then, it puts forward a vector projection method based on the in-phase/quadrature (I/Q) complex plane and a method of projection signal selection based on the circumferential position of the respiration vector to extract the characteristic human respiration waveform from the reflected signal. Finally, a variational encoder-decoder network is leveraged to achieve the fine-grained recovery of the respiratory waveform in the sleep state. Extensive experiments and tests are conducted under different conditions, and the results show that the human respiration waveforms monitored by the proposed method in the sleep state are highly similar to the actual waveforms captured by commercial respiratory belts. The average error of the proposed method in estimating the human respiratory rate is 0.229 bpm, indicating that the method can achieve high-precision detection of the human respiration waveform in the sleep state.
    Available online:  September 20, 2023 , DOI: 10.13328/j.cnki.jos.006956
    Abstract:
    It is essential to detect out-of-distribution (OOD) training set samples for a safe and reliable machine learning system. Likelihood-based generative models are popular methods to detect OOD samples because they do not require sample labels during training. However, recent studies show that likelihoods sometimes fail to detect OOD samples, and the failure reason and solutions are under explored, especially for text data. Therefore, this study investigates the text failure reason from the views of the model and data: insufficient generalization of the generative model and prior probability bias of the text. To tackle the above problems, the study proposes a new OOD text detection method, namely Pobe. To address insufficient generalization of the generative model, the study increases the model generalization via KNN retrieval. Next, to address the prior probability bias of the text, the study designs a strategy to calibrate the bias and improve the influence of probability bias on OOD detection by a pre-trained language model and demonstrates the effectiveness of the strategy according to Bayes’ theorem. Experimental results over a wide range of datasets show the effectiveness of the proposed method. Specifically, the average AUROC is over 99%, and FPR95 is below 1% under eight datasets.
    Available online:  September 13, 2023 , DOI: 10.13328/j.cnki.jos.006964
    [Abstract] (330) [HTML] (0) [PDF 4.98 M] (1138)
    Abstract:
    The domain name plays an important role in cybercrimes. Existing malicious domain name detection methods are not only difficult to use with rich topology and attribute information but also require a large amount of label data, resulting in limited detection effects and high costs. To address this problem, this study proposes a malicious domain name detection method based on graph contrastive learning. The domain name and IP address are taken as two types of nodes in a heterogeneous graph, and the feature matrix of corresponding nodes is established according to their attributes. Three types of meta paths are constructed based on the inclusion relationship between domain names, the measure of similarity, and the correspondence between domain names and IP addresses. In the pre-training stage, the contrast learning model based on the asymmetric encoder is applied to avoid the damage to graph structure and semantics caused by graph data augmentation operation and reduce the demand for computing resources. By using the inductive graph neural network graph encoders HeteroSAGE and HeteroGAT, a node-centric mini-batch training strategy is adopted to explore the aggregation relationship between the target node and its neighbor nodes, which solves the problem of poor applicability of the transductive graph neural networks such as GCN in dynamic scenarios. The downstream classification detection task contrastively utilizes logistic regression and random forest algorithms. Experimental results on publicly available data sets show that detection performance is improved by two to six percentage points compared with that of related works.
    Available online:  September 13, 2023 , DOI: 10.13328/j.cnki.jos.006959
    Abstract:
    The openness and ease-of-use of Python make it one of the most commonly used programming languages. The PyPI ecosystem formed by Python not only provides convenience for developers but also becomes an important target for attackers to launch vulnerability attacks. Thus, after discovering Python vulnerabilities, it is critical to deal with Python vulnerabilities by accurately and comprehensively assessing the impact scope of the vulnerabilities. However, the current assessment methods of Python vulnerability impact scope mainly rely on the dependency analysis of packet granularity, which will produce a large number of false positives. On the other hand, existing Python program analysis methods of function granularity have accuracy problems due to context insensitivity and produce false positives when applied to assess the impact scope of vulnerabilities. This study proposes a vulnerability impact scope assessment method for the PyPI ecosystem based on static analysis, namely PyVul++. First, it builds the index of the PyPI ecosystem, then finds the candidate packets affected by the vulnerability through vulnerability function identification, and confirms the vulnerability packets through vulnerability trigger condition. PyVul++ realizes vulnerability impact scope assessment of function granularity, improves the call analysis of function granularity for Python code, and outperforms other tools on the PyCG benchmark (accuracy of 86.71% and recall of 83.20%). PyVul++ is used to assess the impact scope of 10 Python CVE vulnerabilities on the PyPI ecosystem (385855 packets) and finds more vulnerability packets and reduces false positives compared with other tools such as pip-audit. In addition, PyVul++ newly finds that 11 packets in the current PyPI ecosystem still have security issues of referencing unpatched vulnerable functions in 10 assessment experiments of Python CVE vulnerability impact scope.
    Available online:  September 13, 2023 , DOI: 10.13328/j.cnki.jos.006945
    Abstract:
    Jacobi computation is a kind of stencil computation, which has been widely applied in the field of scientific computing. The performance optimization of Jacobi computation is a classic topic, where loop tiling is an effective optimization method. The existing loop tiling methods mainly focus on the impact of tiling on parallel communication and program locality and fail to consider other factors such as load balancing and vectorization. This study analyzes and compares several tiling methods based on multi-core computing architecture and chooses an advanced hexagonal tiling as the main method to accelerate Jacobi computation. For tile size selection, this study proposes a hexagonal tile size selection algorithm called Hexagon_TSS by comprehensively considering the impact of tiling on load balancing, vectorization efficiency, and locality. The experimental results show that the L1 data cache miss rate can be reduced to 5.46% of original serial program computation in the best case by Hexagon_TSS, and the maximum speedup reaches 24.48. The proposed method also has excellent scalability.
    Available online:  September 13, 2023 , DOI: 10.13328/j.cnki.jos.006947
    Abstract:
    Software change prediction, aimed at identifying change-prone modules, can help software managers and developers allocate resources efficiently and reduce maintenance overhead. Extracting effective features from the code plays a vital role in the construction of accurate prediction models. In recent years, researchers have shifted from traditional hand-crafted features to semantic features with powerful representation capabilities for prediction. They extracted semantic features from abstract syntax tree (AST) node sequences to build models. However, existing studies have ignored the structural information in the AST and the rich semantic information in the code. How to extract the semantic features of the code is still a challenging problem. For this reason, the study proposes a change prediction method based on hybrid graph representation. To start with, the model combines AST, control flow graph (CFG), data flow graph (DFG), and other structural information to construct the program graph representation of the code. Then, it uses the graph neural network to learn the semantic features of the program graph and the features obtained to predict change-proneness. The model can integrate various semantic information to represent the code better. The effectiveness of the proposed method is verified by comparing it with the latest change prediction methods on various change datasets.
    Available online:  September 06, 2023 , DOI: 10.13328/j.cnki.jos.006960
    Abstract:
    Thanks to the low storage cost and high retrieval speed, graph-based unsupervised cross-modal hash learning has attracted much attention from academic and industrial researchers and has been an indispensable tool for cross-modal retrieval. However, the high computational complexity of graph structures prevents its application in large-scale multi-modal applications. This study mainly attempts to solve two important challenges facing graph-based unsupervised cross-modal hash learning: 1) How to efficiently construct graphs in unsupervised cross-modal hash learning? 2) How to handle the discrete optimization in cross-modal hash learning? To address such two problems, this study presents anchor-based cross-modal learning and a differentiable hash layer. To be specific, the study first randomly samples some image-text pairs from the training set as anchor sets and uses the anchor sets as the agent to compute the graph matrix of each batch of data. The graph matrix is used to guide cross-modal hash learning, thus remarkably reducing the space and time cost; second, the proposed differentiable hash layer directly adopts binary coding for computation during network forward propagation and produces gradient to update the network without continuous-value relaxation during backpropagation, thus embracing better hash encoding performance. Finally, the study introduces cross-modal ranking loss to consider the ranking results in the training process and improve the cross-modal retrieval accuracy. To verify the effectiveness of the proposed algorithm, the study compares the algorithm with 10 cross-modal hash algorithms on three general data sets.
    Available online:  September 06, 2023 , DOI: 10.13328/j.cnki.jos.006963
    Abstract:
    Aspect-level sentiment classification task, which aims to determine the sentiment polarity of a given aspect, has attracted increasing attention due to its broad applications. The key to this task is to identify contextual descriptions relevant to the given aspect and predict the aspect-related sentiment orientation of the author according to the context. Statistically, it is found that close to 30% of reviews convey a clear sentiment orientation without any explicit sentiment description of the given aspect, which is called implicit sentiment expression. Recent attention mechanism-based neural network methods have gained great achievement in sentiment analysis. However, this kind of method can only capture explicit aspect-related sentiment descriptions but fails to effectively explore and analyze implicit sentiment, and it often models aspect words and sentence contexts separately, which makes the expression of aspect words lack contextual semantics. To solve the above two problems, this study proposes an aspect-level sentiment classification method that integrates local aspect information and global sentence context information and improves the classification performance of the model by curriculum learning according to different classification difficulties of implicit and explicit sentiment sentences. Experimental results show that the proposed method not only has a high accuracy in identifying the aspect-related sentiment orientation of explicit sentiment sentences but also can effectively learn the sentiment categories of implicit sentiment sentences.
    Available online:  September 06, 2023 , DOI: 10.13328/j.cnki.jos.006969
    Abstract:
    As an essential component of real-time system design, priority is utilized to resolve conflicts in resource sharing and design for safety. For real-time systems that introduce priorities, each task is assigned a priority, which leads to the possibility of low-priority tasks being preempted by high-priority tasks at runtime, thus creating a preemptive scheduling problem for real-time systems. Existing research on this problem lacks a modeling and automatic verification method that can visually represent the priority of tasks and the dependencies between tasks. To this end, a preemptive priority timed automata (PPTA) is proposed and a preemptive priority timed automata network (PPTAN) is introduced. First, the priority of a task is represented by adding the priority of migration to the timed automata, and then the migration is adopted to correlate tasks with dependencies so that PPTA can be applied to model real-time tasks with priority. The blocking position is also added to the timed automata, so PPTAN can be used to model the priority preemptive scheduling problem. Second, a model-based transformation method is proposed to map the PPTA to the automatic verification tool UPPAAL. Finally, by modeling an example of a multi-core multi-task real-time system and comparing it with other models, it is shown that this model is not only suitable for modeling the priority preemptive scheduling problem but also for accurately verifying and analyzing it.
    Available online:  September 06, 2023 , DOI: 10.13328/j.cnki.jos.006979
    [Abstract] (398) [HTML] (0) [PDF 5.17 M] (1068)
    Abstract:
    When prototypical networks are directly applied to few-shot named entity recognition (FEW-NER), there are the following problems: Non-entities do not have strong semantic relationships with each other, and using the same way to construct the prototype for both entities and non-entities will make non-entity prototypes fail to accurately represent the semantic characteristics of non-entities; using only the average entity vector as the computing method of the prototype will make it difficult to capture similar entities with different semantic features. To address these problems, this study proposes a FEW-NER based on fine-grained prototypical networks (FNFP) to improve the annotation effect of FEW-NER. Firstly, different non-entity prototypes are constructed for different query sets to capture the key semantic features of non-entities in sentences and obtain finer-grained prototypes to improve the recognition effect of non-entities. Then, an inconsistent metric module is designed to measure the inconsistency between similar entities, and different metric functions are applied to entities and non-entities, so as to reduce the feature representation between similar samples and improve the feature representation of the prototype. Finally, a Viterbi decoder is introduced to capture the label transformation relationship and optimize the final annotation sequence. The experimental results show that the performance of the proposed method is improved compared with that of the large-scale FEW-NER dataset, namely FEW-NERD; and the generalization ability of this method in different domain scenarios is verified on the cross-domain dataset.
    Available online:  September 06, 2023 , DOI: 10.13328/j.cnki.jos.006980
    Abstract:
    A large number of bug reports are generated during software development and maintenance, which can help developers to locate bugs. Information retrieval based bug localization (IRBL) analyzes the similarity of bug reports and source code files to locate bugs, achieving high accuracy at the file and function levels. However, a lot of labor and time costs are consumed to find bugs from suspicious files and function fragments due to the coarse location granularity of IRBL. This study proposes a statement level software bug localization method based on historical bug information retrieval, STMTLocator. Firstly, it retrieves historical bug reports which are similar to the bug report of the program under test and extracts the bug statements from the historical bug reports. Then, it retrieves the suspicious files according to the text similarity between the source code files and the bug report of the program under test, and extracts the suspicious statements from the suspicious files. Finally, it calculates the similarity between the suspicious statements and the historical bug statements, and arranges them in descending order to localize bug statements. To evaluate the bug localization performance of STMTLocator, comparative experiments are conducted on the Defects4J and JIRA dataset with Top@N, MRR, and other evaluation metrics. The experimental results show that STMTLocator is nearly four times than the static bug localization method BugLocator in terms of MRR and locates 7 more bug statements for Top@1. The average time used by STMTLocator to locate a bug version is reduced by 98.37% and 63.41% than dynamic bug localization methods Metallaxis and DStar, and STMTLocator has a significant advantage of not requiring the construction and execution of test cases.
    Available online:  August 30, 2023 , DOI: 10.13328/j.cnki.jos.006962
    [Abstract] (411) [HTML] (0) [PDF 7.76 M] (1103)
    Abstract:
    With the rapid development of Internet information technologies, the explosive growth of online learning resources has caused the problem of “information overload” and “learning disorientation”. In the absence of expert guidance, it is difficult for users to identify their learning demands and select the appropriate content from the vast amount of learning resources. Educational domain recommendation methods have received a lot of attention from researchers in recent years because they can provide personalized recommendations of learning resources based on the historical learning behaviors of users. However, the existing educational domain recommendation methods ignore the modeling of complex relationships among knowledge points in learning demand perception and fail to consider the dynamic changes of users’ learning demands, which leads to inaccurate learning resource recommendations. To address the above problems, this study proposes a knowledge point recommendation method based on static and dynamic learning demand perception, which models users’ learning behaviors under complex knowledge association by combining static perception and dynamic perception. For static learning demand perception, this study innovatively designs an attentional graph convolutional network based on the first-course-following meta-path guidance of knowledge points, which can accurately capture users’ static learning demands at the fine-grained knowledge point level by modeling the complex constraints of the first-course-following relationship between knowledge points and eliminating the interference of other non-learning demand factors. For dynamic learning demand perception, the method aggregates knowledge point embeddings to characterize users’ knowledge levels at different moments by taking courses as units and then uses a recurrent neural network to encode users’ knowledge level sequences, which can effectively explore the dynamic learning demands hidden in users’ knowledge level changes. Finally, this study fuses the obtained static and dynamic learning demands, models the compatibility between static and dynamic learning demands in the same framework, and promotes the complementarity of these two learning demands to achieve fine-grained and personalized knowledge point recommendations. Experiments show that the proposed method can effectively perceive users’ learning demands, provide personalized knowledge point recommendations on two publicly available datasets, and outperform the mainstream recommendation methods in terms of various evaluation metrics.
    Available online:  August 23, 2023 , DOI: 10.13328/j.cnki.jos.006951
    Abstract:
    Fact verification is intended to check whether a textual statement is supported by a given piece of evidence. Due to the structural dependence and implicit content of tables, the task of fact verification with tables as the evidence still faces many challenges. Existing literature has either used logical expressions to parse statements based on tabular evidence or designed table-aware neural networks to encode statement-table pairs and thereby accomplish table-based fact verification tasks. However, these approaches fail to fully utilize the implicit tabular information behind the statements, which leads to the degraded inference performance of the model. Moreover, Chinese statements based on tabular evidence have more complex syntax and semantics, which also adds to the difficulties in model inference. For this reason, the study proposes a method of fact verification with Chinese tabular data based on the capsule heterogeneous graph attention network (CapsHAN). This method can fully understand the structure and semantics of statements. On this basis, the tabular information implied by the statements is mined and utilized to effectively improve the accuracy of table-based fact verification tasks. Specifically, a heterogeneous graph is constructed by performing syntactic dependency parsing and named entity recognition of statements. Subsequently, the graph is learned and understood by the heterogeneous graph attention network and the capsule graph neural network, and the obtained textual representation of the statements is sliced together with the textual representation of the encoded tables. Finally, the result is predicted. Further, this study also attempts to address the problem that the datasets of fact verification based on Chinese tables are scarce and thus unable to support the performance evaluation of table-based fact verification methods. For this purpose, the study transforms the mainstream English table-based fact verification datasets TABFACT and INFOTABS into Chinese and constructs a dataset that is based on the uniform content label (UCL) national standard and specifically tailored to the characteristics of Chinese tabular data. This dataset, namely, UCLDS, takes Wikipedia infoboxes as evidence of manually annotated natural language statements and labels them into three classes: entailed, contradictory, and neutral. UCLDS outperforms the traditional datasets TABFACT and INFOTABS in supporting both single-table and multi-table inference. The experimental results on the above three Chinese benchmark datasets show that the proposed model outperforms the baseline model invariably, demonstrating its superiority for Chinese table-based fact verification tasks.
    Available online:  July 26, 2023 , DOI: 10.13328/j.cnki.jos.006940
    Abstract:
    How to improve the accuracy of matching between natural language query input and highly structured programming language source code is a fundamental concern in code search. Accurate extraction of code features is one of the key challenges to improving matching accuracy. The semantics expressed by statements in codes is not only relevant to themselves but also to their contexts. The structural model of the code provides rich contextual information for understanding code functions. This study proposes a code search method based on function multigraph embedding. By using an early fusion strategy, the study fuses the data dependencies of code statements into a control flow graph and constructs a function multigraph to represent the code. The multigraph explicitly expresses the dependency relationships of indirect predecessor and successor nodes that are lacking in the control flow graph through data dependencies and enhances the contextual information of statement nodes. At the same time, in view of the edge heterogeneity of the multigraph, this study uses the relational graph convolutional network to extract the features of the code from the function multigraph. Experiments on a public dataset show that the proposed method can improve the MRR by more than 5% compared with the existing methods based on code text and structural models. The ablation experiments also show that the control flow graph contributes more to the search accuracy than the data dependence graph.
    Available online:  October 18, 2017 , DOI:
    [Abstract] (2939) [HTML] (0) [PDF 525.21 K] (5253)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,315-325页. 原文链接如下:https://doi.org/10.1145/3106237.3106242, 读者如需引用该文请标引原文出处。
    Available online:  October 18, 2017 , DOI:
    [Abstract] (2876) [HTML] (0) [PDF 352.38 K] (6357)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,303-314页. 原文链接如下:https://doi.org/10.1145/3106237.3106239, 读者如需引用该文请标引原文出处。
    Available online:  September 11, 2017 , DOI:
    [Abstract] (3437) [HTML] (0) [PDF 276.42 K] (3588)
    Abstract:
    GitHub, a popular social-software-development platform, has fostered a variety of software ecosystems where projects depend on one another and practitioners interact with each other. Projects within an ecosystem often have complex inter-dependencies that impose new challenges in bug reporting and fixing. In this paper, we conduct an empirical study on cross-project correlated bugs, i.e., causally related bugs reported to different projects, focusing on two aspects: 1) how developers track the root causes across projects; and 2) how the downstream developers coordinate to deal with upstream bugs. Through manual inspection of bug reports collected from the scientific Python ecosystem and an online survey with developers, this study reveals the common practices of developers and the various factors in fixing cross-project bugs. These findings provide implications for future software bug analysis in the scope of ecosystem, as well as shed light on the requirements of issue trackers for such bugs.
    Available online:  June 21, 2017 , DOI:
    [Abstract] (3457) [HTML] (0) [PDF 169.43 K] (3453)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在IEEE Transactions on Software Engineering 2017 已录用待发表. 原文链接如下:http://ieeexplore.ieee.org/document/7792694, 读者如需引用该文请标引原文出处。
    Available online:  June 13, 2017 , DOI:
    [Abstract] (4664) [HTML] (0) [PDF 174.91 K] (3958)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 39th International Conference on Software Engineering, Pages 27-37, Buenos Aires, Argentina — May 20 - 28, 2017, IEEE Press Piscataway, NJ, USA ?2017, ISBN: 978-1-5386-3868-2 原文链接如下:http://dl.acm.org/citation.cfm?id=3097373, 读者如需引用该文请标引原文出处。
    Available online:  January 25, 2017 , DOI:
    [Abstract] (3539) [HTML] (0) [PDF 254.98 K] (3353)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 871-882. DOI: https://doi.org/10.1145/2950290.2950364 原文链接如下:http://dl.acm.org/citation.cfm?id=2950364, 读者如需引用该文请标引原文出处。
    Available online:  January 18, 2017 , DOI:
    [Abstract] (4007) [HTML] (0) [PDF 472.29 K] (3488)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Pages 133—143, Seattle WA, USA, November 2016. 原文链接如下:http://dl.acm.org/citation.cfm?id=2950327, 读者如需引用该文请标引原文出处。
    Available online:  January 04, 2017 , DOI:
    [Abstract] (3742) [HTML] (0) [PDF 293.93 K] (3066)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16), 810 – 821, November 13 - 18, 2016. 原文链接如下:https://doi.org/10.1145/2950290.2950310, 读者如需引用该文请标引原文出处。
    Available online:  January 04, 2017 , DOI:
    [Abstract] (4084) [HTML] (0) [PDF 244.61 K] (3582)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在FSE 2016, 原文链接如下:http://dl.acm.org/citation.cfm?doid=2950290.2950313, 读者如需引用该文请标引原文出处。
    Available online:  December 12, 2016 , DOI:
    [Abstract] (3632) [HTML] (0) [PDF 358.69 K] (3562)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在FSE'16会议上Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 原文链接如下:http://dl.acm.org/citation.cfm?id=2950340, 读者如需引用该文请标引原文出处。
    Available online:  September 30, 2016 , DOI:
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在ASE2016会议上http://ase2016.org/ 原文链接如下:http://dl.acm.org/citation.cfm?id=2970366 读者如需引用该文请标引原文出处。
    Available online:  September 09, 2016 , DOI:
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 俊杰的文章发表在ASE2016会议上,http://ase2016.org/。 原文链接如下:http://dl.acm.org/citation.cfm?doid=2970276.2970300 请读者标引时请引注原文出处。
    Available online:  September 07, 2016 , DOI:
    Abstract:
    CCF 软件工程专业委员会白晓颖教授(清华大学)推荐。 原文发表在 ASE 2016 Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering。 全文链接:http://dx.doi.org/10.1145/2970276.2970307。 重要提示:读者如引用该文时请标注原文出处。
    Available online:  August 29, 2016 , DOI:
    Abstract:
    CCF软件工程专业委员会白晓颖教授(清华大学)推荐。 该论文发表在ACM Transactions on Software Engineering and Methodology (TOSEM, Vol. 25, No. 2, Article 13, May 2016),被ICSE 2016主会邀请为“Journal first”报告, 全文参见http://dl.acm.org/citation.cfm?id=2876443。 论文作者包括北京大学的周明辉,马秀娟,张路和梅宏,以及田纳西大学的Audris Mockus。 重要提示:读者如引用该文时请标注原文出处。
  • 全文下载排行(总排行年度排行各期排行)
    摘要点击排行(总排行年度排行各期排行)

  • Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2003,14(7):1282-1291, DOI:
    [Abstract] (37345) [HTML] (0) [PDF 832.28 K] (81083)
    Abstract:
    Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
    2010,21(3):427-437, DOI:
    [Abstract] (33073) [HTML] (0) [PDF 308.76 K] (39076)
    Abstract:
    Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
    2011,22(1):71-83, DOI:10.3724/SP.J.1001.2011.03958
    [Abstract] (30028) [HTML] (0) [PDF 781.42 K] (55923)
    Abstract:
    Cloud Computing is the fundamental change happening in the field of Information Technology. It is a representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
    2016,27(1):45-71, DOI:10.13328/j.cnki.jos.004914
    [Abstract] (29453) [HTML] (3335) [PDF 880.96 K] (31892)
    Abstract:
    Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
    2008,19(1):48-61, DOI:
    [Abstract] (28394) [HTML] (0) [PDF 671.39 K] (62076)
    Abstract:
    The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
    2009,20(5):1337-1348, DOI:
    [Abstract] (28280) [HTML] (0) [PDF 1.06 M] (45258)
    Abstract:
    This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
    2009,20(2):271-289, DOI:
    [Abstract] (27187) [HTML] (0) [PDF 675.56 K] (44114)
    Abstract:
    Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
    2005,16(1):1-7, DOI:
    [Abstract] (22354) [HTML] (0) [PDF 614.61 K] (21311)
    Abstract:
    The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
    2010,21(8):1834-1848, DOI:
    [Abstract] (20884) [HTML] (0) [PDF 682.96 K] (57296)
    Abstract:
    This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
    2004,15(3):428-442, DOI:
    [Abstract] (20700) [HTML] (0) [PDF 1009.57 K] (17504)
    Abstract:
    With the rapid development of e-business, web applications based on the Web are developed from localization to globalization, from B2C(business-to-customer) to B2B(business-to-business), from centralized fashion to decentralized fashion. Web service is a new application model for decentralized computing, and it is also an effective mechanism for the data and service integration on the web. Thus, web service has become a solution to e-business. It is important and necessary to carry out the research on the new architecture of web services, on the combinations with other good techniques, and on the integration of services. In this paper, a survey presents on various aspects of the research of web services from the basic concepts to the principal research problems and the underlying techniques, including data integration in web services, web service composition, semantic web service, web service discovery, web service security, the solution to web services in the P2P (Peer-to-Peer) computing environment, and the grid service, etc. This paper also presents a summary of the current art of the state of these techniques, a discussion on the future research topics, and the challenges of the web services.
    2005,16(5):857-868, DOI:
    [Abstract] (19870) [HTML] (0) [PDF 489.65 K] (30912)
    Abstract:
    Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
    2009,20(1):54-66, DOI:
    [Abstract] (19738) [HTML] (0) [PDF 1.41 M] (51063)
    Abstract:
    Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
    2012,23(4):962-986, DOI:10.3724/SP.J.1001.2012.04175
    [Abstract] (18943) [HTML] (0) [PDF 2.09 M] (32711)
    Abstract:
    Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
    2012,23(1):32-45, DOI:10.3724/SP.J.1001.2012.04091
    [Abstract] (18741) [HTML] (0) [PDF 408.86 K] (31690)
    Abstract:
    In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
    2009,20(3):524-545, DOI:
    [Abstract] (17431) [HTML] (0) [PDF 1.09 M] (23052)
    Abstract:
    Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
    2009,20(1):124-137, DOI:
    [Abstract] (17038) [HTML] (0) [PDF 1.06 M] (22833)
    Abstract:
    The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
    2004,15(8):1208-1219, DOI:
    [Abstract] (16520) [HTML] (0) [PDF 948.49 K] (14792)
    Abstract:
    With the explosive growth of network applications and complexity, the threat of Internet worms against network security becomes increasingly serious. Especially under the environment of Internet, the variety of the propagation ways and the complexity of the application environment result in worm with much higher frequency of outbreak, much deeper latency and more wider coverage, and Internet worms have been a primary issue faced by malicious code researchers. In this paper, the concept and research situation of Internet worms, exploration function component and execution mechanism are first presented, then the scanning strategies and propagation model are discussed, and finally the critical techniques of Internet worm prevention are given. Some major problems and research trends in this area are also addressed.
    2009,20(11):2965-2976, DOI:
    [Abstract] (16482) [HTML] (0) [PDF 442.42 K] (15860)
    Abstract:
    This paper studies uncertain graph data mining and especially investigates the problem of mining frequent subgraph patterns from uncertain graph data. A data model is introduced for representing uncertainties in graphs, and an expected support is employed to evaluate the significance of subgraph patterns. By using the apriori property of expected support, a depth-first search-based mining algorithm is proposed with an efficient method for computing expected supports and a technique for pruning search space, which reduces the number of subgraph isomorphism testings needed by computing expected support from the exponential scale to the linear scale. Experimental results show that the proposed algorithm is 3 to 5 orders of magnitude faster than a na?ve depth-first search algorithm, and is efficient and scalable.
    2009,20(5):1226-1240, DOI:
    [Abstract] (16409) [HTML] (0) [PDF 926.82 K] (16756)
    Abstract:
    This paper introduces the concrete details of combining the automated reasoning techniques with planning methods, which includes planning as satisfiability using propositional logic, Conformant planning using modal logic and disjunctive reasoning, planning as nonmonotonic logic, and Flexible planning as fuzzy description logic. After considering experimental results of International Planning Competition and relevant papers, it concludes that planning methods based on automated reasoning techniques is helpful and can be adopted. It also proposes the challenges and possible hotspots.
    2009,20(2):350-362, DOI:
    [Abstract] (16380) [HTML] (0) [PDF 1.39 M] (41269)
    Abstract:
    This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
    2003,14(10):1717-1727, DOI:
    [Abstract] (16247) [HTML] (0) [PDF 839.25 K] (15681)
    Abstract:
    Sensor networks are integration of sensor techniques, nested computation techniques, distributed computation techniques and wireless communication techniques. They can be used for testing, sensing, collecting and processing information of monitored objects and transferring the processed information to users. Sensor network is a new research area of computer science and technology and has a wide application future. Both academia and industries are very interested in it. The concepts and characteristics of the sensor networks and the data in the networks are introduced, and the issues of the sensor networks and the data management of sensor networks are discussed. The advance of the research on sensor networks and the data management of sensor networks are also presented.
    2015,26(1):62-81, DOI:10.13328/j.cnki.jos.004701
    [Abstract] (16106) [HTML] (3819) [PDF 1.04 M] (27581)
    Abstract:
    Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
    2014,25(4):839-862, DOI:10.13328/j.cnki.jos.004558
    [Abstract] (15544) [HTML] (2693) [PDF 1.32 M] (20244)
    Abstract:
    Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
    2012,23(1):1-20, DOI:10.3724/SP.J.1001.2012.04100
    [Abstract] (14592) [HTML] (0) [PDF 1017.73 K] (32374)
    Abstract:
    Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
    2009,20(10):2729-2743, DOI:
    [Abstract] (14473) [HTML] (0) [PDF 1.12 M] (11436)
    Abstract:
    In a multi-hop wireless sensor network (WSN), the sensors closest to the sink tend to deplete their energy faster than other sensors, which is known as an energy hole around the sink. No more data can be delivered to the sink after an energy hole appears, while a considerable amount of energy is wasted and the network lifetime ends prematurely. This paper investigates the energy hole problem, and based on the improved corona model with levels, it concludes that the assignment of transmission ranges of nodes in different coronas is an effective approach for achieving energy-efficient network. It proves that the optimal transmission ranges for all areas is a multi-objective optimization problem (MOP), which is NP hard. The paper proposes an ACO (ant colony optimization)-based distributed algorithm to prolong the network lifetime, which can help nodes in different areas to adaptively find approximate optimal transmission range based on the node distribution. Furthermore, the simulation results indicate that the network lifetime under this solution approximates to that using the optimal list. Compared with existing algorithms, this ACO-based algorithm can not only make the network lifetime be extended more than two times longer, but also have good performance in the non-uniform node distribution.
    2012,23(5):1148-1166, DOI:10.3724/SP.J.1001.2012.04195
    [Abstract] (14389) [HTML] (0) [PDF 946.37 K] (18050)
    Abstract:
    With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
    2000,11(11):1460-1466, DOI:
    [Abstract] (14331) [HTML] (0) [PDF 520.69 K] (11795)
    Abstract:
    Intrusion detection is a highlighted topic of network security research in recent years. In this paper, first the necessity o f intrusion detection is presented, and its concepts and models are described. T hen, many intrusion detection techniques and architectures are summarized. Final ly, the existing problems and the future direction in this field are discussed.
    2013,24(8):1786-1803, DOI:10.3724/SP.J.1001.2013.04416
    [Abstract] (14045) [HTML] (0) [PDF 1.04 M] (17878)
    Abstract:
    Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
    2002,13(7):1228-1237, DOI:
    [Abstract] (14040) [HTML] (0) [PDF 500.04 K] (14939)
    Abstract:
    Software architecture (SA) is emerging as one of the primary research areas in software engineering recently and one of the key technologies to the development of large-scale software-intensive system and software product line system. The history and the major direction of SA are summarized, and the concept of SA is brought up based on analyzing and comparing the several classical definitions about SA. Based on summing up the activities about SA, two categories of study about SA are extracted out, and the advancements of researches on SA are subsequently introduced from seven aspects.Additionally,some disadvantages of study on SA are discussed,and the causes are explained at the same.Finally,it is concluded with some singificantly promising tendency about research on SA.
    2006,17(7):1588-1600, DOI:
    [Abstract] (13788) [HTML] (0) [PDF 808.73 K] (15185)
    Abstract:
    Routing technology at the network layer is pivotal in the architecture of wireless sensor networks. As an active branch of routing technology, cluster-based routing protocols excel in network topology management, energy minimization, data aggregation and so on. In this paper, cluster-based routing mechanisms for wireless sensor networks are analyzed. Cluster head selection, cluster formation and data transmission are three key techniques in cluster-based routing protocols. As viewed from the three techniques, recent representative cluster-based routing protocols are presented, and their characteristics and application areas are compared. Finally, the future research issues in this area are pointed out.
    2004,15(4):571-583, DOI:
    [Abstract] (13784) [HTML] (0) [PDF 1005.17 K] (10425)
    Abstract:
    For most peer-to-peer file-swapping applications, sharing is a volunteer action, and peers are not responsible for their irresponsible bartering history. This situation indicates the trust between participants can not be set up simply on the traditional trust mechanism. A reasonable trust construction approach comes from the social network analysis, in which trust relations between individuals are set up upon recommendations of other individuals. Current p2p trust model could not promise the convergence of iteration for trust computation, and takes no consideration for model security problems, such as sybil attack and slandering. This paper presents a novel recommendation-based global trust model and gives a distributed implementation method. Mathematic analyses and simulations show that, compared to the current global trust model, the proposed model is more robust on trust security problems and more complete on iteration for computing peer trust.
    2015,26(1):26-39, DOI:10.13328/j.cnki.jos.004631
    [Abstract] (13778) [HTML] (2592) [PDF 763.52 K] (17147)
    Abstract:
    In recent years, transfer learning has provoked vast amount of attention and research. Transfer learning is a new machine learning method that applies the knowledge from related but different domains to target domains. It relaxes the two basic assumptions in traditional machine learning: (1) the training (also referred as source domain) and test data (also referred target domain) follow the independent and identically distributed (i.i.d.) condition; (2) there are enough labeled samples to learn a good classification model, aiming to solve the problems that there are few or even not any labeled data in target domains. This paper surveys the research progress of transfer learning and introduces its own works, especially the ones in building transfer learning models by applying generative model on the concept level. Finally, the paper introduces the applications of transfer learning, such as text classification and collaborative filtering, and further suggests the future research direction of transfer learning.
    2011,22(1):115-131, DOI:10.3724/SP.J.1001.2011.03950
    [Abstract] (13777) [HTML] (0) [PDF 845.91 K] (28745)
    Abstract:
    The Internet traffic model is the key issue for network performance management, Quality of Service management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
    2009,20(1):11-29, DOI:
    [Abstract] (13658) [HTML] (0) [PDF 787.30 K] (14844)
    Abstract:
    Constrained optimization problems (COPs) are mathematical programming problems frequently encountered in the disciplines of science and engineering application. Solving COPs has become an important research area of evolutionary computation in recent years. In this paper, the state-of-the-art of constrained optimization evolutionary algorithms (COEAs) is surveyed from two basic aspects of COEAs (i.e., constraint-handling techniques and evolutionary algorithms). In addition, this paper discusses some important issues of COEAs. More specifically, several typical algorithms are analyzed in detail. Based on the analyses, it concluded that to obtain competitive results, a proper constraint-handling technique needs to be considered in conjunction with an appropriate search algorithm. Finally, the open research issues in this field are also pointed out.
    2008,19(zk):112-120, DOI:
    [Abstract] (13566) [HTML] (0) [PDF 594.29 K] (15141)
    Abstract:
    An ad hoc network is a collection of wireless mobile nodes dynamically forming a temporary network without the use of any existing network infrastructure or centralized administration. Due to bandwidth constraint and dynamic topology of mobile ad hoc networks, multipath supported routing is a very important research issue. In this paper, we present an entropy-based metric to support stability multipath on-demand routing (SMDR). The key idea of SMDR protocol is to construct the new metric-entropy and select the stability multipath with the help of entropy metric to reduce the number of route reconstruction so as to provide QoS guarantee in the ad hoc network whose topology changes continuously. Simulation results show that, with the proposed multipath routing protocol, packet delivery ratio, end-to-end delay, and routing overhead ratio can be improved in most of cases. It is an available approach to multipath routing decision.
    2013,24(1):50-66, DOI:10.3724/SP.J.1001.2013.04276
    [Abstract] (13448) [HTML] (0) [PDF 0.00 Byte] (17575)
    Abstract:
    As an important application of acceleration in the cloud, the distributed caching technology has received considerable attention in industry and academia. This paper starts with a discussion on the combination of cloud computing and distributed caching technology, giving an analysis of its characteristics, typical application scenarios, stages of development, standards, and several key elements, which have promoted its development. In order to systematically know the state of art progress and weak points of the distributed caching technology, the paper builds a multi-dimensional framework, DctAF. This framework is constituted of 6 dimensions through analyzing the characteristics of cloud computing and boundary of the caching techniques. Based on DctAF, current techniques have been analyzed and summarized; comparisons among several influential products have also been made. Finally, the paper describes and highlights the several challenges that the cache system faces and examines the current research through in-depth analysis and comparison.
    2003,14(9):1621-1628, DOI:
    [Abstract] (13277) [HTML] (0) [PDF 680.35 K] (20694)
    Abstract:
    Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
    2003,14(9):1635-1644, DOI:
    [Abstract] (13138) [HTML] (0) [PDF 622.06 K] (12577)
    Abstract:
    Computer forensics is the technology field that attempts to prove thorough, efficient, and secure means to investigate computer crime. Computer evidence must be authentic, accurate, complete and convincing to juries. In this paper, the stages of computer forensics are presented, and the theories and the realization of the forensics software are described. An example about forensic practice is also given. The deficiency of computer forensics technique and anti-forensics are also discussed. The result comes out that it is as the improvement of computer science technology, the forensics technique will become more integrated and thorough.
    2002,13(10):1952-1961, DOI:
    [Abstract] (13094) [HTML] (0) [PDF 570.96 K] (12810)
    Abstract:
    The crucial technologies related to personalization are introduced in this paper, which include the representation and modification of user profile, the representation of resource, the recommendation technology, and the architecture of personalization. By comparing with some existing prototype systems, the key technologies about how to implement personalization are discussed in detail. In addition, three representative personalization systems are analyzed. At last, some research directions for personalization are presented.
    2008,19(8):1947-1964, DOI:
    [Abstract] (13027) [HTML] (0) [PDF 811.11 K] (10562)
    Abstract:
    Wide-Spread deployment for interactive information visualization is difficult. Non-Specialist users need a general development method and a toolkit to support the generic data structures suited to tree, network and multi-dimensional data, special visualization techniques and interaction techniques, and well-known generic information tasks. This paper presents a model driven development method for interactive information visualization. First, an interactive information visualization interface model (IIVM) is proposed. Then, the development method for interactive information visualization based on IIVM is presented. The Daisy toolkit is introduced, which includes Daisy model builder, Daisy IIV generator and runtime framework with Daisy library. Finally, an application example is given. Experimental results show that Daisy can provide a general solution for development for interactive information visualization.
    2008,19(8):1902-1919, DOI:
    [Abstract] (12998) [HTML] (0) [PDF 521.73 K] (14018)
    Abstract:
    Visual language techniques have exhibited more advantages in describing various software artifacts than one-dimensional textual languages during software development, ranging from the requirement analysis and design to testing and maintenance, as diagrammatic and graphical notations have been well applied in modeling system. In addition to an intuitive appearance, graph grammars provide a well-established foundation for defining visual languages with the power of precise modeling and verification on computers. This paper discusses the issues and techniques for a formal foundation of visual languages, reviews related practical graphical environments, presents a spatial graph grammar formalism, and applies the spatial graph grammar to defining behavioral semantics of UML diagrams and developing a style-driven framework for software architecture design.
    2012,23(1):82-96, DOI:10.3724/SP.J.1001.2012.04101
    [Abstract] (12988) [HTML] (0) [PDF 394.07 K] (15216)
    Abstract:
    Botnets are one of the most serious threats to the Internet. Researchers have done plenty of research and made significant progress. However, botnets keep evolving and have become more and more sophisticated. Due to the underlying security limitation of current system and Internet architecture, and the complexity of botnet itself, how to effectively counter the global threat of botnets is still a very challenging issue. This paper first introduces the evolving of botnet’s propagation, attack, command, and control mechanisms. Then the paper summarizes recent advances of botnet defense research and categorizes into five areas: Botnet monitoring, botnet infiltration, analysis of botnet characteristics, botnet detection and botnet disruption. The limitation of current botnet defense techniques, the evolving trend of botnet, and some possible directions for future research are also discussed.
    2008,19(7):1565-1580, DOI:
    [Abstract] (12794) [HTML] (0) [PDF 815.02 K] (16962)
    Abstract:
    Software defect prediction has been one of the active parts of software engineering since it was developed in 1970's. It plays a very important role in the analysis of software quality and balance of software cost. This paper investigates and discusses the motivation, evolvement, solutions and challenges of software defect prediction technologies, and it also categorizes, analyzes and compares the representatives of these prediction technologies. Some case studies for software defect distribution models are given to help understanding.
    2010,21(2):231-247, DOI:
    [Abstract] (12764) [HTML] (0) [PDF 1.21 M] (16776)
    Abstract:
    In this paper, a framework is proposed for handling fault of service composition through analyzing fault requirements. Petri nets are used in the framework for fault detecting and its handling, which focuses on targeting the failure of available services, component failure and network failure. The corresponding fault models are given. Based on the model, the correctness criterion of fault handling is given to analyze fault handling model, and its correctness is proven. Finally, CTL (computational tree logic) is used to specify the related properties and enforcement algorithm of fault analysis. The simulation results show that this method can ensure the reliability and consistency of service composition.
    2017,28(1):1-16, DOI:10.13328/j.cnki.jos.005139
    [Abstract] (12625) [HTML] (3479) [PDF 1.75 M] (10002)
    Abstract:
    Knapsack problem (KP) is a well-known combinatorial optimization problem which includes 0-1 KP, bounded KP, multi-constraint KP, multiple KP, multiple-choice KP, quadratic KP, dynamic knapsack KP, discounted KP and other types of KPs. KP can be considered as a mathematical model extracted from variety of real fields and therefore has wide applications. Evolutionary algorithms (EAs) are universally considered as an efficient tool to solve KP approximately and quickly. This paper presents a survey on solving KP by EAs over the past ten years. It not only discusses various KP encoding mechanism and the individual infeasible solution processing but also provides useful guidelines for designing new EAs to solve KPs.
    2006,17(9):1848-1859, DOI:
    [Abstract] (12599) [HTML] (0) [PDF 770.40 K] (21401)
    Abstract:
    In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
    2010,21(7):1620-1634, DOI:
    [Abstract] (12541) [HTML] (0) [PDF 765.23 K] (20287)
    Abstract:
    As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
    2010,21(5):916-929, DOI:
    [Abstract] (12431) [HTML] (0) [PDF 944.50 K] (18176)
    Abstract:
    Data deduplication technologies can be divided into two categories: a) identical data detection techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since data deduplication technologies can affect the reliability and performance of storage systems, this paper also surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of the current state of research on data deduplication technologies, this paper makes several conclusions as follows: a) How to mine data characteristic information in data deduplication has not been completely solved, and how to use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by data deduplication techniques.
    2009,20(6):1393-1405, DOI:
    [Abstract] (12279) [HTML] (0) [PDF 831.86 K] (19213)
    Abstract:
    Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
    2008,19(10):2706-2719, DOI:
    [Abstract] (12182) [HTML] (0) [PDF 778.29 K] (12238)
    Abstract:
    Web search engine has become a very important tool for finding information efficiently from the massive Web data. With the explosive growth of the Web data, traditional centralized search engines become harder to catch up with the growing step of people's information needs. With the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web search has been proposed and quickly becomes a research focus. The goal of this paper is to give a brief summary of current P2P Web search technologies in order to facilitate future research. First, some main challenges for P2P Web search are presented. Then, key techniques for building a feasible and efficient P2P Web search engine are reviewed, including system topology, data placement, query routing, index partitioning, collection selection, relevance ranking and Web crawling. Finally, three recently proposed novel P2P Web search prototypes are introduced.
  • 全文下载排行(总排行年度排行各期排行)
    摘要点击排行(总排行年度排行各期排行)

  • Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2003,14(7):1282-1291, DOI:
    [Abstract] (37345) [HTML] (0) [PDF 832.28 K] (81083)
    Abstract:
    Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
    2008,19(1):48-61, DOI:
    [Abstract] (28394) [HTML] (0) [PDF 671.39 K] (62076)
    Abstract:
    The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
    2010,21(8):1834-1848, DOI:
    [Abstract] (20884) [HTML] (0) [PDF 682.96 K] (57296)
    Abstract:
    This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
    2011,22(1):71-83, DOI:10.3724/SP.J.1001.2011.03958
    [Abstract] (30028) [HTML] (0) [PDF 781.42 K] (55923)
    Abstract:
    Cloud Computing is the fundamental change happening in the field of Information Technology. It is a representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
    2009,20(1):54-66, DOI:
    [Abstract] (19738) [HTML] (0) [PDF 1.41 M] (51063)
    Abstract:
    Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
    2009,20(5):1337-1348, DOI:
    [Abstract] (28280) [HTML] (0) [PDF 1.06 M] (45258)
    Abstract:
    This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
    2009,20(2):271-289, DOI:
    [Abstract] (27187) [HTML] (0) [PDF 675.56 K] (44114)
    Abstract:
    Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
    2009,20(2):350-362, DOI:
    [Abstract] (16380) [HTML] (0) [PDF 1.39 M] (41269)
    Abstract:
    This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
    2004,15(10):1493-1504, DOI:
    [Abstract] (9177) [HTML] (0) [PDF 937.72 K] (39748)
    Abstract:
    Graphics processing unit (GPU) has been developing rapidly in recent years at a speed over Moor抯 law, and as a result, various applications associated with computer graphics advance greatly. At the same time, the highly processing power, parallelism and programmability available nowadays on the contemporary GPU provide an ideal platform on which the general-purpose computation could be made. Starting from an introduction to the development history and the architecture of GPU, the technical fundamentals of GPU are described in the paper. Then in the main part of the paper, the development of various applications on general purpose computation on GPU is introduced, and among those applications, fluid dynamics, algebraic computation, database operations, and spectrum analysis are introduced in detail. The experience of our work on fluid dynamics has been also given, and the development of software tools in this area is introduced. Finally, a conclusion is made, and the future development and the new challenge on both hardware and software in this subject are discussed.
    2010,21(3):427-437, DOI:
    [Abstract] (33073) [HTML] (0) [PDF 308.76 K] (39076)
    Abstract:
    Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
    2014,25(9):1889-1908, DOI:10.13328/j.cnki.jos.004674
    [Abstract] (11856) [HTML] (3411) [PDF 550.98 K] (38509)
    Abstract:
    This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
    2013,24(11):2476-2497, DOI:10.3724/SP.J.1001.2013.04486
    [Abstract] (10531) [HTML] (0) [PDF 1.14 M] (35586)
    Abstract:
    Probabilistic graphical models are powerful tools for compactly representing complex probability distributions, efficiently computing (approximate) marginal and conditional distributions, and conveniently learning parameters and hyperparameters in probabilistic models. As a result, they have been widely used in applications that require some sort of automated probabilistic reasoning, such as computer vision and natural language processing, as a formal approach to deal with uncertainty. This paper surveys the basic concepts and key results of representation, inference and learning in probabilistic graphical models, and demonstrates their uses in two important probabilistic models. It also reviews some recent advances in speeding up classic approximate inference algorithms, followed by a discussion of promising research directions.
    2012,23(4):962-986, DOI:10.3724/SP.J.1001.2012.04175
    [Abstract] (18943) [HTML] (0) [PDF 2.09 M] (32711)
    Abstract:
    Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
    2012,23(1):1-20, DOI:10.3724/SP.J.1001.2012.04100
    [Abstract] (14592) [HTML] (0) [PDF 1017.73 K] (32374)
    Abstract:
    Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
    2018,29(5):1471-1514, DOI:10.13328/j.cnki.jos.005519
    [Abstract] (6097) [HTML] (4261) [PDF 4.38 M] (32244)
    Abstract:
    Computer aided detection/diagnosis (CAD) can improve the accuracy of diagnosis,reduce false positive,and provide decision supports for doctors.The main purpose of this paper is to analyze the latest development of computer aided diagnosis tools.Focusing on the top four fatal cancer's incidence positions,major recent publications on CAD applications in different medical imaging areas are reviewed in this survey according to different imaging techniques and diseases.Further more,multidimentional analysis is made on the researches from image data sets,algorithms and evaluation methods.Finally,existing problems,research trend and development direction in the field of medical image CAD system are discussed.
    2016,27(1):45-71, DOI:10.13328/j.cnki.jos.004914
    [Abstract] (29453) [HTML] (3335) [PDF 880.96 K] (31892)
    Abstract:
    Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
    2012,23(1):32-45, DOI:10.3724/SP.J.1001.2012.04091
    [Abstract] (18741) [HTML] (0) [PDF 408.86 K] (31690)
    Abstract:
    In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
    2021,32(2):349-369, DOI:10.13328/j.cnki.jos.006138
    [Abstract] (8033) [HTML] (7520) [PDF 2.36 M] (31002)
    Abstract:
    Few-shot learning is defined as learning models to solve problems from small samples. In recent years, under the trend of training model with big data, machine learning and deep learning have achieved success in many fields. However, in many application scenarios in the real world, there is not a large amount of data or labeled data for model training, and labeling a large number of unlabeled samples will cost a lot of manpower. Therefore, how to use a small number of samples for learning has become a problem that needs to be paid attention to at present. This paper systematically combs the current approaches of few-shot learning. It introduces each kind of corresponding model from the three categories: fine-tune based, data augmentation based, and transfer learning based. Then, the data augmentation based approaches are subdivided into unlabeled data based, data generation based, and feature augmentation based approaches. The transfer learning based approaches are subdivided into metric learning based, meta-learning based, and graph neural network based methods. In the following, the paper summarizes the few-shot datasets and the results in the experiments of the aforementioned models. Next, the paper summarizes the current situation and challenges in few-shot learning. Finally, the future technological development of few-shot learning is prospected.
    2005,16(5):857-868, DOI:
    [Abstract] (19870) [HTML] (0) [PDF 489.65 K] (30912)
    Abstract:
    Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
    2011,22(1):115-131, DOI:10.3724/SP.J.1001.2011.03950
    [Abstract] (13777) [HTML] (0) [PDF 845.91 K] (28745)
    Abstract:
    The Internet traffic model is the key issue for network performance management, Quality of Service management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
    2015,26(1):62-81, DOI:10.13328/j.cnki.jos.004701
    [Abstract] (16106) [HTML] (3819) [PDF 1.04 M] (27581)
    Abstract:
    Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
    2013,24(1):77-90, DOI:10.3724/SP.J.1001.2013.04339
    [Abstract] (11275) [HTML] (0) [PDF 0.00 Byte] (27520)
    Abstract:
    Task parallel programming model is a widely used parallel programming model on multi-core platforms. With the intention of simplifying parallel programming and improving the utilization of multiple cores, this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives: Parallelism expression, data management and task scheduling. In the end, some future trends in this area are discussed.
    2017,28(4):959-992, DOI:10.13328/j.cnki.jos.005143
    [Abstract] (9180) [HTML] (4703) [PDF 3.58 M] (25583)
    Abstract:
    The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
    2011,22(6):1299-1315, DOI:10.3724/SP.J.1001.2011.03993
    [Abstract] (11350) [HTML] (0) [PDF 987.90 K] (23420)
    Abstract:
    Attribute-Based encryption (ABE) scheme takes attributes as the public key and associates the ciphertext and user’s secret key with attributes, so that it can support expressive access control policies. This dramatically reduces the cost of network bandwidth and sending node’s operation in fine-grained access control of data sharing. Therefore, ABE has a broad prospect of application in the area of fine-grained access control. After analyzing the basic ABE system and its two variants, Key-Policy ABE (KP-ABE) and Ciphertext-Policy ABE (CP-ABE), this study elaborates the research problems relating to ABE systems, including access structure design for CP-ABE, attribute key revocation, key abuse and multi-authorities ABE with an extensive comparison of their functionality and performance. Finally, this study discusses the need-to-be solved problems and main research directions in ABE.
    2009,20(3):524-545, DOI:
    [Abstract] (17431) [HTML] (0) [PDF 1.09 M] (23052)
    Abstract:
    Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
    2009,20(1):124-137, DOI:
    [Abstract] (17038) [HTML] (0) [PDF 1.06 M] (22833)
    Abstract:
    The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
    2014,25(1):37-50, DOI:10.13328/j.cnki.jos.004497
    [Abstract] (9831) [HTML] (3570) [PDF 929.87 K] (22011)
    Abstract:
    This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.
    2004,15(11):1583-1594, DOI:
    [Abstract] (9057) [HTML] (0) [PDF 1.57 M] (21928)
    Abstract:
    Uncertainty exists widely in the subjective and objective world. In all kinds of uncertainty, randomness and fuzziness are the most important and fundamental. In this paper, the relationship between randomness and fuzziness is discussed. Uncertain states and their changes can be measured by entropy and hyper-entropy respectively. Taken advantage of entropy and hyper-entropy, the uncertainty of chaos, fractal and complex networks by their various evolution and differentiation are further studied. A simple and effective way is proposed to simulate the uncertainty by means of knowledge representation which provides a basis for the automation of both logic and image thinking with uncertainty. The AI (artificial intelligence) with uncertainty is a new cross-discipline, which covers computer science, physics, mathematics, brain science, psychology, cognitive science, biology and philosophy, and results in the automation of representation, process and thinking for uncertain information and knowledge.
    2018,29(10):2966-2994, DOI:10.13328/j.cnki.jos.005551
    [Abstract] (9929) [HTML] (5134) [PDF 610.06 K] (21604)
    Abstract:
    In recent years, the rapid development of Internet technology and Web applications has triggered the explosion of various data on the Internet, which generates a large amount of valuable knowledge. How to organize, represent and analyze these knowledge has attracted much attention. Knowledge graph was thus developed to organize these knowledge in a semantical and visualized manner. Knowledge reasoning over knowledge graph then becomes one of the hot research topics and plays an important role in many applications such as vertical search and intelligent question-answer. The goal of knowledge reasoning over knowledge graph is to infer new facts or identify erroneous facts according to existing ones. Unlike traditional knowledge reasoning, knowledge reasoning over knowledge graph is more diversified, due to the simplicity, intuitiveness, flexibility, and richness of knowledge representation in knowledge graph. Starting with the basic concept of knowledge reasoning, this paper presents a survey on the recently developed methods for knowledge reasoning over knowledge graph. Specifically, the research progress is reviewed in detail from two aspects:One-Step reasoning and multi-step reasoning, each including rule based reasoning, distributed embedding based reasoning, neural network based reasoning and hybrid reasoning. Finally, future research directions and outlook of knowledge reasoning over knowledge graph are discussed.
    2006,17(9):1848-1859, DOI:
    [Abstract] (12599) [HTML] (0) [PDF 770.40 K] (21401)
    Abstract:
    In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
    2005,16(1):1-7, DOI:
    [Abstract] (22354) [HTML] (0) [PDF 614.61 K] (21311)
    Abstract:
    The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
    2020,31(7):2245-2282, DOI:10.13328/j.cnki.jos.006037
    [Abstract] (3051) [HTML] (4437) [PDF 967.02 K] (20995)
    Abstract:
    Ultrasonography is the first choice of imaging examination and preoperative evaluation for thyroid and breast cancer. However, ultrasonic characteristics of benign and malignant nodules are commonly overlapped. The diagnosis heavily relies on operator's experience other than quantitative and stable methods. In recent years, medical imaging analysis based on computer technology has developed rapidly, and a series of landmark breakthroughs have been made, which provides effective decision supports for medical imaging diagnosis. In this work, the research progress of computer vision and image recognition technologies in thyroid and breast ultrasound images is studied. A series of key technologies involved in automatic diagnosis of ultrasound images is the main lines of the work. The major algorithms in recent years are summarized and analyzed, such as ultrasound image preprocessing, lesion localization and segmentation, feature extraction and classification. Moreover, multi-dimensional analysis is made on the algorithms, data sets, and evaluation methods. Finally, existing problems related to automatic analysis of those two kinds of ultrasound imaging are discussed, research trend and development direction in the field of ultrasound images analysis are discussed.
    2012,23(8):2058-2072, DOI:10.3724/SP.J.1001.2012.04237
    [Abstract] (10198) [HTML] (0) [PDF 800.05 K] (20781)
    Abstract:
    The Distributed denial of service (DDoS) attack is a major threat to the current network. Based on the attack packet level, the study divides DDoS attacks into network-level DDoS attacks and application-level DDoS attacks. Next, the study analyzes the detection and control methods of these two kinds of DDoS attacks in detail, and it also analyzes the drawbacks of different control methods implemented in different network positions. Finally, the study analyzes the drawbacks of the current detection and control methods, the development trend of the DDoS filter system, and corresponding technological challenges are also proposed.
    2003,14(9):1621-1628, DOI:
    [Abstract] (13277) [HTML] (0) [PDF 680.35 K] (20694)
    Abstract:
    Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
    2013,24(2):295-316, DOI:10.3724/SP.J.1001.2013.04336
    [Abstract] (9914) [HTML] (0) [PDF 0.00 Byte] (20313)
    Abstract:
    Under the new application mode, the traditional hierarchy data centers face several limitations in size, bandwidth, scalability, and cost. In order to meet the needs of new applications, data center network should fulfill the requirements with low-cost, such as high scalability, low configuration overhead, robustness and energy-saving. First, the shortcomings of the traditional data center network architecture are summarized, and new requirements are pointed out. Secondly, the existing proposals are divided into two categories, i.e. server-centric and network-centric. Then, several representative architectures of these two categories are overviewed and compared in detail. Finally, the future directions of data center network are discussed.
    2010,21(7):1620-1634, DOI:
    [Abstract] (12541) [HTML] (0) [PDF 765.23 K] (20287)
    Abstract:
    As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
    2005,16(10):1743-1756, DOI:
    [Abstract] (10282) [HTML] (0) [PDF 545.62 K] (20248)
    Abstract:
    This paper presents a survey on the theory of provable security and its applications to the design and analysis of security protocols. It clarifies what the provable security is, explains some basic notions involved in the theory of provable security and illustrates the basic idea of random oracle model. It also reviews the development and advances of provably secure public-key encryption and digital signature schemes, in the random oracle model or the standard model, as well as the applications of provable security to the design and analysis of session-key distribution protocols and their advances.
    2014,25(4):839-862, DOI:10.13328/j.cnki.jos.004558
    [Abstract] (15544) [HTML] (2693) [PDF 1.32 M] (20244)
    Abstract:
    Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
    2019,30(2):440-468, DOI:10.13328/j.cnki.jos.005659
    [Abstract] (8421) [HTML] (5732) [PDF 3.27 M] (19323)
    Abstract:
    Recent years, applying Deep Learning (DL) into Image Semantic Segmentation (ISS) has been widely used due to its state-of-the-art performances and high-quality results. This paper systematically reviews the contribution of DL to the field of ISS. Different methods of ISS based on DL (ISSbDL) are summarized. These methods are divided into ISS based on the Regional Classification (ISSbRC) and ISS based on the Pixel Classification (ISSbPC) according to the image segmentation characteristics and segmentation granularity. Then, the methods of ISSbPC are surveyed from two points of view:ISS based on Fully Supervised Learning (ISSbFSL) and ISS based on Weakly Supervised Learning (ISSbWSL). The representative algorithms of each method are introduced and analyzed, as well as the basic workflow, framework, advantages and disadvantages of these methods are detailedly analyzed and compared. In addition, the related experiments of ISS are analyzed and summarized, and the common data sets and performance evaluation indexes in ISS experiments are introduced. Finally, possible research directions and trends are given and analyzed.
    2023,34(2):625-654, DOI:10.13328/j.cnki.jos.006696
    [Abstract] (3091) [HTML] (3913) [PDF 3.04 M] (19248)
    Abstract:
    Source code bug (vulnerability) detection is a process of judging whether there are unexpected behaviors in the program code. It is widely used in software engineering tasks such as software testing and software maintenance, and plays a vital role in software functional assurance and application security. Traditional vulnerability detection research is based on program analysis, which usually requires strong domain knowledge and complex calculation rules, and faces the problem of state explosion, resulting in limited detection performance, and there is room for greater improvement in the rate of false positives and false negatives. In recent years, the open source community's vigorous development has accumulated massive amounts of data with open source code as the core. In this context, the feature learning capabilities of deep learning can automatically learn semantically rich code representations, thereby providing a new way for vulnerability detection. This study collected the latest high-level papers in this field, systematically summarized and explained the current methods from two aspects:vulnerability code dataset and deep learning vulnerability detection model. Finally, it summarizes the main challenges faced by the research in this field, and looks forward to the possible future research focus.
    2010,21(7):1605-1619, DOI:
    [Abstract] (10007) [HTML] (0) [PDF 856.25 K] (19234)
    Abstract:
    The rapid development of Internet leads to an increase in system complexity and uncertainty. Traditional network management can not meet the requirement, and it shall evolve to fusion based Cyberspace Situational Awareness (CSA). Based on the analysis of function shortage and development requirement, this paper introduces CSA as well as its origin, conception, objective and characteristics. Firstly, a CSA research framework is proposed and the research history is investigated, based on which the main aspects and the existing issues of the research are analyzed. Meanwhile, assessment methods are divided into three categories: Mathematics model, knowledge reasoning and pattern recognition. Then, this paper discusses CSA from three aspects: Model, knowledge representation and assessment methods, and then goes into detail about main idea, assessment process, merits and shortcomings of novel methods. Many typical methods are compared. The current application research of CSA in the fields of security, transmission, survivable, system evaluation and so on is presented. Finally, this paper points the development directions of CSA and offers the conclusions from issue system, technical system and application system.
    2009,20(6):1393-1405, DOI:
    [Abstract] (12279) [HTML] (0) [PDF 831.86 K] (19213)
    Abstract:
    Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
    2013,24(4):825-842, DOI:10.3724/SP.J.1001.2013.04369
    [Abstract] (8698) [HTML] (0) [PDF 1.09 M] (19193)
    Abstract:
    Honeypot is a proactive defense technology, introduced by the defense side to change the asymmetric situation of a network attack and defensive game. Through the deployment of the honeypots, i.e. security resources without any production purpose, the defenders can deceive attackers to illegally take advantage of the honeypots and capture and analyze the attack behaviors to understand the attack tools and methods, and to learn the intentions and motivations. Honeypot technology has won the sustained attention of the security community to make considerable progress and get wide application, and has become one of the main technical means of the Internet security threat monitoring and analysis. In this paper, the origin and evolution process of the honeypot technology are presented first. Next, the key mechanisms of honeypot technology are comprehensively analyzed, the development process of the honeypot deployment structure is also reviewed, and the latest applications of honeypot technology in the directions of Internet security threat monitoring, analysis and prevention are summarized. Finally, the problems of honeypot technology, development trends and further research directions are discussed.
    2018,29(7):2092-2115, DOI:10.13328/j.cnki.jos.005589
    [Abstract] (10432) [HTML] (5435) [PDF 2.52 M] (18995)
    Abstract:
    Blockchain is a distributed public ledger technology that originates from the digital cryptocurrency, bitcoin. Its development has attracted wide attention in industry and academia fields. Blockchain has the advantages of de-centralization, trustworthiness, anonymity and immutability. It breaks through the limitation of traditional center-based technology and has broad development prospect. This paper introduces the research progress of blockchain technology and its application in the field of information security. Firstly, the basic theory and model of blockchain are introduced from five aspects:Basic framework, key technology, technical feature, and application mode and area. Secondly, from the perspective of current research situation of blockchain in the field of information security, this paper summarizes the research progress of blockchain in authentication technology, access control technology and data protection technology, and compares the characteristics of various researches. Finally, the application challenges of blockchain technology are analyzed, and the development outlook of blockchain in the field of information security is highlighted. This study intends to provide certain reference value for future research work.
    2017,28(1):160-183, DOI:10.13328/j.cnki.jos.005136
    [Abstract] (8887) [HTML] (4893) [PDF 3.12 M] (18936)
    Abstract:
    Image segmentation is the process of dividing the image into a number of regions with similar properties, and it's the preprocessing step for many image processing tasks. In recent years, domestic and foreign scholars mainly focus on the content-based image segmentation algorithms. Based on extensive research on the existing literatures and the latest achievements, this paper categorizes image segmentation algorithms into three types:graph theory based method, pixel clustering based method and semantic segmentation method. The basic ideas, advantage and disadvantage of typical algorithms belong to each category, especially the most recent image semantic segmentation algorithms based on deep neural network are analyzed, compared and summarized. Furthermore, the paper introduces the datasets which are commonly used as benchmark in image segmentation and evaluation criteria for algorithms, and compares several image segmentation algorithms with experiments as well. Finally, some potential future research work is discussed.
    2011,22(3):381-407, DOI:10.3724/SP.J.1001.2011.03934
    [Abstract] (10563) [HTML] (0) [PDF 614.69 K] (18867)
    Abstract:
    The popularity of the Internet and the boom of the World Wide Web foster innovative changes in software technology that give birth to a new form of software—networked software, which delivers diversified and personalized on-demand services to the public. With the ever-increasing expansion of applications and users, the scale and complexity of networked software are growing beyond the information processing capability of human beings, which brings software engineers a series of challenges to face. In order to come to a scientific understanding of this kind of ultra-large-scale artificial complex systems, a survey research on the infrastructure, application services, and social interactions of networked software is conducted from a three-dimensional perspective of cyberization, servicesation, and socialization. Interestingly enough, most of them have been found to share the same global characteristics of complex networks such as “Small World” and “Scale Free”. Next, the impact of the empirical study on software engineering research and practice and its implications for further investigations are systematically set forth. The convergence of software engineering and other disciplines will put forth new ideas and thoughts that will breed a new way of thinking and input new methodologies for the study of networked software. This convergence is also expected to achieve the innovations of theories, methods, and key technologies of software engineering to promote the rapid development of software service industry in China.
    2018,29(1):42-68, DOI:10.13328/j.cnki.jos.005320
    [Abstract] (9963) [HTML] (4114) [PDF 2.54 M] (18675)
    Abstract:
    The Internet has penetrated into all aspects of human society and has greatly promoted social progress. At the same time, various forms of cybercrimes and network theft occur frequently, bringing great harm to our society and national security. Cyber security has become a major concern to the public and the government. As a large number of Internet functionalities and applications are implemented by software, software plays a crucial role in cyber security research and practice. In fact, almost all cyberattacks were carried out by exploiting vulnerabilities in system software or application software. It is increasingly urgent to investigate the problems of software security in the new age. This paper reviews the state of the art of malware, software vulnerabilities and software security mechanism, and analyzes the new challenges and trends that the software ecosystem is currently facing.
    2008,19(11):2803-2813, DOI:
    [Abstract] (9429) [HTML] (0) [PDF 319.20 K] (18642)
    Abstract:
    A semi-supervised clustering method based on affinity propagation (AP) algorithm is proposed in this paper. AP takes as input measures of similarity between pairs of data points. AP is an efficient and fast clustering algorithm for large dataset compared with the existing clustering algorithms, such as K-center clustering. But for the datasets with complex cluster structures, it cannot produce good clustering results. It can improve the clustering performance of AP by using the priori known labeled data or pairwise constraints to adjust the similarity matrix. Experimental results show that such method indeed reaches its goal for complex datasets, and this method outperforms the comparative methods when there are a large number of pairwise constraints.
    2016,27(3):691-713, DOI:10.13328/j.cnki.jos.004948
    [Abstract] (9563) [HTML] (2421) [PDF 2.43 M] (18590)
    Abstract:
    Learning to rank(L2R) techniques try to solve sorting problems using machine learning methods, and have been well studied and widely used in various fields such as information retrieval, text mining, personalized recommendation, and biomedicine.The main task of L2R based recommendation algorithms is integrating L2R techniques into recommendation algorithms, and studying how to organize a large number of users and features of items, build more suitable user models according to user preferences requirements, and improve the performance and user satisfaction of recommendation algorithms.This paper surveys L2R based recommendation algorithms in recent years, summarizes the problem definition, compares key technologies and analyzes evaluation metrics and their applications.In addition, the paper discusses the future development trend of L2R based recommendation algorithms.
    2009,20(8):2241-2254, DOI:
    [Abstract] (6965) [HTML] (0) [PDF 1.99 M] (18499)
    Abstract:
    Inspired from the idea of data fields, a community discovery algorithm based on topological potential is proposed. The basic idea is that a topological potential function is introduced to analytically model the virtual interaction among all nodes in a network and, by regarding each community as a local high potential area, the community structure in the network can be uncovered by detecting all local high potential areas margined by low potential nodes. The experiments on some real-world networks show that the algorithm requires no input parameters and can discover the intrinsic or even overlapping community structure in networks. The time complexity of the algorithm is O(m+n3/γ)~O(n2), where n is the number of nodes to be explored, m is the number of edges, and 2<γ<3 is a constant.

External Links

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063