• Current Issue
  • Online First
  • Archive
  • Click Rank
  • Most Downloaded
  • 综述文章
  • 专刊文章
  • 分辑系列
    Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2023,34(3):1010-1026, DOI: 10.13328/j.cnki.jos.006787
    [Abstract] (444) [HTML] (82) [PDF 2.22 M] (600)
    Abstract:
    Government data governance is undergoing a new phase of transition from "physical data aggregation" to "logical semantic unification". Thus far, long-term "autonomy" of government information silos, lead to a wide spectrum of metadata curation issues, such as attributes with the same names but having different meanings, or attributes with different names but having the same meanings. Instead of either rebuilding/modifying legacy information systems or physically aggregating data from isolated information systems, logical semantic unification solves this problem by unifying the semantic expression of the metadata in government information silos and achieves the standardized metadata governance. This work semantically aligns the metadata of each government information silo to the existing standard metadata. Specifically, the standard metadata names are viewed as semantic labels, and the semantic meanings of columns of relations in each government information silo are semantically identified, so as to establish the semantic alignment of column names and standard metadata and achieve standardized governance of silo metadata.
    2023,34(3):1027-1048, DOI: 10.13328/j.cnki.jos.006789
    [Abstract] (436) [HTML] (76) [PDF 3.12 M] (598)
    Abstract:
    Timeseries data is widely used in energy, manufacturing, finance, climate and many other fields. Aggregate queries are quite common in timeseries data analysis scenarios to quickly obtain summary of massive data. It is an effective way to accelerating aggregate queries by storing metadata. However, most existing timeseries databases slice data with fixed time windows, which requires real-time sorting and partitioning. In IoT applications with high writing concurrency and throughput, these additional costs are unacceptable. This study proposes a physical metadata management solution in Apache IoTDB for accelerating aggregate queries, in which data are sliced according to the physical storage sharding of files. Both synchronous and asynchronous computing are adopted to ensure writing performance ahead of queries. Out-of-order data streams are another major challenge in IoTDB applications. This study abstracts files with overlapping time ranges into out-of-order file groups and provides metadata for each group. Then aggregate queries will be rewritten into three sub-queries and efficiently executed on physical metadata and timeseries data. Experiments on various datasets have shown the improvement in performance of aggregate queries with the proposed solution, as well as the validity of different computing strategies.
    2023,34(3):1049-1064, DOI: 10.13328/j.cnki.jos.006791
    [Abstract] (364) [HTML] (66) [PDF 1.77 M] (510)
    Abstract:
    With the development of the information society, the scale of data has become larger and the types of data have become more abundant. Nowadays, data have become important strategic resources, which are the vital guarantees for scientific management for countries and enterprises. Nevertheless, with the increasing of data generated in social life, a large amount of dirty data come along with it, and data quality issue ensues. In the field of data science, it has always been a pain point that how to detect errors in an accurate and comprehensive manner. Although many traditional methods based on constraints or statistics have been widely used, they are usually limited by prior knowledge and labor cost. Recently, some novel methods detect errors by utilizing deep learning model to inference time series data or analyze context data and achieve better performance. However, these methods tend to be only applicable to specific areas or specific types of errors, which are not general enough for complex reality cases. Based on above observations, this study takes advantages of both traditional methods and deep learning model to propose a comprehensive error detection method (CEDM), which can deal with multiple type errors in multiple views. Firstly, under the view of patterns, basic detection rules can be constructed based on the statistical analysis with constraints from multiple dimensions, including attributes, cells, and tuples. After this, under the semantic view, data semantics are captured by word embedding and attribute relevance, cell dependency, and tuple similarity are analyzed. And hence, the basic rules can be extended and updated based on the semantic relations in different dimensions. Finally, the errors of multiple types could be detected comprehensively and accurately in multiple views. Extensive experiments on real and synthetic datasets demonstrate that the proposed method outperforms the state-of-the-art error detection methods and has higher generalization ability that can be applicable to multiple areas and multiple error types.
    2023,34(3):1065-1086, DOI: 10.13328/j.cnki.jos.006793
    [Abstract] (271) [HTML] (58) [PDF 3.06 M] (484)
    Abstract:
    Time series data generated by intelligent devices are growing rapidly and faced with serious data quality problems. The demand for time series data quality management and data quality improvement based on data repairing techniques is increasingly urgent. Time series data has the obvious characteristics about the ordered time window and strong associations between rows and columns. This brings much more challenges for the research of the data quality semantic expression of time series data. This study proposes the definition and the construction of time series data quality rules, which takes into account the association on both rows and columns. It extends the expression of the existing data quality rule systems in terms of time window and multi-order expression operation. In addition, the discovery method is proposed for time series data quality rules. Experiment results on real time series data sets verify that the proposed method can effectively and efficiently discover hidden data quality rules from time series data, showing that the proposed method has higher performance with the predicate construction of associated information on row and column, compared with the existing data quality rule discovery method.
    2023,34(3):1087-1108, DOI: 10.13328/j.cnki.jos.006794
    [Abstract] (472) [HTML] (59) [PDF 3.33 M] (693)
    Abstract:
    Entity matching can determine whether records in two datasets point to the same real-world entity, and is indispensable for tasks such as big data integration, social network analysis, and web semantic data management. As a deep learning technology that has achieved a lot of success in natural language processing and computer vision, pre-trained language models have also achieved better results than traditional methods in entity matching tasks, which have attracted the attention of a large number of researchers. However, the performance of entity matching based on pre-trained language model is unstable and the matching results cannot be explained, which brings great uncertainty to the application of this technology in big data integration. At the same time, the existing entity matching model interpretation methods are mainly oriented to machine learning methods as model-agnostic interpretation, and there are shortcomings in their applicability on pre-trained language models. Therefore, this study takes BERT entity matching models such as Ditto and JointBERT as examples, and proposes three model interpretation methods for pre-training language model entity matching technology to solve this problem. (1) In the serialization operation, the order of relational data attributes is sensitive. Dataset meta-features and attribute similarity are used to generate attribute ranking counterfactuals for misclassified samples; (2) As a supplement to traditional attribute importance measurement, the pre-trained language model attention weights are used to measure and visualize model processing; (3) Based on the serialized sentence vector, the k-nearest neighbor search technique is used to recall the samples with good interpretability similar to the misclassified samples to enhance the low-confidence prediction results of pre-trained language model. Experiments on real public datasets show that while improving the model effect through the enhancing method, the proposed method can reach 68.8% of the upper limit of fidelity in the attribute order search space, which provides a decision explanation for the pre-trained language entity matching model. New perspectives such as attribute order counterfactual and attribute association understanding are also introduced.
    2023,34(3):1109-1125, DOI: 10.13328/j.cnki.jos.006795
    [Abstract] (417) [HTML] (67) [PDF 2.15 M] (694)
    Abstract:
    Recently, many countries and regions have enacted data security policies, such as General Data Protection Regulation proposed by the EU. The release of related laws and regulations has aggravated the problem of data silos, which makes it difficult to share data among various data owners. The data federation is a possible solution to this problem. Data federation refers to the calculation of query tasks jointly performed by multiple data owners without disclosing their original data and combining privacy computing technologies such as secure multi-party computation. This concept has become a research trend in recent years, and a series of representative systems have been proposed such as SMCQL and Conclave. However, for the fundamental join query in the relational database system, the existing data federation system still has the following problems. First of all, the join query type is single. It is difficult to meet the query requirements under complex join conditions. Secondly, the algorithm performance has huge improvement space, because the existing systems often call the security tool library directly, which has high running time and communication overhead. Therefore, a data federation join algorithm is proposed to address the above issues. The main contributions of this study are as follows. Firstly, multiparty-oriented federation security operators are designed and implemented, which can support a variety of operations. Secondly, a federated q-join algorithm and an optimization strategy are proposed to significantly reduce the security computation cost. Finally, the performance of this proposal is verified based on the benchmark dataset TPC-H. The experimental results show that the proposed algorithm can reduce the runtime and communication overhead by 61.33% and 95.26% compared with the existing data federation system SMCQL and Conclave.
    2023,34(3):1126-1147, DOI: 10.13328/j.cnki.jos.006781
    [Abstract] (468) [HTML] (53) [PDF 2.58 M] (778)
    Abstract:
    With the emergence and accumulation of massive data, data governance has become an important manner to improve data quality and maximize data value. Error detection is crucial for improving data quality, which has attracted a surge of interests from both industry and academia. Various detection methods tailored for a single data source have been proposed. Nevertheless, in many real-world scenarios, data is not centrally stored and managed. Different sources of correlated data can be employed to improve the accuracy of error detection. Unfortunately, due to privacy/security issues, cross-source data is often not allowed to be integrated centrally. To this end, this study proposes FeLeDetect, a cross-source data error detection method based on federated learning. First, a graph-based error detection model (GEDM) is presented to capture sufficient data features from each data source. Then, the study investigates a federated co-training algorithm (FCTA) to collaboratively train GEDM over different data sources without privacy leakage. Furthermore, the study designs a series of optimization methods to reduce the communication cost during the federated learning and the manual labeling efforts. Extensive experiments on three real-life datasets demonstrate that GEDM achieves an average improvement of 10.3% F1-score in the local scenario and 25.2% F1-score in the centralized scenario, outperforming all the five existing state-of-the-art competitors for a single data source; and FeLeDetect further enhances local GEDM in terms of F1-score by 23.2% on average.
    Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    Available online:  March 15, 2023 , DOI: 10.13328/j.cnki.jos.006802
    Abstract:
    Multi-label text classification methods based on deep learning lack multi-granularity learning of text information and the utilization of constraint relations between labels. To solve these problems, this study proposes a multi-label text classification method with enhancing multi-granularity information relations. First, this method embeds text and labels in the same space by joint embedding and employs the BERT pre-trained model to obtain the implicit vector feature representation of text and labels. Then, three multi-granularity information relations enhancing modules including document-level information shallow label attention (DISLA) classification module, word-level information deep label attention (WIDLA) classification module, and label constraint relation matching auxiliary module are constructed. The first two modules carry out multi-granularity learning from shared feature representation: the shallow interactive learning between document-level text information and label information, and the deep interactive learning between word-level text information and label information. The auxiliary module improves the classification performance by learning the relation between labels. Finally, the comparison with current mainstream multi-label text classification algorithms on three representative datasets shows that the proposed method achieves the best performance on main indicators of Micro-F1, Macro-F1, nDCG@k, and P@k.
    Available online:  March 08, 2023 , DOI: 10.13328/j.cnki.jos.006756
    Abstract:
    Density peaks clustering (DPC) is a density-based clustering algorithm that can intuitively determine the number of clusters, identify clusters of any shape, and automatically detect and exclude abnormal points. However, DPC still has some shortcomings: The DPC algorithm only considers the global distribution, and the clustering performance is poor for datasets with large cluster density differences. In addition, the point allocation strategy of DPC is likely to cause a domino effect. Hence, this study proposes a DPC algorithm based on representative points and K-nearest neighbors (KNN), namely, RKNN-DPC. First, the KNN density is constructed, and the representative points are introduced to describe the global distribution of samples and propose a new local density. Then, the KNN information of samples is used to propose a weighted KNN allocation strategy to relieve the domino effect. Finally, a comparative experiment is conducted with five clustering algorithms on artificial datasets and real datasets. The experimental results show that the RKNN-DPC algorithm can more accurately identify cluster centers and obtain better clustering results.
    Available online:  March 02, 2023 , DOI: 10.13328/j.cnki.jos.006763
    Abstract:
    The emergence of the dynamic link library (DLL) provides great convenience for developers, which improves the interaction between the operating system (OS) and applications. However, the potential security problems of DLL cannot be ignored. Determining how to mine DLL-hijacking vulnerabilities during the running of Windows installers is important to ensure the security of Windows OS. In this paper, the attribute features of numerous installers are collected and extracted, and the double-layer bi-directional long short-term memory (BiLSTM) neural network is applied for machine learning from the perspectives of installers, the invocation modes of DLL from installers, and the DLL file itself. The multi-dimensional features of the vulnerability data set are extracted, and unknown DLL-hijacking vulnerabilities are mined. In experiments, DLL-hijacking vulnerabilities can be effectively detected from Windows installers, and 10 unknown vulnerabilities are discovered and assigned CNVD authorizations. In addition, the effectiveness and integrity of this method are further verified by comparison with other vulnerability analyzers.
    Available online:  March 02, 2023 , DOI: 10.13328/j.cnki.jos.006764
    Abstract:
    Entity resolution widely exists in data tasks such as data quality control, information retrieval, and data integration. Traditional entity resolution methods mainly focus on relational data, while with the development of big data technology, the application requirements of cross-modal data are generated due to the proliferation of different modal data including texts and images. Hence, cross-modal data entity resolution has become a fundamental problem in big data processing and analysis. In this study, the research development of cross-modal entity resolution is reviewed, and its definition and evaluation indexes are introduced. Then, with the construction of inter-modal relationships and the maintenance of intra-modal relationships as the main line, existing research results are surveyed. In addition, widely used methods are tested on different open datasets, and their differences and reasons behind them are analyzed. Finally, the problems in the present research are concluded, on the basis of which the future research trends are given.
    Available online:  March 02, 2023 , DOI: 10.13328/j.cnki.jos.006759
    Abstract:
    The graphical password mitigates the burden of memorizing traditional textual passwords and simplifies the process of entering passwords, which has been widely applied to user authentication of mobile devices in recent years. Existing graphical password authentication schemes face critical threats. First, graphical passwords are vulnerable to shoulder-surfing attacks, namely that users’ graphical passwords may be leaked if attackers capture their login information through eyes or cameras. More seriously, these schemes are subject to credential leakage attacks. In other words, as the server stores authentication credentials related to the graphical passwords of users to verify their identities, if attackers obtain these credentials, they can perform offline password guessing attacks to retrieve users’ graphical passwords. To solve the above problems, this study proposes a secure graphical password authentication scheme, dubbed GADL. GADL embeds random challenge values into the graphical passwords of users to resist shoulder-surfing attacks, and thus attackers cannot obtain users’ passwords even if they capture their login information. To address credential database leakage of the server, GADL utilizes a deterministic threshold blind signature technique to protect users’ graphical passwords. In this technique, multiple key servers are utilized to assist users in the credential generation, which ensures that attackers cannot perform offline guessing attacks to obtain any knowledge of users’ passwords even if they obtain users’ credentials. The security analysis given in this study proves that GADL is resistant to the aforementioned attacks. In addition, the comprehensive performance evaluation of GADL demonstrates its high performance in terms of computation, storage, and communication overhead and proves that it can be easily deployed on mobile devices.
    Available online:  March 02, 2023 , DOI: 10.13328/j.cnki.jos.006779
    Abstract:
    With the development of cloud computing and service architectures including software as a service (SaaS) and function as a service (FaaS), data centers, as the service provider, constantly face resource management. The quality of service (QoS) should be guaranteed, and the resource cost should be controlled. Therefore, a method to accurately measure computing power consumption becomes a key research issue for improving resource utilization and keeping the load pressure in the acceptable range. Due to mature virtualization technologies and developing parallel technologies, the traditional estimation metric CPU utilization fails to address interference caused by resource competition, thus leading to accuracy loss. However, the hyper-threading (HT) technology is employed as the main data center processor, which makes it urgent to estimate the computing power of HT processors. To address this estimation challenge, this study proposes the APU method to estimate the computing power consumption for HT processors based on the understanding of the HT running mechanism and thread behavior modeling. Considering that users with different authorities can access different system levels, two implementation schemes are put forward: one based on the hardware support and the other based on the operating system (OS). The proposed method adopts CPU utilization as the input without demands for other dimensions. Additionally, it reduces the development and deployment costs of new monitoring tools without the support of special hardware architectures, thereby making the method universal and easy to apply. Finally, SPEC benchmarks further prove the effectiveness of the method. The estimation errors of the three benchmarks are reduced from 20%, 50%, and 20% to less than 5%. For further proving the applicability, the APU method is leveraged to ByteDance clusters for showing its effects in case studies.
    Available online:  February 22, 2023 , DOI: 10.13328/j.cnki.jos.006641
    Abstract:
    Abnormal behavior detection is one of the important functions in the intelligent surveillance system, which plays an active role in ensuring public security. To improve the detection rate of abnormal behavior in surveillance videos, this study designs a semi-supervised abnormal behavior detection network based on a probabilistic memory model from the perspective of learning the distribution of normal behavior, in an attempt to deal with the great imbalance between normal behavior data and abnormal behavior data. The network takes an auto-encoding network as the backbone network and uses the gap between the predicted future frame and the real frame to measure the intensity of the anomaly. When extracting spatiotemporal features, the backbone network employs three-dimensional causal convolutional and temporally-shared full connection layers to avoid future information leakage and ensure the temporal sequence of information. In terms of auxiliary modules, a probabilistic model and a memory module are designed from the perspective of probability entropy and diverse patterns of normal behavior data to improve the quality of video frame reconstruction in the backbone network. Specifically, the probabilistic model uses the autoregressive process to fit the input data distribution, which promotes the model to converge to the low-entropy state of the normal distribution; the memory module stores the prototypical features of normal behavior in the historical data to realize the coexistence of multi-modal data and avoid the reconstruction of abnormal video frames caused by excessive participation of the backbone network. Finally, ablation experiments and comparison experiments with classic algorithms are carried out on public datasets to examine the effectiveness of the proposed algorithm.
    Available online:  February 22, 2023 , DOI: 10.13328/j.cnki.jos.006652
    Abstract:
    After years of technical development and attack-defense confrontation, the reinforcement technology for Android applications has matured to the extent that protection granularity has gradually developed from general dynamic Dalvik executable (DEX) modification to a highly customized Native-layer obfuscation mechanism. Client code protection is strengthened by continuously increasing reverse analysis difficulty and workload. For the newly emerged reinforcement technology of obfuscator low level virtual machine (OLLVM) obfuscation, this study proposes an automatic anti-obfuscation solution CiANa based on Capstone and flow-sensitive concolic execution. The Capstone engine is used to analyze the basic block and its instruction structure, thereby identifying the real blocks scattered in the control flow graph of program disassembly. Then, the execution sequence of the real blocks is determined by leveraging flow-sensitive concolic execution. Finally, the real block assembly instructions are repaired to obtain anti-obfuscated executable binary files. The comparative experimental results show that CiANa can recover the Android Native files under OLLVM obfuscation in the ARM/ARM64 architecture. As the first framework that offers effective anti-obfuscation and generates executable files for all versions (Debug/Release version) of OLLVM in the ARM/ARM64 architecture, CiANa provides necessary auxiliary support for reverse analysis.
    Available online:  February 22, 2023 , DOI: 10.13328/j.cnki.jos.006757
    Abstract:
    Mixed precision has made many advances in deep learning and precision tuning and optimization. Extensive research shows that mixed precision optimization for stencil computation is challenging. Moreover, the research achievements secured by the polyhedral model in the field of automatic parallelization indicate that the model provides a good mathematical abstraction for loop nesting, on the basis of which loop transformations can be performed. This study designs and implements an automatic mixed precision optimizer for Stencil computation on the basis of polyhedral compilation technology. By performing iterative domain partitioning, data flow analysis, and scheduling tree transformation on the intermediate representation layers, this study implements the source-to-source automatic generation of mixed precision codes for Stencil computation for the first time. The experiments demonstrate that the code after automatic mixed precision optimization can give full play to its parallelism potential and improve the performance of the program by reducing precision redundancy. With high-precision computing as the benchmark, the maximum speedup is 1.76, and the geometric average speedup is 1.15 on the x86 architecture; on the new-generation Sunway architecture, the maximum speedup is 1.64, and the geometric average speedup is 1.20.
    Available online:  February 22, 2023 , DOI: 10.13328/j.cnki.jos.006758
    Abstract:
    With the increasingly powerful performance of neural network models, they are widely used to solve various computer-related tasks and show excellent capabilities. However, a clear understanding of the operation mechanism of neural network models is lacking. Therefore, this study reviews and summarizes the current research on the interpretability of neural networks. A detailed discussion is rendered on the definition, necessity, classification, and evaluation of research on model interpretability. With the emphasis on the focus of interpretable algorithms, a new classification method for the interpretable algorithms of neural networks is proposed, which provides a novel perspective for the understanding of neural networks. According to the proposed method, this study sorts out the current interpretable methods for convolutional neural networks and comparatively analyzes the characteristics of interpretable algorithms falling within different categories. Moreover, it introduces the evaluation principles and methods of common interpretable algorithms and expounds on the research directions and applications of interpretable neural networks. Finally, the problems confronted in this regard are discussed, and possible solutions to these problems are given.
    Available online:  February 15, 2023 , DOI: 10.13328/j.cnki.jos.006754
    Abstract:
    Tag-aware recommendation algorithms use tagged data to enhance the recommendation models’ understanding of user preferences and item attributes, which attract extensive attention in the field. Most existing methods, however, neglect the diversities of user concerns, item attributes, and tag semantics and interfere with the correlation inference of the three, which affects the recommendation results. Therefore, this paper introduces the disentangled graph collaborative filtering (DGCF) method into the tag-aware recommendation task and proposes a DGCF-based explainable tag-aware recommendation (DETRec) method. It disentangles the perspectives of users, items, and tags to provide explainable recommendation references. Specifically, DETRec utilizes a correlation graph construction module to model the user–item–tag correlations. Then, it employs a neighborhood routing mechanism and a message propagation mechanism to disentangle the nodes to form the sub-graphs of attributes and thereby describe the nodal correlations under different attributes. Finally, it generates recommendation references on the basis of these attribute sub-graphs. This study implements two types of DETRec instantiation: 1) DETRec based on a single graph (DETRec-S), which describes all correlations of user, item, and tag nodes in a single graph, and 2) DETRec based on multiple graphs (DETRec-M), which utilizes three bipartite graphs to describe the user–item, item–tag, and user–tag correlations separately. Extensive experiments on three public datasets demonstrate that the above two types of DETRec instantiation are significantly superior to the baseline model and generate the references corresponding to the recommendation results. Hence, DETRec is an effective explainable tag-aware recommendation algorithm.
    Available online:  February 15, 2023 , DOI: 10.13328/j.cnki.jos.006765
    [Abstract] (122) [HTML] (0) [PDF 4.99 M] (3482)
    Abstract:
    In recent years, software construction, operation, and evolution have encountered many new requirements, such as the need for efficient switching or configuration in development and testing environments, application isolation, resource consumption reduction, and higher efficiency of testing and deployment. These requirements pose great challenges to developers in developing and maintaining software. Container technology has the potential of releasing developers from the heavy workload of development and maintenance. Of particular note, Docker, as the de facto industrial standard for containers, has recently become a popular research area in the academic community. To help researchers understand the status and trends of research on Docker containers, this study conducts a systematic literature review by collecting 75 high-level papers in this field. First, quantitative methods are used to investigate the basic status of research on Docker containers, including research quantity, research quality, research areas, and research methods. Second, the first classification framework for research on Docker containers is presented in this study, and the current studies are systematically classified and reviewed from the dimensions of the core, platform, and support. Finally, the development trends of Docker container technology are discussed, and seven future research directions are summarized.
    Available online:  February 15, 2023 , DOI: 10.13328/j.cnki.jos.006762
    Abstract:
    The extended Berkeley packet filter (eBPF) mechanism in the Linux kernel can safely load user-provided untrusted programs into the kernel. In the eBPF mechanism, the verifier checks these programs and ensures that they will not cause the kernel to crash or access the kernel address space maliciously. In recent years, the eBPF mechanism has developed rapidly, and its verifier has become more complex as more and more new features are added. This study observes two problems of the complex eBPF verifier. One is the “false negative” problem: There are many bugs in the complex security check logic of the verifier, and attackers can leverage these bugs to design malicious eBPF programs that can pass the verifier to attack the kernel. The other is the “false positive” problem: Since the verifier adopts the static check method, only conservative checks can be performed due to the lack of runtime information. This may cause the originally safe program to fail the check of the verifier and only support limited semantics, which brings difficulties to the development of eBPF programs. Further analysis shows that the static simulation execution check mechanism in the eBPF verifier features massive codes, high complexity, and conservative analysis, which are the main reasons for security vulnerabilities and false positives. Therefore, this study proposes to replace the static simulation execution check mechanism in the eBPF verifier with a lightweight dynamic check method. The bugs and conservative checks that originally existed in the eBPF verifier due to simulation execution no longer exist, and hence, the above-mentioned “false negative” and “false positive” problems can be eliminated. Specifically, the eBPF program is run in a kernel sandbox, which dynamically checks the memory access of the program in the runtime to prevent it from accessing the kernel memory illegally. For efficient implementation of a lightweight kernel sandbox, the Intel protection keys for supervisor (PKS), a new hardware feature, is used to perform a zero-overhead memory access check, and an efficient interaction method between the kernel and the eBPF program in the sandbox is presented. The evaluation results show that this study can eliminate memory security vulnerabilities of the eBPF verifier (this type of vulnerability has accounted for more than 60% of the total vulnerabilities of the eBPF verifier since 2020). Moreover, in the scenario of high-throughput network packet processing, the performance overhead brought by the lightweight kernel sandbox is lower than 3%.
    Available online:  February 15, 2023 , DOI: 10.13328/j.cnki.jos.006750
    Abstract:
    Entity recognition is a key task of information extraction. With the development of information extraction technology, researchers turn the research direction from the recognition of simple entities to the recognition of complex ones. Complex entities usually have no explicit features, and they are more complicated in syntactic constructions and parts of speech, which makes the recognition of complex entities a great challenge. In addition, existing models widely use span-based methods to identify nested entities. As a result, they always have an ambiguity in the detection of entity boundaries, which affects recognition performance. In response to the above challenge and problem, this paper proposes an entity recognition model GIA-2DPE based on prior semantic knowledge and type embedding. The model uses keyword sequences of entity categories as prior semantic knowledge to improve the cognition of entities, utilizes type embedding to capture potential features of different entity types, and then combines prior knowledge with entity-type features through the gated interactive attention mechanism to assist in the recognition of complex entities. Moreover, the model uses 2D probability encoding to predict entity boundaries and combines boundary features and contextual features to enhance accurate boundary detection, thereby improving the performance of nested entity recognition. This study conducts extensive experiments on seven English datasets and two Chinese datasets. The results show that GIA-2DPE outperforms state-of-the-art models and achieves a 10.4% F1 boost compared with the baseline in entity recognition tasks on the ScienceIE dataset.
    Available online:  February 08, 2023 , DOI: 10.13328/j.cnki.jos.006651
    Abstract:
    Parallelization is one of the most effective blockchain scalability solutions, and the existing parallelization schemes can be classified into two categories, i.e., starlike structure and parallel structure, according to the network structure. However, the current research lacks the analyses of factors affecting the performance boundary and performance bottleneck in starlike sharding structure. To address this problem, this study abstracts a general starlike sharding structure of blockchains for the schemes adopting different starlike sharding structure, and the transaction process in this general structure is quantitatively modeled to derive the relationship between throughput and the number of shards in starlike sharding structure. According to the constructed model, there exists a performance limit in starlike sharding structure and an optimal sharding quantity to maximize the system throughput. An explicit functional relationship exists between the maximal throughput and the functional complexity of the mainchain. With the proposed throughput model, related blockchain systems can balance the number of shards and the functional complexity of the mainchain to reach the theoretical upper limit of system throughput with the consideration of their specific design. Therefore, the work of this study has significant guiding value in the design of the schemes adopting starlike parallelization.
    Available online:  February 08, 2023 , DOI: 10.13328/j.cnki.jos.006660
    Abstract:
    Various business relationships and routing policies exist among the autonomous systems (ASes) in an inter-domain routing system. Routing propagation violating the export policy agreements among the ASes is likely to cause route leaks, ultimately leading to serious consequences such as network interruption, traffic eavesdropping, and link overload. Verifying routing policy compliance is thus essential for ensuring the security and stability of the inter-domain routing system. However, the dual requirements of ASes for the autonomous configuration and privacy protection of local routing policies increase the difficulty in verifying routing policy compliance and consequently pose a hard problem that remains to be settled properly in the field of inter-domain routing security. This study proposes a blockchain-based verification method for inter-domain routing policy compliance. With blockchain and the cryptographic technology as trust endorsements, this method enables ASes to publish, interact, verify, and execute routing policy expectations in a safe and private manner. The authenticity of the routing propagation process is ensured by generating route attestations corresponding to routing updates. Thus, the verification of routing policy compliance is completed by multi-domain cooperation. A prototype system is implemented, and experiments and analyses are carried out on real routing data. The results show that the proposed method offers traceable verification of export policy compliance of routing propagation without leaking the business relationships and local routing policies among ASes, suppresses policy-violating routing propagation effectively with reasonable overhead, and maintains a remarkable ability to suppress policy-violating routing even in partial deployment scenarios.
    Available online:  January 18, 2023 , DOI: 10.13328/j.cnki.jos.006663
    Abstract:
    Distributed hash table (DHT) is widely used in distributed storage because of its efficient data addressing. Nevertheless, traditional DHT-based storage has to store data in specified nodes to achieve efficient data addressing, which restricts the application scope of the DHT technology severely. Taking heterogeneous storage networks for example, the storage space, bandwidth, and stability of nodes vary greatly. Choosing appropriate data storage nodes according to data features and the performance differences among the nodes can greatly improve the data access efficiency. However, the tight coupling between data and storage location disqualifies the traditional DHT-based storage from being applied to heterogeneous storage networks. Therefore, this study proposes a vRoute algorithm to decouple the data identifier from storage location in DHT-based storage. By building a distributed data index based on Bloom Filter, the vRoute algorithm allows data to be stored in any node of the network without reducing the efficiency of data addressing. It is implemented by extending the Kademlia algorithm, and its validity is verified theoretically. Finally, the simulation experiments show that vRoute achieves a data addressing efficiency close to that of the traditional DHT algorithm with low storage and network overhead.
    Available online:  January 18, 2023 , DOI: 10.13328/j.cnki.jos.006649
    Abstract:
    In a published study, the problem of using Turing reduction to solve ε-NN is studied. In other words, given a query point q, a point set P, and an approximate factor ε, the purpose is to return the approximate nearest neighbor of q in P with an approximation ratio of not more than 1+ε. Moreover, a Turing reduction algorithm with O(logn) query time complexity is proposed, where the query time is the number of times that the oracle is invoked. The comparison indicates that the O(logn) query time is the lowest compared to that of all the existing algorithms. However, the disadvantage of the proposed algorithm is that there is a factor of O((d/ε)d) in the preprocessing time complexity and space complexity. When the number of dimensions d is high, or the approximation factor ε is small, the factor would become unacceptable. Therefore, this study revises the reduction algorithm and analyzes the expected time complexity and space complexity of the algorithm when the input point set follows the Poisson point process. As a result, the expected preprocessing time complexity is reduced to O(nlogn), and the expected space complexity is reduced to O(nlogn), while the expected query time complexity remains O(logn). In this sense, the future work raised in the published study is completed.
    Available online:  January 13, 2023 , DOI: 10.13328/j.cnki.jos.006818
    Abstract:
    With the development of IoT technology, IoT devices are widely applied in many areas of production and life. However, IoT devices also bring severe challenges to equipment asset management and security management. Firstly, Due to the diversity of device types and access modes in IoT devices, it is often difficult for administrators to know the network's device types and operating status. Secondly, IoT devices are becoming the focus of cyber attacks due to their limited computing and storage resources, making it difficult to deploy traditional defense measures. Therefore, it is particularly important to acknowledge the IoT devices in the network through device identification and detect anomalies based on the device identification results to ensure their normal operation. In recent years, the academia has carried out a lot of research on the above issues. This paper systematically reviews the work related to IoT devices' identification and anomaly detection. In terms of device identification, existing researches can be divided into passive identification methods and active identification methods according to whether packets are sent to the network. The passive identification methods are further investigated according to the identification method, identification granularity and application scenarios. We also investigate the active methods according to the identification method, identification granularity and detection granularity. In the aspect of anomaly detection, the existing work can be divided into detection methods based on machine learning and detection methods based on normal behavior. On this basis, IoT device identification and anomaly detection challenges are summarized, and its future development direction is proposed.
    Available online:  January 13, 2023 , DOI: 10.13328/j.cnki.jos.006537
    Abstract:
    The development of artificial intelligence brings more and more challenges to data hiding technology, and it is urgent to improve the security of existing steganography methods. In this study, a generative multiple adversarial steganography algorithm based on U-Net network structure is proposed to improve the image data hiding ability. A generative multiple adversarial steganography network (GMASN), including the generative adversarial network, the steganalyzer optimization network and the steganalysis network, is firstly constructed, and the anti steganalysis ability of the steganography image is improved through the competition of the networks in the GMASN. At the same time, aiming at the problem that the existing generative adversarial network can only generate low-quality images randomly, a generative network based on U-Net structure is designed to transfer the details of the reference image to the generated carrier image, by which the image can be generated objectively with high visual quality. Moreover, the image discrimination loss, mean square error (MSE) loss, and steganalysis loss are dynamically combined in the proposed scheme to enable the GMASN to converge rapidly and stably. Experimental results show that the PSNR of the generated carrier image can reach 48.60 dB, and the discrimination rate between the generated carrier image and the steganographic image is 50.02%. The proposed algorithm can generate high-quality carrier images suitable for data hiding, enable the steganographic network to converge rapidly and stably, and improve the security of image steganography effectively.
    Available online:  January 13, 2023 , DOI: 10.13328/j.cnki.jos.006542
    Abstract:
    With the wide application of global positioning system (GPS), more and more electric bicycles are equipped with GPS sensors. Massive trajectory data recorded by those sensors are of great value in many fields, such as users’ travel patterns analysis, decision support for urban planners, and so on. However, the low-cost GPS sensors widely used on electric bicycles cannot provide high-precision positioning. Besides, the map matching for the electric bicycles’ track data is more complex and challenging due to: (1) many stay points on electric bicycles’ trajectories; (2) higher sampling frequency and shorter distance between adjacent track points on electric bicycle’s track data; (3) some roads only open for electric bicycles, and the accuracy of matching is sensitive to the quality of the road network. To solve those issues mentioned above, an adaptive and accurate road network map matching algorithm is proposed named KFTS-AMM, which consists of two main components: the segmented Kalman filtering based trajectory simplification (KFTS) algorithm and segmented hidden Markov model based adaptive map matching (AMM) algorithm. Since Kalman filtering algorithm can be used for optimal state estimation, the trajectory simplification algorithm KFTS can make the trajectory curve smoother and reduce the impact of abnormal points on the accuracy of map matching by fixing the trajectory points automatically in the process of trajectory simplification. Besides, the matching algorithm AMM is used to reduce the impact of invalid trajectory segments on the map matching accuracy. Moreover, stay points identification and merging step are added into the processing of track data, and the accuracy is further improved. Extensive experiments conducted on the real-world track dataset of electric bicycles in Zhengzhou city show that the proposed approach KFTS-AMM outperforms baselines in terms of accuracy and can speed up the matching process by using the simplified track data significantly.
    Available online:  January 04, 2023 , DOI: 10.13328/j.cnki.jos.006646
    Abstract:
    SQL is a programming language that is widely used to operate relation databases. Many users (such as data analysts and junior programmers) will encounter various difficulties when writing SQL query programs due to the lack of programming experience and knowledge of SQL syntax. Currently, the research on the automatic synthesis of SQL query programs from the <input-output> (I/O) example tables has attracted more and more attention. The inductive SQL synthesis with positive and negative tuples (ISST) method proposed in this study can automatically synthesize SQL query programs that meet the users’ expectations by the I/O example tables edited by users and containing a small number of tuples. The ISST method contains five main stages: constructing the SQL query program sketches, expanding the worksheet data, dividing the sets of positive and negative examples, inductively synthesizing selection predicates, and sorting after verifying. The candidate set of SQL query programs is verified on the online database PostgreSQL, and the candidate set of synthesized SQL query programs is scored and sorted according to the principle of Occam’s razor. The ISST method is implemented using the Java language and then is evaluated on a test set containing 28 samples. The results reveal that the ISST method can correctly synthesize 24 of the samples, which takes an average of 2 seconds.
    Available online:  January 04, 2023 , DOI: 10.13328/j.cnki.jos.006647
    Abstract:
    In recent years, graph neural networks (GNNs) have attracted wide attention due to their powerful and flexible representation ability. Considering the increasing scale of graph data and the limitation of the video memory capacity, it becomes more challenging to train GNNs with traditional general deep learning systems, and such training cannot give full play to the performance of GPU devices. To achieve efficient use of GPU hardware for GNN training is one of the important research issues in this field. Traditional approaches employ sparse matrix multiplication for the calculation process of GNNs. When the video memory capacity of GPU devices is limited, the computation tasks are distributed to each device by distributed matrix multiplication. Their shortcomings are mainly as follows: (1) Sparse matrix multiplication ignores the sparse distribution of the graph data, which results in low computation efficiency. (2) These methods ignore the computation and memory access characteristics of GPU and fail to utilize the hardware resources. To improve the training efficiency, some studies propose to reduce the costs of each iteration and storage requirements through graph sampling techniques, which also support flexible distributed scaling. Due to the stochastics and variance, however, these methods often affect the model accuracy. Therefore, this study proposes a high-performance GNN training framework for multi-GPUs. Different GNN partition strategies for multi-GPUs are explored, and the influence of different graph ordering patterns on the GPU performance during the calculation process of GNNs is investigated to ensure the accuracy of the model. Moreover, block-sparsity-aware optimization methods are put forward for GPU memory access. The prototype system is achieved using C++ and CuDNN. The experiments on four large-scale GNN datasets demonstrate that (1) the graph re-ordering method improves the cache hit rate of GPU by around 40% and doubles the computation speedup; (2) compared to the existing system DGL, the proposed system achieves a total speedup of 5.8x.
    Available online:  January 04, 2023 , DOI: 10.13328/j.cnki.jos.006665
    Abstract:
    As the scale of business data increases, distributed online analytical processing (OLAP) is widely performed in business intelligence (BI), enterprise resource planning (ERP), user behavior analysis, and other application scenarios to support large-scale data analysis. Moreover, distributed OLAP overcomes the limitations of single-machine storage and stores data in memory to improve the performance of OLAP. However, after the in-memory distributed OLAP eliminates disk input/output (I/O), the join operation becomes one of its new performance bottlenecks. As a common practice in OLAP, the join operation involves a huge amount of data accessing and computation operations. By analyzing existing methods for the join operation, this study presents a graph structure indexing method that can accelerate the join operation and a new join method called LinkJoin based on it. Graph structure indexing stores the in-memory position of data in the form of a graph structure according to the join relationship specified by users. The join method based on graph structure indexing reduces the amount of data accessing and computation operations with a low complexity equivalent to that of the hash join. This study expands the state-of-the-art open-source in-memory OLAP system called MonetDB from a single-machine system to a distributed one and designs and implements a join method based on graph structure indexing on it. A series of designs and optimizations are also conducted in the aspects of graph indexing structure, columnar storage, and distributed execution engine to improve the distributed OLAP performance of the system. The test results show that in the TPC-H benchmark tests, the join operation based on graph structure indexing improves the performance on queries with join operations by 1.64 times on average and 4.1 times at most. For the join operation part of these queries, it enhances the performance by 9.8–22.1 times.
    Available online:  December 30, 2022 , DOI: 10.13328/j.cnki.jos.006724
    Abstract:
    The task of knowledge tracing is to trace the changes of students' knowledge state and predict their future performance in learning according to their historical learning records. In recent years, knowledge tracing models based on attention mechanism are obviously superior to traditional knowledge tracing models in both flexibility and prediction performance. However, only the situation of single concept is considered by most of the existing deep models which cannot directly deal with multi-concept exercises, while there are a large number of multi-concept exercises in the intelligent education systems. In addition, how to improve interpretability is one of the key challenges of the deep knowledge tracing models. To solve these problems, this paper proposes a deep knowledge tracing model with multi-concept enhanced exercise embedding, which considers the relationship between concepts in exercises and proposes two novel multi-concept embedding methods and combines educational psychology models with forgetting factors to improve prediction performance and interpretability. Experiments have shown that the model proposed by this paper outperforms existing models in predictive performance on large-scale real datasets, and the effectiveness of each module is verified.
    Available online:  December 30, 2022 , DOI: 10.13328/j.cnki.jos.006800
    Abstract:
    With the rapid growth and further application of deep learning, the scale of DL training keeps increasing and memory insufficiency has become one of the major bottlenecks threatening the DL availability. Memory swapping is the key mechanism that alleviates this memory problem of DL training. This mechanism leverages the "time-varying" memory requirement of DL training when moving the data between specific computing accelerating device memory and external storage. By replacing an accumulated memory requirement with an "instant" one, DL training can be durably supported on computing accelerating devices. This paper surveys on the memory swapping mechanism for DL training from an aspect of the time-varying memory requirement. We start with introductions to a swapping-out mechanism combining the operator characteristics, and a swapping-in mechanism combining the data dependency. Then we present a joint DL performance driven swapping decision mechanism. Finally, we make prospects forecast for the research works in this field.
    Available online:  December 30, 2022 , DOI: 10.13328/j.cnki.jos.006777
    Abstract:
    Smoothed Particle Hydrodynamics (SPH) is a popular technique for fluid simulation in computer graphics. With the increasing demand for applications of SPH fluid simulation technology, many research achievements have emerged in recent years, and the works include methods improving the algorithms of incompressibility, viscosity and surface tension, as well as strategies to enhance its efficiency and stability. In addition, some researchers focused on designing methods for complex scenarios and building a unified simulation framework to expand the application scope of SPH fluids. In this paper, the related works on SPH fluid simulation are summarized and discussed from the above aspects, and the future developments of the technology are prospected.
    Available online:  December 30, 2022 , DOI: 10.13328/j.cnki.jos.006804
    Abstract:
    Stochastic configuration networks (SCN), as a novel incremental learning model, which is different from other randomized neural network models. The parameters of hidden layer nodes are determined through supervision mechanism, keeping a faster convergence of SCN. Due to the advantages of higher learning efficiency, lower human intervention, and stronger generalization ability, SCN has attracted a large number of research interests from domestic and foreign scholars and has been promoted and developed rapidly since it was proposed in 2017. In this paper, a survey on SCN is summarized comprehensively for the first time from the aspects of basic theory, typical algorithm variants, application fields and future research directions of SCN. Firstly, the algorithm principle, universal approximation capacity, advantages of SCN are analyzed theoretically. Secondly, various variants of SCN are summarized, such as DeepSCN, 2DSCN, Robust SCN, Ensemble SCN, Distributed parallel SCN, Regularized SCN. Then, the applications of SCN in different fields, including hardware implementation, computer vision, biomedical data analysis, fault detection and diagnosis, system modeling and prediction are introduced. Finally, the development potential of SCN in convolutional neural network architecture, semi-supervised learning, unsupervised learning, multi-view learning, fuzzy neural network, and recurrent neural network is pointed out.
    Available online:  December 28, 2022 , DOI: 10.13328/j.cnki.jos.006662
    Abstract:
    Recent research studies on social recommendation have focused on the joint modeling of the explicit and implicit relations in social networks and overlooked the special phenomenon that high-order implicit relations are not equally important to each user. The importance of high-order implicit relations to users with plenty of neighbors differs greatly from that to users with few neighbors. In addition, due to the randomness of social relation construction, explicit relations are not always available. This study proposes a novel adaptive high-order implicit relations modeling (AHIRM) method, and the model consists of three components. Specifically, unreliable relations are filtered, and potential reliable relations are identified, thereby mitigating the adverse effects of unreliable relations and alleviating the data sparsity issue. Then, an adaptive random walk algorithm is designed to capture neighbors at different orders for users according to normalized node centrality, construct high-order implicit relations among the users, and ultimately reconstruct the social network. Finally, the graph convolutional network (GCN) is employed to aggregate information about neighbor nodes. User embeddings are thereby updated to model the high-order implicit relations and further alleviate the data sparsity issue. The influence of social structure and personal preference are both considered during modeling, and the process of social influence propagation is simulated and retained. Comparative verification of the proposed model and the existing algorithms are conducted on the LastFM, Douban, and Gowalla datasets, and the results verify the effectiveness and rationality of the proposed AHIRM model.
    Available online:  December 28, 2022 , DOI: 10.13328/j.cnki.jos.006644
    Abstract:
    Constraint Programming (CP) is one of the classical paradigms for representing and solving combinatorial problems. Extensional constraints, also called table constraints, are the most common type of constraints in CP, and most CP problems can be expressed by table constraints. In the problem-solving process, consistency algorithms are used to reduce the search space, and the simple table reduction (STR) algorithms are the most efficient consistency algorithms with table constraints, including Compact-Table (CT) and STRbit algorithms, both of which maintain the generalized arc consistency (GAC) during the search. In addition, the full pairwise consistency (fPWC) is stronger than GAC in the pruning capability, and the most efficient fPWC maintenance algorithm is the PW-CT algorithm. Over the years, many consistency algorithms with table constraints are proposed to improve the pruning capability and efficiency. Factor-decomposition encoding (FDE) recodes trivial problems, which enlarges the scale of the problem model to some extent, and as a result, maintaining a relatively weak GAC on a new model is equivalent to maintaining a strong fPWC on the original model. Currently, the appropriate STR algorithms for FDE are STRFDE and STR2 rather than CT as the CT algorithm may produce memory overflow. In the process of maintaining the consistency algorithm, it is necessary to call each constraint iteratively to perform its consistency algorithm to filter the search space. This process is called constraint propagation. The dynamic submission scheme is a parallel constraint propagation scheme, which can schedule constraint execution propagation algorithms in parallel, and it is particularly effective in large-scale problems. Therefore, this study proposes PSTRFDE for FDE by improving STRFDE and dynamic submission propagation algorithms. PSTRFDE can be embedded into the dynamic submission scheme to further improve the efficiency of constraint problem solving. Extensive experiments indicate that PSTRFDE can reduce the used memory compared with CT and STRbit, and compared with the results of STRFDE and STR2, the efficiency of PSTRFDE can be further improved. The research demonstrates that PSTRFDE is the most efficient filtering algorithm for FDE at present.
    Available online:  December 22, 2022 , DOI: 10.13328/j.cnki.jos.006643
    Abstract:
    This study proposes a new feature constrained distillation learning method for visual anomaly detection, which makes full use of the features of the teacher model to instruct the student model to efficiently identify abnormal images. Specifically, the Vision Transformer (ViT) model is introduced as the backbone network of anomaly detection tasks, and a central feature strategy is put forward to constrain the output features of the student network. Considering the strong feature expressiveness of the teacher network, the central feature strategy is developed to dynamically generate the feature representation centers of normal samples for the student network from the teacher network. In this way, the ability of the student network to describe the feature output of normal data is improved, and the feature difference between the student and teacher networks in abnormal data is widened. In addition, to minimize the difference between the student and teacher networks in the feature representation of normal images, the proposed method leverages the Gram loss function to constrain the relationship between the coding layers of the student network. Experiments are conducted on three general anomaly detection data sets and one real-world industrial anomaly detection data set, and the experimental results demonstrate that the proposed method significantly improves the performance of visual anomaly detection compared with the state-of-the-art methods.
    Available online:  December 22, 2022 , DOI: 10.13328/j.cnki.jos.006661
    Abstract:
    Although traditional watermarking attack methods can obstruct the correct extraction of watermark information, they reduce the visual quality of watermarked images greatly. Therefore, a novel imperceptible watermarking attack method based on residual learning is proposed. Specifically, a watermarking attack model based on a convolutional neural network is constructed for the end-to-end nonlinear learning between a watermarked image and an unwatermarked one. A mapping from the watermarked image to the unwatermarked one is thereby accomplished to achieve the purpose of watermarking attack. Then, a proper number of feature extraction blocks are selected according to the embedding region of watermark information to extract a feature map containing watermark information. As the difference between the two images is insignificant, the learning ability of the watermarking attack model is limited in the training process, making it difficult for the model to reach a convergence state. A residual learning mechanism is thus introduced to improve the convergence speed and learning ability of the watermarking attack model. The imperceptibility of the attacked image can be improved by reducing the difference between the residual image (the subtraction between the watermarked image and the extracted feature map) and the unwatermarked one. In addition, a dataset for training the watermarking attack model is constructed with the super-resolution dataset DIV2K2017 and the attacked robust color image watermarking algorithm based on quaternion exponent moments. The experimental results show the proposed watermarking attack model can attack a robust watermarking algorithm with a high bit error rate (BER) without compromising the visual quality of watermarked images.
    Available online:  December 22, 2022 , DOI: 10.13328/j.cnki.jos.006512
    Abstract:
    Existing image denoising algorithms have achieved decent performance on the images with the additive white Gaussian noise (AWGN), while their generalization ability is not good on real-world images with unknown noise. Motivated by the significant progress of deep convolution neural networks (CNNs) in image denoising, a novel two-scale blind image denoising algorithm is proposed based on self-supervised constraints. Firstly, the proposed algorithm leverages the denoised results from the small-scale network branch to provide additional useful information for the large-scale image denoising, so as to achieve favorable denoised results. Secondly, the used network is composed of a noise estimation subnetwork and an image non-blind denoising subnetwork. The noise estimation subnetwork is firstly used to predict noise map, and then image denoising is carried out through the non-blind denoising subnetwork under the guidance of the corresponding noise map. In view of the fact that the real noise image lacks the corresponding clean image as the label, an edge-preserving self-supervised constraint is proposed based on the total variation (TV) priori, which generalizes the network to different real noisy datasets by adjusting smoothing parameters. To keep the consistency of the image background, a background guidance module (BGM) is proposed, which builds a self-supervised constraint based on the information difference between multi-scale Gaussian blurred images and thus assists the network to complete image denoising. In addition, the structural similarity attention mechanism (SAM) is proposed to guide the network to pay attention to trivial structural details in images, so as to recover real denoised images with cleaner texture details. The relevant experimental results on the SIDD, DND, and Nam benchmarks indicate that the proposed self-supervised blind denoising algorithm is superior to some deep supervised denoising methods, and the generalization performance of the network is improved significantly.
    Available online:  December 22, 2022 , DOI: 10.13328/j.cnki.jos.006659
    Abstract:
    The order of label learning is crucial to a classifier chains method. Therefore, this study proposes a classifier chains method based on the association rules and topological sequence (TSECC). Specifically, a measurement strategy for label dependencies based on strong association rules is designed by leveraging frequent patterns. Then, a directed acyclic graph is constructed according to the dependency relationships among the labels to topologically sort all the vertices in the graph. Finally, the topological sequence obtained is used as the order of label learning to iteratively update each label’s classifier successively. In particular, to reduce the impact of “lonely” labels with no or low label dependencies on the prediction performance on the other labels, TSECC excludes “lonely” labels out of the topological sequence and uses a binary relevance model to train them separately. Experimental results on a variety of public multi-label datasets show that TSECC can effectively improve classification performance.
    Available online:  December 16, 2022 , DOI: 10.13328/j.cnki.jos.006655
    Abstract:
    Improving the quality and diversity of generated samples has always been one of the main challenging tasks in the field of generative adversarial network (GAN). For this reason, a bidirectional constraint GAN (BCGAN) is proposed. Compared with the traditional GAN variants, this network adds one more generator module to the architecture design. The two generators approach the data distribution of real samples from two different directions. Then, according to the network architecture of BCGAN, this study designs a new loss function and analyzes and proves it theoretically. During BCGAN training, the diversity of the generated samples is enriched by increasing the distance between the data distribution of two generated samples, and the difference of the discriminator between the data distribution of the two generated samples is reduced to stabilize the training process and thereby improve the quality of the generated samples. Finally, experiments are carried out on a synthetic dataset and three open challenge datasets. This series of experiments show that compared with other generative methods, the proposed method fits real data distribution better and effectively improves the quality and diversity of generated samples. In addition, the training process of this method is smoother and more stable.
    Available online:  December 16, 2022 , DOI: 10.13328/j.cnki.jos.006511
    Abstract:
    Stable learning aims to leverage the knowledge obtained only from a single training data to learn a robust prediction model for accurately predicting label of the test data from a different but related distribution. To achieve promising performance on the test data with agnostic distributions, existing stable learning algorithms focus on eliminating the spurious correlations between the features and the class variable. However, these algorithms can only weaken part of the spurious correlations between the features and the class variable, but can not completely eliminate the spurious correlations. Furthermore, these algorithms may encounter the overfitting problem in learning the prediction model. To tackle these issues, this study proposes a sample reweighting and dual classifiers based stable learning algorithm, which jointly optimizes the weights of samples and the parameters of dual classifiers to learn a robust prediction model. Specifically, to estimate the effects of all features on classification, the proposed algorithm balances the distribution of confunders by learning global sample weights to remove the spurious correlations between the features and the class variable. In order to eliminate the spurious correlations between some irrelevant features and the class variable and weaken the influence of irrelevant features on the weighting process of samples, the proposed algorithm selects and removes some irrelevant features before sample reweighting. To further improve the generalization ability of the model, the algorithm constructs two classifiers and learns a prediction model with an optimal hyperplane by minimizing the parameter difference between the two classifiers during learning the prediction model. Using synthetic and real-world datasets, the experiments have validated the effectiveness of the proposed algorithm.
    Available online:  December 16, 2022 , DOI: 10.13328/j.cnki.jos.006504
    Abstract:
    Layered key structure plays an important role in quantum communication, in addition to using EPR and GHZ states to achieve layered quantum key distribution, asymmetric high dimensional multi-particle entanglement also provides a new idea for solving layered quantum key distribution. This method is more efficient in the number of quantum channel uses than the conventional quantum key distribution using bipartite links. This study introduces five layered key structures for three users, and gives a partitionable key structure for 4 and 5 users. For the various layered key structures introduced in this study, the above two methods are compared to get the protocol with the highest idealized key rate of each key structure. When the quantum network has more than three users and the key structure can be partitioned, it is proved that the idealized key rate of each layer can be 1 by using the EPR and GHZ states. Finally, the partitionable key structure of four and five users is taken as an example to expand the description.
    Available online:  December 08, 2022 , DOI: 10.13328/j.cnki.jos.006508
    Abstract:
    With the development of the Internet and service-oriented technology, a new type of Web application—Mashup service, began to become popular on the Internet and grow rapidly. How to find high-quality services among large number of Mashup services has become a focus of attention. It has been shown that finding and clustering services with similar functions can effectively improve the accuracy and efficiency of service discovery. At present, current methods mainly focus on mining the hidden functional information in the Mashup service, and use specific clustering algorithms such as K-means for clustering. However, Mashup service documents are usually short texts. Traditional mining algorithms such as LDA are difficult to represent short texts and find satisfied clustering effects from them. In order to solve this problem, this study proposes a non-negative matrix factorization combining tags and word embedding (TWE-NMF) model to discover topics for the Mashup services. This method firstly normalizes the Mashup service, then uses a Dirichlet process multinomial mixture model based on improved Gibbs sampling to automatically estimate the number of topics. Next, it combines the word embedding and service tag information with non-negative matrix factorization to calculate Mashup topic features. Moreover, a spectral clustering algorithm is used to perform Mashup service clustering. Finally, the performance of the method is comprehensively evaluated. Compared with the existing service clustering method, the experimental results show that the proposed method has a significant improvement in the evaluation indicators such as precision, recall, F-measure, purity, and entropy.
    Available online:  December 08, 2022 , DOI: 10.13328/j.cnki.jos.006509
    Abstract:
    High quality feature representation can boost performance for object detection and other computer vision tasks. Modern object detectors resort to versatile feature pyramids to enrich the representation power but neglect that different fusing operations should be used for pathways of different directions to meet their different needs of information flow. This study proposes separated spatial semantic fusion (SSSF) that uses a channel attention block (CAB) in top-down pathway to pass semantic information and a spatial attention block (SAB) with a bottleneck structure in the bottom-up pathway to pass precise location signals to the top level with fewer parameters and less computation (compared with plain spatial attention without dimension reduction). SSSF is effective and has a great generality ability: it improves AP over 1.3% for object detection, about 0.8% over plain addition for fusing operation of the top-down path for semantic segmentation, and boost the instance segmentation performance in all metrics for both bounding box AP and mask AP.
    Available online:  December 08, 2022 , DOI: 10.13328/j.cnki.jos.006514
    Abstract:
    In smart healthcare, cloud computing and the Internet of Things are combined to solve the problem of real-time access to large-scale data. However, the data is uploaded to a remote cloud. It increases additional communication cost and transmission delay. Fog computing has been introduced into smart healthcare to solve this problem. The fog servers assist the cloud server to complete data storage and access locally. It contributes to low latency and high mobility. Since the medical data is highly sensitive, how to design a fog computing-based smart healthcare authentication protocol has become a research hotspot. If the data is tampered illegally, the consequences will be catastrophic. Hence, the authentication protocol should be secure against various attacks and realize the secure data transmission among users, fog nodes, and cloud servers. This study analyzes two schemes for smart healthcare, and points out that Hajian et al.’s scheme cannot resist stolen verifier attack, denial of service attacks impersonation attacks, node capture attack, and session key disclosure attacks; Wu et al.’s scheme cannot resist offline password guessing attacks and impersonation attacks. Furthermore, a fog computing-based three-party authentication and key agreement protocol are proposed for smart healthcare. The security is proved by using the random oracle model, the BAN logic, and heuristic analysis. As result, it is secure against known attacks. The performance comparison with related schemes shows that the proposed scheme is more suitable for fog computing-based smart healthcare.
    Available online:  December 08, 2022 , DOI: 10.13328/j.cnki.jos.006656
    Abstract:
    The observation of tumor location and growth is an important link in the formulation of tumor treatment plans. Intervention methods based on medical images can be employed to visually observe the status of the tumor in the patient in a non-invasive way, predict the growth of the tumor, and ultimately help physicians develop a treatment plan specific to the patient. This study proposes a new deep network model, namely the conditional adversarial spatiotemporal encoder model, to predict tumor growth. This model mainly consists of three parts: the tumor prediction generator, the similarity score discriminator, and conditions composed of the patient’s personal situations. The tumor prediction generator predicts the tumor in the next period according to the tumor images of two periods. The similarity score discriminator is used to calculate the similarity between the predicted tumor and the real one. In addition, this study adds the patient’s personal situations as conditions to the tumor growth prediction process. The proposed model is experimentally verified on two collected medical datasets. The experimental results achieve a recall rate of 76.10%, an accuracy rate of 91.70%, and a Dice coefficient of 82.4%, indicating that the proposed model can accurately predict the tumor images of the next period.
    Available online:  December 08, 2022 , DOI: 10.13328/j.cnki.jos.006654
    Abstract:
    Community is an important attribute of information networks. Community search, as an important content of information network analysis, aims to find a set of nodes that meet the conditions specified by the user. As heterogeneous information networks contain more comprehensive and richer structural and semantic information, community search in such networks has received extensive attention in recent years. However, the existing community search methods for heterogeneous information networks cannot be directly applied when the search conditions are complex. For this reason, this study defines community search under complex conditions and proposes search algorithms considering asymmetric meta-paths, constrained meta-paths, and prohibited node constraints. These three algorithms respectively use the meta-path completion strategy, the strategy of adjusting batch search with labeling, and the way of dividing complex search conditions to search communities. Moreover, two optimization algorithms respectively based on the pruning strategy and the approximate strategy are designed to improve the efficiency of the search algorithm with prohibited node constraints. A large number of experiments are performed on real datasets, and the experimental results verify the effectiveness and efficiency of the proposed algorithms.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006533
    Abstract:
    Recommendation systems, which can effectively filter information based on user preferences, has been applied widely. The problem of cold start and data sparsity becomes more and more serious with the explosive growth of the number of users. Multi-source data fusion, which can effectively alleviate the recommendation accuracy under the conditions of data sparsity and the cold start problem, is favored by researchers. Its main idea is to fuse auxiliary information of users in other aspects for missing values filling to optimize the accuracy of target recommendation service. Nevertheless, more serious risk of privacy disclosure is introduced due to the relations between data. To solve the above problems, this study proposes a deep cross-domain recommendation model with privacy protection. In detail, a deep learning collaborative recommendation method is designed featuring multi-source data fusion and differential privacy protection. On the one hand, this method fuses auxiliary domain information to improve the accuracy of recommendation and corrects the deviation of abnormal points to improve the performance of the recommender system; on the other hand, this method adds noise in the collaborative training process based on differential privacy model to solve the data security problem in data fusion. In order to evaluate the long tail effect in the recommendation system, this study proposes a new metric—discovery degree for the first time, which is used to measure the ability of the recommendation algorithm to find users’ invisible requirements. Based on the performance comparison and analysis of the existing algorithms, the results show that the proposed method has better recommendation accuracy and diversity than the existing methods on the premise of ensuring privacy security, and can effectively discover the hidden needs of users.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006534
    Abstract:
    In order to improve the CPU utilization of spacecraft computers, the new generation of spacecraft operating system uses a hybrid scheduling algorithm that includes both fixed-point starting tasks and sporadic tasks. Among them, fixed-point starting tasks are often safety-critical tasks and need to be started at fixed points and cannot be blocked during execution. Under the condition that fixed-point starting tasks and sporadic tasks coexist, the existing real-time lock protocols cannot guarantee that the blocking time of fixed-point starting tasks is zero, so on the basis of the classic priority ceiling protocol, a real-time lock protocol based on the idea of avoidance blocking is proposed in this study to ensure that sporadic tasks' access to shared resources will not affect the execution of fixed-point starting tasks by judging in advance and setting virtual starting point. At the same time, by temporarily increasing the access priority of some resources, the cost caused by task preemption can be reduced. This paper presents the worst blocking time of the above lock protocol and uses the schedulable rate experiments to analyze its performance. Experiments show that in the case of short critical sections, this protocol can control the schedulability loss caused by accessing shared resources to under 27%.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006507
    Abstract:
    As a new type of high-value computing system, cloud computing has been widely used in various industries fields. Classified protection 2.0 also puts forward the requirement of dynamic trust verification for its application of active immune trusted computing technology. In the cloud computing mode, the virtual machine is the direct carrier for users to use cloud services, and its trusted startup is the basis for the trustworthiness of the virtual machine operating environment. However, since the virtual machine runs on the physical node in the form of process, its characteristics of startup process are high dynamic and unexpected interference between multiple virtual machine domains. But the existing trusted startup schemes of virtual machine have problems such as insufficient dynamic protection during virtual machine startup process and lack of elimination of unexpected interference between multiple virtual domains. To solve the above problems, this study proposes a scheme that research on trusted startup of virtual machine based on non-interference theory. Firstly, based on the non-interference theory, the run-time trusted theorem of virtual machine process is proposed. In addition, the definition of trusted launch of virtual machine is given and the judgement theorem of trusted boot of virtual machine is well proved. Then, according to the trusted startup theorem of virtual machine, the monitoring and control logic is designed based on system call, and the virtual machine startup process is actively measured and controlled. Finally, the experimental evaluation shows that the proposed scheme can effectively eliminate the unexpected interference between multiple virtual machines in complex cloud environment, ensure the dynamic credibility of virtual machine startup process, and greatly reduce the performance overhead.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006497
    Abstract:
    The main purpose of facial action unit analysis is to identify the state of each facial action unit, which can be applied to many scenarios such as lie detection, autonomous driving, intelligent medical, and others. In recent years, with the popularization of deep learning in the field of computer vision, facial action unit analysis has attracted extensive attention. Face action unit analysis can be divided into two different tasks: face action unit recognition and face action unit intensity estimation. However, the existing studies usually only address one of the problems. More importantly, these methods usually only focus on designing or learning complex feature representations, but ignore the semantic correlation between facial action units. Actually, facial action units often have strong interrelationships. How to effectively use semantic knowledge for learning and reasoning is the key to facial action unit analysis tasks. This study explores to model the semantic relationship of facial action units by analyzing the symbiosis and mutual exclusion of AUs in various facial behaviors and organize the facial AUs in the form of structured knowledge-graph, and then propose an AU semantic relationship embedded representation learning (SRERL) framework. The experiments are conducted on three benchmarks: BP4D, DISFA, and FERA2015 for both facial action unit analysis tasks. The experimental results show that the proposed method outperforms the previous work and achieves state-of-the-art performance. Furthermore, the experiments are also conducted on the BP4D+ dataset and occlusion evaluation is performed on the BP4D dataset to demonstrate the outstanding generalization and robustness of proposed method.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006527
    Abstract:
    BLAS (basic linear algebra subprograms) is an important module of the high-performance extended math library, which is widely used in the field of scientific and engineering computing. Level 1 BLAS provides vector-vector operation, Level 2 BLAS provides matrix-vector operation. This study designs and implements highly optimized Level 1 and Level 2 BLAS routines for SW26010-Pro, a domestic many-core processor. A reduction strategy among CPEs is designed based on the RMA communication mechanism, which improves the reduction efficiency of many Level 1 and Level 2 BLAS routines. For TRSV and TPSV and other routines that have data dependencies, a series of efficient parallelization algorithms are proposed. The algorithm maintains data dependencies through point-to-point synchronization and designs an efficient task mapping mechanism that is suitable for triangular matrices, which reduces the number of point-to-point synchronizations effectively, and improves the execution efficiency. In this study, adaptive optimization, vector compression, data multiplexing, and other technologies have further improved the memory access bandwidth utilization of Level 1 and Level 2 BLAS routines. The experimental results show that the memory access bandwidth utilization rate of the Level 1 BLAS routines can reach as high as 95%, with an average bandwidth of more than 90%. The memory access bandwidth utilization rate of Level 2 BLAS routines can reach 98%, with an average bandwidth of more than 80%. Compared with the widely used open-source linear algebra library GotoBLAS, the proposed implementation of Level 1 and Level 2 BLAS routines achieved an average speedup of 18.78 times and 25.96 times. With the optimized Level 1 and Level 2 BLAS routines, LQ decomposition, QR decomposition, and eigenvalue problems achieved an average speedup of 10.99 times.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006529
    Abstract:
    Secure multi-party computation is one of the hot issues in international cryptographic community. The secure computation of intersection-sum is a new problem of secure multi-party computation. The problem has important theoretical significance and practical value in the fields of industry, commerce, and healthcare. The existing solutions are designed under the condition that the private sets are subsets of a universal set and the intersection cardinality will be leaked and there are some false probabilities. This study, based on the Paillier cryptosystem, designs three protocols for the intersection-sum problem. These protocols are secure in the semi-honest model. Protocol 1 privately computes the number of common identifiers (i.e., user identifier intersection cardinality) and the sum of the integer values associated with these users, Protocol 2 and Protocol 3 privately compute the sum of the associated integer values of intersection elements without leaking the intersection cardinality. The whole computation process does not reveal any more information about their private inputs except for the intersection-sum. The protocols do not restrict that the private sets are subsets of a universal set, and they can be applied in more scenarios. It is proved, by using the simulation paradigm, that these protocols are secure in the semi-honest model. The efficiency of the protocols is also tested by experiments.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006516
    Abstract:
    The Snowden incident revealed the fact that certain cryptosystems were indeed subverted. Elliptic curve digital signature algorithm (ECDSA) has been widely used due to its short signature length advantage under the same security level, for example, signing bitcoin transactions. However, whether the ECDSA can be subverted and how to resist this attack remain a challenge. This study answers this question positively. Firstly, it is shown that how to use a pseudorandom function (PRF) to calculate a random value to replace the randomness used in the ECDSA. The subverted ECDSA enables an adversary to extract signing private key by obtaining at most three consecutive signatures. Secondly, the hash value of private key, message, and the random signature component are used as the second random number to improve the ECDSA scheme, and as a result, the signature scheme against subversion-resistant attack is proposed. Even an adversary replaces the component of the new signature algorithm, it cannot extract any information of the signing key. Finally, the proposed algorithm and existing algorithm are implemented, and the implementation demonstrates that the proposed scheme has advantages in terms of computational complexity and efficiency.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006517
    Abstract:
    One of the main challenges of blockchain technology is to ensure the privacy protection of transaction identity under the condition of open ledger and multi-party consensus. At present, the identity privacy protection scheme based on anonymous authentication and transaction mixing in public blockchain is difficult to be popularized in the industry due to the lack of supervision. Based on the identity privacy protection scheme in Monero, this study introduces the role of the regulator, designs a supervised privacy protection scheme for the transaction receiver based on one-time address encryption and zero knowledge proof. It also designs a linkable revocable ring signature scheme based on linkable ring signature and revocable ring signature so as to implement the supervised privacy protection scheme for transaction sender based on autonomous mixing. The scheme can not only protect the identity privacy of the participants, but also support the offline transaction identity recovery for the regulator so as to achieve the regulatory purpose of “controllable anonymity”. The analysis and test results show that the algorithm operation time is millisecond in this scheme, which can meet the performance requirements of blockchain in non-high frequency transaction scenarios.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006519
    Abstract:
    General matrix multiply (GEMM) is one of the most used functions in scientific and engineering computation, and it is also the base function of many linear algebra libraries. Its performance usually has essential influence on the whole application. Besides, because of its intensity in computation, its efficiency is often considered as an important metric of the hardware platform. This study conducts systematic optimization to dense GEMM on the domestic SW1621 processor. Based on analysis of the baseline code and profiling of various overhead, as well as utilization of the architectural features and instruction set, optimization for DGEMM is carefully designed and performed, including blocking scheme, packing mechanism, kernel function implementation, data prefetch, etc. Besides, a code generator is developed, which can generate different assembly and C code according to the input parameters. Using the code generator, together with auto-tuning scripts, it is able to find the optimal values for the tunable parameters. After applying the optimizations and tuning, the proposed single thread DGEMM achieved 85% of the peak performance of a single core, and 80% of the performance of the entire chip of 16 cores. The optimization to DGEMM not only improves the performance of BLAS on SW1621, but also provides an important reference for optimizing dense data computation on SW series multi-core machines.
    Available online:  November 30, 2022 , DOI: 10.13328/j.cnki.jos.006523
    Abstract:
    Microservice is becoming the mainstream architecture of the cloud-based software systems because of its agile development and rapid deployment. However, the structure of a microservice system is complex, it often has hundred of service instances. Moreover, the call relationship between services is extremely complex. When an anomaly occurs in the microservice system, it is difficult to locate the root causes of the anomaly. The end-to-end request tracing method becomes the standard configuration of a microservice system to solve this problem. However, current methods of distributed request tracing are intrusive to applications and heavily rely on the developers’ expertise in request tracing. Besides, it is unable to start or stop the tracing functionality at runtime. These defects not only increase the burden of developers but also restrict the adoption of distributed request tracing technique in practice. This study designs and implements a transparent request tracing system named Trace++, which can generate tracing code automatically and inject the generated code into the running application by using dynamic code instrumentation technology. Trace++ is low intrusive to programs, transparent to developers, and can start or stop the tracing functionality flexibly. In addition, the adaptive sampling method of Trace++ effectively reduces the cost of request tracing. The results of the experiments conducted on TrainTicket, a microservice system, show that Trace++ can discover the dependencies between services accurately and its performance cost is close to the source code instrumentation method when it starts request tracing. When the request tracing functionality is stopped, Trace++ incurs no performance cost. Moreover, the adaptive sampling method can preserve the representative trace data while 89.4% of trace data are reduced.
    Available online:  November 24, 2022 , DOI: 10.13328/j.cnki.jos.006498
    Abstract:
    Fault localization is an essential precondition for repairing in software development. To this end, researchers have proposed automated fault localization (AFL) methods to facilitate the task. Such approaches have taken full advantage of information such as the execution tracks and execution results of given test cases and receive significant effectiveness in reducing the difficulty of fault localization. In competitive crowdsourcing software development, one task could receive multiple competitive implementations (solutions). This study proposes a novel approach for AFL in crowdsourcing software engineering. The key insight of the proposed approach is that when locating faulty statements in a program, it regards competitive implementations as reference programs. By searching for reference statements in reference programs for each statement in buggy program, it calculates the suspicious score of the statement by leveraging its references. Given a set of test cases and a buggy program, the test scenario is run and the initial suspicious score for each statement in the buggy program is calculated by wildly used SBFL approach. After that, suspicious score of each statement is adapted according to its similarity with statements in competitive implementations. The proposed approach is evaluated on 118 real word buggy programs that are accompanied with competitive implementations. The evaluation results suggest that compared with SBFL approaches, the cost of fault localization is reduced by more than 25%.
    Available online:  November 16, 2022 , DOI: 10.13328/j.cnki.jos.006496
    Abstract:
    Vertical data partitioning technology logically stores database table attributes satisfying certain semantic conditions in the same physical block, so as to reduce the cost of data accessing and improve the efficiency of query processing. Every query is usually related only to the table’s some attributes in the database, so a subset of the table’s attributes can be used to get the accurate query results. Reasonable vertical data partitioning can make most queries answered without scanning the whole table, so as to reduce the amount of data accessing and improve the efficiency of query processing. Traditional database vertical partitioning methods are mainly based on heuristic rules set by experts. The granularity of partitioning is coarse, and it can not provide different partition optimizations according to the characteristics of workload. Besides, when the scale of workload or the number of attributes becomes large, the execution time of the existing methods are too long to meet the performance requirements of online real-time tuning of database. Therefore, a method called spectral clustering based vertical partitioning (SCVP) is proposed for the online environment. The idea of phased solution is adapted to reduce the time complexity of the algorithm and speed up partitioning. Firstly, SCVP reduces the solution space by increasing the constraint conditions, that is, generating initial partitions by spectral clustering. Secondly, SCVP designs the algorithm to search solution space, that is, the initial partitions are optimized by combining frequent itemset mining and greedy search. In order to further improve the performance of SCVP under high-dimensional attributes, a new method called special clustering based vertical partitioning redesign (SCVP-R) is proposed which is an improved version of SCVP. SCVP-R optimizes the partitions combiner component of SCVP by introducing sympatric-competition mechanism, double-elimination mechanism, and loop mechanism. The experimental results on different datasets show that SCVP and SCVP-R have faster execution time and better performance than the current state-of-the-art vertical partitioning method.
    Available online:  November 16, 2022 , DOI: 10.13328/j.cnki.jos.006506
    Abstract:
    As software delivery increasingly emphasizes speed and reliability, continuous integration (CI) has attracted more and more attention these years. Developers continue to integrate working copies into the mainline to realize software evolution. Each integration involves automated tests to verify whether the update introduces faults. However, as the scale of software increases, test suites contain more and more test cases. As software evolves, the coverage and fault detection ability of test cases also change among different CI cycles. As a result, the traditional test case prioritization techniques may be inapplicable. Techniques based on reinforcement learning can adjust prioritization strategies dynamically according to test feedback. But the existing techniques based on reinforcement learning proposed in recent years do not comprehensively consider information in the test suite during prioritization, which limits their effectiveness. This study proposes a new test case prioritization method in CI, called pointer ranking method. The method uses features like history information of test cases as inputs. In each CI cycle, the agent uses the attention mechanism to gain attention to all candidate test cases, and then obtains a prioritization result. After test execution, it obtains the updating direction from the feedback. It constantly adjusts its prioritization strategy in the process “prioritization, test execution, test feedback” and finally achieves satisfied prioritization performance. This study verifies the effectiveness of the proposed method on five large-scale datasets, and explores the impact of history length on method performance. Besides, it explores the model’s effectiveness on datasets which only contain regression test cases and the model’s execution efficiency. Finally, the study comes to the following conclusions. First, compared to existing techniques, pointer ranking method can adjust its strategy along with the evolution of the software, and effectively enhance the fault detection ability of test sequence in CI. Second, pointer ranking method has good robustness to history length. A small amount of history information can make it achieve the optimal performance. Third, pointer ranking method can handle regression test cases and newly-added test cases well. Finally, pointer ranking method has little time overhead. Considering its better and more stable prioritization performance, pointer ranking method is a very competitive method.
    Available online:  October 28, 2022 , DOI: 10.13328/j.cnki.jos.006495
    Abstract:
    Heterogeneous defect prediction (HDP) can effectively solve the problem that the source project and the target project use different features. It uses heterogeneous feature data from the source project to predict the defect tendency of the software module in the target project. At present, HDP has made certain achievements, but its overall performance is not satisfactory. Most previous HDP methods solve this problem by learning domain invariant feature subspace to reduce the difference between domains. However, the source domain and the target domain usually show huge heterogeneity, which makes the domain alignment effect not satisfied. The reason is that these methods ignore the potential knowledge that the classifier should generate similar classification probability distributions for the same category in the two domains, and fail to mine the intrinsic semantic information contained in the data. In addition, because the collection of training data in newly launched projects or historical legacy projects relies on expert knowledge, is time-consuming, laborious, and error-prone, the possibility of heterogeneous defect prediction is explored based on a small number of labeled modules in the target project. Based on these, a heterogeneous defect prediction method is proposed based on simultaneous semantic alignment (SHSSAN). On the one hand, it explores the implicit knowledge learned from the labeled source projects, so as to transfer relevance between categories and achieve implicit semantic information transfer. On the other hand, in order to learn the semantic representation of unlabeled target data, centroid matching is performed through target pseudo-labels to achieve explicit semantic alignment. At the same time, SHSSAN can effectively solve the class imbalance problem and the data linearly inseparable problem, and make full use of the label information in the target project. Experiments on public heterogeneous data sets containing 30 different projects show that compared with the current excellent CTKCCA, CLSUP, MSMDA, KSETE, and CDAA methods, the F-measure and AUC are increased by 6.96%, 19.68%, 19.43%, 13.55%, 9.32% and 2.02%, 3.62%, 2.96%, 3.48%, 2.47%, respectively.
    Available online:  October 26, 2022 , DOI: 10.13328/j.cnki.jos.006753
    Abstract:
    As a complement and extension of the terrestrial network, the satellite network contributes to the acceleration of bridging the digital divide between different regions and can expand the coverage and service range of the terrestrial network. However, the satellite network features highly dynamic topology, long transmission delay, and limited on-board computing and storage capacity. Hence, various technical challenges, including routing scalability and transmission stability, are encountered in the organic integration of the satellite network and the terrestrial network and the construction of a global space-ground integrated network (SGIN). Considering the research challenges of SGIN, this paper describes the international and domestic research progress of SGIN in terms of network architecture, routing, transmission, multicast-based content delivery, etc., and then discusses the research trends.
    Available online:  October 26, 2022 , DOI: 10.13328/j.cnki.jos.006755
    Abstract:
    The critical reliability and availability of distributed systems are threatened by crash recovery bugs caused by incorrect crash recovery mechanisms and their implementations. The detection of crash recovery bugs, however, can be extremely challenging since these bugs only manifest themselves when a node crashes under special timing conditions. This study presents a novel approach Deminer to automatically detect crash recovery bugs in distributed systems. Observations in the large-scale distributed systems show that node crashes that interrupt the execution of related I/O write operations, which store a piece of data (i.e., common data) in different places, e.g., different storage paths or nodes, are more likely to trigger crash recovery bugs. Therefore, Deminer detects crash recovery bugs by automatically identifying and injecting such error-prone node crashes under the usage guidance of common data. Deminer first tracks the usage of critical data in a correct run. Then, it identifies I/O write operation pairs that use the common data and predicts error-prone injection points of a node crash on the basis of the execution trace. Finally, Deminer tests the predicted injection points of the node crash and checks failure symptoms to expose and confirm crash recovery bugs. A prototype of Deminer is implemented and evaluated on the latest versions of four widely used distributed systems, i.e., ZooKeeper, HBase, YARN, and HDFS. The experimental results show that Deminer is effective in finding crash recovery bugs. Deminer has detected six crash recovery bugs.
    Available online:  October 26, 2022 , DOI: 10.13328/j.cnki.jos.006749
    Abstract:
    Code change is a kind of key behavior in software evolution, and its quality has a large impact on software quality. Modeling and representing code changes is the basis of many software engineering tasks, such as just-in-time defect prediction and recovery of software product traceability. The representation learning technologies for code changes have attracted extensive attention and have been applied to diverse applications in recent years. This type of technology targets at learning to represent the semantic information in code changes as low-dimensional dense real-valued vectors, namely, learning the distributed representation of code changes. Compared with the conventional methods of manually designing code change features, such technologies offers the advantages of automatic learning, end-to-end training, and accurate representation. However, this field is still faced with some challenges, such as great difficulties in utilizing structural information and the absence of benchmark datasets. This study surveys and summarizes the recent progress of studies and applications of representation learning technologies for code changes, and it mainly consists of the following four parts. (1) The study presents the general framework of representation learning of code changes and its application. (2) Subsequently, it reviews the currently available representation learning technologies for code changes and summarizes their respective advantages and disadvantages. (3) Then, the downstream applications of such technologies are summarized and classified. (4) Finally, this study discusses the challenges and potential opportunities ahead of representation learning technologies for code changes and suggests the directions for the future development of this type of technology.
    Available online:  October 26, 2022 , DOI: 10.13328/j.cnki.jos.006752
    Abstract:
    Most traditional information hiding methods embed secret data by modifying cover data, which inevitably leaves traces of modification in cover data, and hence, it is difficult to resist the detection of the existing steganalysis algorithms. Consequently, the technique of coverless information hiding emerges, which hides secret data without modifying cover data. To improve the hiding capacity and robustness of coverless information hiding, this study proposes a constructive data hiding method based on texture synthesis and recognition with image style transfer. Firstly, natural images and texture images of different categories are used to construct the content image database and the textural style image database, respectively. A mapping dictionary of binary codes is established according to the categories of natural images in the content image database. Secondly, the labeled textural image database should be constructed and input into the convolutional neural network as a training dataset, and the texture image recognition model can be obtained by iterative training. In this way, the secret data can be extracted from stego images at the receiving end. During secret data hiding, natural images are selected from the content image database according to to-be-embedded secret data fragments, which are synthesized to form a stego mosaic image. Then, a texture image is randomly selected from the textural style image database, and the stego texture image can be generated by the selected texture image and the stego mosaic image with the strategy of style transfer to achieve secret data hiding. During secret data extraction, the obtained texture image recognition model can accurately identify the original categories of stego texture images corresponding to natural images, and secret data can be finally extracted by reference to the mapping dictionary. The experimental results demonstrate that the proposed method can achieve the stego texture image with a satisfactory visual effect and a high hiding capacity, and it illustrates strong robustness to attacks such as JPEG compression and Gaussian noise.
    Available online:  October 26, 2022 , DOI: 10.13328/j.cnki.jos.006648
    Abstract:
    The many-objective evolutionary algorithm based on decomposition is the main approach to solving many-objective optimization problems, but its performance largely depends on the matching degree between the adopted reference vectors and the real Pareto front (PF). Existing decomposition-based many-objective evolutionary algorithms can hardly deal with all kinds of many-objective optimization problems with different PF at the same time. To solve this issue, this study proposes a many-objective evolutionary algorithm based on the curvature estimation (MaOEA-CE) of PF. The core of the proposed algorithm includes two aspects: Firstly, on the basis of PF curvature estimation, different reference vectors are generated in each iteration to gradually match the real PF of different kinds of problems. Secondly, with the estimated PF curvature, the appropriate aggregation function is used to select elite solutions and dynamically adjust the generated reference vector in the environmental selection, which can improve the convergence while maintaining the diversity of the population. Moreover, MaOEA-CE is compared with seven advanced many-objective algorithms on three mainstream problem sets for testing, i.e., DTLZ, WFG, and MaF, to verify its effectiveness. The experimental results prove that MaOEA-CE has strong competitiveness.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006491
    Abstract:
    Many quantifiable state-out-of-bound software defects, such as access violations, memory exhaustion, and performance failures, are caused by a large quantity of input data. However, existing dependent data identification and mutation optimization technologies for grey-box fuzzing mainly focus on fixed-length data formats. They are not efficient in increasing the amount of cumulated data required by the accumulated buggy states. This study proposes a differential mutation method to accelerate feature state optimization during the directed fuzzing. By monitoring the seed that updates the maximum or minimum state value of the cumulative defects, the effective mutate offset and content are determined. The frequency is leveraged and the distribution of the effective mutation is offset to distinguish whether the feature value of the defect depends on a fixed field or cumulative data in the input. The effective mutation content is reused as a material in the cumulative input mutation to accelerate the bug reproduction or directed testing. Based on this idea, this study implements the fuzzing tool Jigsaw. The evaluation results on the experimental data set show that the proposed dependency detection method can efficiently detect the input data type that drives the feature value of cumulative defects and the mutation method significantly shorten the reproduction time of the cumulative defect that requires a large amount of special input data.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006493
    Abstract:
    In recent years, single-stage instance segmentation methods have made preliminary progress in real-world applications due to their high efficiency, but there are still two drawbacks compared to two-stage counterparts. (1) Low accuracy: the single-stage method does not have multiple rounds of refinement, so its accuracy is some distance away from real-world applications; (2) Low flexibility: most existing single-stage methods are specifically designed models, which are not compatible with object detectors. This paper presents an accurate and flexible framework for single-stage instance segmentation, which contains the following two key designs. (1) To improve the accuracy of instance segmentation, a grid dividing binarization algorithm is proposed, where the bounding box region is firstly divided into several grid cells and then instance segmentation is performed on each grid cell. In this way, the original full-object segmentation task is simplified into the sub-tasks of grid cells, which significantly reduces the complexity of feature representation and further improves the instance segmentation accuracy; (2) To be compatible with object detectors, a plug-and-play module is designed, which can be seamlessly plugged into most existing object detection methods, thus enabling them to perform instance segmentation. The proposed method achieves excellent performance on the public dataset, such as MS COCO. It outperforms most existing single-stage methods and even some two-stage methods.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006494
    Abstract:
    Software-defined network (SDN) is a new network architecture that separates the control and forwarding planes. It can schedule and optimize network resources based on global information. Nevertheless, precise scheduling requires accurate measurement of information on the entire network (including the status of all switching devices in the network and all link information in the topology). In-band network telemetry (INT) can realize the collection of relevant information while forwarding data packets, and configuration of detection paths which cover the entire network is one of the key issues to be solved for INT. However, existing detection path configuration methods for INT have the following problems. (1) The deployment of a large number of detection nodes is required in advance, which leads to increased maintenance overhead. (2) The detection path is too long, which results in the length of detection packet exceeding the MTU value in the network. (3) The redundant detection paths cause the traffic load introduced by the measurement to account for too much of the overall network traffic. (4) The recovery time of the detection path adjustment under the dynamically changing topology is too long. In order to solve the above problems, an adaptive detection path configuration method for in-band network telemetry in SDN based on graph segmentation (ACGS) is proposed. The basic idea is to divide the network topology with the graph segmentation to restrict the length of detection path by controlling the topology scale, solve the Euler circuit in the divided subgraph to obtain a detection path that only traverses the directed edges in the subgraph once, to avoid the problems of too many detection nodes and high detection path redundancy; and use the combination of local adjustment and global adjustment to solve the problem of long recovery time of the detection path when the topology changes dynamically. The experimental results prove that the ACGS method can realize the INT detection path configuration in SDN with moderate detection path length, fewer detection nodes, lower detection path redundancy, and faster adjustment under the dynamically changing topology.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006526
    Abstract:
    The file hierarchy ciphertext policy attribute-based encryption (FH-CP-ABE) scheme realizes multi-level files encryption with the single access policy, which saves the computation cost of encryption and decryption and the storage cost of ciphertext. Nevertheless, the existing file hierarchy CP-ABE scheme cannot support graded user access, while suffers due to the unauthorized access. For this reason, a file hierarchy CP-ABE scheme that supports graded user access is proposed. In the proposed scheme, the graded user access tree is constructed, and the ciphertext subsections are reconstructed to support the access requirements of graded users, thus eliminate the possibility of users to conduct unauthorized access. The security analysis shows that the proposed scheme can resist selective chosen-plaintext attack. Both theoretical and experimental analyses show that the proposed scheme is more efficient in terms of computation and storage compared to related scheme.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006460
    Abstract:
    A new architecture that supports multiple coordinators and multi-replica storage has emerged in distributed database systems, which brings new challenges to the correctness of transaction scheduling. The challenges are represented by new data anomalies caused by the lack of a central coordinator and data inconsistency caused by the multi-replica mechanism. Based on the definition of transaction isolation levels and consistency protocols for distributed systems, this study constructs a unified hybrid dependency graph model for transactional multi-level consistency in multi-coordinator and multi-replica distributed databases. The model provides a robust standard for evaluating the correctness of transaction scheduling, which can facilitate dynamic or static analysis of transaction scheduling in databases.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006487
    Abstract:
    Image aesthetic assessment and emotional analysis aim to enable computers to identify the aesthetic and emotional responses of human beings caused by visual stimulations, respectively. Existing research usually treats them as two independent tasks. However, people’s aesthetic and emotional responses do not appear in isolation. On the contrary, from the perspective of psychological cognition, the two responses are interrelated and mutually influenced. Therefore, this study follows the idea of deep multi-task learning to deal with image aesthetic assessment and emotional analysis under a unified framework and explore their relationship. Specifically, a novel adaptive feature interaction module is proposed to correlate the backbone networks of the two tasks and achieve a unified prediction. In addition, a dynamic feature interaction mechanism is introduced to adaptively determine the degree of feature interaction between the tasks according to the feature dependencies. As the multi-task network updates structural parameters, the study, based on the inconsistency in complexity and convergence speed between the two tasks, proposes a novel gradient balancing strategy to ensure that the network parameters of each task can be smoothly learned under the unified prediction framework. Furthermore, the study constructs a large-scale unified image aesthetic and emotional dataset–UAE. According to the study, UAE is the first image collection containing both aesthetic and emotional labels. Finally, the model and codes of the proposed method as well as the UAE dataset have been released at https://github.com/zhenshen-mla/Aesthetic-Emotion-Dataset.
    Available online:  October 14, 2022 , DOI: 10.13328/j.cnki.jos.006489
    Abstract:
    Model-based reinforcement learning methods train a model to simulate the environment by using the collected samples and utilize the imaginary samples generated by the model to optimize the policy, thus they have potential to improve sample efficiency. Nevertheless, due to the shortage of training samples, the environment model is often inaccurate, and the imaginary samples generated by it would be deleterious for the training process. For this reason, a learnable weighting mechanism is proposed which can reduce the negative effect on the training process by weighting the generated samples. The effect of the imaginary samples on the training process is quantified through calculating the difference between the losses on the real samples before and after updating value and policy networks by the imaginary samples. The experimental results show that the reinforcement learning algorithm using the weighting mechanism is superior to existing model-based and model-free algorithms in multiple tasks.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006438
    Abstract:
    Self-paced learning (SPL) is a learning regime inspired by the learning process of humans and animals that gradually incorporates samples into training set from easy to complex by assigning a weight to each training sample. SPL incorporates a self-paced regularizer into the objective function to control the learning process. At present, there are various forms of SP regularizers and different regularizers may lead to distinct learning performance. Mixture weighting regularizer has the characteristics of both hard weighting and soft weighting. Therefore, it is widely used in many SPL-based applications. However, the current mixture weighting method only considers logarithmic soft weighting, which is relatively simple. In addition, in comparison with soft weighting or hard weighting, more parameters are introduced in the mixture weighting scheme. In this study, an adaptive mixture weighting SP regularizer is proposed to overcome the above issues. On the one hand, the representation form of weights can be adjusted adaptively during the learning process; on the other hand, the SP parameters introduced by mixture weighting can be adapted according to the characteristics of sample loss distribution, so as to be fully free of the empirically adjusted parameters. The experimental results on action recognition and multimedia event detection show that the proposed method is able to adjust the weighting form and parameters adaptively.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006439
    Abstract:
    With the problem of the aging population becomes serious, more attention is payed to the safety of the elderly when they are at home alone. In order to provide early warning, alarm, and report of some dangerous behaviors, several domestic and foreign research institutions are focusing on studying the intelligent monitoring of the daily activities of the elderly in robot-view. For promoting the industrialization of these technologies, this work mainly studies how to automatically recognize the daily activities of the elderly, such as “drinking water”, “washing hands”, “reading a book”, “reading a newspaper”. Through the investigation of the daily activity videos of the elderly, it is found that the semantics of the daily activities of the elderly are obviously fine-grained. For example, the semantics of “drinking water” and “taking medicine” are highly similar, and only a small number of video frames can accurately reflect their category semantics. To effectively address such problem of the elderly behavior recognition, this work proposes a new multimodal multi-granularity graph convolutional network (MM-GCN), by applying the graph convolution network on four modalities, i.e., the skeleton (“point”), bone (“line”), frame (“frame”), and proposal (“segment”), to model the activities of the elderly, and capture the semantics under the four granularities of “point-line-frame-proposal”. Finally, the experiments are conducted to validate the activity recognition performance of the proposed method on ETRI-Activity3D (110000+ videos, 50+ classes), which is the largest daily activities dataset for the elderly. Compared with the state-of-the-art methods, the proposed MM-GCN achieves the highest recognition accuracy. In addition, in order to verify the robustness of MM-GCN for the normal human action recognition tasks, the experiment is also carried out on the benchmark NTU RGB+D, and the results show that MM-GCN is comparable to the SOTA methods.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006440
    Abstract:
    Password hardening encryption (PHE) is an emerging primitive in recent years. It can resist offline attack brought by keyword guessing attack from server via adding a third party with crypto services joining the decryption process. This primitive enhances the password authentication protocol and adds encryption functionality. This paper presents an active attack from server in the first scheme that introduced this primitive. This attack combines the idea from a cutting-edge threat called algorithm substitution attack which is undetectable and makes the server capable of launching offline attack. This result shows that the original PHE scheme can not resist attacks from malicious server. Then this paper tries to summarize the property that an algorithm substitution attack resistant scheme should have. After that this paper presents a PHE scheme that can resist such kind of attacks from malicious server with simulation results. Finally, this paper concludes the result and gives some expectation for future systematic research on interactive protocols under algorithm substitution attack.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006541
    Abstract:
    The asymmetric flow generated by the widely deployed address translation technology brings challenges to the design of load balancing system. To solve the problem of insufficient use of multi-core processors and network card hardware capabilities by software load balancers, an asymmetric flow load balancing method based on flow characteristics is proposed. Firstly, a data packet dispatching algorithm to dispatch packets into expected CPU core via hardware is proposed. Then, an elephant flow detection algorithm is constructed by analyzing of the temporal and spatial characteristics of packet sequences. Finally, based on detected results, a load balance offloading method is proposed. The experimental results show that, asymmetric flow load balancing method can correctly handle the asymmetric flow. Meanwhile, the average throughput rate increases by 14.5%.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006536
    Abstract:
    Cross-modal hashing can greatly improve the efficiency of cross-modal retrieval by mapping data of different modalities into more compact hash codes. Nevertheless, existing cross-modal hashing methods usually use a binary similarity matrix, which cannot accurately describe the semantic similarity relationships between samples and suffer from the squared complexity problem. In order to better mine the semantic similarity relationships of data, this study presents a label enhancement based discrete cross-modal hashing method (LEDCH). It first leverages the prior knowledge of transfer learning to generate the label distribution of samples, then constructs a stronger similarity matrix through the label distribution, and generates the hash codes by an efficient discrete optimization algorithm with a small quantization error. Finally, experimental results on two benchmark datasets validate the effectiveness of the proposed method on cross-modal retrieval tasks.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006545
    Abstract:
    Recently, with the continuous improvement of realism requirements of movies, games, virtual reality applications, etc., the real-time rendering of translucent materials such as human organs and milk has become more and more important. For most of the current subsurface scattering calculation methods, it is difficult to correct the scattering range. To tackle this estimation issue, a new subsurface scattering calculation formula is proposed to accurately represent the maximum scattering distance. First, the brute-force Monte Carlo photon tracking results are simulated to obtain the reflectance profile results. Second, the selected polynomial model is used to fit the reflectance profile to calculate the precise maximum scattering range at the shading point. To begin with, a new importance sampling scheme is proposed to reduce the number of Monte Carlo samples, thereby increasing the computational efficiency. In addition, the required parameters are only provided by the reflectance on the shading points and the mean free path of the material, so as to flexibly adjust the rendering effect. Experiments results have shown that the proposed model can avoid the previous error estimation of the scattering range, and has more accurate rendering results of the complex reflectivity area of the material. Meanwhile, the rendering rate meets real-time requirements.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006432
    Abstract:
    In order to solve the dilemma that particle swarm optimization algorithm (PSO) cannot well balance the exploration and exploitation, a density peak based multi subpopulation particle swarm optimization algorithm is proposed with dimensionally reset strategy (DPMPSO). In the proposed DPMPSO, the idea of relative distance originated from density peak clustering is firstly adopted and then it is combined with the fitness value of particles to divide the whole swarm into two subpopulations: the top subpopulation and the bottom subpopulation. Secondly, the learning strategy is designed, focusing on local search for the top subpopulation and the learning strategy paying more attention to global search for the bottom subpopulation, which can well balance the exploration and exploitation. Finally, particles that fall into local optima will be reset by crossover with the global optima dimensionally, which can not only effectively avoid premature but also significantly reduce invalid iteration. The experiment results on 10 benchmark problems and CEC2017 optimization problems demonstrate that DPMPSO performs better than some representative PSOs and other optimization algorithms with significant difference.
    Available online:  September 30, 2022 , DOI: 10.13328/j.cnki.jos.006433
    Abstract:
    With the free supervised signals/labels created by pretext tasks, self-supervised learning (SSL) can learn effective representation from unlabeled data, which has been verified in various downstream tasks. Existing pretext tasks usually first perform explicit linear or nonlinear transformations on the original view data, thus forming multiple augmented view data, then learn the representation by predicting the corresponding transformations or maximizing the consistency among the above views. It is found that such self-supervised augmentations (i.e., the augmentations of the data itself and self-supervised labels) benefit the learning of not only the unsupervised pretext tasks but also the supervised classification task. Nevertheless, few work focus on this at present, while existing works either take the pretexts as the auxiliary of downstream classification task and adopt the multi-task learning or jointly model the downstream task labels and self-supervised labels in a multi-label learning way. Actually, there are inherent differences between downstream and pretext tasks (e.g., semantic, task difficulty, etc.), which inevitably result in the competitions between them and bring risks to the learning of downstream tasks. To challenge this issue, this study proposes a simple yet effective SSL multi-view learning framework (SSL-MV), which avoids the learning interference of self-supervised labels on downstream labels through performing the same learning as downstream tasks on the augmented data views. More interestingly, with the multi-view learning, the proposed framework naturally owns the integration inference ability, which significantly improves the performance of downstream supervised classification tasks. Extensive experiments on benchmark datasets demonstrate the effectiveness of SSL-MV.
    Available online:  September 23, 2022 , DOI: 10.13328/j.cnki.jos.006530
    Abstract:
    This study proposes a convolutional neural network (CNN) based Transformer to solve the panoptic segmentation task. The method draws on the inherent advantages of the CNN in image feature learning and avoids increase in the amount of calculation when the Transformer is transplanted into the vision task. The CNN-based Transformer is attributed to the two basic structures of the projector performing the feature domain transformation and the extractor responsible for the feature extraction. The effective combination of the projector and the extractor forms the framework of the CNN-based Transformer. Specifically, the projector is implemented by a lattice convolution that models the spatial relationship of the image by designing and optimizing the convolution filter configuration. The extractor is performed by a chain network that improves feature extraction capabilities by chain block stacking. Considering the framework and the substantial function of panoptic segmentation, the CNN-based Transformer is successfully applied to solve the panoptic segmentation task. The experimental results on the MS COCO and Cityscapes datasets demonstrate that the proposed method has excellent performance.
    Available online:  September 23, 2022 , DOI: 10.13328/j.cnki.jos.006532
    Abstract:
    As an effective technique for black-box state machine models of software systems, model learning (a.k.a. automata learning) can be divided into active and passive learning. Based on given input and output alphabets, the minimum complete state machine of the target system can be obtained in polynomial time through active interaction with the black box system. And the algorithm of equivalence query is still a big obstacle to the development and application of active automata learning tools. This study discusses the influence of counterexamples on the learning algorithms with the discrimination tree, and defines the comparison rules of hypotheses, and proposes two principles of constructing test cases. According to the principle, the Wp-method equivalence query algorithm is improved to produce better hypotheses and effectively reduce the number of queries and symbols. Based on the LearnLib, three kinds of automata are used as experimental objects to verify the effectiveness of the principle and the improved algorithm.
    Available online:  September 23, 2022 , DOI: 10.13328/j.cnki.jos.006543
    Abstract:
    The security of traditional cryptographic algorithms is based on the black-box attack model. In this attack model, the attacker can only obtain the input and output of the cryptographic algorithm, but not the internal details of the cryptographic algorithm. In recent years, the concept of white-box attack model has been proposed. In the white-box attack model, attackers can not only obtain the input and output of cryptographic algorithm, but also directly observe or change the internal data of cryptographic algorithm. In order to ensure the security of existing cryptographic algorithms under white-box attack environment, redesigning the existing cryptographic algorithms through white-box cryptography technology without changing their functions is called white-box implementation of existing cryptographic algorithms. It is of great significance to study the design and analysis of the white-box implementation scheme for solving the issue of digital rights management. In recent years, a kind of side channel analysis method for white-box implementation schemes has emerged. This kind of analysis method only needs to know a few internal details of white-box implementation schemes, then it can extract the key. Therefore, it is the analysis method with practical threat to the existing white-box implementation schemes. It is of great practical significance to analyze the existing white-box implementation schemes to ensure the security of the schemes. The typical representative of this kind of analysis method is the differential computation analysis (DCA) based on the principle of differential power analysis. This study analyzes the Bai-Wu white-box SM4 scheme based on DCA. Based on the research results of the statistical characteristics of n-order uniform random invertible matrix on GF(2), an improved DCA (IDCA) is proposed, which can significantly improve the analysis efficiency on the premise of almost constant success rate. The results also show that the Bai-Wu white-box SM4 scheme can not guarantee the security in the face of DCA, therefore, it must be further improved to meet the security requirements of practical scenarios.
    Available online:  September 20, 2022 , DOI: 10.13328/j.cnki.jos.006683
    Abstract:
    The ability to describe local geometric shapes is very important for the representation of irregular point cloud. However, the existing network is still difficult to effectively capture accurate local shape information. In this article, we simulate depthwise separable convolution calculation method in the point cloud and proposes a new type of convolution, namely dynamic cover convolution (DC-Conv), to aggregate local features. The core of DC-Conv is the space cover operator (SCOP), which constructs anisotropic spatial geometry in a local area to cover the local feature space to enhance the compactness of local features. DC-Conv achieves the capture of local shapes by dynamically combining multiple SCOPs in the local neighborhood. Among them, the attention coefficients of the SCOPs are adaptively learned from the point position in a data-driven way. Experiments on the 3D point cloud shape recognition benchmark dataset ModelNet40,ModelNet 10 and ScanObjectNN show that this method can effectively improve the performance of 3D point cloud shape recognition and robustness to sparse point clouds even in the case of a single scale. Finally, we also provide sufficient ablation experiments to verify the effectiveness of the method. The open source code is published at https://github.com/changshuowang/DC-CNN.
    Available online:  September 20, 2022 , DOI: 10.13328/j.cnki.jos.006664
    Abstract:
    Smart contracts running on the blockchain can hardly be modified after deployment, and their call and execution rely on a consensus procedure. Consequently, existing debugging methods that require the modification of the smart contract code or the interruption of execution cannot be directly applied to smart contracts. Since the running of a smart contract is composed of ordered execution of blockchain transactions, tracing the execution of the transactions is an effective approach to render the smart contract more debuggable. The major goal of tracing blockchain transaction execution is to unveil how a blockchain transaction produces such a result in execution. The execution of a blockchain transaction relies on the internal state of the blockchain, and this state is determined by the execution results of previous transactions, which results in transitive dependencies. Such dependencies and the characteristics of the execution environment the blockchain provides bring challenges to tracing. The tracing of blockchain transaction execution is mainly faced with three challenges: how to obtain enough information for tracing from the production environment in which the smart contract is deployed, how to obtain the dependencies among the blockchain transactions, and how to ensure the consistency between the result of tracing and the real execution online. This study proposes a tracing method for blockchain transaction execution based on recording and replay. By building a recording and replay mechanism in the contract container, the proposed method enables the recording of state reading and writing operations during transaction execution without modifying the contract code and interrupting the running of the smart contract. A transaction dependency analysis method based on state reading and writing is proposed to support the retracing of previous transactions linked by dependencies on demand. Moreover, a verification mechanism for reading and writing operation recording is designed to ensure that the consistency between the replaying execution and the real online execution can be verified. The tracing method can trace the execution of the blockchain transaction that calls the smart contract, which can be used in debugging of smart contracts. When loss is caused by the failure of smart contracts, the tracing result can be used as evidence. Experiments are conducted for a performance comparison between storing recorded reading and writing operations on chain and off chain. The advantages and effectiveness of the proposed method in tracing blockchain transaction execution are revealed by a case study.
    Available online:  September 20, 2022 , DOI: 10.13328/j.cnki.jos.006658
    [Abstract] (903) [HTML] (0) [PDF 3.35 M] (1471)
    Abstract:
    As data silos emerge and importance is attached to personal privacy protection, the application modes of centralized learning are restricted, whereas federated learning has attracted great attention since it appeared owing to the fact that it, as a distributed machine learning framework, can accomplish model training without leaking users’ data. As federated learning is increasingly widely applied, its security and privacy protection capability have also begun to be questioned. This study offers a systematic summary and analysis of the research achievements domestic and foreign researchers have made in recent years in the security and privacy of federated learning models. Specifically, this study outlines the background of federated learning, clarifies its definition and workflow, and analyzes its vulnerabilities. Then, the security threats and privacy risks against federated learning are systematically analyzed and compared respectively, and the existing defense methods are summarized. Finally, the prospects of this research area and the challenges ahead are presented.
    Available online:  September 20, 2022 , DOI: 10.13328/j.cnki.jos.006531
    Abstract:
    The chosen-ciphertext attack (CCA) security model can effectively figure active attacks in reality. The existing cryptosystems against CCA are mainly designed by foreign countries, and China is lack of its CCA secure cryptosystems. Although there are general transformation approaches to achieving CCA security, they lead to an increase in both computational overhead and communication overhead. Based on the SM9 encryption algorithm, this study proposes an identity-based broadcast encryption scheme with CCA security. The design is derived from the SM9, and the size of the private key and ciphertext is constant and independent of the number of receivers chosen in the data encryption phase. Specifically, the private key includes one element, and the ciphertext is composed of three elements. If the GDDHE assumption holds, the study proves that the proposed scheme has selective CCA security under the random oracle model. In order to achieve CCA security, a dummy identity is introduced in designing the encryption algorithm, and the identity can be used to answer the decryption query successfully. Analysis shows that the proposed scheme is comparable to the existing efficient identity-based broadcast encryption schemes in terms of computational efficiency and storage efficiency.
    Available online:  September 16, 2022 , DOI: 10.13328/j.cnki.jos.006422
    Abstract:
    Comparing with traditional software, the deep learning software has different structures. Even if a lot of test data is used for testing the deep learning software, the adequacy of testing still hard to be evaluted, and many unknown defects could be implied. The deep forest is an emerging deep learning model that overcomes many shortcomings of deep neural networks. For example, the deep neural network requires a lot of training data, high performance computing platform, and many hyperparameters. However, there is no research on testing deep forest. Based on the structural characteristics of deep forests, this study proposes a set of testing coverage criteria, including random forest node coverage (RFNC), random forest leaf coverage (RFLC), cascad forest class coverage (CFCC), and cascad forest output coverage (CFOC). DeepRanger, a coverage-oriented test data generation method based on genetic algorithm, is proposed to automatically generate new test data and effectively improve the model coverage of the test data. Experiments are carried out on the MNIST data set and the gcForest, which is an open source deep forest project. The experimental results show that the four coverage criteria proposed can effectively evaluate the adequacy of the test data set for the deep forest model. In addition, comparing with the genetic algorithm based on random selection, DeepRanger, which is guided by coverage information, can improve the testing coverage of the deep forest model under testing.
    Available online:  September 16, 2022 , DOI: 10.13328/j.cnki.jos.006428
    Abstract:
    Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers’ emotions in commit messages and identifying developers’ opinions on Q&A forums. Nevertheless, commonly used out-of-box sentiment analysis tools cannot obtain reliable results in SE tasks and misunderstanding of technical knowledge is demonstrated to be the main reason. Then researchers start to customize SE-specific methods in supervised or distantly supervised ways. To assess the performance of these methods, researchers use SE-related annotated datasets to evaluate them in a within-dataset setting, that is, they train and test each method using data from the same dataset. However, the annotated dataset for an SE-specific sentiment analysis task is not always available. Moreover, building a manually annotated dataset is time-consuming and not always feasible. An alternative is to use datasets extracted from the same platform for similar tasks or datasets extracted from other SE platforms. To verify the feasibility of these practices, it is needed to evaluate existing methods in within-platform and cross-platform settings, which refer to training and testing each method using data from the same platform but not the same dataset, and training and testing each classifier using data from different platforms. This study comprehensively evaluates existing SE-customized sentiment analysis methods in within-dataset, within-platform, and cross-platform settings. Finally, the experimental results provide actionable insights for both researchers and practitioners.
    Available online:  September 16, 2022 , DOI: 10.13328/j.cnki.jos.006396
    Abstract:
    A verifiable timed signature (VTS) scheme allows one to time-lock a signature on a known message for a given amount of time T such that after performing a sequential computation for time T anyone can extract the signature from the time-lock. Verifiability ensures that anyone can publicly check if a time-lock contains a valid signature on the message without solving it first, and that the signature can be obtained by solving the same for time T. This study first proposes the notion of verifiable attribute-based timed signatures (VABTS) and gives an instantiation VABTS further. The instantiation VABTS scheme can not only simultaneously support identity privacy-preserving, dynamic user revocation, traceability, timing, but also solve the problem of key escrow in attribute-based scheme. In addition, VABTS has many applications. This study lists two application scenarios of VABTS: building a privacy-preserving payment channel network for the permissioned blockchain and realizing a fair privacy-preserving multi-party computing. Finally, it is proved that the instantiation VABTS scheme is secure and efficient via formal security analysis and performance evaluation.
    Available online:  September 09, 2022 , DOI: 10.13328/j.cnki.jos.006525
    Abstract:
    With machine learning widely applied to the natural language processing (NLP) domain in recent years, the security of NLP tasks receives growing natural concerns. Existing studies found that small modifications in examples might lead to wrong machine learning predictions, which was also called adversarial attack. The textual adversarial attack can effectively reveal the vulnerability of NLP models for improvement. Nevertheless, existing textual adversarial attack methods all focus on designing complex adversarial example generation strategies with a limited improvement of success rate, and the highly invasive modifications bring the decline of textual quality. Thus, a simple and effective method with high adversarial example quality is in demand. To solve this problem, the sememe-level sentence dilution algorithm (SSDA) and the dilution pool construction algorithm (DPCA) are proposed from a new perspective of improving the process of adversarial attack. SSDA is a new process that can be freely embedded into the classical adversarial attack workflow. SSDA first uses dilution pools constructed by DPCA to dilute the original examples, then generates adversarial examples through those diluted examples. It can not only improve the success rate of any adversarial attack methods without any limit of datasets or victim models but also obtain higher adversarial example quality compared with the original method. Through the experiments of different datasets, dilution pools, victim models, and textual adversarial attack methods, it is successfully verified the improvement of SSDA on the success rate and proved that dilution pools constructed by DPCA can further enhance the dilution ability of SSDA. The experiment results demonstrate that SSDA reveals more vulnerabilities of models than classical methods, and DPCA can help SSDA to improve success rate with higher adversarial example quality.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006486
    Abstract:
    Network representation learning is regarded as a key technology for improving the efficiency of information network analysis. It maps network nodes to low-dimensional vectors in a latent space and maintains the structure and characteristics of the original network in these vectors efficiently. In recent years, many studies focus on exploring network topology and node features intensively, and the application bears fruit in many network analysis tasks. In fact, besides these two kinds of key information, the accompanying information widely existing in the network reflects various complex relationships and plays an important role in the network’s construction and evolution. In order to improve the efficiency of network representation learning, a novel model integrating the accompanying information is proposed with the name NRLIAI. The model employs the variational auto-encoders (VAE) to propagate and process information. In addition, it aggregates and maps network topology and node features by graph convolutional operators in the encoder, reconstructs the network in the decoder, and integrates the accompanying information to guide the network representation learning. Furthermore, the proposed model solves the problem that the existing methods fail to utilize the accompanying information effectively. At the same time, the model possesses a generative ability, which enables it to reduce the overfitting problem in the learning process. With several real-world network datasets, this study conducts extensive comparative experiments on the existing methods of NRLIAL through node classification and link prediction tasks, and the experimental results have proved the feasibility of the proposed model.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006692
    [Abstract] (1318) [HTML] (0) [PDF 11.12 M] (1102)
    Abstract:
    With the rapid development of the Internet and the penetration of big data mining and applications into all walks of life, how to share and use massive data securely and efficiently has become a new hot research issue. Secure multi-party computation is one of the key technologies to solve this problem. It allows a group of participants to interact, compute a function together and get the output without revealing private inputs. Oblivious transfer is a privacy-protected two-party communication protocol in which a sender holds two messages to be sent, and a receiver selects one to receive, but after that, the sender knows nothing about which message the receiver gets, and the receiver cannot get any information about the unselected message. Oblivious transfer has become one of the key modules of secure multi-party computation, and its efficiency optimization can effectively promote the application of secure multi-party computation, especially for special two-party secure computation protocols such as private set intersection. This paper summarizes the classification of oblivious transfer and several common variants, and respectively describes the construction and research progress of the oblivious transfer protocol based on public key cryptography and oblivious transfer extension, which leads to the importance of the efficiency optimization research of oblivious transfer. At the same time, this paper comprehensively reviews the research progress of efficiency optimization of oblivious transfer and oblivious transfer extension from the perspectives of semi-honest adversary and malicious adversary. On the other hand, in practical application, this paper systematically summarizes the optimization technologies used in the engineering implementation of oblivious transfer and oblivious transfer extension protocols. Finally, this paper points out the main problems and future works of oblivious transfer and oblivious transfer extension protocols.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006697
    Abstract:
    App reviews are considered as a communication channel between users and developers to perceive user satisfaction. Users usually describe buggy features (i.e., User Actions) and App abnormal behaviors (i.e., Abnormal Behaviors) in forms of key phrases (e.g., "send a video" and "crash"), which could be buried with other trivial information (e.g., complaints) in the review texts. A fine-grained view about this information could facilitate the developers' understanding of feature requests or bug reports from users, and improve the quality of Apps. Existing pattern-based approaches to extract target phrases can only summarize the high-level topics/aspects of reviews, and suffer from low performance due to insufficient semantic understanding of reviews. This paper proposes a semantic-aware and fine-grained App Review bug mining approach (Arab) to extract User Actions and Abnormal Behaviors, and mine the correlations between them. We design a novel neural network model for extracting fine-grained target phrases, which combines textual descriptions and review attributes to better represent the semantics of reviews. Arab also clusters the extracted phrases based on their semantic relations and provides a visualization of the correlations between User Actions and Abnormal Behaviors. We evaluate Arab on 3,426 reviews from six Apps, and the results confirm the effectiveness of Arab in phrase extraction. We further conduct a case study with Arab on 301,415 reviews of 15 popular Apps to explore its potential application and examine its usefulness on large-scale data.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006700
    [Abstract] (853) [HTML] (0) [PDF 2.79 M] (1283)
    Abstract:
    Inspired by the human visual attention mechanism, salient object detection (SOD) aims to detect the most attractive and interesting object or region in a given scene. In recent years, with the development and popularization of depth cameras, depth map has been successfully applied to various computer vision tasks, at the same time, which also provides new ideas for the salient object detection task. The introduction of depth map not only enables the computer to simulate the human visual system more comprehensively, but also provides new solutions for the detection of some difficult scenes, such as low contrast and complex backgrounds by utilizing the structure information and location information of the depth map. In view of the rapid development of RGB-D SOD task in the era of deep learning, this paper aims to sort out and summarize the existing related research outputs from the perspective of key scientific problem solutions, and conduct the quantitative analysis and qualitative comparison of different methods on the commonly used RGB-D SOD datasets. Finally, we summarize the challenges and prospects for the future development trends.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006702
    Abstract:
    It is difficult to solve many-objective optimization problems (MaOPs) effectively by using the traditional multi-objective evolutionary algorithms (MOEAs) based on Pareto dominance relation. A dominance relation is proposed by combing double distances of PBI utility function without introducing extra parameter. Secondly, a diversity maintenance method based on double distances is also defined, which not only considers the double distances of the individual, but also adaptively adjusts the weight of diversity according to the objective number of MaOP, so as to better balance the convergence and diversity of the solution set in many-objective space. Finally, the proposed dominance relation and diversity maintenance method are embedded into the framework of NSGA-II, and then a many-objective evolutionary algorithm based on double distances (MaOEA/d2) is designed. The MaOEA/d2 is compared with other five representative many-objective evolutionary algorithms on the DTLZ and WFG benchmark functions with 5-,10-,15-,and 20-objective in terms of IGD and HV indicators. The empirical results show that MaOEA/d2can obtain better convergence and diversity. Therefore, the proposed MaOEA/d2 is a promising many-objective evolutionary algorithm.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006703
    Abstract:
    Multi-label learning is a very important machine learning paradigm. Traditional multi-label learning methods are designed in supervised or semi-supervised manner. Generally, they require accurate labeling of all or partial data into multiple categories. In many practical applications, it's difficult to obtain the label information with a large number of labels, which greatly restricts the promotion and application of multi-label learning. In contrast, label correlation, as a common weak supervision information, has lower requirements for labeling information. How to use label correlation for multi-label learning is an important but unstudied problem. In this paper, we propose a method named weakly supervised multi-label learning using prior label correlation information (WSMLLC). This model restates the sample similarity by using label correlation, and can obtain label indicator matrix effectively, constrain the projection matrix of data by using prior information, and modify the indicator matrix by introducing regression terms. Compared with the existing methods, the outstanding advantage of WSMLLC model is that it can realize the label assignment of multi-label samples only by providing label correlation priors. Experimental results show that WSMLLC has obvious advantages over current advanced multi-label learning methods in the case of complete loss of label matrix.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006725
    Abstract:
    Participating media are ubiquitous in nature and are also major elements in many rendering applications such as special effects, digital games, and simulation systems. Physically-based simulation and reproduction of their appearance can significantly boost the realism and immersion of 3D virtual scenes. However, both the underlying structures of participating media and the light propagation in them are very complex. Therefore, rendering with participating media is a difficult task and hot topic in computer graphics so far. In order to facilitate the treatment in rendering and lower the computational cost, classical methods for participating media rendering are always based on two assumptions:independent scattering and local continuity. These two assumptions are also the building blocks of classical Radiative Transfer Equation (RTE). However, most participating media in nature do not satisfy these two assumptions. This results in the noticeable discrepancy between rendered images and real images. In recent years, these two assumptions have been relaxed by incorporating more physically accurate methods to model participating media, thus significantly improving the physical realism of participating media rendering. This survey analyzes and discusses exisiting non-classical participating media rendering techniques from two aspects:correlated media rendering and discrete media rendering. We foucs on discussing the differences between classical and non-classical participating media rendering. We also describe the principles, advantages and limitations behind each method. Finally, we provide some future directions around non-classical participating media rendering that are worth delving into. We hope this suvery can inspire researchers to study non-classical participating media rendering by addressing some critical issues. We also hope this suvery can be a guidance for engineers from industy to improve their renderers by considering non-classical participating media rendering.
    Available online:  July 22, 2022 , DOI: 10.13328/j.cnki.jos.006732
    Abstract:
    With the increasing trend of data scale expansion and structure diversification, how to use the heterogeneous multi co-processors in modern link to provide a real-time and reliable parallel runtime environment for large-scale data processing has become a research hotspot in the field of high performance and database. Modern servers equipped with multi co-processors (GPU) has become the preferred high-performance platform for analyzing large-scale and irregular graph data. The overall performance of existing research designing graph computing systems and algorithms based on multi-GPU server architecture (such as breadth first traversal and shortest path algorithm) has been significantly better than that of multi-core CPU computing environment. However, the data transmission performance between multi-GPU of existing graph computing system is limited by PCI-E bandwidth and local delay, leading to being unable to achieve a linear growth trend of performance by increasing the number of GPU devices, and even serious delay jitter which can not satisfy the high scalability requirements of large-scale graph parallel computing systems. After a series of benchmark experiments, it is found that the existing system has the following two types of defects:1) the hardware architecture of the data link between modern GPU devices is rapidly updated (such as NVLink-V1 and NVLink-V2), and its link bandwidth and delay have been greatly improved. However, the existing systems are still limited by PCI-E for data communication, and can not make full use of modern GPU link resources (including link topology, connectivity and routing); 2) When dealing with irregular graph data, such systems often adopt single data movement strategy between devices, bringing a lot of unnecessary data synchronization overhead between GPU devices via PCI-E bus, resulting in excessive time-wait overhead of local computing. Therefore, it is urgent to make full use of various communication links between modern multi-GPU to design a highly scalable graph computing system. In order to achieve the high scalability of the multi-GPU graph computing system, a fine-grained communication based on hybrid perception is proposed to enhance the scalability of the multi-GPU graph computing system. It pre-awares the architecture link, uses the modular data link and communication strategy for different graph structured data, and finally selects the optimal data exchange method for large-scale graph data (structural data and application data). Based on above optimization strategies, this paper proposes and designs a graph oriented parallel computing system via multi-GPU named ChattyGraph. By optimizing data buffer and multi-GPU collaborative computing with OpenMP and NCCL, ChattyGraph can adaptively and efficiently support various graph parallel computing applications and algorithms on multi-GPU HPC platform. Several experiments of various real-world graph data on 8-GPU NVIDIA DGX server show that ChattyGraph significantly improves graph computing efficiency and scalability, and outperforms other advanced competitors. The average computing efficiency is increased by 1.2-1.5X and the average acceleration ratio is increased by 2-3X, including WS-VR and Groute.
    Available online:  July 15, 2022 , DOI: 10.13328/j.cnki.jos.006412
    Abstract:
    Question matching is an important task of question answering systems. Current methods usually use neural networks to model the semantic matching degree of two sentences.However, in the field of law, questions often have some problems, such as sparse textual representation, professional legal words, and insufficient legal knowledge contained in sentences.Therefore, the general domain deep learning text matching model is not effective in the legal question matching task.In order to make the model better understand the meaning of legal questions and model the knowledge of the legal field, this study firstly constructs a knowledge base of the legal field, and then proposes a question matching model integrating the knowledge of the legal field (such as legal words and statutes). Specifically, a legal dictionary under five categories of legal disputeshas been constructed, including contract dispute, divorce, traffic accident, labor injury, debt and creditor’s right, and relevant legal articles have been collected to build a knowledge base in the legal field.In question matching, the legal knowledge base is first searched for the legal words and statutes corresponding to the question pair, and then the relationship among the question, legal words, and statutes is modeled simultaneously through the cross attention model. Finally, to achieve more accurate question matching, experiments under multiple legal categories were carried out, and the results show that the proposed method in thisstudy can effectively improve the performance of question matching.
    Available online:  July 15, 2022 , DOI: 10.13328/j.cnki.jos.006413
    Abstract:
    Speech translation aims to translate the speech in one language into the speech or text in another language. Compared with the pipeline system, the end-to-end speech translation model has the advantages of low latency, less error propagation, and small storage, so it has attracted much attention. However, the end-to-end model not only requires to process the long speech sequence and extract the acoustic information, but also needs to learn the alignment relationship between the source speech and the target text, leading to modeling difficulty with poor performance. This study proposes an end-to-end speech translation model with cross-modal information fusion, which deeply combines text-based machine translation model with speech translation model. For the length inconsistency between the speech and the text, a redundancy filter is proposed to remove the redundant acoustic information, making the length of filtered acoustic representation consistent with the corresponding text. For learning the alignment relationship, the parameter sharing method is applied to embed the whole machine translation model into the speech translation model with multi-task training. Experimental results on public speech translation data sets show that the proposed method can significantly improve the model performance.
    Available online:  July 15, 2022 , DOI: 10.13328/j.cnki.jos.006414
    Abstract:
    In recent years, the prevalent research on big-data processing often deals with increased data scale and high data complexity. The frequent usage of high-dimensional data poses challenges during application, such as efficient query and fast access of database in the system. Hence, it is critical to design an effective high-dimensional index to increase query throughput and decrease memory footage. Kraska et al. proposed learned index, which has been proved superior in real-world low-dimensional datasets. With the success of wide adoption of machine learning and deep learning on database management system, more and more researchers aim to set up learned index on high-dimensional datasets so as to improve the query efficiency. However, current solutions fail to effectively utilize the distribution information of data, and sometimes incur high overhead on the initialization of complex deep learning models. In this work, an improved high-dimensional learned index (IHDL index) is proposed based on the division of data space and dimension reduction. Specifically, the index utilizes multiple linear models on the dataset, and decreases the initialization overhead while maintains high query accuracy. Experiments on the synthetic dataset and the OSM dataset verifyits superiority in terms of initialization overhead, query throughput, and memory footage.
    Available online:  July 07, 2022 , DOI: 10.13328/j.cnki.jos.006403
    Abstract:
    Deep neural networks have been widely used in fields such as autonomous driving and smart healthcare. Like traditional software, deep neural networks inevitably contain defects, and it may cause serious consequences if they make wrong decisions. Therefore, the quality assurance of deep neural networks has received extensive attention. However, deep neural networks are quite different from traditional software. Traditional software quality assurance methods cannot be directly applied to deep neural networks, and targeted quality assurance methods need to be designed. Software fault localization is one of the important methods to ensure software quality. The spectrum-based fault localization method has achieved good results in traditional software fault localization methods, but it cannot be directly applied to deep neural networks. In this study, based on the traditional software fault localization methods, a spectrum-based fault localization approach named Deep-SBFL for deep neural network is proposed. The approach firstly collects the neuron output information and the prediction results of deep neural network as the spectrum. The spectrum is then further calculated as the contribution information, which can be used to quantify the contribution of neurons to the predicted results. Finally, a suspicious formula for the defect localization of deep neural network is proposed. Based on the contribution information, the suspiciousness scores of neurons in deep neural network are calculated and ranked to find out the most likely defective neurons. To verify the effectiveness of the method, EInspect@n (the number of defects successfully located by inpecting the first n positions of the sorted list) and EXAM (the percentage of elements that must be checked before finding defect elements) were evaluated on a deep neural network trained by the MNIST data set. Experimental results show that this approach can effectively locate different types of defects in deep neural networks.
    Available online:  July 07, 2022 , DOI: 10.13328/j.cnki.jos.006404
    Abstract:
    Improving the efficiency of frequent itemset mining in big data is a hot research topic at present. With the continuous growth of data volume, the computing costs of traditional frequent itemset generation algorithms remain high. Therefore, this study proposes a fast mining algorithm of frequent itemset based on Spark (Fmafibs in short). Taking advantage of bit-wise operation, a novel pattern growth strategy is designed. Firstly, the algorithm converts itemset into BitString and exploits bit-wise operation to generate candidate itemset. Secondly, to improve the processing efficiency of long BitString, a vertical grouping strategy is designed and the candidate itemset are obtained by joining the frequent itemset between different groups of same transaction, and then aggregating and filtering them to get the final frequent itemset. Fmafibs is implemented in Spark environment. The experimental results on benchmark datasets show that the proposed method is correct and it can significantly improve the mining efficiency.
    Available online:  July 07, 2022 , DOI: 10.13328/j.cnki.jos.006405
    Abstract:
    The sparsity has always been a primary challenge for recommendation system, and information fusion recommendation can alleviate this problem by exploiting user preference through their comments, ratings, and trust information, so as to generate corresponding recommendations for target users. Full learning of user and item information is the key to build a successful recommendation system. Different users have different preferences for various items, and users’ interest preferences and social circle are changeable dynamically. A recommendation method combining deep learning and information fusion is proposed to solve the problem of sparsity. Particularly, a new deep learning model named information fusion recommendation model combining attention CNN and GNN (ACGIF for short), is constructed. First, attention mechanism is added to the CNN to process the comment information and learn the personalized representation of users and items from the comment information. It learns the comment representation based on comment coding, and learns the user/item representation in the comment through user/item coding. It adds personalized attention mechanism to filter comments with different levels of importance. Then, the rating and trust information are processed through the GNN. For each user, the diffusion process begins with the initial embedding, combining the relevant features and the free user potential vectors that capture the potential behavioral preferences. A layered influence propagation structure is designed to simulate how the user’s potential embedding evolves as the social diffusion process continues. Finally, the preference vector of the user for the item obtained from the first two parts is weighted and fused to obtain the preference vector of the final user for the item. The MAE and RMSE of the recommended results are employed as the experimenalevaluation indicators on four public data sets. The experimental results show that the proposed model has better recommendation effect and running time compared with the existing seven typical recommendation models.
    Available online:  July 07, 2022 , DOI: 10.13328/j.cnki.jos.006406
    Abstract:
    Generating coherent topic descriptions from the user comments of case-related topics plays a significant role in quickly understanding the case-related news, which can be regarded as a multi-document summarization task based on user comments. However, these comments contain lots of noise, the crucial information for generating summaries is scattered in different comments, the sequence-to-sequence model tends to generate irrelevant and incorrect summaries. Based on these observations, this paper presents a case-related topic summarization method based on the topic interaction graph, which reconstructs the user comments into a topic interaction graph. The motivation is that the graph can express the correlation between different user comments, which is useful to filter the key information in user comments. Specifically, the case elements are first extracted from the user comments, and then the topic interaction graph is constructed, which takes the case elements as the nodes and uses the sentences including these case elements as the node’s contents; then the graph transformer network is introduced to produce the representation of the graph. Finally, the summary is generated by using a standard transformer-based decoder. The experimental results on the collected case-related topic summarization corpus show that the proposed method effectively selects useful content and can generate coherent and factual topic summaries.
    Available online:  July 07, 2022 , DOI: 10.13328/j.cnki.jos.006408
    Abstract:
    The identification of opinion targets in microblog is the basis of analyzing network public opinion involved in cases. At present, the identification method of opinion targets based on topic representation needs to preset a fixed number of topics, and the final results rely on artificial inference. In order to solve these problems, this study proposes a weak supervision method, which only uses a small number of labelled comments to automatically identify the opinion targets in microblog. The specific implementation is as follows. Firstly, the comments are encoded and reconstructed twice based on the variational dual topic representation network to obtain rich topic features. Secondly, a small number of labelled comments are used to guide the topic representation network to automatically identify the opinion targets. Finally, the reconstruction loss of double topic representation and the classification loss of opinion targets identification are optimized together by the joint training strategy, to classify comments of opinion targets automatically and mine target terms. Experiments are carried out on two data sets of microblogs involved in cases. The results show that the proposed model outperforms several baseline models in the classification of opinion targets, topic coherence, and diversity of target terms.
    Available online:  June 15, 2022 , DOI: 10.13328/j.cnki.jos.006399
    Abstract:
    Dense depth map is essential in areas such as autonomous driving and robotics, but today's depth sensors can only produce sparse depth measurements. Therefore, it is necessary to complete it. In all auxiliary modalities, RGB images are commonly used and easily obtained. Many current methods use RGB and sparse depth information in depth completion. However, most of them simply use channel concatenation or element-wise addition to fuse the information of the two modalities, without considering the confidence of each modalities in different scenarios. This paper proposes a dynamic gated fusion module, which is guided by the sparse distribution of input sparse depth and information of both RGB and sparse depth feature, thus fusing two modal features more efficiently by generating dynamic weights. And designed an efficient feature extraction structure according to the data characteristics of different modalities. Comprehensive experiments show the effectiveness of each model. And the network proposed in this paper uses lightweight model to achieve advanced results on two challenging public data sets KITTI depth completion and NYU depth v2. Which shows our method has a good balance of performance and speed.
    Available online:  June 15, 2022 , DOI: 10.13328/j.cnki.jos.006400
    Abstract:
    High-dimensional data is widely adopted in the real world. However, there is usually plenty of redundant and noisy information existing in high-dimensional data, which accounts for the poor performance of many traditional clustering algorithms when clustering high-dimensional data. In practice, it is found that the cluster structure of high-dimensional data is often embedded in the lower dimensional subspace. Therefore, dimension reduction becomes the key technology of mining high-dimensional data. Among many dimension reduction methods, graph-based method becomes a research hotspot. However, most graph-based dimension reduction algorithms suffer from the following two problems: (1) most of the graph-based dimension reduction algorithms need to calculate or learn adjacency graphs, which have high computational complexity; (2) the purpose of dimension reduction is not considered in the process of dimension reduction. To address the problem, a fast unsupervised dimension reduction algorithm is proposed based on the maximum entropy-MEDR, which combines linear projection and the maximum entropy clustering model to find the potential optimal cluster structure of high-dimensional data embedded in low-dimensional subspace through an effective iterative optimization algorithm. The MEDR algorithm does not need the adjacency graph as an input in advance, and has linear time complexity of input data scale. A large number of experimental results on real datasets show that the MEDR algorithm can find a better projection matrix to project high-dimensional data into low-dimensional subspace compared with the traditional dimensionality reduction method, so that the projected data is conducive to clustering analysis.
    Available online:  June 15, 2022 , DOI: 10.13328/j.cnki.jos.006401
    Abstract:
    In order to reduce the labor cost in the process of bug localization, researchers have proposed various automated information retrieval based bug localization models (IRBL), including those models leveraging traditional features and deep learning based features. When evaluating the effectiveness of IRBL models, most of the existing studies neglect the following problems: the softare version mismatching between bug reports and the corresponding source code files in the testing data or/and the data leakage caused by the chronological order of bug reports when training and testing their models. This study aims to investigate the performance of existing models in real experiment settings and analyzes the impact of version mismatching and data leakage on the real performance of each model. F irst, six traditional information retrieval-based models (Buglocator, BTRracer, BLUiR, AmaLgam, BLIA, and Locus) and one novel deep learning model (CodeBERT) are selected as the research objects. Then, an empirical analysis is conducted based on eight open-source projects under five different experimental settings. The experimental results demonstrate that the effectiveness of directly applying CodeBERT in bug localization is not as good as expected, since its accuracy depends on the version and source code size of a test project. Second, the results also show that, compared with the traditional version mismatching experimental setting, the traditional information retrieval-based models under the version matching setting can lead to an improviment that is up to 47.2% and 46.0% in terms of MAP and MRR. Meanwhile, the effectiveness of CodeBERT model is also affected by both data leakage and version mismatching. It means that the effectiveness of traditional information retrieval-based bug localization is underestimated while the application of deep learning based CodeBERT to bug localization still needs more exploration.
    Available online:  June 06, 2022 , DOI: 10.13328/j.cnki.jos.006642
    Abstract:
    The computation offloading problem of multi-access edge computing (MEC) has become one of the research focuses. The current computation offloading scheme only considers the computation offloading problem in the cloud, edge, and end structures and does not take into account the attributes of the public and private clouds. In this study, a novel computation offloading scheme is proposed, which considers the relationship between the public cloud and private cloud in edge computing and regards the public cloud as a supplement to private cloud resources to alleviate the insufficient computing power caused by the limitations of private cloud resources. Moreover, a two-layer Stackelberg game is established to solve the computation offloading problem. The optimal strategies of each player are obtained upon the analysis of the strategies and profits of the public cloud, the private cloud, and users, and the existence and uniqueness of the Nash equilibrium solution to the two-layer game are proved. The simulation results and analysis verifies the feasibility of the computation offloading scheme based on the two-layer Stackelberg game. Compared with the computation offloading scheme based on the single-layer Stackelberg game, the proposed scheme is more efficient and more suitable for edge computing environments.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006650
    Abstract:
    Discourse structure analysis aims to understand the overall structure of an article and the semantic relationships between its various parts. As a research hotspot of natural language processing, it has developed rapidly in recent years. This study first summarizes the mainstream discourse structure analysis theories in English and Chinese and then introduces the research on the popular English and Chinese discourse corpora as well as their calculation models. On this basis, this study surveys the current work context of discourse structure analysis in Chinese and English and constructs its research framework. Moreover, the current research trends and focuses are summarized, and the application of discourse structure in downstream tasks is introduced briefly. Finally, this study points out the issues and challenges in the current Chinese discourse structure analysis to provide guidance and help for future research.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006645
    [Abstract] (677) [HTML] (0) [PDF 4.10 M] (1247)
    Abstract:
    Event extraction is to automatically extract event information in which users are interested from unstructured natural language texts and express it in a structured form. Event extraction is an important direction in natural language processing and understanding and is of high application value in different fields, such as government management of public affairs, financial business, and biomedicine. According to the degree of dependence on manually labeled data, the current event extraction methods based on deep learning are mainly divided into two categories: supervised learning and distantly-supervised learning. This article provides a comprehensive overview of current event extraction techniques in deep learning. Focusing on supervised methods such as CNN, RNN, GAN, GCN, and distant supervision, this study systematically summarizes the research in recent years. Additionally, the performance of different deep learning models is compared and analyzed in detail. Finally, the challenges facing event extraction are analyzed, and the research trends are forecasted.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006684
    Abstract:
    Distributed systems play an important role in the computing environment. The consensus protocol is used to guarantee the consistency between nodes. The design error in the consensus protocol might cause failure in the operation of the system and might bring catastrophic consequences to humans or the environment. Therefore, proving the correctness of the consensus protocol is very important. Formal verification can strictly prove the correctness of the target properties in the design model, which is suitable for verifying consensus protocols. However, with the scale of distributed systems increasing, it becomes more complicated to verify its correctness, which brings more challenges to the formal verification techniques. What method to use to formally verify the design of the consensus protocol and how to increase the verification scale are important research issues in the formal verification of the consensus protocol. This paper investigates the current research on the use of formal methods to verify consensus protocols, summarizes the key techniques and main methods, and proposes future research directions in this field.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006657
    Abstract:
    Lattice-based cryptanalysis, an analysis method using the algorithms solving hard lattice problems to analyze the security of public-key cryptosystems, has become one of the powerful mathematical tools for studying the security of the Rivest-Shamir-Adleman (RSA)-type cryptographic algorithms. The key point of this method is the construction of the lattice basis. There exists a general strategy for lattice basis construction. However, this general strategy fails to fully and flexibly utilize the algebraic structure of the RSA algorithm and its variants. In recent years, lattice-based cryptanalysis of RSA-type algorithms mostly focuses on introducing special techniques of lattice base construction on the basis of the general strategy. This study starts by outlining lattice-based cryptanalysis and the general strategy for lattice basis construction and summarizing several commonly used techniques of lattice basis construction. Subsequently, the main achievements in lattice-based cryptanalysis of the standard RSA algorithm are reviewed, and they involve factoring with known bits, small private exponent attacks, and partial key exposure attacks. Then, the special algebraic structures of several mainstream variants of the RSA algorithm and the techniques of lattice basis construction applicable to these variants are summarized. Finally, the available work on lattice-based cryptanalysis of the RSA algorithm and its variants is classified and summed up, and the prospects of the research and development of lattice-based cryptanalysis are presented.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006666
    [Abstract] (482) [HTML] (0) [PDF 1.19 M] (1062)
    Abstract:
    During software development, developers make extensive use of third-party libraries to relieve themselves of heavy burden instead of reinventing common functions. There are dependencies between different third-party libraries. Incompatibilities between versions will lead to errors during installing, loading or invoking third-party libraries, resulting in system exceptions. Such problem is called Dependency Conflict (DC also referred as Conflict Dependency or CD) issue. The root cause of this issue is the third-party libraries fail to cover required features (e.g., methods). DC issues often occur at the project’s build time or runtime, and are difficult to locate. Repairing DC issues requires developers to know about the differencies among versions of third-party libraries they use, and the complex relationship between the third-party libraries increases the difficulty of repairment. In order to find DC issues before software running, and to respond to and deal with system anomalies caused by DC issues in the process of running, researchers have made various studies on these issues. This paper conducts a systematic review of this research topic from four aspects, including the usage analysis of third-party libraries, the root cause of DC issues, the detection methods of DC issues, and common fixing strategies. Finally, the potential research opportunities in the future are discussed, and references are provided for researchers in this field.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006671
    [Abstract] (651) [HTML] (0) [PDF 1.26 M] (1186)
    Abstract:
    Inverse reinforcement learning (IRL), also known as inverse optimal control (IOC), is a subfield of imitation learning and reinforcement learning. In order to learn expert behavior, IRL methods infer a reward function from expert demonstrations, then, IRL methods adopt a reinforcement learning algorithm to find out the desired behavior. In recent years, IRL methods have received a lot of attention and have been successfully used in solving a variety of tasks, such as navigation for vehicle investigation, planning trajectory, and robotic optimal control. First, the fundamental theories that include the formal definition of IRL are presented. Then, we introduce the research progress of IRL methods which include algorithms based on linear reward function and non-linear reward function, such as maximum margin approaches and maximum entropy approaches. In addition, from frontier research directions of inverse reinforcement learning, we introduce and analyze representative algorithms in this IRL which include incomplete expert demonstrations IRL approach, multi-agent IRL approach, sub-optimal expert demonstrations IRL approach, and guiding IRL approach. Finally, we summary some primary challenges and future developments in inverse reinforcement learning methods.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006672
    [Abstract] (394) [HTML] (0) [PDF 4.15 M] (1041)
    Abstract:
    The basic idea of the multiobjective optimization evolutionary algorithm based on decomposition (MOEA/D) is to transform a multiobjective optimization problem into a set of subproblems (single-objective or multiobjective). Since MOEA/D was proposed in 2007, it has attracted extensive attention from scholars all over the world. MOEA/D has become one of the most representative multiobjective optimization evolutionary algorithms. This paper summarizes the research progress on MOEA/D in the past thirteen years. It includes: (1) the improvements of MOEA/D; (2) the research of MOEA/D on many-objective optimization problems and constrainted optimization problems; (3) the application of MOEA/D on some real-world problems. Then, this paper compares experimentally the performance of several representative improved algorithms of MOEA/D. Finally, this paper presents several potential research topics of MOEA/D in the future.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006677
    Abstract:
    Distributed Ledger (DL) is usually considered as a distributed data management architecture. It maintains data records (the ledgers) across distributed nodes based on consensus mechanism and protocols. DL system can trace all the information of data ownership, usage, and trading chains throughout the lifecycle of data production and transactions, and protect the data from illegal use such as tamper-resistant and non-repudiation. It thus can provide endorsement for data rights confirmation, protection, and audit. Blockchain is a typical implementation of DL system. With the emerging of digital economy applications like digital currency and data asset trading, DL technologies gain more and more attentions. However, system performance is one of the key technical bottlenecks for large-scale application of DL systems, and performance optimization has become a research focus of academia and industry. The paper investigated the methods, technologies, and typical solutions of DL performance optimization from following four perspectives: DL system architecture, ledger data structure, consensus mechanism, and message communication.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006679
    Abstract:
    Deep learning (DL) systems have powerful learning and reasoning capabilities and are widely used in many fields, such as unmanned vehicles, speech processing, intelligent robotics, and etc. Due to the limited dataset and the dependence on manually labeled data, DL systems often fail to detect erroneous behaviors. Accordingly, the quality of DL systems have received widespread attention, especially in safety-critical fields. Due to that fuzzing shows efficient fault-detecting ability in traditional programs, in recent years, it becomes a hot research field to employ fuzzing to test DL systems. In this study, we present a systematic review of fuzzing for DL systems, focusing on test case generation (including seed queue construction, seed selection, and seed mutation), test result determination, and coverage analysis, and then introduce commonly used data sets and metrics. We also discuss issues and opportunities in future researches of this field.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006680
    Abstract:
    In recent years, deep learning technology has made remarkable progress in many computer vision tasks, and more and more researchers have tried to apply it to the field of medical image processing, such as segmentation of anatomical structures in high-throughput medical images (CT, MRI), which can improve the efficiency of image reading for doctors. For specific deep learning tasks in medical applications, the training of deep neural networks needs a large amount of labeled data. But in the medical field, it is awfully hard to obtain large amounts, even unlabeled data from a separate medical institution. Moreover, due to the difference in medical equipment and acquisition protocols, the data from different medical institutions are quite different. The large heterogeneity of data makes it difficult to obtain reliable results on the data of a certain medical institution, with the model trained with data from other medical institutions. In addition, the distribution of disease stage in a dataset is often very uneven, which may also reduce the reliability of the model. In order to reduce the impact of data heterogeneity and improve the generalization ability of the model, domain adaptation and multi-site learning gradually started to be used. Domain adaptation as a research hotspot in transfer learning, is intended to transfer knowledge learned from the source domain to unlabeled target domain data; and federated learning on non-independent and identically distributed (non-iid) data aim to improve the robustness of the model by learning a common representation on multiple data sets. This paper investigates, analyzes, and summarizes domain adaptation, multi-site learning, federated learning on non-iid data and datasets in recent years, and provides references to related research.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006681
    Abstract:
    Heterogeneous information network is a representation form of heterogeneous data. How to integrate the complex semantic information of heterogeneous data is one of the challenges faced by recommendation system. A high-order embedded learning framework for heterogeneous information networks based on weak ties is constructed by using rich semantic information and powerful information transmission capabilities of weak ties, which mainly includes three modules: initial information embedding, high-order information embedding aggregation and recommendation prediction. Initialization information embedded module first adopts the best trust path selection algorithm to avoid information loss caused by sampling a fixed number of neighbors in a full-relational heterogeneous information network. Then network nodes are effectively characterized by filtering out the semantic information of each node using the newly defined multi-task sharing feature importance measurement factor based on multi-head attention and combining it with the interactive structure. The high-order information embedding aggregation module realizes the expression of high-order information by integrating the representational ability of the weak ties and network embedding. And the hierarchical propagation mechanism of heterogeneous information network is utilized to aggregate the characteristics of sampled nodes into the nodes to be predicted. The recommendation prediction module uses the influence recommendation method of high-order information to complete the recommendation. UI-HEHo framework has the characteristics of rich types of embedded nodes, fusion of shared attributes, and implicit interactive information. Finally, the experiments have verified that UI-HEHo can effectively improve the accuracy of rating prediction, as well as the pertinence, novelty and diversity of recommendation generatio. Especially in application scenarios with sparse data,UI-HEHo behaved a good recommendation effect.
    Available online:  May 24, 2022 , DOI: 10.13328/j.cnki.jos.006520
    Abstract:
    In view of the fact that the syntactic relationship is not fully utilized and the argument role is missing in event extraction, an event extraction based on dual attention mechanism (EEDAM) method is proposed to improve the accuracy and recall rate of event extraction. Firstly, sentence coding is based on four embedded vectors and dependency relation is introduced to construct dependency relation graph, so that deep neural network can make full use of syntactic relation. Then, through graph transformation attention network, new dependency arcs and aggregate node information are generated to capture long-range dependencies and potential interactions, weighted attention network is integrated to capture key semantic information in sentences, and sentence level event arguments are extracted to improve the prediction ability of the model. Finally, the key sentence detection and similarity ranking are used to fill in the document level arguments. The experimental results show that the event extraction method based on dual attention mechanism can improve the accuracy rate, recall rate, and F1-score by 17.82%, 4.61%, and 9.80% respectively compared with the optimal baseline joint multiple Chinese event extractor (JMCEE) on ACE2005 data set. On the data set of dam safety operation records, the accuracy, recall rate, and F1 score are 18.08%, 4.41%, and 9.93% higher than the optimal baseline JMCEE, respectively.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006535
    Abstract:
    Heterogeneous information networks can be used for modeling several applications in the real world. Their representation learning has received extensive attention from scholars. Most of the representation learning methods extract structural and semantic information based on meta-paths and their effectiveness in network analysis have been proved. However, these methods ignore the node internal information and different degrees of importance of meta-path instances. Besides, they can capture only the local node information. Thus, this study proposes a heterogeneous network representation learning method fusing mutual information and multiple meta-paths. First, a meta-path internal encoding method called relational rotation encoding is used, which captures the structural and semantic information of the heterogeneous information network according to adjacent nodes and meta-path context nodes. It uses an attention mechanism to model the importance of each meta-path instance. Then, an unsupervised heterogeneous network representation learning method fusing mutual information maximization and multiple meta-paths is proposed and mutual information can capture both global and local information. Finally, experiments are conducted on two real datasets. Compared with the current mainstream algorithms as well as some semi-supervised algorithms, the results show that the proposed method has better performance on node classification and clustering.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006634
    Abstract:
    Interference among wireless signals hinders the concurrent transmission of signals and reduces the throughput of wireless networks. Link scheduling is an effective way to improve throughput and decrease transmission delay of wireless networks as the signal-to-interference-plus-noise ratio (SINR) model can accurately describe the inherent characteristics of wireless signal propagation and truly reflect the interference among wireless signals. Therefore, this study proposes an online distributed link scheduling (OLD_LS) algorithm in the dynamic wireless networks with the constant approximation factor of the SINR model. Specifically, online means that nodes can join and leave wireless networks at any time, and this arbitrary behavior of nodes reflects the dynamic characteristics of wireless networks. The OLD_LS algorithm partitions the network region into hexagons to localize the global interference of the SINR model. In addition, a leader election (LE) subroutine in dynamic networks is designed in this study. It is shown that as long as the dynamic rate of nodes is less than 1/ ε, LE can elect a leader with a high probability in the time complexity of O(logn + logR), where ε is a constant satisfying ε ≤ 5(1?21?α/2)/6, with $\alpha $ being the path loss exponent, n the number of senders, and R the longest link length. To the best of our knowledge, the algorithm proposed in this study is the first OLD_LS algorithm for dynamic wireless networks.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006635
    [Abstract] (979) [HTML] (0) [PDF 9.23 M] (1017)
    Abstract:
    Network measurement is the basis of scenes including network performance monitoring, traffic management, and fault diagnosis, and in-band network telemetry (INT) has become the focus of network measurement research due to its timeliness, accuracy, and scalability. With the emergence and development of programmable data planes, many practical INT solutions have been proposed thanks to their rich information feedback and flexible function deployment. First, this study analyzes the principles and deployment challenges of typical INT solutions INT and AM-PM. Second, according to the optimization measures and extension of INT, it studies the characteristics of the optimization mechanism from the aspects of the data collection process and multi-tasking, as well as the feasibility of technology extension in terms of wireless networks, optical networks, and hybrid networks. Third, in view of the applications of INT in typical scenes, the characteristics of these INT applications are comparatively investigated from the perspectives of in-network performance sensing, network-level telemetry systems, traffic scheduling, and fault diagnosis. Finally, a research summary of INT is made, and the future research directions are predicted.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006636
    Abstract:
    This study proposes a new classical key recovery attack against schemes such as Feistel, Misty, and Type-1/2 generalized Feistel schemes (GFS), which creatively combines the birthday attack with the periodic property of Simon’s algorithm. Although Simon’s algorithm can recover the periodic value in polynomial time, this study requires the birthday bound to recover the corresponding periodic value in the classical setting. By this new attack, the key to a 5-round Feistel-F scheme can be recovered with the time complexity of ${\rm{O}}({2^{3n/4}})$ under the chosen plaintexts and ciphertexts of ${\rm{O}}({2^{n/4}})$ , and the corresponding memory complexity is ${\rm{O}}({2^{n/4}})$. Compared with the results of Isobe and Shibutani, the above result not only increases one round but also requires lower memory complexity. For the Feistel-FK scheme, a 7-round key recovery attack is constructed. In addition, the above approach is applied to construct the key recovery attacks against Misty schemes and Type-1/2 GFS. Specifically, the key recovery attacks against the 5-round Misty L-F and Misty R-F schemes and those against the 6-round Misty L-KF/FK and Misty R-KF/FK schemes are given; for the d-branch Type-1 GFS, a d2-round key recovery attack is presented, and when d≥6, the number of rounds of the key recovery attack is superior to those of the existing key recovery attacks.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006637
    Abstract:
    Comment generation for software codes has been an important research task in the field of software engineering in the past few years. Several research efforts have achieved impressive results on the open-source datasets that contain copious <code snippet, comment> pairs. In the practice of software enterprises, however, the codes to be commented usually belong to a software project library, and it should be decided first on which code lines the comment generation can achieve better performance; moreover, the code snippets to be commented have different lengths and granularity. Thus, a code comment generation method is required, which can integrate commenting decisions and comment generation and is resistant to noise. To this end, CoComment, a software project-oriented code comment generation approach, is proposed in this study. This approach can automatically extract domain-specific basic concepts from software project documents and then uses code parsing and text matching to propagate and expand these concepts. On this basis, automatic code commenting decisions are made by locating code lines or segments related to these concepts, and corresponding natural language comments with high readability are generated upon the fusion of concepts and contexts with templates. Comparative experiments are conducted on three enterprise software projects containing more than 46000 manually annotated code comments. The experimental results demonstrate the proposed approach can effectively make code commenting decisions and generate more helpful code comments compared with existing methods, which provides an integrated solution to code commenting decisions and comment generation for software projects.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006638
    Abstract:
    With the rapid development of technologies such as the Internet of Things (IoT) and cloud computing, portable health clinics (PHCs) have been realized and widely used in telemedicine. Relying on the significant advantages of 5G communications, China has actively promoted the construction of smart healthcare and built a multi-function and high-quality telemedicine information service platform.The realization of telemedicine represented by PHCs is inseparable from the technical support of remote data-sharing systems. At present, the remote data-sharing system combining IoT and the cloud server (CS) has attracted wide attention due to its flexibility and efficiency, but its privacy and security issues are rarely studied. Considering the sensitivity of medical data, this paper endeavors to study the security and privacy issues in the PHC data-sharing system. As a result, in the PHC system, this study achieves the secure uploading of IoT awareness data, normalization of personalized ciphertexts, dynamic multi-user fine-grained access control, and efficient decryption operations, and it also presents formal security verification. The specific innovations of this study are as follows: (1) The classical proxy re-encryption (PRE) and attribute-based encryption algorithms are improved, and an IPRE-TO-FAME combined encryption mechanism is proposed to ensure the data-sharing security of the PHC system with cloud-edge collaboration. (2) To address the challenge of key updates caused by many highly distributed IoT terminals, this paper uses the idea of PRE to realize the key updates on the basis of the unilateral transformation without changing the keys to IoT terminals. Meanwhile, the re-encryption entities can be regarded as fully trusted in the application scenarios of this study, which is different from the situation of the conventional PRE mechanism, where the re-encryption entities are usually untrusted third-party servers. Therefore, the conventional PRE algorithm is improved, and an efficient improved PRE (IPRE) algorithm is put forward to adapt to the scenarios proposed in this study. (3) The classical fast attribute-based message encryption (FAME) mechanism is improved to enable dynamic multi-user fine-grained access control. In this way, users can easily use portable intelligent devices to access data anytime and anywhere. The security proofs, theoretical analysis, and experimental results reveal that the proposed solution is highly secure and practical, which is an effective way to ensure secure PHC data sharing.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006615
    Abstract:
    Asynchronous programs utilize asynchronous non-blocking calls to achieve program concurrency, and they are widely applied in parallel and distributed systems. However, it is very complex to verify asynchronous programs, and the difficulty can be ranked as EXPSPACE in terms of both safety and liveness. This study proposes a program model of asynchronous programs and defines two problems of asynchronous programs, namely, ?-equivalence and ?-reachability. In addition, the two problems can be proved to be NP-complete by reducing the 3-CNF-SAT to the problems and making them further reduced to the reachability of the communication-free Petri net. The case shows that the two problems can solve the verification problems of asynchronous programs.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006616
    Abstract:
    The reliable functioning of safety-critical IT systems depends heavily on the correct execution of programs. Deductive program verification can be performed to guarantee the correct execution to a large extent. There are already a plethora of programming languages in use, and new languages oriented toward high-reliability scenarios are still being invented. As a result, it is difficult to devise a full-fledged logical system for each language to support the verification of programs and prove the soundness and completeness of the logical system with respect to the formal semantics of the language. Furthermore, language-independent verification techniques offer sound verification procedures parameterized over the formal semantics of programming languages. The specialization of the verification procedure with the formal semantics of a concrete programming language directly gives rise to a verification procedure for the language. This study proposes a language-independent verification technique based on big-step operational semantics. The technique features a unified procedure for sound reasoning about program structures that potentially cause unbounded behavior, such as iteration and recursion. In particular, the study employs a functional formalization of big-step operational semantics to support the explicit representation of the computation performed by the sub-structures of a program. This representation enables the exploitation of the auxiliary information provided for these sub-structures in the unified reasoning process. In addition, the study has proved the soundness and relative completeness of the proposed technique, evaluated the technique by verification examples in imperative and functional programming languages, and formalized all theoretical results and verification examples in the Coq proof assistant, and thus provides a basis for developing a language-independent program verifier with big-step operational semantics based on a proof assistant.
    Available online:  March 24, 2022 , DOI: 10.13328/j.cnki.jos.006617
    Abstract:
    With the development of the Internet, the 5th generation (5G) of mobile communication technology emerges. The 5G authentication and key agreement (5G-AKA) protocol is proposed mainly to achieve two-way authentication between users and service networks. However, recent research suggests that the protocol may be subject to information deciphering and message replay attacks. At the same time, it is found that some variants of the current 5G-AKA cannot satisfy the protocol’s unlinkability. Therefore, in response to these shortcomings, this study proposes an improvement plan called SM-AKA. SM-AKA is composed of two parallel sub-protocols in a novel way. In addition, through the flexible mode switching, lightweight sub-protocols (GUTI submodule) are frequently adopted, and the other sub-protocol (SUPI submodule) is used to deal with abnormalities caused by authentication. Therefore, this mechanism not only realizes the efficient authentication between users and networks but also improves the stability of the protocol. Furthermore, the freshness of variables has been effectively maintained to prevent the replay of messages, and strict encryption and decryption methods have further strengthened the security of the protocol. Finally, the study carries out a complete evaluation of SM-AKA. Through formal modeling, attack assumptions, and Tamarin derivation, it is proved that the plan can achieve the authentication and privacy goals, and the theoretical analysis has demonstrated the performance advantage of the protocol.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006594
    [Abstract] (557) [HTML] (0) [PDF 9.71 M] (1164)
    Abstract:
    The realization of safe and efficient behavior decision-making has become a challenging issue for autonomous driving. As autonomous driving industries develop vigorously, industrial professionals and academic members have proposed many autonomous driving behavior decision-making approaches. However, due to the influence of environmental uncertainties as well as requirements for effectiveness and high security of the decision, existing approaches fail to take all these factors into account. Therefore, this study proposes an autonomous driving behavior decision-making approach with the RoboSim model based on the Bayesian network. First, based on domain ontology, the study analyzes the semantic relationship between elements in autonomous driving scenarios and predicts the intention of dynamic entities in scenarios by the LSTM model, so as to provide driving scenario information for establishing the Bayesian network. Next, the autonomous driving behavior decision-making in specific scenarios is inferred by the Bayesian network, and the state transition of the RoboSim model is employed to carry the dynamic execution of behavior decision-making and eliminate the redundant operation of the Bayesian network, thus improving the efficiency of decision-making. The RoboSim model is platform-independent. In addition, it can simulate the decision-making cycle and support validation technologies in different forms. To ensure the safety of the behavior decision-making, this study uses a model checking tool UPPAAL to verify and analyze the RoboSim model. Finally, based on lane change and overtaking cases, this study validates the feasibility of the proposed approach and provides a feasible way to achieve safe and efficient autonomous driving behavior decision-making.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006614
    Abstract:
    Determinization of an automaton refers to the transformation of a nondeterministic automaton into a deterministic automaton recognizing the same language, which is one of the fundamental notions in automata theory. Determinization of ω automata serves as a basic step in the decision procedures of SnS, CTL*, and μ-calculus. Meanwhile, it is also the key to solving infinite games. Therefore, studies on the determinization of ω automata are of great significance. This study focuses on a kind of ω automata called Streett automata. Nondeterministic Streett automata can be transformed into equivalent deterministic Rabin or Parity automata. In the previous work, the study has obtained algorithms with optimal state complexity and optimal asymptotical performance, respectively. Furthermore, it is necessary to develop a tool supporting Streett determinization, so as to evaluate the actual effect of proposed algorithms and show the procedure of determinization visually. This study first introduces four different Streett determinization structures including μ-Safra trees, H-Safra trees, compact Streett Safra trees, and LIR-H-Safra trees. By H-Safra trees, which are optimal, and μ-Safra trees, deterministic Rabin transformation automata are obtained. In addition, deterministic parity transformation automata are constructed via another two structures, where LIR-H-Safra trees are asymptotically optimal. Furthermore, based on the open source software named graphical tool for omega-automata and logics (GOAL), the study develops a tool for Streett determinization and names it NS2DR&PT to support the four structures. Besides, corresponding test sets are constructed by randomly generating 100 Streett automata, and comparative experiments are carried out. Results show that the actual effect of state complexity in each structure is consistent with theoretical analysis. Moreover, the efficiency of different algorithms is compared and analyzed.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006492
    Abstract:
    For a given set of moving objects, continuous k-nearest neighbor (CKNN) query q over moving objects is to quickly identify and monitor the k-nearest objects as objects and the query point evolve. In real life, many location-based applications in transportation, social network, e-commerce, and other fields involve the basic problem of processing CKNN queries over moving objects. Most of existing work processing CKNN queries usually needs to determine a query range containing k-nearest neighbors through multiple iterations, while each iteration has to identify the number of objects in the current query range, and which dominates the query cost. In order to address this issue, this study proposes a dual index called GGI that consists of a grid index and a Gaussian mixture function to simulate the varying distribution of objects. The bottom layer of GGI employs a grid index to maintain moving objects, and the upper layer constructs Gaussian mixture model to simulate the distribution of moving objects in two-dimensional space. Based on GGI, an incremental search algorithm called IS-CKNN to process CKNN queries. This algorithm directly determines a query region that at least contains k neighbors of q based on Gaussian mixture model, which greatly reduces the number of iterations. When the objects and query point evolve, an efficient incremental query strategy is further proposed, which can maximize the use of existing query results and reduce the calculation of the current query. Finally, extensive experiments are carried out on one real dataset and two synthetic datasets to confirm the superiority of the proposed proposal.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006595
    Abstract:
    ARM develops an M-Profile vector extension solution in terms of ARMv8.1-M micro processor architecture and names it ARM Helium. It is declared that ARM Helium can increase the machine learning performance of the ARM Cortex-M processor by up to 15 times. As the Internet of Things develops rapidly, the correct execution of microprocessors is important. In addition, the official manual of instruction sets provides a basis for developing chip simulators and on-chip applications, and thus it is the basic guarantee of program correctness. This study introduces these mantic correctness of vectorized machine learning instructions in the official manual of the ARMv8.1-M architecture by using K Framework. Furthermore, the study automatically extracts pseudo codes describing the vectorized machine learning instruction operation based on the manual and then formalizes them in semantics rules. With the executable framework provided by K Framework, the correctness of machine learning instructions in arithmetic operation is verified.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006612
    Abstract:
    The security of the trusted execution environment (TEE) has been concerned by Chinese and foreign researchers. Memory tag technology utilized in TEE helps to achieve finer-grained memory isolation and access control mechanisms. Nevertheless, prior works often rely on testing or empirical analysis to show their effectiveness, which fails to strongly guarantee the functional correctness and security properties. This study proposes a general formal model framework for memory tag-based access control and introduces a security analysis method in access control based on model checking. First, a general model framework for the access control model of TEE based on memory tag is constructed through a formal method, and those access control entities are formally defined. The defined rules include access control rules and tag update rules. Then abstract machines under the framework are incrementally designed and implemented with formal language B. In addition, the basic properties of the machines are formalized through invariant constraints. Next, a TEE implementation called TIMBER-V is used as an application case. The TIMBER-V access control model is constructed by instantiating these abstract machines, and the security properties are formally specified. Furthermore, the functional correctness and security of the models are verified based on model checking. Finally, this study simulates the specific attack scenarios, and these attacks are successfully detected. The evaluation results have proved the effectiveness of the security analysis method.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006538
    [Abstract] (578) [HTML] (0) [PDF 1.82 M] (1012)
    Abstract:
    Existing malware similarity measurement methods cannot accommodate code obfuscation technology and lack the ability to model the complex relationships between malware. This study proposes a malware similarity measurement method called API relation graph enhanced multiple heterogeneous proxembed (RG-MHPE) based on multiplex heterogeneous graph to solve the above problems. This method first uses the dynamic and static feature of malware to construct the multiplex heterogeneous graph and then proposes an enhanced proximity embedding method based on relational paths to solve the problem that proximity embedding cannot be applied to the similarity measurement of the multiplex heterogeneous graph. In addition, this study extracts knowledge from API documents on the MSDN website, builds an API relation graph, learns the similarity between Windows APIs, and effectively slows down the aging speed of similarity measurement models. Finally, the experimental results show that RG-MHPE has the best performance in similarity measurement performance and model anti-aging ability.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006552
    Abstract:
    As emerging technologies develop rapidly, domain software puts forward new requirements for development efficiency. In addition, as a declarative programming language with concise syntax and well-defined semantics, Datalog can help developers solve complex problems rapidly and achieve smooth development and thus has attracted wide attention in recent years. However, when solving real-world problems, the existing single-machine Datalog engines are often limited by the size of memory capacity and possess no scalability. To solve these problems, this study designs and implements a Datalog engine based on out-of-core computing. Firstly, a series of operators supporting out-of-core computing are designed to compute the Datalog program, and then the program is converted into a C++ program with the operators. Next, the study designs a partition strategy based on Hash and a minimum replacement scheduling strategy based on search tree pruning. After that, the corresponding partition files are scheduled and computed to generate the final results. Based on this method, the study establishes the prototype tool DDL (disk-based Datalog engine) and selects widely used real-world Datalog programs to conduct experiments on both synthetic and real-world datasets. The experimental results show that DDL has positive performance and high scalability.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006618
    Abstract:
    Data races are common defects in multi-threaded programs. Traditional data race analysis methods fail to achieve both recall and precision, and their detection reports cannot locate the root cause of defects. Due to the advantages of Petri nets in terms of accurate behavior description and rich analysis tools in the modeling and analysis of concurrent systems, this study proposes a new data race detection method based on Petri net unfolding technology. First, a Petri net model of the program is established by analyzing and mining a program running trace. The model implies different traces of the program even though it is mined from a single trace, which can reduce the false negative rate of traditional dynamic analysis methods while ensuring performance. After that, a Petri net unfolding-based detection method of program potential data races is proposed, which improves the efficiency significantly compared with static analysis methods and can clearly show the triggering path of data race defects. Finally, for the potential data race detected in the previous stage, a scheduling schema is designed to replay the defect based on the CalFuzzer platform, which can eliminate false positives and ensure the authenticity of detection results. In addition, the corresponding prototype system is developed, and the effectiveness of the proposed method is verified by open program instances.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006593
    [Abstract] (551) [HTML] (0) [PDF 8.51 M] (1000)
    Abstract:
    In recent years, deep reinforcement learning has been widely used in sequential decisions with positive effects, and it has outstanding advantages in application scenarios with high-dimensional input and large state spaces. However, deep reinforcement learning faces some limitations such as a lack of interpretability, inefficient initial training, and a cold start. To address these issues, this study proposes a dynamic decision framework combing explicit knowledge reasoning with deep reinforcement learning. The framework successfully embeds the priori knowledge in intelligent agent training via explicit knowledge representation and gets the agent intervened by the knowledge reasoning results during the reinforcement learning, so as to improve the training efficiency and the model’s interpretability. The explicit knowledge in this study is categorized into two kinds, namely, heuristic acceleration knowledge and evasive safety knowledge. The heuristic acceleration knowledge intervenes in the decision of the agent in the initial training to speed up the training, while the evasive safety knowledge keeps the agent from making catastrophic decisions to keep the training process stable. The experimental results show that the proposed framework significantly improves the training efficiency and the model’s interpretability under different application scenarios and reinforcement learning algorithms.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006558
    Abstract:
    Feature requests refer to suggestions to perfect existing features or requests for new features proposed by software users on open platforms, and they can reflect users’ wishes and needs. In addition, efficient and accurate analysis and processing of feature requests play a vital role in improving user satisfaction and product competitiveness. With users’ active participation, feature requests have become an important source of software requirements. However, feature requests are different from traditional requirements in terms of source, content, and form. Therefore, methods of applying feature requests to software development must differ from that of traditional requirements. At present, massive research focuses on applying feature requests to software development, e.g., feature requests’ acquisition, classification, prioritization, quality management, developer recommendation, and location of relevant codes. As related research emerges constantly, it is increasingly necessary to review user feature request analysis and processing. This study analyzes 121 global academic research papers on how to analyze and process feature requests in the software development process and systematically sorts existing research results from the perspective of applying feature requests to software development. In addition, the study summarizes research topics on feature requests, suggests that feature requests be applied to software development, and makes a comparison with traditional requirements engineering processes. Furthermore, it analyzes existing research methods of different requirement engineering and points out the difference. Finally, the research direction of feature requests is discussed to provide guidance for future researchers.
    Available online:  January 28, 2022 , DOI: 10.13328/j.cnki.jos.006592
    [Abstract] (1166) [HTML] (0) [PDF 4.38 M] (2008)
    Abstract:
    In recent years, artificial intelligence (AI) has rapidly developed. AI systems have penetrated people’s lives and become an indispensable part. However, these systems require a large amount of data to train models, and data disturbances will affect their results. Furthermore, as the business becomes diversified, and the scale gets complex, the trustworthiness of AI systems has attracted wide attention. Firstly, based on the trustworthiness attributes proposed by different organizations and scholars, this study introduces nine trustworthiness attributes of AI systems. Next, in terms of the data, model, and result trustworthiness, the study discusses methods for measuring the data, model, and result trustworthiness of existing AI systems and designs an evidence collection method of AI trustworthiness. Then, it summarizes the trustworthiness measurement theory and methods of AI systems. In addition, combined with attribute-based software trustworthiness measurement methods and blockchain technologies, the study establishes a trustworthiness measurement framework for AI systems, which includes methods of trustworthiness attribute decomposition and evidence acquisition, the federation trustworthiness measurement model, and the blockchain-based trustworthiness measurement structure of AI systems. Finally, it describes the opportunities and challenges of trustworthiness measurement technologies for AI systems.
    Available online:  October 20, 2021 , DOI: 10.13328/j.cnki.jos.006485
    [Abstract] (3491) [HTML] (0) [PDF 4.80 M] (3797)
    Abstract:
    Reinforcement learning is a technique that discovers optimal behavior strategies in a trial-and-error way, and it has become a general method for solving environmental interaction problems. However, as a machine learning method, reinforcement learning faces a common problem in machine learning, or in other words, it is unexplainable. The unexplainable problem limits the application of reinforcement learning in safety-sensitive fields, e.g., medical treatment and transportation, and it leads to a lack of universally applicable solutions in environmental simulation and task generalization. In order to address the problem, extensive research on explainable reinforcement learning (XRL) has emerged. However, academic members still have an inconsistent understanding of XRL. Therefore, this study explores the basic problems of XRL and reviews existing works. To begin with, the study discusses the parent problem, i.e., explainable artificial intelligence, and summarizes its existing definitions. Next, it constructs a theoretical system of interpretability to describe the common problems of XRL and explainable artificial intelligence. To be specific, it distinguishes between intelligent algorithms and mechanical algorithms, defines interpretability, discusses factors that affect interpretability, and classifies the intuitiveness of interpretability. Then, based on the characteristics of reinforcement learning, the study defines three unique problems of XRL, i.e., environmental interpretation, task interpretation, and strategy interpretation. After that, the latest research on XRL is reviewed, and the existing methods were systematically classified. Finally, the future research directions of XRL are put forward.
    Available online:  October 20, 2021 , DOI: 10.13328/j.cnki.jos.006437
    [Abstract] (544) [HTML] (0) [PDF 5.58 M] (1126)
    Abstract:
    As a distributed storage solution with high performance and high scalability, key-value storage systems have been widely adopted in recent years, such as Redis, MongoDB, Cassandra, etc. On the one hand,the multi-replication mechanism widely used in distributed storage system improves system throughput and reliability, but also increases the extra overhead of system coordination and replicationconsistency. For the cross-region distributed system, the long-distance replication coordination overhead may even become the performance bottleneck of the system, reducing system availability and throughput. The distributed key-value storage system called Elsa, proposed in this study, is a coordination-free multi-master key-value storage system that is designed for cross-region architecture. On the basis of ensuring high performance and high scalability, Elsa adopts the conflict-free replicated data types (CRDT) technology to ensure strong eventual consistency between replications without coordination, reducing the coordination overhead between system nodes. In this study, across-region distributed environment spanning 4 data centers and 8 nodes on aliyun platform is set up and a large-scale distributed performance comparison experiment is carried out.The experimental results show that under the cross-region distributed environment, the throughput of Elsa has obvious advantages for high concurrent contention loads, reaching up to 7.37 times of the MongoDB cluster and 1.62 times of the Cassandra cluster.
    Available online:  October 20, 2021 , DOI: 10.13328/j.cnki.jos.006351
    Abstract:
    How to detect sudden events in data streams on social media is a popular research topic in natural language processing. However, current methods for extracting emergencies have problems of low accuracy and low efficiency. In order to solve these problems, this paper proposes an emergency detection method based on the characteristics of word correlation, which can quickly detect emergency events from the social network data stream, so that relevant decision makers can take timely and effective measures to deal with, making the negative impact of emergencies can be reduced as much as possible to maintain social stability. First of all, through noise filtering and emotion filtering, we get microblog texts full of negative emotions. Then, based on the time information, time slice the Weibo data to calculate the word frequency characteristics, user influence and word frequency growth rate characteristics of each word of the data in each time window, and use the burst calculation method to extract the burst word. According to the word2vec model, similar words are merged, and the characteristic similarity of the burst words is used to form a burst word relationship graph. Finally, the multi-attribute spectral clustering algorithm is used to optimally divide the word relationship graph, and pay attention to abnormal words when the time window slides, and to judge the sudden events through the structural changes caused by the sudden changes of the words in the sub-graph. It is known from the experimental results that the emergency event detection method has a better event detection effect in the real-time blog post data stream. Compared with the existing methods, the emergency detection method proposed in this paper can meet the needs of emergency detection. Not only can it detect the detailed information of sub-events, but also the relevant information of events can be accurately detected.
    Available online:  August 02, 2021 , DOI: 10.13328/j.cnki.jos.006410
    [Abstract] (778) [HTML] (0) [PDF 5.15 M] (1469)
    Abstract:
    Cyber-physical system (CPS) plays an increasingly important role in social life. The on-demand choreography of CPS resources is based on the software defining of CPS resources. The definition of software interfaces depends on the full description for the capabilities of CPS resources. At present, in the CPS field, there is a lack of a knowledge base that can describe resources and their capabilities, and a lack of an effective way to construct the knowledge base. For the text description of CPS resources, this study proposes to construct the CPS resource capability knowledge graph and designs a bottom-up automatic construction method. Given CPS resources, this method first extracts textual descriptions of the resources’ capabilities from code and texts, and generates a normalized expression of capability phrases based on a predefined representation pattern. Then, capability phrases are divided, aggregated and abstracted based on the key components of the verb-object structure to generate the hierarchical abstract description of capabilities for different categories of resources. Finally, the CPS knowledge graph is constructed. Based on the Home Assistant platform, this study constructs a knowledge graph containing 32 resource categories and 957 resource capabilities. In the construction experiment, the results of manual construction and automatic construction using the proposed method are compared and analyzed from different dimensions. Experimental results show that this study provides a feasible method for automatic construction of CPS Resource Capability Knowledge Graph. This method helps to reduce the workload of artificial construction, supplement the description of resource services and capabilities in the CPS field and improves the knowledge completeness.
    Available online:  August 02, 2021 , DOI: 10.13328/j.cnki.jos.006407
    [Abstract] (601) [HTML] (0) [PDF 5.28 M] (1027)
    Abstract:
    Separation logic is an extension of the classical Hoare logic for reasoning about pointers and dynamic data structures, and has been extensively used in the formal analysis and verification of fundamental software, including operating system kernels. Automated constraint solving is one of the key means to automate the separation-logic based verification of these programs. The verification of programs manipulating dynamic data structures usually involves both the shape properties, e.g., singly or doubly linked lists and trees, and data constraints, e.g., sortedness and the invariance of data sets/multisets. This paper introduces COMPSPEN, a separation logic solver capable of simultaneously reasoning about the shape properties and data constraints of linear dynamic data structures. First, the theoretical foundations of COMPSPEN are introduced, including the definition of separation logic fragment SLIDdata as well as the decision procedures of the satisfiability and entailment problems of SLIDdata. Then, the implementation and the architecture of the COMPSPEN tool are presented. At last, the experimental results for COMPSEN are reported. 600 test cases are collected and the performance of COMPSPEN is compared with the state-of-the-art separation logic solvers, including ASTERIX, S2S, Songbird, and SPEN. The experimental results show that COMPSPEN is the only tool capable of solving separation logic formulae involving set data constraints, and in overall, it is able to efficiently solve the satisfiability problem of separation logic formulas involving both shape properties and linear arithmetic data constraints on linear dynamic data structures, and is also capable of solving the entailment problem.
    Available online:  October 18, 2017 , DOI:
    [Abstract] (2714) [HTML] (0) [PDF 525.21 K] (4326)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,315-325页. 原文链接如下:https://doi.org/10.1145/3106237.3106242, 读者如需引用该文请标引原文出处。
    Available online:  October 18, 2017 , DOI:
    [Abstract] (2691) [HTML] (0) [PDF 352.38 K] (5516)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,303-314页. 原文链接如下:https://doi.org/10.1145/3106237.3106239, 读者如需引用该文请标引原文出处。
    Available online:  September 11, 2017 , DOI:
    [Abstract] (3123) [HTML] (0) [PDF 276.42 K] (2495)
    Abstract:
    GitHub, a popular social-software-development platform, has fostered a variety of software ecosystems where projects depend on one another and practitioners interact with each other. Projects within an ecosystem often have complex inter-dependencies that impose new challenges in bug reporting and fixing. In this paper, we conduct an empirical study on cross-project correlated bugs, i.e., causally related bugs reported to different projects, focusing on two aspects: 1) how developers track the root causes across projects; and 2) how the downstream developers coordinate to deal with upstream bugs. Through manual inspection of bug reports collected from the scientific Python ecosystem and an online survey with developers, this study reveals the common practices of developers and the various factors in fixing cross-project bugs. These findings provide implications for future software bug analysis in the scope of ecosystem, as well as shed light on the requirements of issue trackers for such bugs.
    Available online:  June 21, 2017 , DOI:
    [Abstract] (3194) [HTML] (0) [PDF 169.43 K] (2624)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在IEEE Transactions on Software Engineering 2017 已录用待发表. 原文链接如下:http://ieeexplore.ieee.org/document/7792694, 读者如需引用该文请标引原文出处。
    Available online:  June 13, 2017 , DOI:
    [Abstract] (4396) [HTML] (0) [PDF 174.91 K] (3048)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 39th International Conference on Software Engineering, Pages 27-37, Buenos Aires, Argentina — May 20 - 28, 2017, IEEE Press Piscataway, NJ, USA ?2017, ISBN: 978-1-5386-3868-2 原文链接如下:http://dl.acm.org/citation.cfm?id=3097373, 读者如需引用该文请标引原文出处。
    Available online:  January 25, 2017 , DOI:
    [Abstract] (3269) [HTML] (0) [PDF 254.98 K] (2409)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 871-882. DOI: https://doi.org/10.1145/2950290.2950364 原文链接如下:http://dl.acm.org/citation.cfm?id=2950364, 读者如需引用该文请标引原文出处。
    Available online:  January 18, 2017 , DOI:
    [Abstract] (3703) [HTML] (0) [PDF 472.29 K] (2381)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Pages 133—143, Seattle WA, USA, November 2016. 原文链接如下:http://dl.acm.org/citation.cfm?id=2950327, 读者如需引用该文请标引原文出处。
    Available online:  January 04, 2017 , DOI:
    [Abstract] (3509) [HTML] (0) [PDF 293.93 K] (2204)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16), 810 – 821, November 13 - 18, 2016. 原文链接如下:https://doi.org/10.1145/2950290.2950310, 读者如需引用该文请标引原文出处。
    Available online:  January 04, 2017 , DOI:
    [Abstract] (3834) [HTML] (0) [PDF 244.61 K] (2499)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在FSE 2016, 原文链接如下:http://dl.acm.org/citation.cfm?doid=2950290.2950313, 读者如需引用该文请标引原文出处。
    Available online:  December 12, 2016 , DOI:
    [Abstract] (3351) [HTML] (0) [PDF 358.69 K] (2517)
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在FSE'16会议上Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 原文链接如下:http://dl.acm.org/citation.cfm?id=2950340, 读者如需引用该文请标引原文出处。
    Available online:  September 30, 2016 , DOI:
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 文章发表在ASE2016会议上http://ase2016.org/ 原文链接如下:http://dl.acm.org/citation.cfm?id=2970366 读者如需引用该文请标引原文出处。
    Available online:  September 09, 2016 , DOI:
    Abstract:
    文章由CCF软件工程专业委员会白颖教授推荐。 俊杰的文章发表在ASE2016会议上,http://ase2016.org/。 原文链接如下:http://dl.acm.org/citation.cfm?doid=2970276.2970300 请读者标引时请引注原文出处。
    Available online:  September 07, 2016 , DOI:
    Abstract:
    CCF 软件工程专业委员会白晓颖教授(清华大学)推荐。 原文发表在 ASE 2016 Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering。 全文链接:http://dx.doi.org/10.1145/2970276.2970307。 重要提示:读者如引用该文时请标注原文出处。
    Available online:  August 29, 2016 , DOI:
    Abstract:
    CCF软件工程专业委员会白晓颖教授(清华大学)推荐。 该论文发表在ACM Transactions on Software Engineering and Methodology (TOSEM, Vol. 25, No. 2, Article 13, May 2016),被ICSE 2016主会邀请为“Journal first”报告, 全文参见http://dl.acm.org/citation.cfm?id=2876443。 论文作者包括北京大学的周明辉,马秀娟,张路和梅宏,以及田纳西大学的Audris Mockus。 重要提示:读者如引用该文时请标注原文出处。
  • 全文下载排行(总排行年度排行各期排行)
    摘要点击排行(总排行年度排行各期排行)

  • Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2003,14(7):1282-1291, DOI:
    [Abstract] (36385) [HTML] (0) [PDF 832.28 K] (77042)
    Abstract:
    Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
    2010,21(3):427-437, DOI:
    [Abstract] (32038) [HTML] (0) [PDF 308.76 K] (36402)
    Abstract:
    Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
    2011,22(1):71-83, DOI:10.3724/SP.J.1001.2011.03958
    [Abstract] (29002) [HTML] (0) [PDF 781.42 K] (51539)
    Abstract:
    Cloud Computing is the fundamental change happening in the field of Information Technology. It is a representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
    2016,27(1):45-71, DOI:10.13328/j.cnki.jos.004914
    [Abstract] (28318) [HTML] (1239) [PDF 880.96 K] (28128)
    Abstract:
    Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
    2009,20(5):1337-1348, DOI:
    [Abstract] (27334) [HTML] (0) [PDF 1.06 M] (42693)
    Abstract:
    This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
    2008,19(1):48-61, DOI:
    [Abstract] (27156) [HTML] (0) [PDF 671.39 K] (58959)
    Abstract:
    The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
    2009,20(2):271-289, DOI:
    [Abstract] (26282) [HTML] (0) [PDF 675.56 K] (40755)
    Abstract:
    Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
    2005,16(1):1-7, DOI:
    [Abstract] (21459) [HTML] (0) [PDF 614.61 K] (18854)
    Abstract:
    The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
    2004,15(3):428-442, DOI:
    [Abstract] (20096) [HTML] (0) [PDF 1009.57 K] (15021)
    Abstract:
    With the rapid development of e-business, web applications based on the Web are developed from localization to globalization, from B2C(business-to-customer) to B2B(business-to-business), from centralized fashion to decentralized fashion. Web service is a new application model for decentralized computing, and it is also an effective mechanism for the data and service integration on the web. Thus, web service has become a solution to e-business. It is important and necessary to carry out the research on the new architecture of web services, on the combinations with other good techniques, and on the integration of services. In this paper, a survey presents on various aspects of the research of web services from the basic concepts to the principal research problems and the underlying techniques, including data integration in web services, web service composition, semantic web service, web service discovery, web service security, the solution to web services in the P2P (Peer-to-Peer) computing environment, and the grid service, etc. This paper also presents a summary of the current art of the state of these techniques, a discussion on the future research topics, and the challenges of the web services.
    2010,21(8):1834-1848, DOI:
    [Abstract] (19516) [HTML] (0) [PDF 682.96 K] (52353)
    Abstract:
    This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
    2005,16(5):857-868, DOI:
    [Abstract] (19432) [HTML] (0) [PDF 489.65 K] (28046)
    Abstract:
    Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
    2009,20(1):54-66, DOI:
    [Abstract] (18859) [HTML] (0) [PDF 1.41 M] (47677)
    Abstract:
    Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
    2012,23(4):962-986, DOI:10.3724/SP.J.1001.2012.04175
    [Abstract] (18105) [HTML] (0) [PDF 2.09 M] (28982)
    Abstract:
    Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
    2012,23(1):32-45, DOI:10.3724/SP.J.1001.2012.04091
    [Abstract] (18079) [HTML] (0) [PDF 408.86 K] (28591)
    Abstract:
    In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
    2009,20(3):524-545, DOI:
    [Abstract] (16994) [HTML] (0) [PDF 1.09 M] (20314)
    Abstract:
    Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
    2009,20(1):124-137, DOI:
    [Abstract] (16276) [HTML] (0) [PDF 1.06 M] (20343)
    Abstract:
    The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
    2009,20(11):2965-2976, DOI:
    [Abstract] (16042) [HTML] (0) [PDF 442.42 K] (13644)
    Abstract:
    This paper studies uncertain graph data mining and especially investigates the problem of mining frequent subgraph patterns from uncertain graph data. A data model is introduced for representing uncertainties in graphs, and an expected support is employed to evaluate the significance of subgraph patterns. By using the apriori property of expected support, a depth-first search-based mining algorithm is proposed with an efficient method for computing expected supports and a technique for pruning search space, which reduces the number of subgraph isomorphism testings needed by computing expected support from the exponential scale to the linear scale. Experimental results show that the proposed algorithm is 3 to 5 orders of magnitude faster than a na?ve depth-first search algorithm, and is efficient and scalable.
    2004,15(8):1208-1219, DOI:
    [Abstract] (16012) [HTML] (0) [PDF 948.49 K] (12230)
    Abstract:
    With the explosive growth of network applications and complexity, the threat of Internet worms against network security becomes increasingly serious. Especially under the environment of Internet, the variety of the propagation ways and the complexity of the application environment result in worm with much higher frequency of outbreak, much deeper latency and more wider coverage, and Internet worms have been a primary issue faced by malicious code researchers. In this paper, the concept and research situation of Internet worms, exploration function component and execution mechanism are first presented, then the scanning strategies and propagation model are discussed, and finally the critical techniques of Internet worm prevention are given. Some major problems and research trends in this area are also addressed.
    2009,20(5):1226-1240, DOI:
    [Abstract] (15831) [HTML] (0) [PDF 926.82 K] (14698)
    Abstract:
    This paper introduces the concrete details of combining the automated reasoning techniques with planning methods, which includes planning as satisfiability using propositional logic, Conformant planning using modal logic and disjunctive reasoning, planning as nonmonotonic logic, and Flexible planning as fuzzy description logic. After considering experimental results of International Planning Competition and relevant papers, it concludes that planning methods based on automated reasoning techniques is helpful and can be adopted. It also proposes the challenges and possible hotspots.
    2003,14(10):1717-1727, DOI:
    [Abstract] (15691) [HTML] (0) [PDF 839.25 K] (12946)
    Abstract:
    Sensor networks are integration of sensor techniques, nested computation techniques, distributed computation techniques and wireless communication techniques. They can be used for testing, sensing, collecting and processing information of monitored objects and transferring the processed information to users. Sensor network is a new research area of computer science and technology and has a wide application future. Both academia and industries are very interested in it. The concepts and characteristics of the sensor networks and the data in the networks are introduced, and the issues of the sensor networks and the data management of sensor networks are discussed. The advance of the research on sensor networks and the data management of sensor networks are also presented.
    2009,20(2):350-362, DOI:
    [Abstract] (15587) [HTML] (0) [PDF 1.39 M] (37624)
    Abstract:
    This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
    2015,26(1):62-81, DOI:10.13328/j.cnki.jos.004701
    [Abstract] (15197) [HTML] (1024) [PDF 1.04 M] (23303)
    Abstract:
    Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
    2014,25(4):839-862, DOI:10.13328/j.cnki.jos.004558
    [Abstract] (14951) [HTML] (1110) [PDF 1.32 M] (17336)
    Abstract:
    Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
    2009,20(10):2729-2743, DOI:
    [Abstract] (14136) [HTML] (0) [PDF 1.12 M] (9785)
    Abstract:
    In a multi-hop wireless sensor network (WSN), the sensors closest to the sink tend to deplete their energy faster than other sensors, which is known as an energy hole around the sink. No more data can be delivered to the sink after an energy hole appears, while a considerable amount of energy is wasted and the network lifetime ends prematurely. This paper investigates the energy hole problem, and based on the improved corona model with levels, it concludes that the assignment of transmission ranges of nodes in different coronas is an effective approach for achieving energy-efficient network. It proves that the optimal transmission ranges for all areas is a multi-objective optimization problem (MOP), which is NP hard. The paper proposes an ACO (ant colony optimization)-based distributed algorithm to prolong the network lifetime, which can help nodes in different areas to adaptively find approximate optimal transmission range based on the node distribution. Furthermore, the simulation results indicate that the network lifetime under this solution approximates to that using the optimal list. Compared with existing algorithms, this ACO-based algorithm can not only make the network lifetime be extended more than two times longer, but also have good performance in the non-uniform node distribution.
    2012,23(1):1-20, DOI:10.3724/SP.J.1001.2012.04100
    [Abstract] (13889) [HTML] (0) [PDF 1017.73 K] (28597)
    Abstract:
    Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
    2012,23(5):1148-1166, DOI:10.3724/SP.J.1001.2012.04195
    [Abstract] (13855) [HTML] (0) [PDF 946.37 K] (15794)
    Abstract:
    With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
    2000,11(11):1460-1466, DOI:
    [Abstract] (13739) [HTML] (0) [PDF 520.69 K] (10048)
    Abstract:
    Intrusion detection is a highlighted topic of network security research in recent years. In this paper, first the necessity o f intrusion detection is presented, and its concepts and models are described. T hen, many intrusion detection techniques and architectures are summarized. Final ly, the existing problems and the future direction in this field are discussed.
    2013,24(8):1786-1803, DOI:10.3724/SP.J.1001.2013.04416
    [Abstract] (13417) [HTML] (0) [PDF 1.04 M] (15008)
    Abstract:
    Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
    2004,15(4):571-583, DOI:
    [Abstract] (13375) [HTML] (0) [PDF 1005.17 K] (8726)
    Abstract:
    For most peer-to-peer file-swapping applications, sharing is a volunteer action, and peers are not responsible for their irresponsible bartering history. This situation indicates the trust between participants can not be set up simply on the traditional trust mechanism. A reasonable trust construction approach comes from the social network analysis, in which trust relations between individuals are set up upon recommendations of other individuals. Current p2p trust model could not promise the convergence of iteration for trust computation, and takes no consideration for model security problems, such as sybil attack and slandering. This paper presents a novel recommendation-based global trust model and gives a distributed implementation method. Mathematic analyses and simulations show that, compared to the current global trust model, the proposed model is more robust on trust security problems and more complete on iteration for computing peer trust.
    2008,19(zk):112-120, DOI:
    [Abstract] (13327) [HTML] (0) [PDF 594.29 K] (13418)
    Abstract:
    An ad hoc network is a collection of wireless mobile nodes dynamically forming a temporary network without the use of any existing network infrastructure or centralized administration. Due to bandwidth constraint and dynamic topology of mobile ad hoc networks, multipath supported routing is a very important research issue. In this paper, we present an entropy-based metric to support stability multipath on-demand routing (SMDR). The key idea of SMDR protocol is to construct the new metric-entropy and select the stability multipath with the help of entropy metric to reduce the number of route reconstruction so as to provide QoS guarantee in the ad hoc network whose topology changes continuously. Simulation results show that, with the proposed multipath routing protocol, packet delivery ratio, end-to-end delay, and routing overhead ratio can be improved in most of cases. It is an available approach to multipath routing decision.
    2006,17(7):1588-1600, DOI:
    [Abstract] (13287) [HTML] (0) [PDF 808.73 K] (13119)
    Abstract:
    Routing technology at the network layer is pivotal in the architecture of wireless sensor networks. As an active branch of routing technology, cluster-based routing protocols excel in network topology management, energy minimization, data aggregation and so on. In this paper, cluster-based routing mechanisms for wireless sensor networks are analyzed. Cluster head selection, cluster formation and data transmission are three key techniques in cluster-based routing protocols. As viewed from the three techniques, recent representative cluster-based routing protocols are presented, and their characteristics and application areas are compared. Finally, the future research issues in this area are pointed out.
    2002,13(7):1228-1237, DOI:
    [Abstract] (13278) [HTML] (0) [PDF 500.04 K] (12619)
    Abstract:
    Software architecture (SA) is emerging as one of the primary research areas in software engineering recently and one of the key technologies to the development of large-scale software-intensive system and software product line system. The history and the major direction of SA are summarized, and the concept of SA is brought up based on analyzing and comparing the several classical definitions about SA. Based on summing up the activities about SA, two categories of study about SA are extracted out, and the advancements of researches on SA are subsequently introduced from seven aspects.Additionally,some disadvantages of study on SA are discussed,and the causes are explained at the same.Finally,it is concluded with some singificantly promising tendency about research on SA.
    2011,22(1):115-131, DOI:10.3724/SP.J.1001.2011.03950
    [Abstract] (13195) [HTML] (0) [PDF 845.91 K] (26235)
    Abstract:
    The Internet traffic model is the key issue for network performance management, Quality of Service management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
    2009,20(1):11-29, DOI:
    [Abstract] (13098) [HTML] (0) [PDF 787.30 K] (12881)
    Abstract:
    Constrained optimization problems (COPs) are mathematical programming problems frequently encountered in the disciplines of science and engineering application. Solving COPs has become an important research area of evolutionary computation in recent years. In this paper, the state-of-the-art of constrained optimization evolutionary algorithms (COEAs) is surveyed from two basic aspects of COEAs (i.e., constraint-handling techniques and evolutionary algorithms). In addition, this paper discusses some important issues of COEAs. More specifically, several typical algorithms are analyzed in detail. Based on the analyses, it concluded that to obtain competitive results, a proper constraint-handling technique needs to be considered in conjunction with an appropriate search algorithm. Finally, the open research issues in this field are also pointed out.
    2015,26(1):26-39, DOI:10.13328/j.cnki.jos.004631
    [Abstract] (13024) [HTML] (1043) [PDF 763.52 K] (12781)
    Abstract:
    In recent years, transfer learning has provoked vast amount of attention and research. Transfer learning is a new machine learning method that applies the knowledge from related but different domains to target domains. It relaxes the two basic assumptions in traditional machine learning: (1) the training (also referred as source domain) and test data (also referred target domain) follow the independent and identically distributed (i.i.d.) condition; (2) there are enough labeled samples to learn a good classification model, aiming to solve the problems that there are few or even not any labeled data in target domains. This paper surveys the research progress of transfer learning and introduces its own works, especially the ones in building transfer learning models by applying generative model on the concept level. Finally, the paper introduces the applications of transfer learning, such as text classification and collaborative filtering, and further suggests the future research direction of transfer learning.
    2013,24(1):50-66, DOI:10.3724/SP.J.1001.2013.04276
    [Abstract] (12991) [HTML] (0) [PDF 0.00 Byte] (15281)
    Abstract:
    As an important application of acceleration in the cloud, the distributed caching technology has received considerable attention in industry and academia. This paper starts with a discussion on the combination of cloud computing and distributed caching technology, giving an analysis of its characteristics, typical application scenarios, stages of development, standards, and several key elements, which have promoted its development. In order to systematically know the state of art progress and weak points of the distributed caching technology, the paper builds a multi-dimensional framework, DctAF. This framework is constituted of 6 dimensions through analyzing the characteristics of cloud computing and boundary of the caching techniques. Based on DctAF, current techniques have been analyzed and summarized; comparisons among several influential products have also been made. Finally, the paper describes and highlights the several challenges that the cache system faces and examines the current research through in-depth analysis and comparison.
    2008,19(8):1902-1919, DOI:
    [Abstract] (12710) [HTML] (0) [PDF 521.73 K] (12452)
    Abstract:
    Visual language techniques have exhibited more advantages in describing various software artifacts than one-dimensional textual languages during software development, ranging from the requirement analysis and design to testing and maintenance, as diagrammatic and graphical notations have been well applied in modeling system. In addition to an intuitive appearance, graph grammars provide a well-established foundation for defining visual languages with the power of precise modeling and verification on computers. This paper discusses the issues and techniques for a formal foundation of visual languages, reviews related practical graphical environments, presents a spatial graph grammar formalism, and applies the spatial graph grammar to defining behavioral semantics of UML diagrams and developing a style-driven framework for software architecture design.
    2003,14(9):1621-1628, DOI:
    [Abstract] (12699) [HTML] (0) [PDF 680.35 K] (17826)
    Abstract:
    Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
    2008,19(8):1947-1964, DOI:
    [Abstract] (12611) [HTML] (0) [PDF 811.11 K] (8798)
    Abstract:
    Wide-Spread deployment for interactive information visualization is difficult. Non-Specialist users need a general development method and a toolkit to support the generic data structures suited to tree, network and multi-dimensional data, special visualization techniques and interaction techniques, and well-known generic information tasks. This paper presents a model driven development method for interactive information visualization. First, an interactive information visualization interface model (IIVM) is proposed. Then, the development method for interactive information visualization based on IIVM is presented. The Daisy toolkit is introduced, which includes Daisy model builder, Daisy IIV generator and runtime framework with Daisy library. Finally, an application example is given. Experimental results show that Daisy can provide a general solution for development for interactive information visualization.
    2002,13(10):1952-1961, DOI:
    [Abstract] (12563) [HTML] (0) [PDF 570.96 K] (10491)
    Abstract:
    The crucial technologies related to personalization are introduced in this paper, which include the representation and modification of user profile, the representation of resource, the recommendation technology, and the architecture of personalization. By comparing with some existing prototype systems, the key technologies about how to implement personalization are discussed in detail. In addition, three representative personalization systems are analyzed. At last, some research directions for personalization are presented.
    2010,21(2):231-247, DOI:
    [Abstract] (12498) [HTML] (0) [PDF 1.21 M] (15029)
    Abstract:
    In this paper, a framework is proposed for handling fault of service composition through analyzing fault requirements. Petri nets are used in the framework for fault detecting and its handling, which focuses on targeting the failure of available services, component failure and network failure. The corresponding fault models are given. Based on the model, the correctness criterion of fault handling is given to analyze fault handling model, and its correctness is proven. Finally, CTL (computational tree logic) is used to specify the related properties and enforcement algorithm of fault analysis. The simulation results show that this method can ensure the reliability and consistency of service composition.
    2003,14(9):1635-1644, DOI:
    [Abstract] (12463) [HTML] (0) [PDF 622.06 K] (10704)
    Abstract:
    Computer forensics is the technology field that attempts to prove thorough, efficient, and secure means to investigate computer crime. Computer evidence must be authentic, accurate, complete and convincing to juries. In this paper, the stages of computer forensics are presented, and the theories and the realization of the forensics software are described. An example about forensic practice is also given. The deficiency of computer forensics technique and anti-forensics are also discussed. The result comes out that it is as the improvement of computer science technology, the forensics technique will become more integrated and thorough.
    2012,23(1):82-96, DOI:10.3724/SP.J.1001.2012.04101
    [Abstract] (12332) [HTML] (0) [PDF 394.07 K] (12958)
    Abstract:
    Botnets are one of the most serious threats to the Internet. Researchers have done plenty of research and made significant progress. However, botnets keep evolving and have become more and more sophisticated. Due to the underlying security limitation of current system and Internet architecture, and the complexity of botnet itself, how to effectively counter the global threat of botnets is still a very challenging issue. This paper first introduces the evolving of botnet’s propagation, attack, command, and control mechanisms. Then the paper summarizes recent advances of botnet defense research and categorizes into five areas: Botnet monitoring, botnet infiltration, analysis of botnet characteristics, botnet detection and botnet disruption. The limitation of current botnet defense techniques, the evolving trend of botnet, and some possible directions for future research are also discussed.
    2010,21(7):1620-1634, DOI:
    [Abstract] (12190) [HTML] (0) [PDF 765.23 K] (18466)
    Abstract:
    As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
    2017,28(1):1-16, DOI:10.13328/j.cnki.jos.005139
    [Abstract] (12123) [HTML] (1096) [PDF 1.75 M] (7260)
    Abstract:
    Knapsack problem (KP) is a well-known combinatorial optimization problem which includes 0-1 KP, bounded KP, multi-constraint KP, multiple KP, multiple-choice KP, quadratic KP, dynamic knapsack KP, discounted KP and other types of KPs. KP can be considered as a mathematical model extracted from variety of real fields and therefore has wide applications. Evolutionary algorithms (EAs) are universally considered as an efficient tool to solve KP approximately and quickly. This paper presents a survey on solving KP by EAs over the past ten years. It not only discusses various KP encoding mechanism and the individual infeasible solution processing but also provides useful guidelines for designing new EAs to solve KPs.
    2008,19(7):1565-1580, DOI:
    [Abstract] (12043) [HTML] (0) [PDF 815.02 K] (14577)
    Abstract:
    Software defect prediction has been one of the active parts of software engineering since it was developed in 1970's. It plays a very important role in the analysis of software quality and balance of software cost. This paper investigates and discusses the motivation, evolvement, solutions and challenges of software defect prediction technologies, and it also categorizes, analyzes and compares the representatives of these prediction technologies. Some case studies for software defect distribution models are given to help understanding.
    2010,21(5):916-929, DOI:
    [Abstract] (11878) [HTML] (0) [PDF 944.50 K] (15959)
    Abstract:
    Data deduplication technologies can be divided into two categories: a) identical data detection techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since data deduplication technologies can affect the reliability and performance of storage systems, this paper also surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of the current state of research on data deduplication technologies, this paper makes several conclusions as follows: a) How to mine data characteristic information in data deduplication has not been completely solved, and how to use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by data deduplication techniques.
    2008,19(10):2706-2719, DOI:
    [Abstract] (11869) [HTML] (0) [PDF 778.29 K] (10493)
    Abstract:
    Web search engine has become a very important tool for finding information efficiently from the massive Web data. With the explosive growth of the Web data, traditional centralized search engines become harder to catch up with the growing step of people's information needs. With the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web search has been proposed and quickly becomes a research focus. The goal of this paper is to give a brief summary of current P2P Web search technologies in order to facilitate future research. First, some main challenges for P2P Web search are presented. Then, key techniques for building a feasible and efficient P2P Web search engine are reviewed, including system topology, data placement, query routing, index partitioning, collection selection, relevance ranking and Web crawling. Finally, three recently proposed novel P2P Web search prototypes are introduced.
    2004,15(12):1751-1763, DOI:
    [Abstract] (11826) [HTML] (0) [PDF 928.33 K] (6942)
    Abstract:
    This paper presents a research work in children Truing test(CTT).The main defference between our test program and other ones is its knowledge-based character,which is supported by a massive commonsense knowledge base.The motivation,design,techniques,experimental results and platform(including a knowledge engine and a cinverstation engine)of the CTT are described in this paper.Finally,some cincluding thoughts about the CTT and AI are given.
    1999,10(11):1206-1211, DOI:
    [Abstract] (11750) [HTML] (0) [PDF 392.66 K] (5735)
    Abstract:
    In this paper, the authors discuss two important issues in rough set research which are attribute reduction and value reduction. A new attribute reduction approach which can reach the best attribute reduction is presented based on discernibility matrix and logic computation. And a multivariate decision tree can be got with this method. Some improvements for a widely used value reduction method are also achieved in this paper. The complexity of acquired rule knowledge can be reduced effectively in this way.
  • 全文下载排行(总排行年度排行各期排行)
    摘要点击排行(总排行年度排行各期排行)

  • Article Search
    Search by issue
    Select AllDeselectExport
    Display Method:
    2003,14(7):1282-1291, DOI:
    [Abstract] (36385) [HTML] (0) [PDF 832.28 K] (77042)
    Abstract:
    Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
    2008,19(1):48-61, DOI:
    [Abstract] (27156) [HTML] (0) [PDF 671.39 K] (58959)
    Abstract:
    The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
    2010,21(8):1834-1848, DOI:
    [Abstract] (19516) [HTML] (0) [PDF 682.96 K] (52353)
    Abstract:
    This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
    2011,22(1):71-83, DOI:10.3724/SP.J.1001.2011.03958
    [Abstract] (29002) [HTML] (0) [PDF 781.42 K] (51539)
    Abstract:
    Cloud Computing is the fundamental change happening in the field of Information Technology. It is a representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
    2009,20(1):54-66, DOI:
    [Abstract] (18859) [HTML] (0) [PDF 1.41 M] (47677)
    Abstract:
    Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
    2009,20(5):1337-1348, DOI:
    [Abstract] (27334) [HTML] (0) [PDF 1.06 M] (42693)
    Abstract:
    This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
    2009,20(2):271-289, DOI:
    [Abstract] (26282) [HTML] (0) [PDF 675.56 K] (40755)
    Abstract:
    Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
    2009,20(2):350-362, DOI:
    [Abstract] (15587) [HTML] (0) [PDF 1.39 M] (37624)
    Abstract:
    This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
    2004,15(10):1493-1504, DOI:
    [Abstract] (8790) [HTML] (0) [PDF 937.72 K] (37526)
    Abstract:
    Graphics processing unit (GPU) has been developing rapidly in recent years at a speed over Moor抯 law, and as a result, various applications associated with computer graphics advance greatly. At the same time, the highly processing power, parallelism and programmability available nowadays on the contemporary GPU provide an ideal platform on which the general-purpose computation could be made. Starting from an introduction to the development history and the architecture of GPU, the technical fundamentals of GPU are described in the paper. Then in the main part of the paper, the development of various applications on general purpose computation on GPU is introduced, and among those applications, fluid dynamics, algebraic computation, database operations, and spectrum analysis are introduced in detail. The experience of our work on fluid dynamics has been also given, and the development of software tools in this area is introduced. Finally, a conclusion is made, and the future development and the new challenge on both hardware and software in this subject are discussed.
    2010,21(3):427-437, DOI:
    [Abstract] (32038) [HTML] (0) [PDF 308.76 K] (36402)
    Abstract:
    Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
    2013,24(11):2476-2497, DOI:10.3724/SP.J.1001.2013.04486
    [Abstract] (9634) [HTML] (0) [PDF 1.14 M] (32389)
    Abstract:
    Probabilistic graphical models are powerful tools for compactly representing complex probability distributions, efficiently computing (approximate) marginal and conditional distributions, and conveniently learning parameters and hyperparameters in probabilistic models. As a result, they have been widely used in applications that require some sort of automated probabilistic reasoning, such as computer vision and natural language processing, as a formal approach to deal with uncertainty. This paper surveys the basic concepts and key results of representation, inference and learning in probabilistic graphical models, and demonstrates their uses in two important probabilistic models. It also reviews some recent advances in speeding up classic approximate inference algorithms, followed by a discussion of promising research directions.
    2014,25(9):1889-1908, DOI:10.13328/j.cnki.jos.004674
    [Abstract] (11168) [HTML] (1224) [PDF 550.98 K] (31990)
    Abstract:
    This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
    2012,23(4):962-986, DOI:10.3724/SP.J.1001.2012.04175
    [Abstract] (18105) [HTML] (0) [PDF 2.09 M] (28982)
    Abstract:
    Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
    2012,23(1):1-20, DOI:10.3724/SP.J.1001.2012.04100
    [Abstract] (13889) [HTML] (0) [PDF 1017.73 K] (28597)
    Abstract:
    Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
    2012,23(1):32-45, DOI:10.3724/SP.J.1001.2012.04091
    [Abstract] (18079) [HTML] (0) [PDF 408.86 K] (28591)
    Abstract:
    In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
    2016,27(1):45-71, DOI:10.13328/j.cnki.jos.004914
    [Abstract] (28318) [HTML] (1239) [PDF 880.96 K] (28128)
    Abstract:
    Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
    2005,16(5):857-868, DOI:
    [Abstract] (19432) [HTML] (0) [PDF 489.65 K] (28046)
    Abstract:
    Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
    2018,29(5):1471-1514, DOI:10.13328/j.cnki.jos.005519
    [Abstract] (5033) [HTML] (1455) [PDF 4.38 M] (27403)
    Abstract:
    Computer aided detection/diagnosis (CAD) can improve the accuracy of diagnosis,reduce false positive,and provide decision supports for doctors.The main purpose of this paper is to analyze the latest development of computer aided diagnosis tools.Focusing on the top four fatal cancer's incidence positions,major recent publications on CAD applications in different medical imaging areas are reviewed in this survey according to different imaging techniques and diseases.Further more,multidimentional analysis is made on the researches from image data sets,algorithms and evaluation methods.Finally,existing problems,research trend and development direction in the field of medical image CAD system are discussed.
    2011,22(1):115-131, DOI:10.3724/SP.J.1001.2011.03950
    [Abstract] (13195) [HTML] (0) [PDF 845.91 K] (26235)
    Abstract:
    The Internet traffic model is the key issue for network performance management, Quality of Service management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
    2013,24(1):77-90, DOI:10.3724/SP.J.1001.2013.04339
    [Abstract] (10857) [HTML] (0) [PDF 0.00 Byte] (24946)
    Abstract:
    Task parallel programming model is a widely used parallel programming model on multi-core platforms. With the intention of simplifying parallel programming and improving the utilization of multiple cores, this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives: Parallelism expression, data management and task scheduling. In the end, some future trends in this area are discussed.
    2015,26(1):62-81, DOI:10.13328/j.cnki.jos.004701
    [Abstract] (15197) [HTML] (1024) [PDF 1.04 M] (23303)
    Abstract:
    Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
    2017,28(4):959-992, DOI:10.13328/j.cnki.jos.005143
    [Abstract] (8464) [HTML] (1325) [PDF 3.58 M] (21363)
    Abstract:
    The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
    2011,22(6):1299-1315, DOI:10.3724/SP.J.1001.2011.03993
    [Abstract] (10179) [HTML] (0) [PDF 987.90 K] (20517)
    Abstract:
    Attribute-Based encryption (ABE) scheme takes attributes as the public key and associates the ciphertext and user’s secret key with attributes, so that it can support expressive access control policies. This dramatically reduces the cost of network bandwidth and sending node’s operation in fine-grained access control of data sharing. Therefore, ABE has a broad prospect of application in the area of fine-grained access control. After analyzing the basic ABE system and its two variants, Key-Policy ABE (KP-ABE) and Ciphertext-Policy ABE (CP-ABE), this study elaborates the research problems relating to ABE systems, including access structure design for CP-ABE, attribute key revocation, key abuse and multi-authorities ABE with an extensive comparison of their functionality and performance. Finally, this study discusses the need-to-be solved problems and main research directions in ABE.
    2009,20(1):124-137, DOI:
    [Abstract] (16276) [HTML] (0) [PDF 1.06 M] (20343)
    Abstract:
    The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
    2009,20(3):524-545, DOI:
    [Abstract] (16994) [HTML] (0) [PDF 1.09 M] (20314)
    Abstract:
    Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
    2021,32(2):349-369, DOI:10.13328/j.cnki.jos.006138
    [Abstract] (6042) [HTML] (1941) [PDF 2.36 M] (19518)
    Abstract:
    Few-shot learning is defined as learning models to solve problems from small samples. In recent years, under the trend of training model with big data, machine learning and deep learning have achieved success in many fields. However, in many application scenarios in the real world, there is not a large amount of data or labeled data for model training, and labeling a large number of unlabeled samples will cost a lot of manpower. Therefore, how to use a small number of samples for learning has become a problem that needs to be paid attention to at present. This paper systematically combs the current approaches of few-shot learning. It introduces each kind of corresponding model from the three categories: fine-tune based, data augmentation based, and transfer learning based. Then, the data augmentation based approaches are subdivided into unlabeled data based, data generation based, and feature augmentation based approaches. The transfer learning based approaches are subdivided into metric learning based, meta-learning based, and graph neural network based methods. In the following, the paper summarizes the few-shot datasets and the results in the experiments of the aforementioned models. Next, the paper summarizes the current situation and challenges in few-shot learning. Finally, the future technological development of few-shot learning is prospected.
    2006,17(9):1848-1859, DOI:
    [Abstract] (11736) [HTML] (0) [PDF 770.40 K] (19346)
    Abstract:
    In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
    2004,15(11):1583-1594, DOI:
    [Abstract] (8075) [HTML] (0) [PDF 1.57 M] (18969)
    Abstract:
    Uncertainty exists widely in the subjective and objective world. In all kinds of uncertainty, randomness and fuzziness are the most important and fundamental. In this paper, the relationship between randomness and fuzziness is discussed. Uncertain states and their changes can be measured by entropy and hyper-entropy respectively. Taken advantage of entropy and hyper-entropy, the uncertainty of chaos, fractal and complex networks by their various evolution and differentiation are further studied. A simple and effective way is proposed to simulate the uncertainty by means of knowledge representation which provides a basis for the automation of both logic and image thinking with uncertainty. The AI (artificial intelligence) with uncertainty is a new cross-discipline, which covers computer science, physics, mathematics, brain science, psychology, cognitive science, biology and philosophy, and results in the automation of representation, process and thinking for uncertain information and knowledge.
    2005,16(1):1-7, DOI:
    [Abstract] (21459) [HTML] (0) [PDF 614.61 K] (18854)
    Abstract:
    The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
    2012,23(8):2058-2072, DOI:10.3724/SP.J.1001.2012.04237
    [Abstract] (9616) [HTML] (0) [PDF 800.05 K] (18669)
    Abstract:
    The Distributed denial of service (DDoS) attack is a major threat to the current network. Based on the attack packet level, the study divides DDoS attacks into network-level DDoS attacks and application-level DDoS attacks. Next, the study analyzes the detection and control methods of these two kinds of DDoS attacks in detail, and it also analyzes the drawbacks of different control methods implemented in different network positions. Finally, the study analyzes the drawbacks of the current detection and control methods, the development trend of the DDoS filter system, and corresponding technological challenges are also proposed.
    2010,21(7):1620-1634, DOI:
    [Abstract] (12190) [HTML] (0) [PDF 765.23 K] (18466)
    Abstract:
    As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
    2014,25(1):37-50, DOI:10.13328/j.cnki.jos.004497
    [Abstract] (9162) [HTML] (994) [PDF 929.87 K] (18381)
    Abstract:
    This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.
    2003,14(9):1621-1628, DOI:
    [Abstract] (12699) [HTML] (0) [PDF 680.35 K] (17826)
    Abstract:
    Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
    2013,24(5):1078-1097, DOI:10.3724/SP.J.1001.2013.04390
    [Abstract] (11311) [HTML] (0) [PDF 1.74 M] (17807)
    Abstract:
    The control and data planes are decoupled in software-defined networking, which provide a new solution for research on new network applications and future Internet technologies. The development status of OpenFlow-based SDN technologies is surveyed in this paper. The research background of decoupled architecture of network control and data transmission in OpenFlow network is summarized first, and the key components and research progress including OpenFlow switch, controller, and SDN technologies are introduced. Moreover, current problems and solutions of OpenFlow-based SDN technologies are analyzed in four aspects. Combined with the development status in recent years, the applications used in campus, data center, network management and network security are summarized. Finally, future research trends are discussed.
    2005,16(10):1743-1756, DOI:
    [Abstract] (9524) [HTML] (0) [PDF 545.62 K] (17804)
    Abstract:
    This paper presents a survey on the theory of provable security and its applications to the design and analysis of security protocols. It clarifies what the provable security is, explains some basic notions involved in the theory of provable security and illustrates the basic idea of random oracle model. It also reviews the development and advances of provably secure public-key encryption and digital signature schemes, in the random oracle model or the standard model, as well as the applications of provable security to the design and analysis of session-key distribution protocols and their advances.
    2018,29(10):2966-2994, DOI:10.13328/j.cnki.jos.005551
    [Abstract] (8034) [HTML] (1904) [PDF 610.06 K] (17527)
    Abstract:
    In recent years, the rapid development of Internet technology and Web applications has triggered the explosion of various data on the Internet, which generates a large amount of valuable knowledge. How to organize, represent and analyze these knowledge has attracted much attention. Knowledge graph was thus developed to organize these knowledge in a semantical and visualized manner. Knowledge reasoning over knowledge graph then becomes one of the hot research topics and plays an important role in many applications such as vertical search and intelligent question-answer. The goal of knowledge reasoning over knowledge graph is to infer new facts or identify erroneous facts according to existing ones. Unlike traditional knowledge reasoning, knowledge reasoning over knowledge graph is more diversified, due to the simplicity, intuitiveness, flexibility, and richness of knowledge representation in knowledge graph. Starting with the basic concept of knowledge reasoning, this paper presents a survey on the recently developed methods for knowledge reasoning over knowledge graph. Specifically, the research progress is reviewed in detail from two aspects:One-Step reasoning and multi-step reasoning, each including rule based reasoning, distributed embedding based reasoning, neural network based reasoning and hybrid reasoning. Finally, future research directions and outlook of knowledge reasoning over knowledge graph are discussed.
    2013,24(2):295-316, DOI:10.3724/SP.J.1001.2013.04336
    [Abstract] (9553) [HTML] (0) [PDF 0.00 Byte] (17387)
    Abstract:
    Under the new application mode, the traditional hierarchy data centers face several limitations in size, bandwidth, scalability, and cost. In order to meet the needs of new applications, data center network should fulfill the requirements with low-cost, such as high scalability, low configuration overhead, robustness and energy-saving. First, the shortcomings of the traditional data center network architecture are summarized, and new requirements are pointed out. Secondly, the existing proposals are divided into two categories, i.e. server-centric and network-centric. Then, several representative architectures of these two categories are overviewed and compared in detail. Finally, the future directions of data center network are discussed.
    2014,25(4):839-862, DOI:10.13328/j.cnki.jos.004558
    [Abstract] (14951) [HTML] (1110) [PDF 1.32 M] (17336)
    Abstract:
    Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
    2010,21(7):1605-1619, DOI:
    [Abstract] (9602) [HTML] (0) [PDF 856.25 K] (16886)
    Abstract:
    The rapid development of Internet leads to an increase in system complexity and uncertainty. Traditional network management can not meet the requirement, and it shall evolve to fusion based Cyberspace Situational Awareness (CSA). Based on the analysis of function shortage and development requirement, this paper introduces CSA as well as its origin, conception, objective and characteristics. Firstly, a CSA research framework is proposed and the research history is investigated, based on which the main aspects and the existing issues of the research are analyzed. Meanwhile, assessment methods are divided into three categories: Mathematics model, knowledge reasoning and pattern recognition. Then, this paper discusses CSA from three aspects: Model, knowledge representation and assessment methods, and then goes into detail about main idea, assessment process, merits and shortcomings of novel methods. Many typical methods are compared. The current application research of CSA in the fields of security, transmission, survivable, system evaluation and so on is presented. Finally, this paper points the development directions of CSA and offers the conclusions from issue system, technical system and application system.
    2009,20(6):1393-1405, DOI:
    [Abstract] (11546) [HTML] (0) [PDF 831.86 K] (16793)
    Abstract:
    Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
    2020,31(7):2245-2282, DOI:10.13328/j.cnki.jos.006037
    [Abstract] (2458) [HTML] (1219) [PDF 967.02 K] (16743)
    Abstract:
    Ultrasonography is the first choice of imaging examination and preoperative evaluation for thyroid and breast cancer. However, ultrasonic characteristics of benign and malignant nodules are commonly overlapped. The diagnosis heavily relies on operator's experience other than quantitative and stable methods. In recent years, medical imaging analysis based on computer technology has developed rapidly, and a series of landmark breakthroughs have been made, which provides effective decision supports for medical imaging diagnosis. In this work, the research progress of computer vision and image recognition technologies in thyroid and breast ultrasound images is studied. A series of key technologies involved in automatic diagnosis of ultrasound images is the main lines of the work. The major algorithms in recent years are summarized and analyzed, such as ultrasound image preprocessing, lesion localization and segmentation, feature extraction and classification. Moreover, multi-dimensional analysis is made on the algorithms, data sets, and evaluation methods. Finally, existing problems related to automatic analysis of those two kinds of ultrasound imaging are discussed, research trend and development direction in the field of ultrasound images analysis are discussed.
    2009,20(8):2241-2254, DOI:
    [Abstract] (6417) [HTML] (0) [PDF 1.99 M] (16608)
    Abstract:
    Inspired from the idea of data fields, a community discovery algorithm based on topological potential is proposed. The basic idea is that a topological potential function is introduced to analytically model the virtual interaction among all nodes in a network and, by regarding each community as a local high potential area, the community structure in the network can be uncovered by detecting all local high potential areas margined by low potential nodes. The experiments on some real-world networks show that the algorithm requires no input parameters and can discover the intrinsic or even overlapping community structure in networks. The time complexity of the algorithm is O(m+n3/γ)~O(n2), where n is the number of nodes to be explored, m is the number of edges, and 2<γ<3 is a constant.
    2008,19(11):2803-2813, DOI:
    [Abstract] (8808) [HTML] (0) [PDF 319.20 K] (16600)
    Abstract:
    A semi-supervised clustering method based on affinity propagation (AP) algorithm is proposed in this paper. AP takes as input measures of similarity between pairs of data points. AP is an efficient and fast clustering algorithm for large dataset compared with the existing clustering algorithms, such as K-center clustering. But for the datasets with complex cluster structures, it cannot produce good clustering results. It can improve the clustering performance of AP by using the priori known labeled data or pairwise constraints to adjust the similarity matrix. Experimental results show that such method indeed reaches its goal for complex datasets, and this method outperforms the comparative methods when there are a large number of pairwise constraints.
    2009,20(8):2199-2213, DOI:
    [Abstract] (10054) [HTML] (0) [PDF 2.05 M] (16134)
    Abstract:
    This paper analyzes the previous study of applying P2P technology in mobile Internet. It first introduces the P2P technology and the conception of mobile Internet, and presents the challenges and service pattern of P2P technology in mobile Internet. Second, the architectures of P2P technology in mobile Internet are described in terms of centralized architecture, super node architecture and ad hoc architecture, respectively. Further more, the resource location algorisms and cross-layer optimizations are introduced based on two different terminal access patterns. Detailed analyses of different key technologies are presented and the disadvantages are pointed out. At last, this paper outlines future research directions.
    2009,20(3):567-582, DOI:
    [Abstract] (7971) [HTML] (0) [PDF 780.38 K] (16070)
    Abstract:
    The research on the software quality model and software quality evaluation model has always been a hot topic in the area of software quality assurance and assessment. A great amount of domestic and foreignresearches have been done in building software quality model and quality assessment model, and so far certainaccomplishments have been achieved in these areas. In recent years, platform building and systematization havebecome the trends of developing basic softwares based on operating systems. Therefore, the quality evaluation ofthe foundational software platform becomes an essential issue to be solved. This article analyzes and concludes thecurrent development of researches on software quality model and software quality assessment model focusing onsummarizing and depicting the developing process of quality evaluation of foundational software platform. It alsodiscusses the future development of researches on quality assessment of foundational software platform in brief, trying to establish a good foundation for it.
    2010,21(5):916-929, DOI:
    [Abstract] (11878) [HTML] (0) [PDF 944.50 K] (15959)
    Abstract:
    Data deduplication technologies can be divided into two categories: a) identical data detection techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since data deduplication technologies can affect the reliability and performance of storage systems, this paper also surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of the current state of research on data deduplication technologies, this paper makes several conclusions as follows: a) How to mine data characteristic information in data deduplication has not been completely solved, and how to use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by data deduplication techniques.
    2012,23(5):1148-1166, DOI:10.3724/SP.J.1001.2012.04195
    [Abstract] (13855) [HTML] (0) [PDF 946.37 K] (15794)
    Abstract:
    With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
    2007,18(1):146-156, DOI:
    [Abstract] (9508) [HTML] (0) [PDF 728.16 K] (15691)
    Abstract:
    A new surrogate placement strategy, CCSP (capacity-constrained surrogate placement), is proposed to enhance the performance for content distribution networks (CDNs). CCSP aims to address surrogate placement in a manner that minimizes the communication cost while ensuring at the same time the maximization of system throughput. This work differs from the existing works on the resource allocation problem in communication networks, CCSP considers load distribution and processing capacity constraints on surrogates by modeling the underlying request-routing mechanism, thus guaranteeing a CDN to have minimum network resource consumption, maximum system throughput, and better load balancing among surrogates. An efficient greedy algorithm is developed for a simplified version of the CCSP problem in tree networks. The efficiency of the proposed algorithm is systematically analyzed through the experimental simulations.
    2013,24(4):825-842, DOI:10.3724/SP.J.1001.2013.04369
    [Abstract] (7828) [HTML] (0) [PDF 1.09 M] (15647)
    Abstract:
    Honeypot is a proactive defense technology, introduced by the defense side to change the asymmetric situation of a network attack and defensive game. Through the deployment of the honeypots, i.e. security resources without any production purpose, the defenders can deceive attackers to illegally take advantage of the honeypots and capture and analyze the attack behaviors to understand the attack tools and methods, and to learn the intentions and motivations. Honeypot technology has won the sustained attention of the security community to make considerable progress and get wide application, and has become one of the main technical means of the Internet security threat monitoring and analysis. In this paper, the origin and evolution process of the honeypot technology are presented first. Next, the key mechanisms of honeypot technology are comprehensively analyzed, the development process of the honeypot deployment structure is also reviewed, and the latest applications of honeypot technology in the directions of Internet security threat monitoring, analysis and prevention are summarized. Finally, the problems of honeypot technology, development trends and further research directions are discussed.
    2016,27(3):691-713, DOI:10.13328/j.cnki.jos.004948
    [Abstract] (8839) [HTML] (896) [PDF 2.43 M] (15635)
    Abstract:
    Learning to rank(L2R) techniques try to solve sorting problems using machine learning methods, and have been well studied and widely used in various fields such as information retrieval, text mining, personalized recommendation, and biomedicine.The main task of L2R based recommendation algorithms is integrating L2R techniques into recommendation algorithms, and studying how to organize a large number of users and features of items, build more suitable user models according to user preferences requirements, and improve the performance and user satisfaction of recommendation algorithms.This paper surveys L2R based recommendation algorithms in recent years, summarizes the problem definition, compares key technologies and analyzes evaluation metrics and their applications.In addition, the paper discusses the future development trend of L2R based recommendation algorithms.