Abstract: In recent years, the localization and tracking of moving targets have been widely used in scenes including indoor navigation, smart homes, security monitoring, and smart medical services. Radio frequency (RF)-based contactless localization and tracking have attracted extensive attention from researchers. Among them, the commercial IR-UWB-based technology can achieve target localization and tracking at low costs and power consumption and has strong development potential. However, most of the existing studies have the following problems: 1) Limited tracking scenes. Modeling and processing methods are only for outdoor or relatively empty indoor scenes under ideal conditions. 2) Limited movement states of targets and unduly ideal modeling. 3) Low tracking accuracy caused by fake moving targets. To solve these problems, this study proposes a moving target tracking method using IR-UWB on the basis of understanding the composition of the received signal spectrum in multipath scenes. First, the dynamic components of the originally received signal spectrum are extracted. Then, the Gaussian blur-based multipath elimination and distance extraction algorithm is employed to eliminate multipath interference, which only retains primary reflection information directly related to the moving target and therefore accurately obtains the distance variation curve of the target. Subsequently, a multi-view fusion algorithm is proposed to fuse the distance information of the devices from different views to achieve accurate localization and tracking of a single freely moving target. In addition, a real-time moving target tracking system based on the low-cost commercial IR-UWB radar is established. The experimental results in the real indoor home scene show that the error between the center position of the human body estimated by the system and the real motion trajectory is always within 20 cm. Moreover, the system remains robust even if influencing factors such as the experimental environment, experimenter, activity speed, and equipment height are altered.
Abstract: In recent years, with the rapid development of blockchain, the types of cryptocurrencies and anonymous transactions have been increasingly diversified. How to make optimal decisions in the transaction type of cryptocurrency market is the concern of users. The users’ decision-making goal is to minimize transaction costs and maximize privacy while ensuring that transactions are packaged. The cryptocurrency trading market is complex, and cryptocurrency technologies differ greatly from each other. Existing studies focus on the Bitcoin market, and few of them discuss other anonymous currency markets such as Zcash and users’ anonymous demands. Therefore, this study proposes a game-based general cryptocurrency trading market model and explores the trading market and users’ decisions on transaction types and costs by combining the anonymous needs of users and employing game theory. Taking Zcash, the most representative optional cryptocurrency, as an example, it analyzes the trading market in combination with the CoinJoin transaction, simulates the trading process about how users and miners find the optimal strategy, and discusses the impact of block size, discount factors, and the number of users on the trading market and user behaviors. Additionally, the model is simulated in a variety of market types to conduct in-depth discussion of the experimental results. Taking a three-type trading market as an example, in the context of vicious fee competition in the trading market, when plnum = 75, θ= 0.4, st = 100, sz = 400, all users are inclined to choose CoinJoin in the early transaction stage (first 500 rounds). In the middle and late part of the market (1500–2000 rounds), 97% of users with a privacy sensitivity below 0.7 tend to choose CoinJoin, while 73% of users with a privacy sensitivity above 0.7 tend to choose shielded transactions. CoinJoin transactions and block sizes above 400 can alleviate the vicious competition of transaction fees to some extent. The proposed model can help researchers understand the game of different cryptocurrency trading markets, analyze user trading behavior, and reveal market operation rules.
Abstract: Code change is a kind of key behavior in software evolution, and its quality has a large impact on software quality. Modeling and representing code changes is the basis of many software engineering tasks, such as just-in-time defect prediction and recovery of software product traceability. The representation learning technologies for code changes have attracted extensive attention and have been applied to diverse applications in recent years. This type of technology targets at learning to represent the semantic information in code changes as low-dimensional dense real-valued vectors, namely, learning the distributed representation of code changes. Compared with the conventional methods of manually designing code change features, such technologies offers the advantages of automatic learning, end-to-end training, and accurate representation. However, this field is still faced with some challenges, such as great difficulties in utilizing structural information and the absence of benchmark datasets. This study surveys and summarizes the recent progress of studies and applications of representation learning technologies for code changes, and it mainly consists of the following four parts. (1) The study presents the general framework of representation learning of code changes and its application. (2) Subsequently, it reviews the currently available representation learning technologies for code changes and summarizes their respective advantages and disadvantages. (3) Then, the downstream applications of such technologies are summarized and classified. (4) Finally, this study discusses the challenges and potential opportunities ahead of representation learning technologies for code changes and suggests the directions for the future development of this type of technology.
Abstract: In recent years, software construction, operation, and evolution have encountered many new requirements, such as the need for efficient switching or configuration in development and testing environments, application isolation, resource consumption reduction, and higher efficiency of testing and deployment. These requirements pose great challenges to developers in developing and maintaining software. Container technology has the potential of releasing developers from the heavy workload of development and maintenance. Of particular note, Docker, as the de facto industrial standard for containers, has recently become a popular research area in the academic community. To help researchers understand the status and trends of research on Docker containers, this study conducts a systematic literature review by collecting 75 high-level papers in this field. First, quantitative methods are used to investigate the basic status of research on Docker containers, including research quantity, research quality, research areas, and research methods. Second, the first classification framework for research on Docker containers is presented in this study, and the current studies are systematically classified and reviewed from the dimensions of the core, platform, and support. Finally, the development trends of Docker container technology are discussed, and seven future research directions are summarized.
Abstract: The emergence of the dynamic link library (DLL) provides great convenience for developers, which improves the interaction between the operating system (OS) and applications. However, the potential security problems of DLL cannot be ignored. Determining how to mine DLL-hijacking vulnerabilities during the running of Windows installers is important to ensure the security of Windows OS. In this paper, the attribute features of numerous installers are collected and extracted, and the double-layer bi-directional long short-term memory (BiLSTM) neural network is applied for machine learning from the perspectives of installers, the invocation modes of DLL from installers, and the DLL file itself. The multi-dimensional features of the vulnerability data set are extracted, and unknown DLL-hijacking vulnerabilities are mined. In experiments, DLL-hijacking vulnerabilities can be effectively detected from Windows installers, and 10 unknown vulnerabilities are discovered and assigned CNVD authorizations. In addition, the effectiveness and integrity of this method are further verified by comparison with other vulnerability analyzers.
Abstract: The critical reliability and availability of distributed systems are threatened by crash recovery bugs caused by incorrect crash recovery mechanisms and their implementations. The detection of crash recovery bugs, however, can be extremely challenging since these bugs only manifest themselves when a node crashes under special timing conditions. This study presents a novel approach Deminer to automatically detect crash recovery bugs in distributed systems. Observations in the large-scale distributed systems show that node crashes that interrupt the execution of related I/O write operations, which store a piece of data (i.e., common data) in different places, e.g., different storage paths or nodes, are more likely to trigger crash recovery bugs. Therefore, Deminer detects crash recovery bugs by automatically identifying and injecting such error-prone node crashes under the usage guidance of common data. Deminer first tracks the usage of critical data in a correct run. Then, it identifies I/O write operation pairs that use the common data and predicts error-prone injection points of a node crash on the basis of the execution trace. Finally, Deminer tests the predicted injection points of the node crash and checks failure symptoms to expose and confirm crash recovery bugs. A prototype of Deminer is implemented and evaluated on the latest versions of four widely used distributed systems, i.e., ZooKeeper, HBase, YARN, and HDFS. The experimental results show that Deminer is effective in finding crash recovery bugs. Deminer has detected six crash recovery bugs.
Abstract: By transferring the knowledge of the source domain to the target domain with similar tasks, domain adaptation aims to assist the latter to learn better. When the data label set of the target domain is a subset of the source domain labels, the domain adaptation of this type of scenario is called partial domain adaptation (PDA). Compared with general domain adaptation, although PDA is more general, it is more challenging with few related studies, especially with the lack of systematic reviews. To fill this gap, this study conducts a comprehensive review, analysis and summary of existing PDA methods, and provides an overview and reference of subject research for the relevant community. Firstly, an overview of the PDA background, concepts, and application fields is summarized. Secondly, according to the modeling characteristics, existing PDA methods are divided into two categories: promoting positive transfer and alleviating negative transfer, and this study reviews and analyzes them respectively. Then, the commonly used experimental benchmark datasets are categorized and summarized. Finally, the problems in existing PDA studies are analyzed to point out possible future development directions.
Abstract: In natural scenes, logos such as trademarks and traffic signs are susceptible to shooting angle, carrier deformation, and scale changes, which reduces logo detection accuracy. Thus, this study proposes an attention guided logo detection and recognition network (AGLDN) to jointly optimize the model robustness for multi-scale and complex deformation. First, a logo synthesis dataset is established by image collection and mask generation of logo templates, image selection of logo background, and logo image generation. Then, based on RetinaNet and FPN, multi-scale features are extracted and high-level semantic feature mapping is formed. Finally, the attention mechanism guided network is employed to focus on the logo area, and the influence of logo deformation on feature robustness is suppressed to improve logo detection and recognition. Experimental results show that the proposed method can reduce the influence of scale changes and non-rigid deformation, and improve detection accuracy.
Abstract: Unlimited by the state and space, the formal verification technology based on mechanized theorem proof is an important method to ensure software correctness and avoid serious loss from potential software bugs. LLRB (left-leaning red-black trees) is a variant of binary search trees, and its structure has an additional left-leaning constraint over the traditional red-black trees. During verification, conventional proof strategies cannot be employed, which requires more manual intervention and effort. Thus, the LLRB correctness verification is widely acknowledged as a challenging problem. To this end, based on the Isabelle verification framework for the binary search tree algorithm, this study refines the additional property part of the framework and provides a concrete verification scheme. The LLRB insertion and deletion operations are functionally modeled in Isabelle, with modular treatment of the LLRB invariants. Subsequently, the function correctness is verified. This is the first mechanized verification of functional LLRB insertion and deletion algorithms in Isabelle. Compared to the current Dafny verification of the LLRB algorithm, the theorem number is reduced from 158 to 84, and it is unnecessary for constructing intermediate assertions, which alleviates the verification burden. Meanwhile, this study provides references for functional modeling and verification of complex tree structure algorithms.
Abstract: Detecting latent topics in social media texts is a meaningful task, and the short and informal posts will cause serious data sparsity. Additionally, models based on variational auto-encoders (VAEs) ignore the social relationships among users during topic inference and VAE assumes that each input data point is independent. This results in the lack of correlation information between the inferred latent topic variables and incoherent topics. Social network structure information can not only provide clues for aggregating contextual messages but also indicate topic correlation among users. Therefore, this study proposes to utilize the microblog topic model based on message passing and graph prior distribution. This model can encode richer context information by graph convolution network (GCN) and integrate the interactive relationship of users by graph prior distribution during VAE topic inference to better understand the complex correlation among multiple data points and mine social media topic information. The experiments on three actual datasets validate the effectiveness of the proposed model.
Abstract: In current real life where data sources are diverse, and manual labeling is difficult, semi-supervised multi-view classification algorithms have important research significance in various fields. In recent years, graph neural networks-based semi-supervised multi-view classification algorithms have achieved great progress. However, most of the existing graph neural networks carry out multi-view information fusion only in the classification stage, while neglecting the multi-view information interaction between the same sample during the training stage. To solve the above issue, this study proposes a model for semi-supervised classification, named multi-view interaction graph convolutional network (MIGCN). The Transformer Encoder module is introduced to the graph convolution layer trained on different views, which aims to adaptively acquire complementary information between different views for the same sample during the training stage. More importantly, the study introduces the consistency constraint loss to make the similar relationship of the final feature expressions of different views as similar as possible. This operation can make graph convolutional neural networks during the classification stage better utilize the consistency and complementarity information between different views reasonably, and then it can further improve the robust performance of the multi-view fusion feature. Extensive experiments on several real-world multi-view datasets demonstrate that compared with the graph-based semi-supervised multi-view classification model, MIGCN can better learn the essential features of multi-view data, thereby improving the accuracy of semi-supervised multi-view classification.
Abstract: In the field of cyber security, the mendacious domains generated by the domain generation algorithm (DGA) are called DGA domains. Similar to real domains, they are usually a random combination of characters or numbers, which makes DGA domains highly camouflaged. Hackers take advantage of the disguised nature of DGA domains to carry out cyber attacks, so as to bypass security detection. How to effectively detect DGA domains has become a research hotspot. Traditional statistical machine learning detection methods require the manual construction of domain feature sets. However, the quality of domain features constructed manually or semi-automatically varies, which affects the accuracy of detection. In view of the powerful automatic feature extraction and representation capability of deep neural networks, a DGA domain detection method based on multi-view contrastive learning (MCL4DGA) is proposed. Different from existing methods, it incorporates attentional neural networks, convolutional neural networks, and recurrent neural networks to effectively capture global, local, and bidirectional multi-view feature dependencies of domain sequences. Besides, the self-supervision signals derived by contrastive learning can enhance the expressiveness between multi-view feature learning encoders and thus improve the accuracy of detection. The effectiveness of the proposed method is verified by experimental comparison with current methods on a real dataset.
Abstract: Nowadays, deep neural network (DNN) is widely used in autonomous driving, medical diagnosis, speech recognition, face recognition, and other safety-critical fields. Therefore, DNN testing is critical to ensure the quality of DNN. However, labeling test cases to judge whether the DNN model predictions are correct is costly. Therefore, selecting test cases that reveal incorrect behavior of DNN models and labeling them earlier can help developers debug DNN models as soon as possible, thus improving the efficiency of DNN testing and ensuring the quality of DNN models. This study proposes a test case selection method based on data mutation, namely DMS. In this method, a data mutation operator is designed and implemented to generate a mutation model to simulate model defects and capture the dynamic pattern of test case bug-revealing, so as to evaluate the ability of test case bug-revealing. Experiments are conducted on the combination of 25 deep learning test sets and models. The results show that DMS is significantly better than the existing test case selection methods in terms of both the proportion of bug-revealing and the diversity of bug-revealing directions in the selected samples. Specifically, taking the original test set as the candidate set, DMS can filter out 53.85%–99.22% of all bug-revealing test cases when selecting 10% of the test cases. Moreover, when 5% of the test cases are selected, the selected cases by DMS can cover almost all bug-revealing directions. Compared with the eight comparison methods, DMS finds 12.38%–71.81% more bug-revealing cases on average, which proves the significant effectiveness of DMS in the task of test case selection.
Abstract: Apache Flink is one of the most popular stream computing platforms and has many applications in industry. Complex event processing (CEP) is one of the important usage scenarios of stream computation. Apache Flink defines and implements a language for complex event processing (referred to as FlinkCEP). FlinkCEP includes rich syntactic features, not only the usual features of filtering, connecting, and looping, but also the advanced features of iterative conditions and after-match skip strategies. The semantics of FlinkCEP is complex, no language specification of FlinkCEP defines its semantics precisely, so it can only be understood by checking the implementation details. This motivates the definition of formal semantics for FlinkCEP so that the developers could understand its semantics precisely. This study proposes an automaton model called data stream transducers (DST) for FlinkCEP, where the data variables are applied to capture the iterative conditions, the data stream variables are adopted to store the outputs, and transition priorities are introduced to capture the after-match skip strategies. DST is leveraged to define the formal semantics of FlinkCEP and design the query evaluation algorithms based on the formal semantics. Moreover, a prototype of the CEP engine is implemented. Finally, test case sets are generated, which cover the syntactic features of FlinkCEP more comprehensively. They are utilized to conduct comparison experiments against the actual results of FlinkCEP on the Flink platform. The experimental results show that the proposed formal semantics of FlinkCEP conforms to the actual semantics of FlinkCEP in the vast majority of the cases. Furthermore, the inconsistencies between the formal and the actual semantics are analyzed and it is discovered that the Flink implementation of FlinkCEP may not deal with the group patterns correctly.
Abstract: Temporal knowledge graph reasoning aims to fill in missing links or facts in knowledge graphs, where each fact is associated with a specific timestamp. The dynamic variational framework based on variational autoencoder is particularly effective for this task. By jointly modeling entities and relations using Gaussian distributions, this method not only offers high interpretability but also solves complex probability distribution problems. However, traditional variational autoencoder-based methods often suffer from overfitting during training, which limits their ability to accurately capture the semantic evolution of entities over time. To address this challenge, this study proposes a new temporal knowledge graph reasoning model based on a diffusion probability distribution approach. Specifically, the model uses a bi-directional iterative process to divide the entity semantic modeling process into multiple sub-modules. Each sub-module uses a forward noisy transformation and a backward Gaussian sampling to model a small-scale evolution process of entity semantics. Compared with the variational autoencoder-based method, this study can obtain more accurate modeling by learning the dynamic representation of entity semantics in the metric space over time through the joint modeling of multiple submodules. Compared with the variational autoencoder-based method, the model improves by 4.18% and 1.87% on the Yago11k dataset and Wikidata12k dataset for evaluating the MRR of the indicator and by 1.63% and 2.48% on the ICEWS14 and ICEWS05-15 datasets, respectively.
Abstract: Safety-critical embedded software usually has rigorous time constraints over the runtime behaviors, raising additional requirements for enforcing security properties. To protect the information flow security of embedded software and mitigate the limitations of the existing simplex verification approaches and their potential false positives, this study first proposes a new timed noninterference property, i.e., timed SIR-NNI, based on the security requirement of a realistic scenario. Then the study presents an information flow security verification approach that unifies the verification of multiple timed noninterference properties, i.e., timed BNNI, timed BSNNI, and timed SIR-NNI. Based on the different timed noninterference requirements, the approach constructs the refined automata and test automata from the timed automata under verification. The study uses UPPAAL’s reachability analysis to implement the refinement relation check and the security verification. The verification tool, i.e., TINIVER, extracts timed automata from SysML’s sequential diagrams or C++ source code to conduct the verification procedure. The verification results of TINIVER on existing timed automata models and security properties justify the usability of the proposed approach. The security verifications on the typical flight-mode switch models of the UAV flight control systems ArduPilot and PX4 demonstrate the practicability and scalability of the proposed approach. Besides, the approach is effective in mitigating the false positives of a state-of-the-art verification approach.
Abstract: Text-based person retrieval is a developing downstream task of cross-modal retrieval and derives from conventional person re-identification, which plays a vital role in public safety and person search. In view of the problem of lacking query images in traditional person re-identification, the main challenge of this task is that it combines two different modalities and requires that the model have the capability of learning both image content and textual semantics. To narrow the semantic gap between pedestrian images and text descriptions, the traditional methods usually split image features and text features mechanically and only focus on cross-modal alignment, which ignores the potential relations between the person image and description and leads to inaccurate cross-modal alignment. To address the above issues, this study proposes a novel relation alignment-based cross-modal person retrieval network. First, the attention mechanism is used to construct the self-attention matrix and the cross-modal attention matrix, in which the attention matrix is regarded as the distribution of response values between different feature sequences. Then, two different matrix construction methods are used to reconstruct the intra-modal attention matrix and the cross-modal attention matrix respectively. Among them, the element-by-element reconstruction of the intra-modal attention matrix can well excavate the potential relationships of intra-modal. Moreover, by taking the cross-modal information as a bridge, the holistic reconstruction of the cross-modal attention matrix can fully excavate the potential information from different modalities and narrow the semantic gap. Finally, the model is jointly trained with a cross-modal projection matching loss and a KL divergence loss, which helps achieve the mutual promotion between modalities. Quantitative and qualitative results on a public text-based person search dataset (CUHK-PEDES) demonstrate that the proposed method performs favorably against state-of-the-art text-based person search methods.
Abstract: Multi-view clustering has attracted more and more attention in the fields of image processing, data mining, and machine learning. Existing multi-view clustering algorithms have two shortcomings. One is that in the process of graph construction, only the pairwise relationship between each view data is considered to generate an affinity matrix, which lacks the characterization of neighborhood relationships; the second is that existing methods separate the process of multi-view information fusion and clustering, thereby reducing the clustering performance of the algorithm. Therefore, this study proposes a more accurate and robust joint spectral embedding multi-view clustering algorithm based on bipartite graphs. Firstly, based on the multi-view subspace clustering idea,bipartite graphs are constructed, and similar graphs are generated.Then the spectral embedding matrix of similar graphs is used to perform graph fusion. Secondly, by considering the importance of each view during the fusion process, weight constraints are applied, and an indicator matrix is introduced to obtain the final clustering result. A model is proposed to optimize the bipartite graph, embedding matrix, and clustering indicator matrix within a single framework. In addition, a fast optimization strategy for solving the model is provided, which decomposes the optimization problem into small module subproblems and efficiently solves them through iterative steps. The proposed algorithm and existing multi-view clustering algorithms have been experimentally analyzed on real data sets. Experimental results show that the proposed algorithm is more effective and robust in dealing with multi-view clustering problems compared with existing methods.
Abstract: The rapid advancement of sensor technology has resulted in a vast volume of traffic trajectory data, and trajectory anomaly detection has a wide range of applications in sectors including smart transportation, autonomous driving, and video surveillance. Trajectory anomaly detection, unlike other trajectory mining tasks like classification, clustering, and prediction, tries to find low-probability, uncertain, and unusual trajectory behavior. The types of anomalies, trajectory data labels, detection accuracy, and computational complexity are all frequent issues in trajectory anomaly detection. In view of the above problems, the research status and latest progress of trajectory anomaly detection technology in the past two decades are comprehensively reviewed. First, the characteristics of trajectory anomaly detection and the current research challenges are analyzed. Then, the existing trajectory anomaly detection algorithms are compared and analyzed based on the classification criteria such as the availability of trajectory labels, the principle of anomaly detection algorithms, and the working mode of offline or online algorithms. For each type of anomaly detection technology, the algorithm principle, representative method, complexity analysis and algorithm advantages and disadvantages are summarized and analyzed in detail. Then, the open source trajectory datasets, commonly used anomaly detection evaluation methods and anomaly detection tools are discussed. On this basis, the architecture of the trajectory anomaly detection system is presented, and a series of relatively complete trajectory mining processes from trajectory data collection to anomaly detection application are formed. Finally, the significant open issues in the domain of trajectory anomaly detection are discussed, as well as potential research trends and solutions.
Abstract: With the development of mobile services’ computing and sensing abilities, spatial crowdsourcing, which is based on location information, comes into being. There are many challenges to improving the performance of task assignments, one of which is how to assign workers the tasks that they are interested in. Existing research methods only focus on workers’ temporal preference but ignore the impact of spatial factors on workers’ preference, and they only focus on long-term preference but ignore workers’ short-term preference and face the problem of inaccurate predictions caused by sparse historical data. This study analyzes the task assignment problem based on long-term and short-term spatio-temporal preference. By comprehensively considering workers’ preferences from both long-term and short-term perspectives, as well as temporal and spatial dimensions, the quality of task assignment is improved in task assignment success rate and completion efficiency. In order to improve the accuracy of spatio-temporal preference prediction, the study proposes a sliced imputation-based context-aware tensor decomposition algorithm (SICTD) to reduce the proportion of missing values in preference tensors and calculates short-term spatio-temporal preference through the ST-HITS algorithm and short-term active range of workers under spatio-temporal constraints. In order to maximize the total task reward and the workers’ average preference for completing tasks, the study designs a spatio-temporal preference-aware greedy and Kuhn-Munkres (KM) algorithm to optimize the results of task assignment. Extensive experimental results on real datasets show the effectiveness of the long- and short-term spatio-temporal preference-aware task assignment framework. Compared with baselines, the RMSE prediction error of the proposed SICTD for temporal and spatial preferences is decreased by 22.55% and 24.17%, respectively. In terms of task assignment, the proposed preference-aware KM algorithm significantly outperforms the baseline algorithms, with the workers’ total reward and average preference for completing tasks averagely increased by 40.86% and 22.40%, respectively.
Abstract: Aiming at the growing threat of distributed denial of service (DDoS) attacks under the rapid popularization of IPv6, this study proposes a two-stage DDoS defense mechanism, including a pre-detection stage to real-time monitor the early appearance of DDoS attacks and a deep-detection stage to accurately filter DDoS traffic after an alarm. First, the IPv6 traffic format is analyzed and the hexadecimal header fields are extracted from PCAP capture files as detection elements. Then, in the pre-detection stage, a lightweight binary convolutional neural network (BCNN) model is introduced and a two-dimensional traffic matrix is designed as model input, which can sensitively perceive the malicious situation caused by mixed DDoS traffic in the network as evidence of DDoS occurrence. After the alarm, the deep-detection stage will intervene with a one-dimensional convolutional neural network (1DCNN) model, which can specifically distinguish the mixed DDoS packets with one-dimensional packet vector as input to issue blocking policies. In the experiment, an IPv6-LAN topology is built and the proposed pure IPv6-DDoS traffic is generated by replaying the CIC-DDoS2019 public set through NAT 4to6. The results show that the proposed mechanism can effectively improve response speed, detection accuracy, and traffic filtering efficiency in DDoS defense. When DDoS traffic only takes 6% and 10% of the total network, BCNN can perceive the occurrence of DDoS with 90.9% and 96.4% accuracy, and the 1DCNN model can distinguish mixed DDoS packets with 99.4% accuracy at the same time.
Abstract: The smart contract is a decentralized application widely deployed on the blockchain platform, e.g., Ethereum. Due to the economic attributes, the vulnerabilities in smart contracts can potentially cause huge financial losses and destroy the stable ecology of Ethereum. Thus, it is crucial to detect the vulnerabilities in smart contracts before they are deployed to Ethereum. The existing smart contract vulnerability detection methods (e.g., Oyente and Secure) are mostly based on heuristic algorithms. The reusability of these methods is weak in different application scenarios. In addition, they are time-consuming and with low accuracy. In order to improve the effectiveness of vulnerability detection, this study proposes Scruple: a smart contract timestamp vulnerability detection approach based on learning data-flow path. It first obtains all possible propagation chains of timestamp vulnerabilities, then refines the propagation chains, uses a graph pre-training model to learn the relationship in the propagation chains, and finally detects whether a smart contract has timestamp vulnerabilities using the learned model. Compared with the existing detection methods, Scruple has a stronger vulnerability capture ability and generalization ability. Meanwhile, learning the propagation chain is not only well-directed but also can avoid an unnecessarily deep hierarchy of programs for the convergence of vulnerabilities. To verify the effectiveness of Scruple, this study uses real-world distinct smart contracts to compare Scruple with 13 state-of-the-art smart contract vulnerability detection methods. The experimental results show that Scruple can achieve 96% accuracy, 90% recall, and 93% F1-score in detecting timestamp vulnerabilities. In other words, the average improvement of Scruple over 13 methods using the three metrics is 59%, 46%, and 57% respectively. It means that Scruple has substantially improved in detecting timestamp vulnerabilities.
Abstract: As an important production factor, data need to be exchanged between different entities to create value. In this process, data integrity needs to be ensured, or in other words, data cannot be tampered without authorization, or otherwise, it may lead to extremely serious consequences. The existing work realizes data evidence preservation by combining distributed ledger with data encryption and verification technology to ensure the integrity of data to be exchanged in transmission, storage, and other related data processing phrases. However, such work is difficult to confirm the integrity of the data provided by the data supplier. Once the data supplier provides forged data, all subsequent integrity assurance will be meaningless. Therefore, this study proposes a method for verifying the integrity of data services based on remote attestation. By using the trusted execution environment as the trust anchor, this method can measure and verify the integrity of the static code, execution process, and execution result of a specific data service. It also optimizes the integrity verification of a specific data service through program slicing, thus extending the scope of data integrity assurance to the time point when the data supplier provides data. A series of experiments are carried out on 25 data services of three real Java information systems to validate the proposed method.
Abstract: In recent years, reinforcement learning methods based on environmental interactions have achieved great success in robotic applications, providing a practical and feasible solution for optimizing the behavior control strategies of robots. However, collecting interactive samples in the real world can lead to problems such as high cost and low efficiency. Therefore, the simulation environment is widely used in the training process of robot reinforcement learning. By obtaining a large number of training samples at a low cost in the virtual simulation environment for strategy training and transferring learning strategies to the real world, the security, reliability, and real-time problems in the real robot training process can be alleviated. However, due to the difference between the simulation environment and the real environment, it is often difficult to obtain ideal performance when directly transferring the strategy trained in the simulation environment to the real robot. To solve this problem, sim-to-real transfer reinforcement learning methods are proposed to reduce the environmental gap, so as to achieve effective strategy transfer. According to the direction of information flow in the process of transfer reinforcement learning and the different objects targeted by intelligent methods, this survey first proposes a sim-to-real transfer reinforcement learning framework, based on which the existing related work is then divided into three categories: the model optimization methods focusing on the real environment, the knowledge transfer methods focusing on the simulation environment, and the iterative policy promotion methods focusing on both simulation and real environments. Then, the representative technologies and related work in each category are described. Finally, the opportunities and challenges in this field are briefly discussed.
Abstract: In recent years, blockchain technology has attracted a lot of attention. As a distributed ledger technology, it has been applied to many fields due to its openness, transparency, and non-tamperability. However, as the number of users and access requirements rise, the performance bottleneck induced by the poor scalability of the existing blockchain architectures has restricted the application and promotion of blockchain technology. How to solve the scalability problem has become a hotspot issue in academia and industry. This study analyzes and summarizes the currently available blockchain scaling solutions. For this purpose, the study outlines the basic concept of blockchain and the origin of the scalability problem, defines the scalability problem, and proposes the metrics for scalability. Then, it presents a classification framework and reports the existing solutions in the manner of categorizing them into three classes: network scaling, on-chain scaling, and off-chain scaling. Different blockchain scalability solutions are analyzed for a comparison of their respective technical characteristics and a summary of their advantages and disadvantages. Finally, this study discusses the open issues that need to be addressed promptly and explores the future trends of blockchain technology.
Abstract: Under the new era of “human-machine-thing” ternary integration and ubiquitous computing, the software deployment and operation environment of “open and changeable”, “diverse needs”, and “complex scenarios” have put forward more requirements and higher expectations for the governance of open-source software library ecosystems. To further promote the construction of trusted software supply chain ecosystems and create an independent and controllable technical system based on the ubiquitous computing model, this study focuses on open-source software library ecosystems. It collects 348 authoritative papers in this field in the past two decades (2001–2023) and sorts out the research work of open-source software library management ecological governance technology. The study discusses the modeling and analysis, evolution and maintenance, quality assurance, and management of open-source software supply chain ecosystems, and summarizes the research status, problems, challenges and trends.
Abstract: Blockchain, as a typical distributed system, its underlying networks highly influences the overall system performance and security. Blockchain networks differ from traditional P2P (peer-to-peer) networks in terms of security models, transmission protocols and performance indicators. This study first systematically analyzes the blockchain network transmission process, i.e., connection establishment and data transmission, and list out the challenging issues. Second, state-of-the-art blockchain topology protocols and data transmission methods are thoroughly investigated and discussed, from the perspective of node heterogeneity, coding scheme, broadcast algorithm and relay network, and etc. Meanwhile, the typical cross-chain network implementation and the network simulation tools are summarized. Finally, we envision the possible future research trends in the realm of blockchain networks.
Abstract: Distributed storage system is receiving more and more attention in mobile network scenarios. Data placement, a key technology of distributed storage, is crucial to improve the success rate of distributed data storage. However, due to unstable wireless signals and fluctuating network bandwidth in mobile environments, the traditional data placement strategies, such as random placement strategy and storage-aware placement strategy, have low success rates of data transmission because both of them do not take network bandwidth into account during data placement. To solve the problem faced by mobile distributed storage systems, this study proposes a bandwidth-aware adaptive data placement strategy (BADP). The main breakthrough is that BADP adopts the group mobility model to sense the network bandwidth of nodes and takes the network bandwidth as an important factor for data placement, thus selecting nodes with good performance to achieve adaptive data placement and improve the success of data transmission. BADP consists of three design features: (1) adopting the group mobility model to sense the network bandwidth of nodes; (2) managing node information in groups to reduce communication overhead, and taking advantage of the heap to build a node selection tree; (3) selecting nodes with good performance using adaptive data placement to improve the success rate of data transmission. Experiments show that when the network changes dynamically, BADP gains at least 30.6% and 34.6% improvements in the success rate of data transmission compared with random placement strategy and storage-aware placement strategy. At the same time, it consistently keeps communication overhead low.
Abstract: Internet users need to resolve through DNS before accessing network applications. DNS security is the first portal to ensure the normal operation of the network. If the security of DNS cannot be effectively guaranteed, even if the level of security protection measures of other network systems is high, attackers can attack the DNS system to make the network unusable. At present, DNS malignant events still have an upward trend, and the development of DNS attack detection and defense technology still cannot meet practical needs. From the perspective of recursive servers that directly serve users’ DNS requests, this study comprehensively summarizes the security problems faced in the DNS process through two classification methods, including various security events caused by attacks or system vulnerabilities, different detection methods for various security events, and various defense and protection technologies. When summarizing various security events, detection and defense protection technologies, the study analyzes the characteristics of the corresponding typical methods and prospects for the future research direction of the DNS security field.
Abstract: GitHub is a well-known open-source software development community that supports developers using the issue tracking system in each open-source project on GitHub to address issues. During the discussion of an issue about a defect, the developer may point out issues from other projects correlated to the defect, which are called cross-project issues, so as to provide reference information for fixing the defect. However, there are more than 200 million open-source projects and 1.2 billion issues on the GitHub platform, making it time-consuming to identify and acquire cross-project issues manually. This study presents a cross-project issue recommendation method CPIRecom for open-source software defects. This study builds a pre-selection set by filtering issues based on the number of historical issue pairs and the time interval for reporting issues. Then, the study also proposes an accurate recommendation model, which extracts textual features based on the pre-trained model of BERT, analyzes features of projects, calculates the relevant probability between defects and issues from the pre-selection set based on a random forest classifier, and obtains the recommendation list according to the ranking. This study simulates the application of CPIRecom method on GitHub platform. The mean reciprocal rank of CPIRecom method reaches 0.603, and the Recall@5 reaches 0.715 on the simulative test set.
Abstract: Fuzzy C-means (FCM) clustering algorithm has become one of the commonly used image segmentation techniques with its low learning cost and algorithm overhead. However, the conventional FCM clustering algorithm is sensitive to noise in images. Recently, many of improved FCM algorithms have been proposed to improve the noise robustness of the conventional FCM clustering algorithm, but often at a cost of detail loss on the image. This study presents an improved FCM clustering algorithm based on Lie group theory and applies it to image segmentation. The proposed algorithm constructs matrix Lie group features for the pixels of an image, which summarizes the low-level image features of each pixel and its relationship with other pixels in the neighborhood window. By doing this, the proposed method transforms the clustering problem of measuring the Euclidean distances between pixels into calculating the geodesic distances between Lie group features of pixels on the Lie group manifold. Aiming at the problem of updating the clustering center and fuzzy membership matrix on the Lie group manifold, the proposed method uses an adaptive fuzzy weighted objective function, which improves the generalization and stability of the algorithm. The effectiveness of the proposed method is verified by comparing with conventional FCM and several classic improved algorithms on the experiments of three types of medical images.
Abstract: This study focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions. In the process of software development, developers often encounter two scenarios. One is writing a large amount of repetitive and low-technical code for implementing common functionalities. The other is writing code that depends on specific task requirements, which may necessitate external resources such as documentation or other tools. Therefore, code generation has received a lot of attention among academia and industry for assisting developers in coding. It has also been one of the key concerns in the field of software engineering to make machines understand users’ requirements and write programs on their own. The recent development of deep learning techniques, especially pre-training models, makes the code generation task achieve promising performance. In this study, the current work on deep learning-based code generation is systematically reviewed and the current deep learning-based code generation methods are classified into three categories: methods based on code features, methods incorporated with retrieval, and methods incorporated with post-processing. The first category refers to the methods that use deep learning algorithms for code generation based on code features, and the second and third categories improve the performance of the methods in the first category. The existing research results of each category of methods are systematically reviewed, summarized, and commented. Besides, the study analyzes the corpus and the popular evaluation metrics used in the existing code generation work. Finally, it summarizes the overall literature review and provides a prospect for future research directions worthy of attention.
Abstract: The current authentication protocol based on username and password has been difficult to meet the increasing security requirements. Specifically, users choose different passwords to access different online services, which greatly increases the user’s memory burden. In addition, password authentication has low security and faces many known attacks. To solve such problems, this study proposes a user-centric two-factor authentication key agreement protocol UC-2FAKA based on the Pointcheval-Sanders signature. Firstly, to prevent the leakage of authentication factors, passwords, and biometric two-factor credentials are constructed based on the Pointcheval-Sanders signature. The identity is authenticated to the service provider (SP) in a zero-knowledge proof manner. Secondly, using a user-centric single sign-on (SSO) architecture, users can request identity credentials by registering with an identity provider (IDP) to log in different SPs to avoid IDP or SP tracking or linking users. Thirdly, the Diffie-Hellman key exchange is used to authenticate SP identities and negotiate communication keys to ensure subsequent communication security. Finally, comprehensive security analysis and performance comparison of the proposed protocol are carried out. The results show that the proposed protocol can resist various known attacks, and the proposed protocol performs better in communication overhead and computational overhead.
Abstract: Existing hypergraph network representation methods need to analyze the full batch nodes and hyperedges to recursively extend the neighbors across layers, which brings huge computational costs and leads to lower generalization accuracy due to over-expansion. To solve this problem, this study proposes a hypergraph network representation method based on importance sampling. First, the method treats nodes and hyperedges as two sets of independent identically distributed samples that satisfy specific probability measures and interprets the structural feature interactions of the hypergraph in an integral form. Second, it designs a neighbor importance sampling rule with learnable parameters and calculates sampling probabilities based on the physical relations and features of nodes and hyperedges. A fixed number of objects are recursively acquired layer by layer to construct a smaller sampled adjacency matrix. Finally, the spatial features of the entire hypergraph are approximated using Monte Carlo methods. In addition, with the advantage of physically informed neural networks, the sampling variance that needs to be reduced is added to the hypergraph neural network as a physical constraint to obtain sampling rules with better generalization capability. Extensive experiments on multiple datasets show that the method proposed in this study can obtain more accurate hypergraph representation results with a faster convergence rate.
Abstract: Fast vulnerability root cause analysis is crucial for patching vulnerabilities and has always been a hotspot in academia and industry. The existing vulnerability root cause analysis methods based on the statistical feature analysis of a large number of test sample execution records have problems such as random noise and missing important logical correlation instructions. According to the test set measurement in this study, the proportion of random noise in the existing statistical methods reaches more than 61%. To solve the above problems, this study proposes a vulnerability root cause analysis method based on the local path graph, which extracts vulnerability-related information such as the inter-function call graph and intra-function control flow transfer graph from the execution paths. The local path graph is utilized for eliminating irrelevant instruction (i.e., noise instructions) elimination, constructing the logic relations for vulnerability root cause relevant points, and adding missing critical instructions. An automated root cause analysis system for binary software, LGBRoot, has been implemented. The effectiveness of the system has been evaluated on a dataset of 20 public CVE memory corruption vulnerabilities. The average time for single-sample root cause analysis is 12.4 seconds. The experimental data show that the system can automatically eliminate 56.2% of noise instructions, and mend as well as visualize the 20 logical structures of vulnerability root cause relevant points, speeding up the vulnerability analysis of analysts.
Abstract: Disassembly of binary codes is hard but necessary for improving the security of binary software. One of the major reasons for the difficult binary disassembly is that the compilers create many jump tables in the binary code for efficiency. In order to solve the targets of the jump table, mainstream disassembly tools use various strategies. However, the details of the implementation of these strategies and their effectiveness are not well studied. To help researchers to well understand the algorithm implementation and performance of disassembly tools, this study first systematically summarizes the strategies used by disassembly tools to solve jump tables; then the study builds an automatic framework for testing jump tables, based on which a large-scale testsuite on jump tables (2410455 jump tables) can be generated. Lastly, this study evaluates the performance of the disassembly tools in solving jump tables on the testsuite and manually analyzes the errors introduced by each strategy of the disassembly tools. In addition, this study finds six bugs in the implementation of the disassembly tools benefiting from the systematic summary of the implementation of the disassembly tool algorithm.
Abstract: The database performance is affected by the database configuration parameters. The quality of parameter settings will directly affect the performance of the database. Therefore, the quality of the database parameter tuning method is important. However, traditional database parameter tuning methods have many limitations, such as the inability to make full use of historical parameter tuning data, wasting time and human resources, and so on. The counterfactual interpretation methods aim to change the original prediction to the expected prediction by making small modifications to the original data. The method plays a role of suggestion, and this can be used for database configuration optimization, namely, making small modifications to the database configuration to optimize the performance of the database. Therefore, this study proposes a counterfactual interpretation method for database configuration optimization. For databases with poor performance under specific load conditions, this method can modify the database configuration and generate corresponding database configuration counterfactuals to optimize database performance. This study conducts two kinds of experiments to evaluate the counterfactual interpretation method and verify the effect of optimizing the database. The experimental results show that the counterfactual interpretation methods proposed in this study are better than other typical counterfactual interpretation methods in terms of various evaluation indicators, and the generated counterfactuals can effectively improve database performance.
Abstract: Parallel computing has become the mainstream. Among all the parallel computing systems, synchronization is one of the critical designs and is imperative to fully utilize the hardware performance. In recent years, GPU, as the most widely used accelerator, has developed rapidly, and many applications have placed greater demands on GPU thread synchronization. However, current GPUs cannot support thread synchronization efficiently in many real-world applications. Although many approaches have been proposed to support GPU thread synchronization and much progress has been made, the unique architecture and parallel pattern of GPUs still lead to many challenges in GPU thread synchronization research. In this study, thread synchronization in GPU parallel programming is divided into different categories according to different synchronization purposes and granularity. Around the synchronization expression and execution, the key problems and challenges of synchronization on GPUs are firstly analyzed, i.e., being difficult to express efficiently, incurring frequent concurrency bugs, and low execution efficiency. Secondly, the study introduces the research on synchronization for thread contention and synchronization for thread cooperation on GPUs in academia and industry in recent years from two aspects of thread synchronization expression method and performance optimization method based on different GPU thread synchronization granularity. Then the existing research methods are analyzed. On this basis, the study points out the future research trends and development prospects of GPU thread synchronization and feasible research methods, providing a reference for researchers in this field.
Abstract: Conformance checking is one of the important scenarios in the field of process mining, and its goal is to determine whether the actual running business behavior is consistent with the desired behavior and then provide a basis for business process management decisions. Traditional methods of conformance checking face the problems of too many metrics and low efficiency. In addition, the existing methods for checking the conformance between process text and process model rely heavily on expert-defined knowledge. Therefore, this study proposes a process text-oriented conformance checking method. Firstly, the study generates graph traces based on the execution semantics of the process model and obtains the structural features by the word vector model from graph traces. At the same time, Hoffman trees are introduced to reduce the computational effort. Then, the word vector representation of the process text and the activities is performed. The study also uses the Siamese mechanism to improve training efficiency. Finally, all the features of the text and the model are fused, and then the consistency score between the text and the model is predicted using a fully connected layer. Experiments show that the average absolute error value of the method in this study is two percentage points lower than that of existing methods.
Abstract: The major challenges traditional operating system (OS) design faces are the increasing number, diversity, and distribution scope of resources to be managed and the frequent changes in system state. However, the structures of existing OSs have become the biggest obstacle to solving the above problems as (1) tight coupling and centralization of the structure lead to poor flexibility and scalability and separate OS ecology; (2) contradiction between various capabilities, e.g., security and performance, due to the unitary isolation mechanism such as kernel-user isolation. Therefore, this study combines the hierarchical software bus (softbus) principles with isolation mechanisms to organize the OS and proposes a new OS model termed Yggdrasil. Yggdrasil decomposes an OS into component nodes connected by softbuses, whose communications are standardized to message passing via the softbus. To support the division of isolated states such as supervisor mode and different software hierarchies, Yggdrasil introduces bridge nodes for cascading and controlled communication between softbuses, and enhances the logical representation capability and scalability of OS through self-similar topology. Additionally, the simplicity and hierarchy of the softbus help to achieve decentralization. To verify the feasibility of Yggdrasil, the study builds hierarchical softbus model for OS (HiBuOS) and demonstrates the feasibility of developing a new OS based on Yggdrasil’s ideas through three specific designs: (1) designing and planning a hierarchical softbus structure according to the scale and requirements of the target operating system; (2) selecting specific isolation and communication mechanisms to instantiate bridge nodes and softbuses; (3) realizing OS services based on the hierarchical softbus style. Finally, the evaluation shows that HiBuOS has notable potential and advantages to enhance system scalability, security, performance, and ecological development without significant performance loss.
Abstract: The informationization 3.0 represented by deep mining and fusion applications of big data is starting, and the software in the traditional static environment is evolving into complex software in the human-cyber-physical ternary environment which is open and dynamic. How to realize the trusted, manageable, and controllable data interconnection on the untrusted and uncontrollable Internet is an urgent problem to be solved. The Internet of Data technical system represented by digital object architecture and identi?er resolution technology provides a feasible solution for these challenges. In order to solve the problems such as low transmission efficiency, high coordination cost, and security management issues in the process of data sharing on the Internet, this study proposes identi?er resolution standard specifications for human-cyber-physical ternary environments. Moreover, to meet the demands that data resources owned by different entities need to be discoverable, accessible, understandable, trustworthy, and interoperable in the human-cyber-physical ternary environment, this study designs the identi?er resolution protocol and implements the identi?er/resolution prototype system for human-cyber-physical ternary environments. At last, this study tests the performance of the prototype system, and the validity of the system is verified by applying it to application scenarios.
Abstract: Static analysis tools often suffer from high false positive rates of reported alarms, despite their ability to aid developers in detecting potential defects early in the software development life cycle. To improve the availability of these tools, many automated warning identification techniques have been proposed to assist developers in classifying false positive alarms. However, existing approaches mainly focus on using hand-engineered features or statement-level abstract syntax tree token sequences to represent the defective code, failing to capture semantics from the reported alarms. To overcome the limitations of traditional approaches, this study employs deep neural networks with powerful feature extraction and representation abilities to generate code semantics from control flow graph paths for warning identification. The control flow graph abstractly represents the execution process of a given program. Thus, the generated path sequences of the control flow graph can guide the deep neural networks to learn semantic information about the potential defect more accurately. In this study, the pre-trained language model is fine-tuned to encode the path sequences and capture the semantic representations for model building. Finally, the study conducts extensive experiments on eight open-source projects to verify the effectiveness of the proposed approach by comparing it with the state-of-the-art baselines.
Abstract: The functions are the smallest naming unit of aggregation behavior in most traditional programming languages. The readability of function names plays a vital role in programmers’ understanding of program functions and the interaction between different modules. Low-quality function names may confuse developers, increase the smell in the code, and then result in software defects caused by API misuse. Therefore, a method of function name consistency checking and recommendation based on deep learning is proposed, which is named DMName. Firstly, for the given source code of the target function, the internal context, interactive context, sibling context, and closed context are constructed respectively, and the context information tag sequence is obtained after merging them. Then the tag sequence is converted into the context representation vector sequence by using the word embedding technology FastText and input into the encoder of the seq2seq model. The copy mechanism and coverage mechanism are utilized to solve the OOV problem and the repeated decoding problem, respectively. Finally, the vector sequence of the prediction result of the target function name is output, and the consistency of the function name is predicted with the help of the two-channel CNN classifier. If the function name is inconsistent, the recommended function name can be obtained by direct mapping according to the vector space similarity matching. The experimental results show that the F1-measure of DMName in function name consistency check and recommendation reaches 82.65% and 73.31% respectively, which is 2.01% and 2.96% higher than the current optimal DeepName. Finally, the DMName is verified in the large-scale open-source project, namely lancia in GitHub. A total of 16 function name inconsistency problems are found, and reasonable name recommendations are made, which further confirms the effectiveness of DMName.
Abstract: Open source software has been a key infrastructure of modern society, supporting software development in almost every field. Through various kinds of code reuse such as install dependency, API call, project fork, file copy, and code clone, open source software forms an intricate supply (i.e., dependency) network, which is referred to as an open source software supply chain. On the one hand, software supply chains facilitate software development and have become the foundation of the software industry. On the other hand, risks from upstream software can affect downstream software along the supply chain, leading to the ripple effect in open source software supply chains. Open source software supply chains have attracted more and more attention from both the academia and the industry. To help advance researchers’ knowledge of open source software supply chains, this study provides a definition and research framework of open source software supply chains from a holistic perspective. Then, it conducts a systematic literature review on worldwide research and summarizes the status quo of research from three aspects: structure and evolution, risk propagation and management, and dependency management. Finally, the study summarizes the challenges and opportunities of future research on open source software supply chains.
Abstract: There are a lot of two-party threshold schemes for SM2 digital signatures proposed in recent years, which can significantly enhance the security of private keys for SM2 digital signatures. According to different methods of key splitting, public schemes can be divided into two types: multiplicative key splitting and additive key splitting. Further, these public schemes can be subdivided into various two-party threshold schemes according to different constructions of the signature random number. This study proposes the framework of two-party threshold schemes for SM2 digital signature, which provides a safe basic calculation process of two-party threshold schemes and introduces the signature random number that can be constructed variously. With the proposed framework and various constructions of the random number, this study achieves the instantiation of the framework, obtaining a variety of two-party threshold schemes for SM2 digital signature. The instantiation includes 23 known two-party threshold schemes, as well as a variety of new schemes.
Abstract: With the rapid development of neural network technology, neural networks have been widely applied in safety-critical fields such as autonomous driving, intelligent manufacturing, and medical diagnosis. Thus, it is crucial to ensure the trustworthiness of neural networks. However, due to the vulnerability of neural networks, slight perturbation often leads to wrong results. Therefore, it is vital to use formal verification methods to ensure the safety and trustworthiness of neural networks. Current verification methods for neural networks are mainly concerned with the accuracy of the analysis, while apt to ignore operational efficiency. When verifying the safety properties of complex networks, the large-scale state space may lead to problems such as infeasibility or unsolvability. To reduce the state space of neural networks and improve the verification efficiency, this study presents a formal verification method for neural networks based on divide and conquer considering over-approximation errors. The method uses the reachability analysis technique to calculate the upper and lower bounds of nonlinear nodes and uses an improved symbolic linear relaxation method to reduce over-approximation errors during the boundary calculation of nonlinear nodes. The constraints of nodes are refined by calculating the direct and indirect effects of their over-approximation errors. Thereby, the original verification problem is split into a set of sub-problems whose mixed integer linear programming (MILP) formulation has a smaller number of constraints. The method is implemented as a tool named NNVerifier, whose properties are verified and evaluated through experiments on four ReLU-based fully-connected benchmark networks trained on three classic datasets. The experimental results show that the verification efficiency of the NNVerifier is 37.18% higher than that of the existing complete verification methods.
Abstract: As one of the ten block cipher algorithms selected for the second round of the 2018 National Cryptographic Algorithm Design Contest, Feistel-based block cipher (FBC) is an efficient and lightweight block cipher algorithm with a four-branch and two-fold Feistel structure. In this study, the FBC algorithm is abstracted as the FBC model, and the pseudorandomness and super-pseudorandomness of the model are studied. It is assumed that the FBC round functions are independent random functions, and a method to find the minimal number of FBC rounds is provided, which will keep FBC indistinguishable from a random permutation. Finally, the study comes to the conclusion that under the chosen-plaintext attack, four rounds of FBC are indistinguishable from random permutation, so the model has pseudorandomness; under the adaptive chosen-plaintext and ciphertext attack, five rounds of FBC are indistinguishable from random permutation, so the model has super-pseudorandomness.
Abstract: Few-shot learning aims at simulating the ability of human beings to quickly learn new things with only few samples, which is of great significance for deep learning tasks when samples are limited. However, in many practical tasks with limited computing resources, the model scale may still limit a wider application of few-shot learning. This study presents a realistic requirement for lightweight tasks for few-shot learning. As a widely used auxiliary strategy in deep learning, knowledge distillation transfers knowledge between models by using additional supervised information, which has practical application in both improving model accuracy and reducing model scale. This study first verifies the effectiveness of the knowledge distillation strategy in model lightweight for few-shot learning. Then according to the characteristics of few-shot learning, two new distillation methods for few-shot learning are designed: (1) distillation based on image local features; (2) distillation based on auxiliary classifiers. Experiments on miniImageNet and TieredImageNet datasets demonstrate that the new distillation methods are significantly superior to traditional knowledge distillation in few-shot learning tasks. The source code is available from https://github.com/cjy97/FSLKD.
Abstract: Subset repair for inconsistent data is an important research problem in the field of data cleaning. Most of the existing methods are based on integrity constraint rules and adopt the principle of the minimum number of deleted tuples for subset repair. However, these methods take no account of the quality of deleted tuples, and the repair accuracy is low. Therefore, this study proposes a subset repair method combining rules and probabilities. The probability of inconsistent tuples is modeled so that the average probability of correct tuples is greater than that of wrong tuples, and the optimal subset repair with the smallest sum of the probability of deleted tuples is calculated. In addition, in order to reduce the time overhead of calculating the probability of inconsistent tuples, this study proposes an efficient error detection method to reduce the size of inconsistent tuples. Experimental results on real data and synthetic data verify that the proposed method outperforms the state-of-the-art subset repair method in terms of accuracy.
Abstract: In recent years, software system security issues are attracting increasing attention. The security threats existing in systems can be easily exploited by attackers. Attackers usually attack systems by using various attacking techniques, such as password brute force cracking, phishing, and SQL injection. Threat modeling is a method of structurally analyzing, identifying, and processing threats. Traditional tests mainly focus on testing code defects, which take place in the late stage of software development. It is not able to well connect the results from early threat modeling and analysis for building secure software. Threat modeling tools in the industry lack the function of generating security tests. In order to tackle this problem, this study proposes a framework that is able to generate security test cases from threat models and designs and implements a tool prototype. In order to facilitate tests, this study improves the traditional attack tree model and performs compliance checks. Test scenarios can be automatically generated from the model. The test scenarios are evaluated according to the probabilities of attack nodes, and the scenarios of the threats with higher probabilities will be tested first. The defense nodes are evaluated, and the defense scheme with higher profit is selected to alleviate the threats, so as to improve the system’s security design. By setting parameters for attack nodes, test scenarios can be specified as test cases. In the early stage of software development, with the inputs of the threats identified by threat modeling, test cases can be generated through this framework and tool to guide subsequent security development and test design, which improves the integration of security technology in software design and development. The case study applies this framework and tool in test generation for very high security risks, which shows their effectiveness.
Abstract: Robots are increasingly entering people’s daily life and are receiving more and more attention in China and abroad. One of the important characteristics of robotic systems is security, and enhancing the security of robotic systems can protect robots from malicious attackers. The security of robot operating system (ROS) is an important part of the security of robotic systems. Although researchers have done a lot of research work on the security of ROSs in recent years, unfortunately, security has not received enough attention yet. In order to draw more attention to the security of robotic systems and help people quickly understand the security solutions of the current mainstream ROS, this study systematically investigates and summarizes the security of ROSs. On the one hand, this study analyzes the security features of ROSs and discusses the known security problems in ROSs. On the other hand, this study categorizes and summarizes the research related to the security of ROSs in recent years and compares the security solutions of ROSs in terms of confidentiality, integrity, and availability. Finally, this study prospects the future of security research on ROSs.
Abstract: Multimodal sentiment analysis is a task that uses subjective information from multiple modalities to analyze sentiment. Exploring how to effectively learn the interaction between modalities has always been an essential task in multimodal analysis. In recent research, it is found that the learning rate of different modalities is unbalanced, leading to the convergence of one modality while the rest of the modalities are under-fitting, which weakens the effect of multimodal collaborative decision-making. In order to combine multiple modalities more effectively and learn the multimodal sentiment features with rich expression, this study proposes a multimodal sentiment analysis method based on adaptive weight fusion. The method of adaptive weight fusion is divided into two phases. The first phase is to adaptively change the fusion weights of unimodal feature representations according to the difference of unimodal learning gradients to dynamically balance the modal learning rate. The study calls this phase balanced fusion (B-fusion). The second phase is to eliminate the impact of the fusion weights of B-fusion on task analysis, propose the modal attention to explore the contributions of modalities to the task, and dynamically allocate the fusion weight to each modality. The study calls this phase attention fusion (A-fusion). The experimental results show that the introduction of the B-fusion method into existing multimodal sentiment analysis methods can effectively improve the accuracy of sentiment analysis. The ablation experiment results show that adding the A-fusion method to B-fusion can effectively reduce the impact of B-fusion weights on the task, which is conducive to improving the analysis results of sentiment analysis. Compared with the existing multimodal sentiment analysis models, the proposed method has a simpler structure, lower computational consumption, and better task accuracy than these comparison models, which shows that the method has high efficiency and excellent performance in multimodal sentiment analysis tasks.
Abstract: Revealing the complex relations among emotions is an important fundamental study in cognitive psychology. From the perspective of natural language processing, the key to exploring the relations among emotions lies in the embedded representation of emotional categories. Recently, there has been some interest in obtaining a category representation in the emotion space that can characterize emotion relations. However, the existing methods for emotion category representations have several drawbacks. For example, fixed dimensionality, the dimensionality of the emotion category representation, depends on the selected dataset. In order to obtain better representations for the emotion categories, this study introduces a supervised contrastive learning representation method. In the previous supervised contrastive learning, the similarity between samples depends on the similarity of the annotated labels of the samples. In order to better reflect the complex relations among different emotion categories, the study further proposes a partially similar supervised contrastive learning representation method, which suggests that samples of different emotion categories (e.g., anger and annoyance) may also be partially similar to each other. Finally, the study organizes a series of experiments to verify the ability of the proposed method and the other five benchmark methods in representing the relationship between emotion categories. The experimental results show that the proposed method achieves satisfactory results for the emotion category representations.
Abstract: The detection of the human respiration waveform in the sleep state is crucial for applications in intelligent health care as well as medical and healthcare in that different respiration waveform patterns can be examined to analyze sleep quality and monitor respiratory diseases. Traditional respiration sensing methods based on contact devices cause various inconveniences to users. In contrast, contactless sensing methods are more suitable for continuous monitoring. However, the randomness of the device deployment, sleep posture, and human motion during sleep severely restrict the application of contactless respiration sensing solutions in daily life. For this reason, the study proposes a detection method for the human respiration waveform in the sleep state based on impulse radio-ultra wide band (IR-UWB). On the basis of the periodic changes in the propagation path of the wireless pulse signal caused by the undulation of the human chest during respiration in the sleep state, the proposed method generates a fine-grained human respiration waveform and thereby achieves the real-time output of the respiration waveform and high-precision respiratory rate estimation. Specifically, to obtain the position of the human chest during respiration from the received wireless radio-frequency (RF) signals, this study proposes the indicator respiration energy ratio based on IR-UWB signals to estimate the target position. Then, it puts forward a vector projection method based on the in-phase/quadrature (I/Q) complex plane and a method of projection signal selection based on the circumferential position of the respiration vector to extract the characteristic human respiration waveform from the reflected signal. Finally, a variational encoder-decoder network is leveraged to achieve the fine-grained recovery of the respiratory waveform in the sleep state. Extensive experiments and tests are conducted under different conditions, and the results show that the human respiration waveforms monitored by the proposed method in the sleep state are highly similar to the actual waveforms captured by commercial respiratory belts. The average error of the proposed method in estimating the human respiratory rate is 0.229 bpm, indicating that the method can achieve high-precision detection of the human respiration waveform in the sleep state.
Abstract: It is essential to detect out-of-distribution (OOD) training set samples for a safe and reliable machine learning system. Likelihood-based generative models are popular methods to detect OOD samples because they do not require sample labels during training. However, recent studies show that likelihoods sometimes fail to detect OOD samples, and the failure reason and solutions are under explored, especially for text data. Therefore, this study investigates the text failure reason from the views of the model and data: insufficient generalization of the generative model and prior probability bias of the text. To tackle the above problems, the study proposes a new OOD text detection method, namely Pobe. To address insufficient generalization of the generative model, the study increases the model generalization via KNN retrieval. Next, to address the prior probability bias of the text, the study designs a strategy to calibrate the bias and improve the influence of probability bias on OOD detection by a pre-trained language model and demonstrates the effectiveness of the strategy according to Bayes’ theorem. Experimental results over a wide range of datasets show the effectiveness of the proposed method. Specifically, the average AUROC is over 99%, and FPR95 is below 1% under eight datasets.
Abstract: Attendance may be for private purposes, which is not associated with an organization, such as keeping a personal travel log, or it is for business needs, which is part of organizational attendance and sometimes associated with multiple organizations. Therefore, the recording, sharing, and analysis of attendance data require elaborate management. The HAO attendance system is a lightweight and mobile attendance platform. It takes the user and organization as two starting points and is driven by HAO intelligence consisting of human intelligence (HI), artificial intelligence (AI), and organizational intelligence (OI). This study builds the knowledge graph of the HAO attendance system and puts forward the closed-loop authority management structure of the HAO attendance system, supplemented by the privacy authority management method from coarse-gained to fine-gained level to ensure refined attendance management and protect the users’ privacy, thereby promoting the intelligent transformation of a new-generation attendance system. For organizational attendance analysis, a four-element scoring method and a four-element attendance reporting method are designed to calculate employee attendance scores, generate accurate and comprehensive attendance reports, provide decision-making support for organizations, and inspire the vitality of both organizations and individuals, so as to build intelligent organizations with organizational intelligence.
Abstract: The domain name plays an important role in cybercrimes. Existing malicious domain name detection methods are not only difficult to use with rich topology and attribute information but also require a large amount of label data, resulting in limited detection effects and high costs. To address this problem, this study proposes a malicious domain name detection method based on graph contrastive learning. The domain name and IP address are taken as two types of nodes in a heterogeneous graph, and the feature matrix of corresponding nodes is established according to their attributes. Three types of meta paths are constructed based on the inclusion relationship between domain names, the measure of similarity, and the correspondence between domain names and IP addresses. In the pre-training stage, the contrast learning model based on the asymmetric encoder is applied to avoid the damage to graph structure and semantics caused by graph data augmentation operation and reduce the demand for computing resources. By using the inductive graph neural network graph encoders HeteroSAGE and HeteroGAT, a node-centric mini-batch training strategy is adopted to explore the aggregation relationship between the target node and its neighbor nodes, which solves the problem of poor applicability of the transductive graph neural networks such as GCN in dynamic scenarios. The downstream classification detection task contrastively utilizes logistic regression and random forest algorithms. Experimental results on publicly available data sets show that detection performance is improved by two to six percentage points compared with that of related works.
Abstract: The openness and ease-of-use of Python make it one of the most commonly used programming languages. The PyPI ecosystem formed by Python not only provides convenience for developers but also becomes an important target for attackers to launch vulnerability attacks. Thus, after discovering Python vulnerabilities, it is critical to deal with Python vulnerabilities by accurately and comprehensively assessing the impact scope of the vulnerabilities. However, the current assessment methods of Python vulnerability impact scope mainly rely on the dependency analysis of packet granularity, which will produce a large number of false positives. On the other hand, existing Python program analysis methods of function granularity have accuracy problems due to context insensitivity and produce false positives when applied to assess the impact scope of vulnerabilities. This study proposes a vulnerability impact scope assessment method for the PyPI ecosystem based on static analysis, namely PyVul++. First, it builds the index of the PyPI ecosystem, then finds the candidate packets affected by the vulnerability through vulnerability function identification, and confirms the vulnerability packets through vulnerability trigger condition. PyVul++ realizes vulnerability impact scope assessment of function granularity, improves the call analysis of function granularity for Python code, and outperforms other tools on the PyCG benchmark (accuracy of 86.71% and recall of 83.20%). PyVul++ is used to assess the impact scope of 10 Python CVE vulnerabilities on the PyPI ecosystem (385855 packets) and finds more vulnerability packets and reduces false positives compared with other tools such as pip-audit. In addition, PyVul++ newly finds that 11 packets in the current PyPI ecosystem still have security issues of referencing unpatched vulnerable functions in 10 assessment experiments of Python CVE vulnerability impact scope.
Abstract: Forgetting is the biggest problem of artificial neural networks in incremental learning and is thus called “catastrophic forgetting”. In contrast, humans can continuously acquire new knowledge and retain most of the frequently used old knowledge. This continuous “incremental learning” ability of human without extensive forgetting is related to the partitioned learning structure and memory replay ability of the human brain. To simulate this structure and ability, the study proposes an incremental learning approach of “recency bias-avoiding self-learning mask (SLM)-based partitioned incremental learning”, or ASPIL for short. ASPIL involves the two stages of regional isolation and regional integration, which are alternately iterated to accomplish continuous incremental learning. Specifically, this study proposes the “Bayesian network (BN)-based sparse regional isolation” method to isolate the new learning process from the existing knowledge and thereby avoid the interference with the existing knowledge. For regional integration, SLM and dual-branch fusion (GBF) methods are proposed. The SLM method can accurately extracts new knowledge and improves the adaptability of the network to new knowledge, while the GBF method integrates the old and new knowledge to achieve the goal of fostering unified and high-precision cognition. During training, a regularization term for Margin Loss is proposed to avoid the “recency bias”, thereby ensuring the further balance of the old knowledge and the avoidance of the bias towards the new knowledge. To evaluate the effectiveness of the proposed method, this study also presents systematic ablation experiments performed on the standard incremental learning datasets CIFAR-100 and miniImageNet and compares the proposed method with a series of well-known state-of-the-art methods. The experimental results show that the method proposed in this study improves the memory ability of the artificial neural network and outperforms the latest well-known methods by more than 5.27% in average identification rate.
Abstract: Deep neural networks can be affected by well-designed backdoor attacks during training. Such attacks are an attack method that controls the model output during tests by injecting data with backdoor labels into the training set. The attacked model performs normally on a clean test set but will be misclassified as the attack target class when the backdoor labels are recognized. The currently available backdoor attack methods have poor invisibility and are still expected to achieve a higher attack success rate. A backdoor attack method based on singular value decomposition is proposed to address the above limitations. The method proposed can be implemented in two ways: One is to directly set some singular values of the picture to zero, and the obtained picture is compressed to a certain extent and can be used as an effective backdoor triggering label. The other is to inject the singular vector information of the attack target class into the left and right singular vectors of the picture, which can also achieve an effective backdoor attack. The backdoor pictures obtained in the two kinds of processing ways are basically the same as the original picture from a visual point of view. According to the experiments, the proposed method proves that singular value decomposition can be effectively leveraged in backdoor attack algorithms to attack neural networks with considerably high success rates on multiple datasets.
Abstract: Detecting out-of-distribution (OOD) samples outside the training set distribution is crucial for deploying deep neural network (DNN) classifiers in the open environment. OOD sample detection is a binary classification problem, which is to classify the input samples into the in-distribution (ID) or OOD categories. Then, the detector itself can be re-bypassed by malicious adversarial attacks. These OOD samples with malicious perturbations are called adversarial OOD samples. Building robust OOD detectors to detect adversarial OOD samples is more challenging. Existing methods usually train DNN through adversarial OOD samples within the neighborhood of auxiliary clean OOD samples to learn separable and robust representations to malicious perturbations. However, due to the distributional differences between the auxiliary OOD training set and original ID training set, training adversarial OOD samples is not effective enough to ensure the robustness of ID boundary against adversarial perturbations. Adversarial ID samples generated from within the neighborhood of (clean) ID samples are closer to the ID boundary and are also effective in improving the adversarial robustness of the ID boundary. This study proposes a semi-supervised adversarial training approach, DiTing, to build robust OOD detectors to detect clean and adversarial OOD samples. This approach treats the adversarial ID samples as auxiliary near-OOD samples and trains them jointly with other auxiliary clean and adversarial OOD samples to improve the robustness of OOD detection. Experiments show that DiTing has a significant advantage in detecting adversarial OOD samples generated by strong attacks while maintaining state-of-the-art performance in classifying clean ID samples and detecting clean OOD samples. Code is available at https://gitee.com/zhiyang3344/diting.
Abstract: Jacobi computation is a kind of stencil computation, which has been widely applied in the field of scientific computing. The performance optimization of Jacobi computation is a classic topic, where loop tiling is an effective optimization method. The existing loop tiling methods mainly focus on the impact of tiling on parallel communication and program locality and fail to consider other factors such as load balancing and vectorization. This study analyzes and compares several tiling methods based on multi-core computing architecture and chooses an advanced hexagonal tiling as the main method to accelerate Jacobi computation. For tile size selection, this study proposes a hexagonal tile size selection algorithm called Hexagon_TSS by comprehensively considering the impact of tiling on load balancing, vectorization efficiency, and locality. The experimental results show that the L1 data cache miss rate can be reduced to 5.46% of original serial program computation in the best case by Hexagon_TSS, and the maximum speedup reaches 24.48. The proposed method also has excellent scalability.
Abstract: In the fields of autonomous driving, augmented reality, and intelligent mobile robots, visual relocalization is a crucial fundamental issue. It refers to the issue of determining the position and attitude in an existing prior map according to the data captured in real time by visual sensors. In the last decades, visual relocalization has received extensive attention, and numerous kinds of prior map construction methods and visual relocalization methods have come to the fore. These efforts vary considerably and cover a wide scope, but technical overviews and summaries are still unavailable. Therefore, a survey of the field of visual relocalization is valuable both theoretically and practically. This study tries to construct a unified blueprint for visual relocalization methods and summarize related studies from the perspective of image data querying from large-scale map databases. This study surveys various types of construction methods for map databases and different feature matching, relocalization, and pose calculation approaches. It then summarizes the current mainstream datasets for visual relocalization and finally analyzes the challenges ahead and the potential development directions of visual relocalization.
Abstract: Software change prediction, aimed at identifying change-prone modules, can help software managers and developers allocate resources efficiently and reduce maintenance overhead. Extracting effective features from the code plays a vital role in the construction of accurate prediction models. In recent years, researchers have shifted from traditional hand-crafted features to semantic features with powerful representation capabilities for prediction. They extracted semantic features from abstract syntax tree (AST) node sequences to build models. However, existing studies have ignored the structural information in the AST and the rich semantic information in the code. How to extract the semantic features of the code is still a challenging problem. For this reason, the study proposes a change prediction method based on hybrid graph representation. To start with, the model combines AST, control flow graph (CFG), data flow graph (DFG), and other structural information to construct the program graph representation of the code. Then, it uses the graph neural network to learn the semantic features of the program graph and the features obtained to predict change-proneness. The model can integrate various semantic information to represent the code better. The effectiveness of the proposed method is verified by comparing it with the latest change prediction methods on various change datasets.
Abstract: Thanks to the low storage cost and high retrieval speed, graph-based unsupervised cross-modal hash learning has attracted much attention from academic and industrial researchers and has been an indispensable tool for cross-modal retrieval. However, the high computational complexity of graph structures prevents its application in large-scale multi-modal applications. This study mainly attempts to solve two important challenges facing graph-based unsupervised cross-modal hash learning: 1) How to efficiently construct graphs in unsupervised cross-modal hash learning? 2) How to handle the discrete optimization in cross-modal hash learning? To address such two problems, this study presents anchor-based cross-modal learning and a differentiable hash layer. To be specific, the study first randomly samples some image-text pairs from the training set as anchor sets and uses the anchor sets as the agent to compute the graph matrix of each batch of data. The graph matrix is used to guide cross-modal hash learning, thus remarkably reducing the space and time cost; second, the proposed differentiable hash layer directly adopts binary coding for computation during network forward propagation and produces gradient to update the network without continuous-value relaxation during backpropagation, thus embracing better hash encoding performance. Finally, the study introduces cross-modal ranking loss to consider the ranking results in the training process and improve the cross-modal retrieval accuracy. To verify the effectiveness of the proposed algorithm, the study compares the algorithm with 10 cross-modal hash algorithms on three general data sets.
Abstract: Aspect-level sentiment classification task, which aims to determine the sentiment polarity of a given aspect, has attracted increasing attention due to its broad applications. The key to this task is to identify contextual descriptions relevant to the given aspect and predict the aspect-related sentiment orientation of the author according to the context. Statistically, it is found that close to 30% of reviews convey a clear sentiment orientation without any explicit sentiment description of the given aspect, which is called implicit sentiment expression. Recent attention mechanism-based neural network methods have gained great achievement in sentiment analysis. However, this kind of method can only capture explicit aspect-related sentiment descriptions but fails to effectively explore and analyze implicit sentiment, and it often models aspect words and sentence contexts separately, which makes the expression of aspect words lack contextual semantics. To solve the above two problems, this study proposes an aspect-level sentiment classification method that integrates local aspect information and global sentence context information and improves the classification performance of the model by curriculum learning according to different classification difficulties of implicit and explicit sentiment sentences. Experimental results show that the proposed method not only has a high accuracy in identifying the aspect-related sentiment orientation of explicit sentiment sentences but also can effectively learn the sentiment categories of implicit sentiment sentences.
Abstract: As an essential component of real-time system design, priority is utilized to resolve conflicts in resource sharing and design for safety. For real-time systems that introduce priorities, each task is assigned a priority, which leads to the possibility of low-priority tasks being preempted by high-priority tasks at runtime, thus creating a preemptive scheduling problem for real-time systems. Existing research on this problem lacks a modeling and automatic verification method that can visually represent the priority of tasks and the dependencies between tasks. To this end, a preemptive priority timed automata (PPTA) is proposed and a preemptive priority timed automata network (PPTAN) is introduced. First, the priority of a task is represented by adding the priority of migration to the timed automata, and then the migration is adopted to correlate tasks with dependencies so that PPTA can be applied to model real-time tasks with priority. The blocking position is also added to the timed automata, so PPTAN can be used to model the priority preemptive scheduling problem. Second, a model-based transformation method is proposed to map the PPTA to the automatic verification tool UPPAAL. Finally, by modeling an example of a multi-core multi-task real-time system and comparing it with other models, it is shown that this model is not only suitable for modeling the priority preemptive scheduling problem but also for accurately verifying and analyzing it.
Abstract: When prototypical networks are directly applied to few-shot named entity recognition (FEW-NER), there are the following problems: Non-entities do not have strong semantic relationships with each other, and using the same way to construct the prototype for both entities and non-entities will make non-entity prototypes fail to accurately represent the semantic characteristics of non-entities; using only the average entity vector as the computing method of the prototype will make it difficult to capture similar entities with different semantic features. To address these problems, this study proposes a FEW-NER based on fine-grained prototypical networks (FNFP) to improve the annotation effect of FEW-NER. Firstly, different non-entity prototypes are constructed for different query sets to capture the key semantic features of non-entities in sentences and obtain finer-grained prototypes to improve the recognition effect of non-entities. Then, an inconsistent metric module is designed to measure the inconsistency between similar entities, and different metric functions are applied to entities and non-entities, so as to reduce the feature representation between similar samples and improve the feature representation of the prototype. Finally, a Viterbi decoder is introduced to capture the label transformation relationship and optimize the final annotation sequence. The experimental results show that the performance of the proposed method is improved compared with that of the large-scale FEW-NER dataset, namely FEW-NERD; and the generalization ability of this method in different domain scenarios is verified on the cross-domain dataset.
Abstract: A large number of bug reports are generated during software development and maintenance, which can help developers to locate bugs. Information retrieval based bug localization (IRBL) analyzes the similarity of bug reports and source code files to locate bugs, achieving high accuracy at the file and function levels. However, a lot of labor and time costs are consumed to find bugs from suspicious files and function fragments due to the coarse location granularity of IRBL. This study proposes a statement level software bug localization method based on historical bug information retrieval, STMTLocator. Firstly, it retrieves historical bug reports which are similar to the bug report of the program under test and extracts the bug statements from the historical bug reports. Then, it retrieves the suspicious files according to the text similarity between the source code files and the bug report of the program under test, and extracts the suspicious statements from the suspicious files. Finally, it calculates the similarity between the suspicious statements and the historical bug statements, and arranges them in descending order to localize bug statements. To evaluate the bug localization performance of STMTLocator, comparative experiments are conducted on the Defects4J and JIRA dataset with Top@N, MRR, and other evaluation metrics. The experimental results show that STMTLocator is nearly four times than the static bug localization method BugLocator in terms of MRR and locates 7 more bug statements for Top@1. The average time used by STMTLocator to locate a bug version is reduced by 98.37% and 63.41% than dynamic bug localization methods Metallaxis and DStar, and STMTLocator has a significant advantage of not requiring the construction and execution of test cases.
Abstract: Blockchain is the basis of the Internet of value. However, data and value silos arise from independent blockchain systems. Blockchain interoperability (also known as cross-chain operability) is essential for breaking inter-chain barriers and building a blockchain network. After differentiating between the blockchain interoperability in the narrow sense and that in the broad sense, this study redefines the former concept and abstracts out two primary operations: cross-chain reading and cross-chain writing. Subsequently, it summarizes three key technical problems that need to be resolved for achieving the blockchain interoperability in the narrow sense: cross-chain information transmission, cross-chain trust transfer, and cross-chain operation atomicity guarantee. Then, the study reviews the current research status of the three problems systematically and makes comparisons from multiple perspectives. Furthermore, it analyzes some representative holistic solutions from the perspective of the key technical problems. Finally, several research directions deserving of further exploration are also presented.
Abstract: As an essential mechanism of group collaboration in software development, code comments are widely used by developers to improve the efficiency of specific developing tasks. However, code comments do not directly affect the software operation, and developers often ignore them, which leads to poor quality of code comments and affects development efficiency. Quality issues of code comments hinder code understanding, bring misunderstanding, or even introduce bugs, which receive widespread attention from researchers. This study systematically analyzes the research work of global scholars on quality issues of code comments in recent years by literature review. It also summarizes related studies in three aspects: evaluation dimensions of code comment quality, indicators of code comment quality, and strategies to promote code comment quality and points out shortcomings, challenges, and suggestions for the current research.
Abstract: Kernel heap vulnerability is currently one of the main threats to operating system security. User-space attackers can leak or modify sensitive kernel information, disrupt kernel control flow, and even gain root privilege by triggering a vulnerability. However, due to the rapid increase in the number and complexity of vulnerabilities, it often takes a long time from when a vulnerability is first reported to when the developer issues a patch, and kernel mitigation mechanisms currently adopted are usually steadily bypassed. Therefore, this study proposes an eBPF-based dynamic mitigation framework for kernel heap vulnerabilities, so as to reduce kernel security risks during the time window fixing. The framework adopts data object space randomization to assign random addresses to the data objects involved in vulnerability reports at each allocation. In addition, it takes full advantage of the dynamic and secure features of eBPF to inject space-randomized objects into the kernel during runtime, so the attacker cannot place any attack payload accurately, and the heap vulnerabilities are almost unexploitable. This study evaluates 40 real kernel heap vulnerabilities and collects 12 attacks that bypass the existing mitigation mechanisms for further analysis and tests. As a result, it verifies that the dynamic mitigation framework provides sufficient security. Performance tests show that even under severe conditions, the four types of data objects only cause performance loss of about 1% and negligible memory loss to the system, and there is almost no additional performance loss when the number of protected objects increases. Compared with related work, the mechanism in this study has a wider scope of application and stronger security, and it does not require vulnerability patches issued by security experts. Furthermore, it can generate mitigation procedures according to vulnerability reports and has a broad application prospect.
Abstract: Regular expressions are widely used in various areas of computer science. However, due to the complex syntax and the use of a large number of meta-characters, regular expressions are quite error-prone when defined and used by developers. Testing is a practical and effective way to ensure the semantic correctness of regular expressions. The most common method is to generate a set of character strings according to the tested expression and check whether they comply with the intended language. Most of the existing test data generation focuses only on positive strings. However, empirical study shows that a majority of errors during actual development are manifested by the fact that the defined language is smaller than the intended one. In addition, such errors can only be detected by negative strings. This study investigates the generation of negative strings from regular expressions based on mutation. The study first obtains a set of mutants by injecting defects into the tested expression through mutation and then selects a negative character string in the complementary set of the language defined by the tested expression to reveal the error simulated by the corresponding mutant. In order to simulate complex defects and avoid the problem that the negative strings cannot be obtained due to the specialization of mutants, a second-order mutation mechanism is adopted. Meanwhile, optimization techniques such as redundant mutant elimination and mutation operator selection are used to reduce the mutants, so as to control the size of the finally generated test set. The experimental results show that the proposed algorithm can generate negative test strings with a moderate size and have strong error detection ability compared with the existing tools.
Abstract: Fault localization collects and analyzes the runtime information of test case sets to evaluate the suspiciousness of each statement of being faulty. Test case sets are constructed by the data from the input domain and have two types, i.e., passing test cases and failing ones. Since failing test cases generally account for a very small portion of the input domain, and their distribution is usually random, the number of failing test cases is much fewer than that of passing ones. Previous work has shown that the lack of failing test cases leads to a class-imbalanced problem of test case sets, which severely hampers fault localization effectiveness. To address this problem, this study proposes a model-domain data augmentation approach using generative adversarial networks for fault localization. Based on the model domain (i.e., spectrum information of fault localization) rather than the traditional input domain (i.e., program input), this approach uses the generative adversarial network to synthesize the model-domain failing test cases covering the minimum suspicious set, so as to address the class-imbalanced problem from the model domain. The experimental results show that the proposed approach significantly improves the effectiveness of 12 representative fault localization approaches.
Abstract: With the rapid development of Internet information technologies, the explosive growth of online learning resources has caused the problem of “information overload” and “learning disorientation”. In the absence of expert guidance, it is difficult for users to identify their learning demands and select the appropriate content from the vast amount of learning resources. Educational domain recommendation methods have received a lot of attention from researchers in recent years because they can provide personalized recommendations of learning resources based on the historical learning behaviors of users. However, the existing educational domain recommendation methods ignore the modeling of complex relationships among knowledge points in learning demand perception and fail to consider the dynamic changes of users’ learning demands, which leads to inaccurate learning resource recommendations. To address the above problems, this study proposes a knowledge point recommendation method based on static and dynamic learning demand perception, which models users’ learning behaviors under complex knowledge association by combining static perception and dynamic perception. For static learning demand perception, this study innovatively designs an attentional graph convolutional network based on the first-course-following meta-path guidance of knowledge points, which can accurately capture users’ static learning demands at the fine-grained knowledge point level by modeling the complex constraints of the first-course-following relationship between knowledge points and eliminating the interference of other non-learning demand factors. For dynamic learning demand perception, the method aggregates knowledge point embeddings to characterize users’ knowledge levels at different moments by taking courses as units and then uses a recurrent neural network to encode users’ knowledge level sequences, which can effectively explore the dynamic learning demands hidden in users’ knowledge level changes. Finally, this study fuses the obtained static and dynamic learning demands, models the compatibility between static and dynamic learning demands in the same framework, and promotes the complementarity of these two learning demands to achieve fine-grained and personalized knowledge point recommendations. Experiments show that the proposed method can effectively perceive users’ learning demands, provide personalized knowledge point recommendations on two publicly available datasets, and outperform the mainstream recommendation methods in terms of various evaluation metrics.
Abstract: The network protocol software is widely deployed and applied, and it provides diversified functions such as communication, transmission, control, and management in cyberspace. In recent years, its security has gradually attracted the attention of academia and industry. Timely finding and repairing network protocol software vulnerabilities has become an important topic. The features, such as diversified deployment methods, complex protocol interaction processes, and functional differences in multiple protocol implementations of the same protocol specification, make the vulnerability mining technique of network protocol software face many challenges. This study first classifies the vulnerability mining technologies of network protocol software and defines the connotation of existing key technologies. Secondly, this study summarizes the technical progress in four aspects of network protocol software vulnerability mining, including network protocol description method, mining object adaptation technology, fuzz testing technology, and vulnerability mining method based on program analysis. In addition, through comparative analysis, the technical advantages and evaluation dimensions of different methods are summarized. Finally, this study summarizes the technical status and challenges of network protocol software vulnerability mining and proposes five potential research directions.
Abstract: The effectiveness of a test suite in defect detection refers to the extent to which the test suite could detect the defects hidden in the software. How to evaluate this performance of a test suite is an important issue. Coverage and mutation score are two of the most important and widely used metrics for test suite effectiveness. To quantify the defect detection capability of a test suite, researchers have devoted a large amount of research effort to this issue and have made significant progress. However, inconsistent conclusions can be observed among the existing studies, and some challenges still call for prompt resolution in the area. This study systematically summarizes the research results achieved by scholars both in China and abroad in the field of the evaluation of test suite effectiveness over the years. To start with, it expounds the problems in the research on the evaluation of test suite effectiveness. Then, it outlines and analyzes the evaluation of test suite effectiveness based on coverage and mutation score and presents the application of the evaluation of test suite effectiveness in test suite optimization. Finally, the study points out the challenges faced by this line of research and suggests the directions of future research.
Abstract: Hypergraphs are generalized representations of ordinary graphs, which are common in many application areas, including the Internet, bioinformatics, and social networks. The independent set problem is a fundamental research problem in the field of graph analysis. Most of the traditional independent set algorithms are targeted for ordinary graph data, and how to achieve efficient maximum independent set mining on hypergraph data is an urgent problem to be solved. To address this problem, this study proposes a definition of hypergraph independent sets. Firstly, two properties of hypergraph independent set search are analyzed, and then a basic algorithm based on the greedy strategy is proposed. Then a pruning framework for hypergraph approximate maximum independent set search is proposed, i.e., a combination of exact pruning and approximate pruning, which reduces the size of the graph by the exact pruning strategy and speeds up the search by the approximate pruning strategy. In addition, four efficient pruning strategies are proposed in this study, and a theoretical proof of each pruning strategy is presented. Finally, experiments are conducted on 10 real hypergraph data sets, and the results show that the pruning algorithm can efficiently search for hypergraph maximum independent sets that are closer to the real results.
Abstract: Entity recognition is a key technology for information extraction. Compared with ordinary text, the entity recognition of Chinese medical text is often faced with a large number of nested entities. Previous methods of entity recognition often ignore the entity nesting rules unique to medical text and directly use sequence annotation methods. Therefore, a Chinese entity recognition method that incorporates entity nesting rules is proposed. This method transforms the entity recognition task into a joint training task of entity boundary recognition and boundary first-tail relationship recognition in the training process and filters the results by combining the entity nesting rules summarized from actual medical text in the decoding process. In this way, the recognition results are in line with the composition law of the nested combinations of inner and outer entities in the actual text. Good results have been achieved in public experiments on entity recognition of medical text. Experiments on the dataset show that the proposed method is significantly superior to the existing methods in terms of nested-type entity recognition performance, and the overall accuracy is increased by 0.5% compared with the state-of-the-art methods.
Abstract: As a privacy-preserving digital identity authentication technology, anonymous credentials not only authenticate the validity of the users’ digital identity but also protect the privacy of their identity. Anonymous credentials are widely applied in anonymous authentication, anonymous tokens, and decentralized digital identity systems. Existing anonymous credentials usually adopt the commitment-signature-proof paradigm, which requires that the adopted signature scheme should have the re-randomization property, such as CL signatures, PS signatures, and structure-preserving signatures (SPS). In practical applications, ECDSA, Schnorr, and SM2 are widely employed for digital identity authentication, but they lack the protection of user identity privacy. Therefore, it is of certain practical significance to construct anonymous credentials compatible with ECDSA, Schnorr, SM2, and other digital signatures, and protect identity privacy during the authentication. This study explores anonymous credentials based on SM2 digital signature. Pedersen commitment is utilized to commit the user attributes in the registration phase. Meanwhile, according to the structural characteristics of SM2, the signed message is H(m), and the equivalence between the Pedersen commitment message and the hash commitment message is proven. This study also employs ZKB++ technology to prove the equivalence of algebraic and non-algebraic statements. The commitment message is transformed to achieve the cross-domain proof and issue the users’ credentials based on the SM2 digital signature. In the showing phase of anonymous credentials, the zero-knowledge proof is combined to prove the possession of an SM2 signature and ensure the anonymity of credentials. This study provides the construction of an anonymous credential protocol based on SM2 digital signature and proves the security of this protocol. Finally, it also verifies the effectiveness and feasibility of the protocol by analyzing the computational complexity of the protocol and testing the algorithm execution efficiency.
Abstract: Since the Snowden revelations, threats from backdoor attacks represented by algorithm substitution attack (ASA) have been widely concerned. This kind of attack subverts the process of the algorithm that tampers with the cryptographic protocol participants in an undetectable manner, which embeds backdoors to obtain secrets. Building a cryptographic reverse firewall (CRF) for protocol participants is a well-known and feasible approach against ASA. Identity-based encryption (IBE), as a quite applicable public key infrastructure, has vital importance to be protected by appropriate CRF schemes. However, the existing work only realizes the CRF re-randomization, ignoring the security risk of sending users’ private keys directly to the third-party CRF. Given the above problem, the formal definition and security model of security properties of CRF applicable to IBE are proposed. Then, the formal definition of rerandomizable and key-malleable secure channel free IBE (RKM-SCF-IBE) and the method of transforming traditional IBE to RKM-SFC-IBE are presented. In addition, an approach to increasing anonymity is also given. Finally, a generic provably secure framework of CRF construction for IBE is proposed based on RKM-SFC-IBE, with several instantiations from classic IBE schemes in the standard model and simulation results with optimization methods. Compared with existing work, the proposed scheme is proven secure under a more complete security model with a generic approach to building CRF for IBE schemes and clarifies the basic principles when constructing CRF for more expressive encryption schemes.
Abstract: Accurately extracting two types of information including elements and clauses in contract texts can effectively improve the contract review efficiency and provide facilitation services for all trading parties. However, current contract information extraction methods generally train single-task models to extract elements and clauses separately, whereas they do not dig deep into the characteristics of contract texts, ignoring the relevance among different tasks. Therefore, this study employs a deep neural network structure to study the correlation between the two tasks of element extraction and clause extraction and proposes a multitask learning method. Firstly, the primary multitask learning model is built for contract information extraction by combining the above two tasks. Then, the model is optimized and attention mechanism is adopted to further explore the correlation. Additionally, an Attention-based dynamic multitask-learning model is built. Finally, based on the above two methods, adynamic multitask learning model with lexical knowledge is proposed for the complex semantic environment in contract texts. The experimental results show that the method can fully capture the shared features among tasks and yield better information extraction results than the single-task model. It can solve the nested entity among elements and clauses in contract texts, and realize the joint information extraction of contract elements and clauses. In addition, to verify the robustness of the proposed method, this study conducts experiments on public datasets in various fields, and the results show that the proposed method is superior to baseline methods.
Abstract: Adversarial texts are malicious samples that can cause deep learning classifiers to make errors. The adversary creates an adversarial text that can deceive the target model by adding subtle perturbations to the original text that are imperceptible to humans. The study of adversarial text generation methods can evaluate the robustness of deep neural networks and contribute to the subsequent robustness improvement of the model. Among the current adversarial text generation methods designed for Chinese text, few attack the robust Chinese BERT model as the target model. For Chinese text classification tasks, this study proposes an attack method against Chinese BERT, that is Chinese BERT Tricker. This method adopts a character-level word importance scoring method, important Chinese character positioning. Meanwhile, a word-level perturbation method for Chinese based on the masked language model with two types of strategies is designed to achieve the replacement of important words. Experimental results show that for the text classification tasks, the proposed method can significantly reduce the classification accuracy of the Chinese BERT model to less than 40% on two real datasets, and it outperforms other baseline methods in terms of multiple attack performance.
Abstract: Graph data, such as citation networks, social networks, and transportation networks, exist widely in the real world. Graph neural networks (GNNs) have attracted extensive attention due to their strong expressiveness and excellent performance in a variety of graph analysis applications. However, the excellent performance of GNNs benefits from label data which are difficult to obtain, and complex network models with high computational costs. Knowledge distillation (KD) is introduced into the GNNs to address the labeled data scarcity and high complexity of GNNs. KD is a method of training constructed small models (student models) by soft-label supervision information from larger models (teacher models) to yield better performance and accuracy. Therefore, how to apply the KD technology to graph data has become a research challenge, but there is still a lack of a graph-based KD research review. Aiming at providing a comprehensive overview of KD based on graphs, this study first summarizes the existing studies and fills in the review gap in this field. Specifically, this study first introduces the background knowledge of graph and KD. Then, three types of graph-based knowledge distillation methods are comprehensively summarized, including graph knowledge distillation for deep neural networks (DNNs), graph knowledge distillation for GNNs, and self-KD-based graph knowledge distillation. Furthermore, each type of method is further divided into knowledge distillation methods based on the output layer, the middle layer, and the constructed graph. Subsequently, the design ideas of various graph-based knowledge distillation algorithms are analyzed and compared, and the advantages and disadvantages of the algorithms are concluded with experimental results. In addition, the application of graph-based knowledge distillation in computer vision, natural language processing, recommendation systems, and other fields are also listed. Finally, the development of graph-based knowledge distillation is summarized and prospected. This study also discloses the references related to graph-based knowledge distillation on GitHub. Please refer to https://github.com/liujing1023/Graph-based-Knowledge-Distillation.
Abstract: As the modern software scale expands, software vulnerabilities bring a great threat to the security and reliability of computer systems and software, causing huge damage to people’s production and life. In recent years, as open source software (OSS) is widely used, the vulnerability issues of OSS have received much attention. Vulnerability awareness techniques can effectively help OSS users to identify vulnerabilities at the early stage for timely defense. Different from the vulnerability detection techniques for traditional software, the transparency and cooperativity of OSS vulnerabilities bring great challenges to vulnerability awareness. Therefore, various techniques are proposed by scholars and developers to perceive potential vulnerabilities and risks in OSS from the code and open source community, so as to find OSS vulnerabilities as early as possible and reduce the losses caused by the vulnerabilities. To boost the development of OSS vulnerability awareness techniques, this study conducts a systematic literature review of existing research works. The study selects 45 high-level papers on open source vulnerability awareness techniques, including code-based, open source community discussion-based, and patch-based vulnerability awareness techniques. The results of these papers are systematically summarized. Especially, this study proposes the category of techniques based on the OSS vulnerability life cycle for the first time according to the most recent publications, which supplements and improves the existing taxonomy of vulnerability awareness techniques. Finally, the study discusses the challenges in the field and predicts future research direction.
Abstract: As a new learning paradigm to solve the problem of label ambiguity, label distribution learning (LDL) has received much attention in recent years. To further improve the prediction performance of LDL, this study proposes an LDL based on deep forest and heterogeneous ensemble (LDLDF), which uses the cascade structure of deep forest to simulate deep learning models with multi-layer processing structure and combines multiple heterogeneous classifiers in the cascade layer to increase the diversity of ensemble. Compared with other existing LDL methods, LDLDF can process information layer by layer and learn better feature representations to mine rich semantic information in data, and it has better representation learning ability and generalization ability. In addition, by considering the degradation problem of deep models, LDLDF adopts a layer feature reuse mechanism to reduce the training error of the model, which effectively utilizes the prediction ability of each layer in the deep model. Sufficient experimental results show that LDLDF is superior to other methods.
Abstract: Object detection is widely used in various fields such as autonomous driving, industry, and medical care. Using the object detection algorithm to solve key tasks in different fields has gradually become the main method. However, the robustness of the object detection model based on deep learning is seriously insufficient under the attack of adversarial samples. It is easy to make the model prediction wrong by adding the adversarial samples constructed by small perturbations, which greatly limits the application of the object detection model in key security fields. In practical applications, the models are black-box models. Related research on black-box attacks against object detection models is relatively lacking, and there are many problems such as incomplete robustness evaluation, low attack success rate of black-box, and high resource consumption. To address the aforementioned issues, this study proposes a black-box object detection attack algorithm based on a generative adversarial network. The algorithm uses the generative network fused with an attention mechanism to output the adversarial perturbations and employs the alternative model loss and the category attention loss to optimize the generated network parameters, which can support two scenarios of target attack and vanish attack. A large number of experiments are conducted on the Pascal VOC and the MSCOCO datasets. The results demonstrate that the proposed method has a higher black-box transferable attack success rate and can perform transferable attacks between different datasets.
Abstract: Network management and monitoring are crucial topics in the network field, with the technologies used to achieve this being referred to as network measurement. In particular, network heavy hitter detection is an important technique of network measurement, and it is analyzed in this study. Heavy hitters are flows that exceed an established threshold in terms of occupied network resources (bandwidth or the number of packets transmitted). Detecting heavy hitters can contribute to quick anomaly detection and more efficient network operation. However, the implementation of heavy hitter detection is impacted by high-speed links. Traditional methods and software defined network (SDN)-based methods are two categories of heavy hitter detection methods that have been developed over time. This study reviews the related frameworks and algorithms, systematically summarizes the development and current status, and finally tries to predict future research directions of network heavy hitter detection.
Abstract: The transport layer is a key component in the network protocol stack, which is responsible for providing end-to-end services for applications between different hosts. Existing transport layer protocols such as TCP provide users with some basic security protection mechanisms, e.g., error controls and acknowledgments, which ensures the consistency of datagrams sent and received by applications between different hosts to a certain extent. However, these security protection mechanisms of the transport layer have serious flaws. For example, the sequence number of TCP datagrams is easy to be guessed and inferred, and the calculation of the datagram’s checksum depends on the vulnerable sum of the complement algorithm. As a result, the existing transport layer security mechanisms cannot guarantee the integrity and security of the datagram, which allows a remote attacker to craft a fake datagram and inject it into the target network stream, thus poisoning the target network stream. The attack against the transport layer occurs at the basic layers of the network protocol stack, which can bypass the security protection mechanisms enforced at the upper application layer and thus cause serious damage to the network infrastructure. After investigating various attacks over network protocols and the related security vulnerabilities in recent years, this study proposes a method for enhancing the security of the transport layer? based on lightweight chain verification, namely LightCTL. Based on the hash verification, LightCTL enables both sides of a TCP connection to create a mutually verifiable consensus on transport layer datagrams, so as to prevent attackers or middlemen from stealing and forging sensitive information. As a result, LightCTL can successfully foil various attacks against the network protocol stack, including TCP connection reset attacks based on sequence number inferring, TCP hijacking attacks, SYN flooding attacks, man-in-the-middle attacks, and datagram replay attacks. Besides, LightCTL does not need to modify the protocol stack of intermediate network devices such as routers. It only needs to modify the checksum and the related parts of the end protocol stack. Therefore, LightCTL can be easily deployed and significantly improves the security of network systems.
Abstract: Fact verification is intended to check whether a textual statement is supported by a given piece of evidence. Due to the structural dependence and implicit content of tables, the task of fact verification with tables as the evidence still faces many challenges. Existing literature has either used logical expressions to parse statements based on tabular evidence or designed table-aware neural networks to encode statement-table pairs and thereby accomplish table-based fact verification tasks. However, these approaches fail to fully utilize the implicit tabular information behind the statements, which leads to the degraded inference performance of the model. Moreover, Chinese statements based on tabular evidence have more complex syntax and semantics, which also adds to the difficulties in model inference. For this reason, the study proposes a method of fact verification with Chinese tabular data based on the capsule heterogeneous graph attention network (CapsHAN). This method can fully understand the structure and semantics of statements. On this basis, the tabular information implied by the statements is mined and utilized to effectively improve the accuracy of table-based fact verification tasks. Specifically, a heterogeneous graph is constructed by performing syntactic dependency parsing and named entity recognition of statements. Subsequently, the graph is learned and understood by the heterogeneous graph attention network and the capsule graph neural network, and the obtained textual representation of the statements is sliced together with the textual representation of the encoded tables. Finally, the result is predicted. Further, this study also attempts to address the problem that the datasets of fact verification based on Chinese tables are scarce and thus unable to support the performance evaluation of table-based fact verification methods. For this purpose, the study transforms the mainstream English table-based fact verification datasets TABFACT and INFOTABS into Chinese and constructs a dataset that is based on the uniform content label (UCL) national standard and specifically tailored to the characteristics of Chinese tabular data. This dataset, namely, UCLDS, takes Wikipedia infoboxes as evidence of manually annotated natural language statements and labels them into three classes: entailed, contradictory, and neutral. UCLDS outperforms the traditional datasets TABFACT and INFOTABS in supporting both single-table and multi-table inference. The experimental results on the above three Chinese benchmark datasets show that the proposed model outperforms the baseline model invariably, demonstrating its superiority for Chinese table-based fact verification tasks.
Abstract: The virtualization, high availability, high scheduling elasticity, and other characteristics of cloud infrastructure provide cloud databases with many advantages, such as the out-of-the-box feature, high reliability and availability, and pay-as-you-go model. Cloud databases can be divided into two categories according to the architecture design: cloud-hosted databases and cloud-native databases. Cloud-hosted databases, deploying the database system in the virtual machine environment on the cloud, offer the advantages of low cost, easy operation and maintenance, and high reliability. Besides, cloud-native databases take full advantage of the characteristic elastic scaling of the cloud infrastructure. The disaggregated compute and storage architecture is adopted to achieve the independent scaling of computing and storage resources and further increase the cost-performance ratio of the databases. However, the disaggregated compute and storage architecture poses new challenges to the design of database systems. This survey is an in-depth analysis of the architecture and technology of the cloud-native database system. Specifically, the architectures of cloud-native online transaction processing (OLTP) and online analytical processing (OLAP) databases are classified and analyzed, respectively, according to the difference in the resource disaggregation mode, and the advantages and limitations of each architecture are compared. Then, on the basis of the disaggregated compute and storage architectures, this study explores the key technologies of cloud-native databases in depth by functional modules. The technologies under discussion include those of cloud-native OLTP (data organization, replica consistency, main/standby synchronization, failure recovery, and mixed workload processing) and those of cloud-native OLAP (storage management, query processing, serverless-aware compute, data protection, and machine learning optimization). At last, the study summarizes the technical challenges for existing cloud-native databases and suggests the directions for future research.
Abstract: Software defect localization refers to the activity of finding program elements that are related to software failure. The existing defect localization techniques, however, can only produce localization results at the function or statement level. These coarse-grained localization results can affect the efficiency and effectiveness of manual debugging and automatic software defect repair. This study focuses on the fine-grained identification of specific code tokens that lead to software defects. The study establishes abstract syntax tree paths for code tokens and proposes a fine-grained defect localization model based on a pointer neural network to predict specific code tokens of defects and specific operation behaviors of repairing the tokens. A large number of defect patch data sets in open-source projects contain a large amount of trainable data, and the paths constructed based on abstract syntax trees can effectively capture the program’s structural information. Experimental results show that the model trained in this study can accurately predict defect code tokens and is significantly better than the baseline methods based on statistics and machine learning. In addition, in order to verify that fine-grained defect localization results can contribute to automatic defect repair, two kinds of program repair processes are designed based on the fine-grained defect localization results. The processes are implemented by using code completion tools to predict the correct token or by following heuristic rules to find appropriate code repair elements. The results show that both methods can effectively solve the overfitting problem in automatic software defect repair.
Abstract: The training of high-precision federated learning models consumes a large number of users’ local resources. The users who participate in the training can gain illegal profits by selling the jointly trained model without others’ permission. In order to protect the property rights of federated learning models, this study proposes a federated learning watermark based on backdoor (FLWB) by using the feature that deep learning backdoor technology maintains the accuracy of main tasks and only causes misclassification in a small number of trigger set samples. FLWB allows users who participate in the training to embed their own private watermarks in the local model and then map the private backdoor watermarks to the global model through the model aggregation in the cloud as the global watermark for federated learning. Then a stepwise training method is designed to enhance the expression effect of private backdoor watermarks in the global model so that FLWB can accommodate the private watermarks of the users without affecting the accuracy of the global model. Theoretical analysis proves the security of FLWB, and experiments verify that the global model can effectively accommodate the private watermarks of the users who participate in the training by only causing an accuracy loss of 1% of the main tasks through the stepwise training method. Finally, FLWB is tested by model compression and fine-tuning attacks. The results show that more than 80% of the watermarks can be retained when the model is compressed to 30% by FLWB, and more than 90% of the watermarks can be retained under four different fine-tuning attacks, which indicates the excellent robustness of FLWB.
Abstract: Internet transport-layer protocols rely on the feedback information provided by the acknowledgment (ACK) mechanism to achieve functions such as congestion control and reliable transmission. According to the evolution of Internet transmission protocols, the ACK mechanisms of transmission control are reviewed. The unsolved problems among the mechanisms are discussed. Based on the elements of “type-trigger-information”, the ACK mechanism based on demand and its design principle are proposed, and the coupling relationship between the ACK mechanism and other transmission protocol submodules (e.g., congestion control, packet loss recovery, etc.) is emphatically analyzed. Subsequently, according to the design principle, the TACK mechanism, a feasible ACK mechanism based on demand, is elaborated, and relative concepts are systematically clarified. Finally, several meaningful research directions are provided according to the challenges encountered by the ACK mechanism based on demand.
Abstract: Code review is one of the best practices widely used in modern software development, which is crucial for ensuring software quality and strengthening engineering capability. Code review comments (CRCs) are one of the main and most important outputs of code reviews. CRCs are not only the reviewers’ perceptions of code quality but also the references for authors to fix code defects and improve quality. Nowadays, although a number of software organizations have developed guidelines for performing code reviews, there are still few effective methods for evaluating the quality of CRCs. To provide an explainable and automated quality evaluation of CRCs, this study conducts a series of empirical studies such as literature reviews and case analyses. Based on the results of the empirical studies, the study proposes a multi-label learning-based approach for evaluating the quality of CRCs. Experiments are carried out by using a large software enterprise-specific dataset that includes a total of 17 000 CRCs from 34 commercial projects. The results indicate that the proposed approach can effectively evaluate the quality attributes and grades of CRCs. The study also provides some modeling experiences such as CRC labeling and verification, so as to help software organizations struggling with code reviews better implement the proposed approach.
Abstract: As an important cornerstone of artificial intelligence, knowledge graphs can extract and represent a priori knowledge from massive data on the Internet, which greatly solves the bottleneck problem of the poor interpretability of cognitive decisions of intelligent systems and plays a key role in the construction and application of intelligent systems. As the application of knowledge graph technology continues to deepen, the knowledge graph completion that aims to solve the problem of the incompleteness of graphs is imminent. Link prediction is the task of predicting the missing entities and relations in the knowledge graph, which is indispensable in the construction and completion of the knowledge graph. The full exploitation of the hidden relations in the knowledge graph and the use of massive entities and relations for computation require the conversion of the symbolic representations of information into the numerical form, i.e., knowledge graph representation learning. Hence, link prediction-oriented knowledge graph representation learning has become a popular research topic in the field of knowledge graphs. This study systematically introduces the latest research progress of link prediction-oriented knowledge graph representation learning methods from the basic concepts of link prediction and representation learning. Specifically, the research progress is discussed in detail in terms of knowledge representation forms and algorithmic modeling methods. The development of the knowledge representation forms is used as a clue to introduce the mathematical modeling of link prediction tasks in the knowledge representation forms of binary relations, multi-relations, and hyper-relations. On the basis of the representation learning modeling, the existing methods are refined into four types of models: translation distance models, tensor decomposition models, traditional deep learning models, and graph neural network models. The implementation methods of each type are described in detail together with representative models for solving link prediction tasks with different relational metrics. The common datasets and criteria for link prediction are then introduced, and on this basis, the link prediction effects of the four types of knowledge representation learning models under the knowledge representation forms of binary relations, multi-relations, and hyper-relations are presented in a comparative analysis. Finally, the future development trends are given in terms of model optimization, knowledge representation forms, and problem scope.
Abstract: Driven by mature data mining technologies, the recommendation system has been able to efficiently utilize explicit and implicit information such as score data and behavior traces and then combine the information with complex and advanced deep learning technologies to achieve sound results. Meanwhile, its application requirements also drive the in-depth mining and utilization of basic data and the load reduction of technical requirements to become research hotspots. On this basis, a lightweight recommendation model, namely LG_APIF is proposed, which uses the graph convolutional network (GCN) method to deeply integrate information. According to behavior memory, the model employs Ebbinghaus forgetting curve to simulate the users’ interest change process and adopts linear regression and other relatively lightweight traditional methods to mine adaptive periods and other depth information of items. In addition, it analyzes users’ current interest distribution and calculates the interest value of the item to obtain users’ potential interest type. It further constructs the graph structure of the user-type-item triplet and uses GCN technology after load reduction to generate the final item recommendation list. The experiments have verified the effectiveness of the proposed method. Through the comparison with eight classical models on the datasets of Last.fm, Douban, Yelp, and MovieLens, it is found that the Precision, Recall, and NDCG of the proposed method are improved, with an average improvement of 2.11% on Precision, 1.01% on Recall, and 1.48% on NDCG, respectively.
Abstract: Database management systems are divided into transactional (OLTP) systems and analytical (OLAP) systems according to application scenarios. With the growing demand for real-time data analysis and the increasing popularity of mixed OLTP and OLAP tasks, the industry has begun to focus on database management systems that support hybrid transactional/analytical processing (HTAP). An HTAP database system not only needs to meet the requirements of high-performance transaction processing but also supports real-time analysis for data freshness. Therefore, it poses new challenges to the design and implementation of database systems. In recent years, some prototypes and products with diverse architectures and technologies have emerged in industry and academia. This study reviews the background and development status of HTAP databases and classifies current HTAP databases from the perspective of storage and computing. On this basis, this study summarizes the key technologies used in the storage and computing of HTAP systems from bottom to top. Under this framework, the design ideas, advantages and disadvantages, and applicable scenarios of various systems are introduced. In addition, according to the evaluation benchmarks and metrics of HTAP databases, this study also analyzes the relationship between the design of various HTAP databases and their performance as well as data freshness. Finally, this study combines cloud computing, artificial intelligence, and new hardware technologies to provide ideas for future research and development of HTAP databases.
Abstract: Adaptor signature, also known as scriptless script, is an important cryptographic technique that can be used to solve the problems of poor scalability and low transaction throughput in blockchain applications such as cryptocurrency. An adaptor signature can be seen as an extension of a digital signature on hard relations, and it ties together the authorization with witness extraction and has many advantages in blockchain applications, such as (1) low on-chain cost; (2) improved fungibility of transactions; (3) advanced functionality beyond the limitation of the blockchain’s scripting language. SM2 signature is the Chinese national standard signature algorithm and has been widely used in various important information systems. This work designs an efficient SM2-based adaptor signature with batch proofs and gives security proofs under the random oracle model. The scheme avoids to generate zero-knowledge proofs used in the pre-signing phase based on the structure of SM2 signature and is more efficient than existing ECDSA/SM2-based adaptor signature. Specifically, the efficiency of pre-signature generation is increased by 4 times, and the efficiency of pre-signature verification is increased by 3 times. Then, based on distributed SM2 signature, this work develops distributed SM2-based adaptor signature which can avoid the single point of failure and improve the security of signing key. Finally, in real-world applications, this work gives a secure and efficient batch atomic swap protocol for one-to-many scenarios based on SM2-based adaptor signature.
Abstract: Commonsense question answering is an essential natural language understanding task that aims to solve natural language questions automatically by using commonsense knowledge to obtain accurate answers. It has a broad application prospect in areas such as virtual assistants or social chatbots and contains crucial scientific issues such as knowledge mining and representation, language understanding and computation, and answer reasoning and generation. Therefore, it has received wide attention from industry and academia. This study first introduces the main datasets in commonsense question answering. Secondly, it summarizes the distinctions between different sources of commonsense knowledge in terms of construction methods, knowledge sources, and presentation forms. Meanwhile, the study focuses on the analysis and comparison of the state-of-the-art commonsense question answering models, as well as the characteristic methods fusing commonsense knowledge. Particularly, based on the commonalities and characteristics of commonsense knowledge in different question answering task scenarios, this study establishes a commonsense knowledge classification system containing attribute, semantic, causal, context, abstract, and intention. On this basis, it conducts prospective research on the construction of commonsense knowledge datasets, the collaboration mechanism of perceptual knowledge fusion and pre-trained language models, and corresponding commonsense knowledge pre-classification techniques. Furthermore, the study reports specifically on the performance changes in the above models under cross-dataset migration scenarios and their potential contributions in commonsense answer reasoning. On the whole, this study gives a comprehensive review of existing data and state-of-the-art technologies, as well as a pre-research for the construction of cross-data knowledge systems, technology migration, and generalization, so as to provide references for the further development of theories and technologies while reporting on the existing technologies in the field.
Abstract: With the development of modern information technology, people’s demand for high resolution and realistic visual perception of image display devices has increased, which has put forward higher requirements for computer software and hardware and brought many challenges to rendering technology in terms of performance and workload. Using machine learning technologies such as deep neural networks to improve the quality and performance of rendered images has become a popular research method in computer graphics, while upsampling low-resolution images through network inference to obtain clearer high-resolution images is an important way to improve image generation performance and ensure high-resolution details. The geometry buffers (G-buffers) generated by the rendering engine in the rendering process contain much semantic information, which help the network learn scene information and features effectively and then improve the quality of upsampling results. In this study, a super-resolution method for rendered contents in low resolution based on deep neural networks is designed. In addition to the color image of the current frame, the method uses high-resolution G-buffers to assist in the calculation and reconstruct the high-resolution content details. The method also leverages a new strategy to fuse the features of high-resolution buffers and low-resolution images, which implements a multi-scale fusion of different feature information in a specific fusion module. Experiments demonstrate the effectiveness of the proposed fusion strategy and module, and the proposed method shows obvious advantages, especially in maintaining high-resolution details, when compared with other image super-resolution methods.
Abstract: SMT solver is an important system software. Therefore, bugs in the SMT solver may lead to the function failure of software relying on it and even bring security incidents. However, fixing bugs in the SMT solver is time-consuming because developers need to spend a lot of effort in understanding and finding the root cause of the bugs. Although many studies on software bug localization have been proposed, there is no systematic work to automatically locate bugs in the SMT solver. Therefore, this study proposes a bug localization method for the SMT solver based on multi-source spectrums, namely SMTLOC. First, for a given bug in the SMT solver, SMTLOC proposes an enumeration-based algorithm to mutate the formula that triggers the bug by generating a set of witness formulas that will not trigger the bug but has a similar execution trace with the formula that triggers the corresponding bug. Then, according to the execution trace of the witness formulas and the source code information of the SMT solver, SMTLOC develops a technique based on the coverage spectrum and historical spectrum to calculate the suspiciousness of files, thus locating the files that contain the bug. In order to evaluate the effectiveness of SMTLOC, 60 bugs in the SMT solver are collected. Experimental results show that SMTLOC is superior to the traditional spectrum bug localization method and can locate 46.67% of the bugs in TOP-5 files, and the localization effect is improved by 133.33%.
Abstract: How to quickly and effectively mine valuable information from massive data to better guide decision-making is an important goal of big data analysis. Visual analysis is an important big data analysis method, and it takes advantage of the characteristics of human visual perception, utilizes visualization charts to present laws contained in complex data intuitively, and supports human-centered interactive data analysis. However, the visual analysis still faces several challenges, such as the high cost of data preparation, high latency of interaction response, high threshold for visual analysis, and low efficiency of interaction modes. To address the above challenges, researchers propose a series of methods to optimize the human-computer interaction mode of visual analysis systems and improve the intelligence of the system by leveraging data management and artificial intelligence techniques. This study systematically sorts out, analyzes, and summarizes these methods and puts forward the basic concept and key technical framework of intelligent data visualization analysis. Then, under the framework, the research progress of data preparation for visual analysis, intelligent data visualization, efficient visual analysis, and intelligent visual analysis interfaces both in China and abroad is reviewed and analyzed. Finally, this study looks forward to the future development trend of intelligent data visualization analysis.
Abstract: Machine learning methods can be well combined with software testing to enhance test effect, but few scholars have applied it to test data generation. In order to further improve the efficiency of test data generation, a chained model combining support vector machine (SVM) and extreme gradient boosting (XGBoost) is proposed, and multi-path test data generation is realized by a genetic algorithm based on the chained model. Firstly, this study uses certain samples to train several sub-models (i.e., SVM and XGBoost) for predicting the state of path nodes, filters the optimal sub-models based on the prediction accuracy value of the sub-models, and links the optimal sub-models in sequence according to the order of the path nodes, so as to form a chained model, namely chained SVM and XGBoost (C-SVMXGBoost). When using the genetic algorithm to generate test cases, the study makes use of the chained model that is trained instead of the instrumentation method to obtain the test data coverage path (i.e., predicted path), finds the path set with the predicted path similar to the target path, performs instrumentation verification on the predicted path with similar path sets, obtains accurate paths, and calculates fitness values. In the crossover and mutation process, excellent test cases with a large path level depth in the sample set are introduced for reuse to generate test data covering the target path. Finally, individuals with higher fitness during the evolutionary generation are saved, and C-SVMXGBoost is updated, so as to further improve the test efficiency. Experiments show that C-SVMXGBoost is more suitable for solving the path prediction problem and improving the test efficiency than other chained models. Moreover, compared with the existing classical methods, the proposed method can increase the coverage rate by up to 15%. The mean evolutionary algebra is also reduced, and the reduction percentage can reach 65% on programs of large size.
Abstract: As an important technology in the field of artificial intelligence (AI), deep neural networks are widely used in various image classification tasks. However, existing studies have shown that deep neural networks have security vulnerabilities and are vulnerable to adversarial examples. At present, there is no research on the systematic analysis of adversarial example detection of images. To improve the security of deep neural networks, this study, based on the existing research work, comprehensively introduces adversarial example detection methods in the field of image classification. First, the detection methods are divided into supervised detection and unsupervised detection by the construction method of the detector, which are then classified into subclasses according to detection principles. Finally, the study summarizes the problems in adversarial example detection and provides suggestions and an outlook in terms of generalization and lightweight, aiming to assist in AI security research.
Abstract: Depth ambiguity is an important challenge for multi-person three-dimensional (3D) pose estimation of single-frame images, and extracting contexts from an image has great potential for alleviating depth ambiguity. Current top-down approaches usually model key point relationships based on human detection, which not only easily results in key point shifting or mismatching but also affects the reliability of absolute depth estimation using human scale factor because of a coarse-grained human bounding box with large background noise. Bottom-up approaches directly detect human key points from an image and then restore the 3D human pose one by one. However, the approaches are at a disadvantage in relative depth estimation although the scene context can be obtained explicitly. This study proposes a new two-branch network, in which human context based on key point region proposal and scene context based on 3D space are extracted by top-down and bottom-up branches, respectively. The human context extraction method with noise resistance is proposed to describe the human by modeling key point region proposal. The dynamic sparse key point relationship for pose association is modeled to eliminate weak connections and reduce noise propagation. A scene context extraction method from a bird’s-eye-view is proposed. The human position layout in 3D space is obtained by modeling the image’s depth features and mapping them to a bird’s-eye-view plane. A network fusing human and scene contexts is designed to predict absolute human depth. The experiments are carried out on public datasets, namely MuPoTS-3D and Human3.6M, and results show that compared with those by the state-of-the-art models, the relative and absolute position accuracies of 3D key points by the proposed HSC-Pose are improved by at least 2.2% and 0.5%, respectively, and the position error of mean roots of the key points is reduced by at least 4.2 mm.
Abstract: As an automatic search tool, mixed integer linear programming (MILP) is widely used to search for differential, linear, integral, and other cryptographic properties of block ciphers. In this study, a new technique of constructing MILP models based on a dynamic selection strategy is proposed, which uses different constraint inequalities to describe the propagation of cryptographic properties under different conditions. Specifically, according to the different Hamming weights of the input division property, this study adopts different methods to construct MILP models of the division property propagation with linear layers. Finally, this technique is applied to search for integral distinguishers of uBlock and Saturnin algorithms. The experimental results show that the proposed technique can obtain an 8-round integral distinguisher with 32 more balance bits than the previous optimal integral distinguisher for the uBlock128 algorithm. In addition, this study gets 9- and 10-round integral distinguishers for uBlock128 and uBlock256 algorithms which are one round longer than the previous optimal integral distinguishers. For the Saturnin256 algorithm, the study finds a 9-round integral distinguisher which is one round longer than the previous optimal integral distinguisher.
Abstract: Hierarchical topic model is an important tool to organize topic hierarchy. Most of the existing hierarchical topic models provide tree-structured prior distributions for document topics by introducing the nCRP construction method into the topic models, but they cannot acquire a topic hierarchy with clear domain meanings, referred to as domain topic hierarchy. Meanwhile, there are not only hierarchical relationships among domain topics but also sub-topic aspect sharing relationships under different parent topics. There is no appropriate model that yields such domain topic hierarchy in the current research on topic relationships. In order to automatically and effectively mine the hierarchical and correlated relationships of domain topics from domain texts, improvements are put forward as follows. Firstly, this study improves the nCRP construction method through the topic sharing mechanism and proposes the nCRP+ hierarchical construction method to provide a tree-structured prior distribution with hierarchical topic aspect sharing for topics generated from topic models. Then the reallocated hierarchical Dirichlet processes (rHDP) are developed based on nCRP+ and HDP models, and an rHDP model is proposed. By employing the domain taxonomy, word semantics, and domain representation of topic words, the study defines domain knowledge, including the domain membership degree based on the voting mechanism, the semantic relevance between words and domain topics, and the contribution degree of hierarchical topic words. Finally, domain knowledge is used to improve the allocation processes of domain topics and topic words in the rHDP model, and rHDP with domain knowledge (rHDP_DK) model is proposed to improve the sampling process. The experimental results show that hierarchical topic models based on nCRP+ are superior to those based on nCRP (hLDA and nHDP) and neural topic model (TSNTM) in terms of evaluation metrics. The topic hierarchy, built by the rHDP_DK model, is characterized by clear domain topic hierarchy and explicit domain differences among related sub-topics. Furthermore, the model will provide a general automatic mining framework for domain topic hierarchy.
Abstract: In multi-label learning, each sample is associated with multiple labels. The key task is how to use the correlation between labels when building the model. Multi-label deep forest (MLDF) algorithm attempts to mine the correlation between labels by using layer-by-layer representation learning under the framework of deep ensemble learning and use the obtained label probability representation to improve prediction accuracy. However, on the one hand, the label probability representation is highly correlated with the label information, which will lead to its low diversity. As the depth of the deep forest increases, the performance will decline. On the other hand, the calculation of label probability requires the storage of forest structures with all layers and the application of these structures one by one in the test stage, which will cause unbearable computational and storage overhead. To solve these problems, this study proposes interaction representation-based MLDF (iMLDF). iMLDF mines the structural information in the feature space from the decision path of the forest model, extracts the feature interaction in the decision tree path by using the random interaction trees, and obtains two interaction representations of feature confidence score and label probability distribution, respectively. On the one hand, iMLDF makes full use of the feature structural information in the forest model to enrich the relevant information between labels. On the other hand, it calculates all the representations through interaction expressions so that the algorithm does not need to store all the forest structures, which greatly improves computational efficiency. The experimental results show that iMLDF algorithm achieves better prediction performance, and the computational efficiency is improved by an order of magnitude compared with MLDF for datasets with massive samples.
Abstract: Graph partitioning is a basic task for distributed graph computing. It is used to divide a large-scale graph into different parts and allocate them to different machines in a cluster. The quality of graph partitioning has a great impact on the performance of distributed graph computing, and graph partitioning aims to minimize edge cuts and load balance. Nowadays, the graph data usually grow dynamically, which needs a partitioning method to process dynamic incremental graphs, so as to ensure the quality of graph partitioning. Although some dynamic graph partitioning algorithms have been presented recently, they cannot process real-time dynamic changes and obtain high-quality graph partitioning results simultaneously. In this study, a dynamic incremental graph partitioning algorithm based on vertex group redistribution (ED-IDGP) is proposed to solve the problem of large-scale dynamic incremental graph partitioning. In ED-IDGP, a dynamic processor is designed to process four different unit update types in real time, and the graph partitioning quality is further improved by executing a local optimizer near the dynamic change in the partition after each unit update. In the local optimizer of ED-IDGP, a vertex group search strategy based on the improved label propagation algorithm is used to search for the vertex group, and a vertex group movement gain formula is proposed to measure the most beneficial vertex group and move it to the target partition for optimization. This study evaluates the performance and efficiency of the ED-IDGP algorithm from different perspectives and metrics on real datasets.
Abstract: As a new granular computing model, partition order product space can describe and solve problems from multiple views and levels. Its problem solving space is a lattice structure composed of multiple problem solving levels, and each problem solving level is composed of multiple one-level views. How to choose the problem solving level in the partition order product space is an NP-hard problem. Therefore, this study proposes a two-stage adaptive genetic algorithm (TSAGA) to find the problem solving level. First, real encoding is used to encode the problem solving level, and then the fitness function is defined according to the classification accuracy and granularity of the problem solving level. The first stage of the algorithm is based on a classical genetic algorithm, and some excellent problem solving levels are pre-selected as part of the initial population of the second stage, so as to optimize the problem solving space. In the second stage of the algorithm, an adaptive selection operator, adaptive crossover operator, and adaptive large-mutation operator are proposed, which can dynamically change with the number of iterations of the current population evolution, so as to further select the problem solving level in the optimized problem solving space. Experimental results demonstrate the effectiveness of the proposed method.
Abstract: The integration of machine learning and automatic reasoning is a new trend in artificial intelligence. Constraint satisfaction is a classic problem in artificial intelligence. A large number of scheduling, planning, and configuration problems in the real world can be modeled as constraint satisfaction problems, and efficient solving algorithms have always been a research hotspot. In recent years, many new methods of applying machine learning to solve constraint satisfaction problems have emerged. These methods based on “learning to reasoning” open up new directions for solving constraint satisfaction problems and show great development potential. They are featured by better adaptability, strong scalability, and online optimization. This study divides the current “learning to reasoning” methods into three categories including message-passing neural network-based, sequence-to-sequence-based, and optimization-based methods. Additionally, the characteristics of various methods and their solution effects on different problem sets are analyzed in detail. In particular, a comparative analysis is conducted on relevant work involved in each type of method from multiple perspectives. Finally, the constraint solving method based on “learning to reasoning” is summarized and prospected.
Abstract: Enumerating minimal unsatisfiable subsets (MUS) is an important subproblem in the Boolean satisfiability problem. For an unsatisfiable problem, the MUS enumeration can reflect the key factors resulting in its unsatisfiability. However, enumerating MUS is extremely time-consuming, and different pruning schemes will directly affect the size of the search space and the total number of iterations, thus affecting the algorithm efficiency. This study proposes a novel enhanced pruning scheme, accelerating by critical MSS (ABC), to accelerate the MUS enumeration. According to the relationship among maximal satisfiable subsets (MSS), minimal correction sets (MCS), and MUS, the concepts of cMSS and subMUS are put forward. Additionally, four properties are summarized, namely that each MUS must be a superset of subMUS, and then the feature that MUS and MCS are mutually hitting sets can be effectively employed to avoid the time cost in solving hitting sets of MCS. When the subMUS is unsatisfiable, it will be the only MUS, and the algorithm will terminate in advance; otherwise, the node representing subMUS will be pruned to effectively avoid searching the non-solution space. Meanwhile, the effectiveness of the proposed ABC scheme is proven by theorem, which has been applied to the state-of-the-art algorithms MARCO and MARCO-MAM, respectively. Experimental results on SAT11 MUS benchmarks show the proposed scheme can effectively prune the search space to improve the enumeration efficiency of MUS.
Abstract: Network traffic encryption not only protects corporate data and user privacy but also brings new challenges to malicious traffic detection. According to different ways of processing encrypted traffic, encrypted malicious traffic detection technology can be divided into active and passive detection. Active detection technology includes detection after traffic decryption and that based on searchable encryption technology. Its research focuses on privacy protection and detection efficiency improvement, and mainly analyzes the application of trusted execution environments and controllable transmission protocols. Passive detection technology is a method of identifying encrypted malicious traffic without perception for users and without performing any encryption or decryption operations. The research focuses on the selection and construction of features. It analyzes relevant detection methods from three types of features such as side channel features, plaintext features, and raw traffic, and then the experimental evaluation conclusions of relevant models are given. Finally, the feasibility of the research on the countermeasures of encrypted malicious traffic detection is analyzed from the perspectives of obfuscating traffic characteristics, interference learning algorithms, and hiding relevant information.
Abstract: The committee consensus and hybrid consensus elect the committee to replace the whole nodes for block validation, which can effectively speed up consensus and improve throughput. However, malicious attacks and bribes can easily lead to committee corruption, affect consensus results, and even cause system paralysis. Although the existing work proposes the reputation mechanism to reduce the possibility of committee corruption, it has high overhead and poor reliability and cannot reduce the impact of corruption on the system. Therefore, this study proposes a dynamic blockchain consensus with pre-validation (DBCP). DBCP realizes reliable reputation evaluation of the committee through pre-validation with little overhead, which can eliminate malicious nodes from the committee in time. If serious corruption has undermined the consensus result, DBCP will transfer the authority of block validation to the whole nodes through dynamic consensus and eliminate the committee nodes that give wrong suggestions to avoid system paralysis. When the committee iterates to the high-credibility state, DBCP will hand over the authority of block validation to the committee, and the whole nodes will accept the consensus result from the committee without verifying the block to speed up the consensus. The experimental results show that the throughput of DBCP is two orders of magnitude higher than that of Bitcoin and similar to that of Byzcoin. In addition, DBCP can quickly deal with committee corruption within a block cycle, demonstrating better security than Byzcoin.
Abstract: How to improve the accuracy of matching between natural language query input and highly structured programming language source code is a fundamental concern in code search. Accurate extraction of code features is one of the key challenges to improving matching accuracy. The semantics expressed by statements in codes is not only relevant to themselves but also to their contexts. The structural model of the code provides rich contextual information for understanding code functions. This study proposes a code search method based on function multigraph embedding. By using an early fusion strategy, the study fuses the data dependencies of code statements into a control flow graph and constructs a function multigraph to represent the code. The multigraph explicitly expresses the dependency relationships of indirect predecessor and successor nodes that are lacking in the control flow graph through data dependencies and enhances the contextual information of statement nodes. At the same time, in view of the edge heterogeneity of the multigraph, this study uses the relational graph convolutional network to extract the features of the code from the function multigraph. Experiments on a public dataset show that the proposed method can improve the MRR by more than 5% compared with the existing methods based on code text and structural models. The ablation experiments also show that the control flow graph contributes more to the search accuracy than the data dependence graph.
Abstract: Third-party library (TPL) detection is an upstream task in the domain of Android application security analysis, and its detection accuracy has a significant impact on its downstream tasks including malware detection, repackaged application detection, and privacy leakage detection. To improve detection accuracy and efficiency, this study proposes a package structure and signature-based TPL detection method, named LibPass, by leveraging the idea of pairwise comparison. LibPass combines primary module identification, TPL candidate identification, and fine-grained detection in a streamlined way. The primary module identification aims at improving detection efficiency by distinguishing the binary code of the main program from that of the introduced TPL. On this basis, a two-stage detection method consisting of TPL candidate identification and fine-grained detection is proposed. The TPL candidate identification leverages the stability of package structure features to deal with obfuscation of applications to improve detection accuracy and identifies candidate TPLs by rapidly comparing package structure signatures to reduce the number of pairwise comparisons, so as to improve the detection efficiency. The fine-grained detection accurately identifies the TPL of a specific version by a finer-grained but more costly pairwise comparison among candidate TPLs. In order to validate the performance and the efficiency of the detection method, three benchmark datasets are built to evaluate different detection capabilities, and experiments are conducted on these datasets. The experimental results are deeply analyzed in terms of detection performance, detection efficiency, and obfuscation resistance, and it is found that LibPass has high detection accuracy and efficiency and can deal with various common obfuscation operations.
Abstract: Memory error vulnerabilities (e.g., buffer overflow) are often caused by improper use of memory copy functions. The identification of memory copy functions in binary programs is beneficial for finding memory error vulnerabilities. However, current methods for identifying memory copy functions in binary programs mainly rely on static analysis to extract functions’ features, control flow, data flow, and other information, with a high false positive and false negative. This study proposes a novel technique, namely CPSeeker, based on hybrid static and dynamic analysis to improve the effectiveness of identifying memory copy functions. CPSeeker combines the advantages of static analysis and dynamic analysis, collects the global static information and local execution information of functions in stages, and fuses the extracted information to identify memory copy functions in binary programs. The experimental results show that CPSeeker outperforms the state-of-the-art BootStomp, SaTC, CPYFinder, and Gemini in identifying memory copy functions, despite its increased runtime consumption, and its F1 value reaches 0.96. Furthermore, CPSeeker is not affected by the compilation environment (compiler version, compiler type, and compiler optimization level). In addition, CPSeeker has a better performance in actual firmware tests.
Abstract: The broad-learning-based dynamic fuzzy inference system (BL-DFIS) can automatically assemble simplified fuzzy rules and achieve high accuracy in classification tasks. However, when BL-DFIS works on large and complex datasets, it may generate too many fuzzy rules to achieve satisfactory identification accuracy, which adversely affects its interpretability. In order to circumvent such a bottleneck, a fuzzy neural network called feature-augmented random vector functional-link neural network (FA-RVFLNN) is proposed in this study to achieve excellent trade-off between classification performance and interpretability. In the proposed network, the RVFLNN with original data as input is taken as its primary structure, and BL-DFIS is taken as a performance supplement, which implies that FA-RVFLNN contains direct links to boost the performance of the whole system. The inference mechanism of the primary structure can be explained by a fuzzy logic operator (I-OR), owing to the use of Sigmoid activation functions in the enhancement nodes of this structure. Moreover, the original input data with clear meaning also help to explain the inference rules of the primary structure. With the support of direct links, FA-RVFLNN can learn more useful information through enhancement nodes, feature nodes, and fuzzy nodes. The experimental results indicate that FA-RVFLNN indeed eases the problem of rule explosion caused by excessive enhancement nodes in the primary structure and improves the interpretability of BL-DFIS therein (The average number of fuzzy rules is reduced by about 50%), and is still competitive in terms of generalization performance and network size.
Abstract: The mixed cooperative-competitive multi-agent system consists of controlled target agents and uncontrolled external agents. The target agents cooperate with each other and compete with external agents, so as to deal with the dynamic changes in the environment and the external agents and complete tasks. In order to train the target agents and make them learn the optimal policy for completing the tasks, the existing work proposes two kinds of solutions: (1) focusing on the cooperation between target agents, viewing the external agents as a part of the environment, and leveraging the multi-agent-reinforcement learning to train the target agents; but these approaches cannot handle the uncertainty of or dynamic changes in the external agents’ policy; (2) focusing on the competition between target agents and external agents, modeling the competition as two-player games, and using a self-play approach to train the target agents; these approaches are only suitable for cases where there is one target agent and external agent, and they are difficult to be extended to a system consisting of multiple target agents and external agents. This study combines the two kinds of solutions and proposes a counterfactual regret advantage-based self-play approach. Specifically, first, based on the counterfactual regret minimization and counterfactual multi-agent policy gradient, the study designs a counterfactual regret advantage-based policy gradient approach for making the target agent update the policy more accurately. Second, in order to deal with the dynamic changes in the external agents’ policy during the self-play process, the study leverages imitation learning, which takes the external agents’ historical decision-making trajectories as training data and imitates the external agents’ policy, so as to explicitly model the external agents’ behaviors. Third, based on the counterfactual regret advantage-based policy gradient and the modeling of external agents’ behaviors, this study designs a self-play training approach. This approach can obtain the optimal joint policy for training multiple target agents when the external agents’ policy is uncertain or dynamically changing. The study also conducts a set of experiments on the cooperative electromagnetic countermeasure, including three typical mixed cooperative-competitive tasks. The experimental results demonstrate that compared with other approaches, the proposed approach has an improvement of at least 78% in the self-game effect.
Abstract: With the popularity of touch devices, pen + touch inputs have become mainstream input modes for mobile officing. However, existing applications mainly take one of them as input, which limits users’ interaction space. In addition, existing pen + touch research mainly focuses on serial pen + touch cooperation and parallel processing of specific interactive tasks and does not systematically consider parallel cooperation mechanism and intention correlation between different inputs. This study first proposes an interaction model based on pen + touch inputs and then defines pen + touch interaction primitives according to users’ behavioral habits in pen + touch cooperation, so as to extend pen + touch interaction space. Furthermore, by using a partially observable Markov decision process (POMDP), the study develops a method of extracting pen + touch input intentions based on time sequence information, so as to incrementally extract the interaction intention of polysemantic interaction primitives. Finally, the study evaluates the advantages of pen + touch inputs through a user experiment.
Abstract: Code search is an important research topic in natural language processing and software engineering. Developing efficient code search algorithms can significantly improve the code reuse and the working efficiency of software developers. The task of code search is to retrieve code fragments that meet the requirements from the massive code repository by taking the natural language describing the function of the code fragments as input. Although the sequence model-based code search method, namely DeepCS has achieved promising results, it cannot capture the deep semantics of the code. GraphSearchNet, a code search method based on graph embedding, can alleviate this problem, but it does not perform fine-grained matching on codes and texts and ignores the global relationship between code graphs and text graphs. To address the above limitations, this study proposes a code search method based on a relational graph convolutional network, which encodes the constructed text graphs and code graphs, performs fine-grained matching on text query and code fragments at the node level, and applies neural tensor networks to capture their global relationship. Experimental results on two public datasets show that the proposed method achieves higher search accuracy than state-of-the-art baseline models, namely DeepCS and GraphSearchNet.
Abstract: In order to perform knowledge mining and management, information systems need to process various forms of data, including stream data. Stream data have the characteristics of large data scale, fast generation speed, and strong timeliness of the knowledge contained in them. Therefore, it is very important for knowledge management of information systems to develop stream processing technology that supports real-time stream processing applications. Stream processing systems (SPSs) can be traced back to the 1990s, and they have undergone significant development since then. However, current diverse knowledge management needs and the new generation of hardware architectures have brought new challenges and opportunities for SPSs, and a series of technical research on stream processing ensues. This study introduces the basic requirements and development history of SPSs and then analyzes relevant technologies in the SPS field in terms of four aspects: programming interface, execution plan, resource scheduling, and fault tolerance. Finally, this study predicts the research directions and development trends of stream processing technology in the future.
Abstract: Hybrid transactional/analytical processing (HTAP) database systems have gained extensive acknowledgment of users due to their full processing support of the mixed workloads in one system, i.e., transactions and analytical queries. Most HTAP database systems tend to maintain multiple data versions or additional replicas to accomplish online analytical processing (OLAP) without downgrading the write performance of online transactional processing (OLTP). This leads to a consistency problem between the data of TP and AP versions. Meanwhile, HTAP database systems face the core challenge of achieving efficient data sharing under resource isolation, and the data-sharing model integrates the trade-off between business requirements for performance and data freshness. To systematically explain the data-sharing model and optimization strategies of existing HTAP database systems, this study first utilizes the consistency models to define the data-sharing model and classify the consistency models for HTAP data sharing into three categories, namely, linear consistency, sequential consistency, and session consistency, according to the differences between TP generated versions and AP query versions. After that, it takes a deep dive into the whole process of data-sharing models from three core issues, i.e., data-version number distribution, data version synchronization, and data version tracking, and provides the implementation methods of different consistency models. Furthermore, this study takes a dozen of classic and popular HTAP database systems as examples for an in-depth interpretation of the implementation methods. Finally, it summarizes and analyzes the optimization strategies of version synchronization, tracking, and recycling modules involved in the data-sharing process and predicts the optimization directions of the data-sharing models. It is concluded that the self-adaptability of the data synchronization scope, self-tuning of the data synchronization cycle, and freshness-bound constraint control under sequential consistency are the possible means for better performance of HTAP database systems and higher freshness.
Abstract: Security bug reports (SBRs) can describe critical security vulnerabilities in software products. SBR prediction has attracted the increasing attention of researchers to eliminate security attack risks of software products. However, in actual software development scenarios, a new company or new project may need software security bug prediction, without enough marked SBRs for building SBR prediction models in practice. A simple solution is employing the migration model, which means that marked data of other projects can be adopted to build the prediction model. Inspired by two recent studies in this field, this study puts forward a cross-project SBR prediction method integrating knowledge graphs, i.e., knowledge graph of security bug report prediction (KG-SBRP), based on the idea of security keyword filtering. The text information field in SBR is combined with common weakness enumeration (CWE) and common vulnerabilities and exposures (CVE) Details to build a triple rule entity. Then the entity is utilized to build a knowledge graph of security bugs and identify SBRs by combining the entity and relationship recognition. Finally, the data is divided into training sets and test sets for model fitting and performance evaluation. The built model conducts empirical research on seven SBR datasets with different scales. The results show that compared with the current main methods FARSEC and Keyword matrix, the proposed method can increase the performance index F1-score by an average of 11% under cross-project SBR prediction scenarios. In addition, the F1-score value can also grow by an average of 30% in SBR prediction scenarios within a project.
Abstract: Software product line testing is challenging. The similarity-based testing method can improve testing coverage and fault detection rate by increasing the diversity of test suites. Due to its excellent scalability and satisfactory testing effects, the method has become one of the most important test methods for software product lines. How to generate diverse test cases and how to maintain the diversity of test suites are two key issues in this test method. To handle the above issues, this study proposes a software product line test algorithm based on diverse SAT solvers and novelty search (NS). Specifically, the algorithm simultaneously uses two types of diverse SAT solvers to generate diverse test cases. In particular, in order to improve the diversity of stochastic local search SAT solvers, the study proposes a general strategy that is based on a probability vector to generate candidate solutions. Furthermore, two archiving strategies inspired by the idea of the NS algorithm are designed and applied to maintain both the global and local diversity of the test suites. Ablation and comparison experiments on 50 real software product lines verify the effectiveness of both the diverse SAT solvers and the two archiving strategies, as well as the superiority of the proposed algorithm over other state-of-the-art algorithms.
Abstract: Business?process?execution
language (BPEL) is an executable web service composition language. Compared with traditional programs, BPEL programs are significantly different in terms of programming models and execution modes. These new features make it challenging to locate and fix faults of BPEL programs detected during the testing process. In addition, fault fixing techniques developed for traditional software cannot be used for BPEL programs directly. This study proposes a fault fixing technique for BPEL programs based on template matching, namely BPELRepair from the perspective of mutation analysis. In order to overcome the high computational overhead of the mutation analysis-based fault fixing technique, a set of optimization strategies are proposed from three perspectives, namely patch generation, test case selection, and termination condition. A supporting tool is developed to improve the automation and efficiency of fault fixing for BPEL programs. An empirical study is used to evaluate the effectiveness of the proposed fault fixing technique and optimization strategies. The experimental results show that the proposed technique can successfully fix about 53% of faults of BPEL programs, and the proposed optimization strategies can significantly reduce the overhead in terms of search matching, patch program verification, test case execution, and fault fixing.
Abstract: Quantum computing is expected to solve many typical and difficult problems in theory. The rapid development of quantum computers in recent years is pushing the theory into practice. However, numerous errors in current hardware can cause incorrect computational results, which severely limit the ability of quantum computers to solve practical problems. Quantum computing system software lies between applications and hardware. In addition, tapping the full potential of the system software in mitigating hardware errors is crucial to realizing practical quantum computing in the near future. As a result, many research works on quantum computing system software have recently emerged. This study classifies them into three categories: compilers, runtime systems, and debuggers. Through an in-depth analysis of these works, the study sorts out the research status of quantum computing system software and reveals their important roles in mitigating hardware errors. This study also looks forward to future research directions.
Abstract: Autonomous driving software based on deep neural networks (DNNs) has become the most popular solution. Like traditional software, DNN can also produce incorrect output or unexpected behaviors, and DNN-based autonomous driving software has caused serious accidents, which seriously threaten life and property safety. Therefore, how to effectively test DNN-based autonomous driving software has become an urgent problem. Since it is difficult to predict and understand the behavior of DNNs, traditional software testing methods are no longer applicable. Existing autonomous driving software testing methods are implemented byadding pixel-level perturbations to original images or modifying the whole image to generate test data. The generated test data are quite different from the real images, and the perturbation-based methods are difficult to be understood. To solve the above problem, this study proposes a test data generation method, namely interpretability analysis-based test data generation (IATG). Firstly, it uses the interpretation method for DNNs to generate visual explanations of decisions made by autonomous driving software and chooses objects in the original images that have significant impacts on the decisions. Then, it generates test data by replacing the chosen objects with other objects with the same semantics. The generated test data are more similar to the real image, and the process is more understandable. As an important part of the autonomous driving software’s decision-making module, the steering angle prediction model is used to conduct experiments. Experimental results show that the introduction of the interpretation method effectively enhances the ability of IATG to mislead the steering angle prediction model. Furthermore, when the misleading angle is the same, the test data generated by IATG are more similar to the real image than DeepTest; IATG has a stronger misleading ability than semSensFuzz, and the interpretation analysis based important object selection method of IATG can effectively improve the misleading ability of semSensFuzz.
Abstract: Runtime configuration brings flexibility and customizability to users in the utilization of software systems. However, its enormous scale and complex mechanisms also pose significant challenges. A large number of scholars and research institutions have probed into runtime configuration to improve the availability and adaptability of software systems in complex environments. This study develops an analytical framework of runtime configuration to provide a systematic overview of state-of-the-art research from three different stages, namely configuration analysis and comprehension, configuration defect detection and misconfiguration diagnosis, and configuration utilization. The study also summarizes the limitations and challenges faced by current research and outlines the research trend of runtime configuration, which is of guiding significance for future work.
Abstract: As critical Internet infrastructure, DNS brings many privacy and security risks due to its plaintext transmission. Many encryption technologies for DNS channel transmission, such as DoH, DoT, and DoQ, are committed to preventing DNS data from leaking or tampering and ensuring the reliability of DNS message sources. Firstly, this study analyzes the privacy and security problems of plaintext DNS from six aspects, including the DNS message format, data storage and management, and system architecture and deployment, and then summarizes the existing related technologies and protocols. Secondly, the implementation principles and the application statuses of the encryption protocols for DNS channel transmission are analyzed, and the performance of each encryption protocol under different network conditions is discussed with multi-angle evaluation indicators. Meanwhile, it discusses the privacy protection effects of the encryption technologies for DNS channel transmission through the limitations of the padding mechanism, the encrypted traffic identification, and the fingerprint-based encryption activity analysis. In addition, the problems and challenges faced by encryption technologies for DNS channel transmission are summarized from the aspects of the deployment specifications, the illegal use of encryption technologies by malicious traffic and its attack on them, the contradiction between privacy and network security management, and other factors affecting privacy and security after encryption. Relevant solutions are also presented. Finally, it summarizes the highlights of future research, such as the discovery of the encrypted DNS service, server-side privacy protection, the encryption between recursive resolvers and authoritative servers, and DNS over HTTP/3.
Abstract: Knowledge space theory, which uses mathematical language for the knowledge evaluation and learning guide of learners, belongs to the research field of mathematical psychology. Skills and problems are the two basic elements of knowledge space, and an in-depth study of the relationship between them is the inherent requirement of knowledge state description and knowledge structure analysis. In the existing knowledge space theory, no explicit bi-directional mapping between skills and problems has been established, which makes it difficult to put forward a knowledge structure analysis model under intuitive conceptual meanings. Moreover, the partial order relationship between knowledge states has not been clearly obtained, which is not conducive to depicting the differences between knowledge states and planning the learning path of learners. In addition, the existing achievements mainly focus on the classical knowledge space, without considering the uncertainties of data in practical problems. To this end, this study introduces formal concept analysis and fuzzy sets into knowledge space theory and builds the fuzzy concept lattice models for knowledge structure analysis. Specifically, fuzzy concept lattice models of knowledge space and closure space are presented. Firstly, the fuzzy concept lattice of knowledge space is constructed, and it is proved that the extents of all concepts form a knowledge space by the upper bounds of any two concepts. The idea of granule description is introduced to define the skill-induced atomic granules of problems, whose combinations can help determine whether a combination of problems is a state in the knowledge space. On this basis, a method to obtain the fuzzy concepts in the knowledge space from the problem combinations is proposed. Secondly, the fuzzy concept lattice of closure space is established, and it is proved that the extents of all concepts form the closure space by the lower bounds of any two concepts. Similarly, the problem-induced atomic granules of skills are defined, and their combinations can help determine whether a skill combination is the skills required by a knowledge state in the closure space. In this way, a method to obtain the fuzzy concepts in the closure space from the skill combinations is presented. Finally, the effects of the number of problems, the number of skills, the filling factor, and the analysis scale on the sizes of knowledge space and closure space are analyzed by some experiments. The results show that the fuzzy concepts in the knowledge space are different from any existing concept and cannot be derived from other concepts. The fuzzy concepts in the closure space are attribute-oriented one-sided fuzzy concepts in essence. In the formal context of two-valued skills, there is one-to-one correspondence between the states in knowledge space and closure space, but this relationship does not hold in the formal context of fuzzy skills.
Abstract: Log is an important carrier of a computer system, which records the states of events, and a log system is responsible for log generation, collection, and output. OpenHarmony is a new open-source, distributed operating system for smart devices in all scenarios of a fully-connected world. Prior to the work described in this study, many key subsystems of OpenHarmony, including the log system, had not been built. The open-source feature of OpenHarmony enables third-party developers to contribute core codes. To solve the problem of the lack of a log system of OpenHarmony, this paper mainly does the following work: ① It analyzes the technical architecture, advantages, and disadvantages of today’s popular log systems. ② It clarifies the model specifications of the log system HiLog according to the interconnection feature of heterogeneous devices in OpenHarmony. ③ It designs and implements the first log system HiLog of OpenHarmony and contributes it to the OpenHarmony trunk. ④ It conducts comparative experiments on the key indicators of HiLog. The experimental data show that in terms of basic performance, the throughput of HiLog and Log is 1500 KB/s and 700 KB/s, respectively, which indicates that HiLog has a 114% improvement over the log system of Android. In terms of log persistence, the packet loss of HiLog is less than 6‰ with a compression rate of 3.5% for persistency, much lower than that of Log. In addition, HiLog also has some novel practical functions such as data protection and flow control.
Abstract: The morphological changes in retina boundaries are important indicators of retinal diseases, and the subtle changes can be captured by images obtained by optical coherence tomography (OCT). The retinal layer boundary segmentation based on OCT images can assist in the clinical judgment of related diseases. In OCT images, due to the diverse morphological changes in retina boundaries, the key boundary-related information, such as contexts and saliency boundaries, is crucial to the judgment and segmentation of layer boundaries. However, existing segmentation methods lack the consideration of the above information, which results in incomplete and discontinuous boundaries. To solve the above problems, this study proposes a coarse-to-fine method for the segmentation of retinal layer boundary in OCT images based on the end-to-end deep neural networks and graph search (GS), which avoids the phenomenon of “faults” common in non-end-to-end methods. In coarse segmentation, the attention global residual network (AGR-Net), an end-to-end deep neural network, is proposed to extract the above key information in a more sufficient and effective way. Specifically, a global feature module (GFM) is designed to capture the global context information of OCT images by scanning from four directions of the images. After that, the channel attention module (CAM) and GFM are sequentially combined and embedded in the backbone network to realize saliency modeling of context information of the retina and its boundaries. This effort effectively solves the problem of wrong segmentation caused by retina deformation and insufficient information extraction in OCT images. In fine segmentation, a GS algorithm is adopted to remove isolated areas or holes from the coarse segmentation results obtained by AGR-Net. In this way, the boundary keeps a fixed topology, and it is continuous and smooth, which further optimizes the overall segmentation results and provides a more complete reference for medical clinical diagnosis. Finally, the performance of the proposed method is evaluated from different perspectives on two public datasets, and the method is compared with the latest methods. The comparative experiments show that the proposed method outperforms the existing methods in terms of segmentation accuracy and stability.
Abstract: Deep neural networks (DNNs) have made remarkable achievements in many fields, but related studies show that they are vulnerable to adversarial examples. The gradient-based attack is a popular adversarial attack and has attracted wide attention. This study investigates the relationship between gradient-based adversarial attacks and numerical methods for solving ordinary differential equations (ODEs). In addition, it proposes a new adversarial attack based on Runge-Kutta (RK) method, a numerical method for solving ODEs. According to the prediction idea in the RK method, perturbations are added to the original examples first to construct predicted examples, and then the gradients of the loss functions with respect to the original and predicted examples are linearly combined to determine the perturbations to be added for the generation of adversarial examples. Different from the existing adversarial attacks, the proposed adversarial attack employs the prediction idea of the RK method to obtain the future gradient information (i.e., the gradient of the loss function with respect to the predicted examples) and uses it to determine the adversarial perturbations to be added. The proposed attack features good extensibility and can be easily applied to all available gradient-based attacks. Extensive experiments demonstrate that in contrast to the state-of-the-art gradient-based attacks, the proposed RK-based attack boasts higher success rates and better transferability.
Abstract: Nowadays, deep neural networks (DNNs) have been widely used in various fields. However, research has shown that DNNs are vulnerable to attacks of adversarial examples (AEs), which seriously threaten the development and application of DNNs. Most of the existing adversarial defense methods need to sacrifice part of the original classification accuracy to obtain defense capability and strongly rely on the knowledge provided by the generated AEs, so they cannot balance the effectiveness and efficiency of defense. Therefore, based on manifold learning, this study proposes an origin hypothesis of AEs in attackable space from the feature space perspective and a trap-type ensemble adversarial defense network (Trap-Net). Trap-Net adds trap data to the training data based on the original model and uses the trap-type smoothing loss function to establish the seducing relationship between the target data and trap data, so as to generate trap-type networks. In order to address the problem that most adversarial defense methods sacrifice original classification accuracy, ensemble learning is used to ensemble multiple trap networks, so as to expand attackable target space defined by trap labels in the feature space and reduce the loss of the original classification accuracy. Finally, Trap-Net determines whether the input data are AEs by detecting whether the data hit the attackable target space. Experiments on MNIST, K-MNIST, F-MNIST, CIFAR-10, and CIFAR-100 datasets show that Trap-Net has strong defense generalization of AEs without sacrificing the classification accuracy of clean samples, and the results of experiments validate the adversarial origin hypothesis in attackable space. In the low-perturbation white-box attack scenario, Trap-Net achieves a detection rate of more than 85% for AEs. In the high-perturbation white-box attack and black-box attack scenarios, Trap-Net has a detection rate of almost 100% for AEs. Compared with other detection methods of AEs, Trap-Net is highly effective against white-box and black-box adversarial attacks, and it provides an efficient robustness optimization method for DNNs in adversarial environments.
Abstract: Dynamic memory allocators are fundamental components of modern applications. They manage free memory and handle user memory requests. Modern general-purpose dynamic memory allocators ensure the balance of performance and memory footprint. However, in view of different memory footprints and optimization goals in application scenarios, a general-purpose memory allocator is not the optimal solution. Special-purpose memory allocators for specific application scenarios usually can better satisfy system requirements. However, they are time-consuming and error-prone to implement. Developers often use the memory allocation framework to build special-purpose dynamic memory allocators. However, the existing memory allocator framework has the problems of poor abstraction ability and insufficient composability and customizability. For this reason, this study proposes a composable and customizable dynamic memory allocator framework, namely mortise, based on function composability by reviewing the dynamic memory allocation process from the perspective of functional programming. The framework abstracts system memory allocation as a composition of hierarchical functions of several multiple decoupled memory allocations, and these functions can provide policies to ensure higher customizability and composability. Mortise is implemented by using standard C. To achieve zero performance overhead of hierarchical function composition, mortise uses the metaprogramming features offered by the C preprocessor. Developers can quickly build a memory allocator for targeted application scenarios by composing and customizing the hierarchical function of allocators. In order to prove the effectiveness of mortise, this study presents three different memory allocator instances, namely tlsfcc, hslab, and wfslab, by using mortise. Specifically, tlsfcc is designed for multi-core embedded application scenarios, which improves the parallel throughput by replacing the synchronization strategy; hslab is a core-aware slab-type allocator, which optimizes performance on heterogeneous hardware by customizing thread cache; wfslab is a low-latency and wait-free/lock-free allocator. This study runs benchmarks to compare these allocators with several existing memory allocators. The experiments are carried out on an 8-core x86/64 platform and an 8-core heterogeneous aarch64 embedded platform, and the experimental results show that tlsfcc achieves a mean speedup of 1.76 and 1.59 on the two platforms compared with the original tlsf allocator; hlsab achieves only 69.6% and 85.0% execution time compared with the tcmalloc with a similar architecture; the worst-case memory request latency of wfslab is the smallest among all memory allocators in the experiment, including the state-of-art lock-free memory allocators: mimalloc and snmalloc.
Abstract: Spoken language understanding (SLU), as a core component of task-oriented dialogue systems, aims to extract the semantic framework of user queries. In dialogue systems, the SLU component is responsible for identifying user requests and creating a semantic framework that summarizes user requests. SLU usually includes two subtasks: intent detection (ID) and slot filling (SF). ID is regarded as a semantic utterance classification problem that analyzes the semantics of utterance at the sentence level, while SF is viewed as a sequence labeling task that analyzes the semantics of utterance at the word level. Due to the close correlation between intentions and slots, mainstream works employ joint models to exploit shared knowledge across tasks. However, ID and SF are two different tasks with strong correlation, and they represent sentence-level semantic information and word-level information of utterances respectively, which means that the information of the two tasks is heterogeneous and has different granularities. This study proposes a heterogeneous interactive structure for joint ID and SF, which adequately captures the relationship between sentence-level semantic information and word-level information in heterogeneous information for two correlative tasks by adopting self-attention and graph attention networks. Different from ordinary homogeneous structures, the proposed model is a heterogeneous graph architecture containing different types of nodes and links because a heterogeneous graph involves more comprehensive information and rich semantics and can better interactively represent the information between nodes with different granularities. In addition, this study utilizes a window mechanism to accurately represent word-level embedding to better accommodate the local continuity of slot labels. Meanwhile, the study uses a pre-trained model (BERT) to analyze the effect of the proposed model using BERT. The experimental results of the proposed model on two public datasets show that the model achieves an accuracy of 97.98% and 99.11% on the ID task and an F1 score of 96.10% and 96.11% on the SF task, which are superior to the current mainstream methods.
Abstract: Software vulnerabilities are known as security defects of computer software systems, and they threaten the completeness, security, and reliability of modern software and application data. Artificial vulnerability management is time-consuming and error-prone. Therefore, in order to better deal with the challenges of vulnerability management, researchers have proposed a variety of automated vulnerability management schemes, among which automated vulnerability repair has attracted wide attention from researchers recently. Automated vulnerability repair consists of three main functions: vulnerability cause localization, patch generation, and patch validation, and it aims to assist developers to repair vulnerabilities. The existing work lacks systematic classification and discussion of vulnerability repair technology. To this end, this study gives a comprehensive insight into the theory, practice, applicable scenarios, advantages, and disadvantages of existing vulnerability repair methods and technologies and writes a research review of automated vulnerability repair technologies, so as to promote the development of vulnerability repair technologies and deepen researchers’ cognition and understanding of vulnerability repair problems. The main contents of the study include: (1) sorting out and summarizing the repair methods of specific and general vulnerabilities according to different vulnerability types; (2) classifying and summarizing different repair methods based on technical principles; (3) summarizing the main challenges of vulnerability repair; (4) looking into future development direction of vulnerability repair.
Abstract: A social law is a set of restrictions on the available actions of agents to establish some target properties in a multiagent system. In the strategic case, where the agents have individual rationality and private information, the social law synthesizing problem should be modeled as an algorithmic mechanism design problem instead of a common optimization problem. Minimal side effect is usually a basic requirement for social laws. From the perspective of game theory, minimal side effect closely relates to the concept of maximum social welfare, and synthesizing a social law with minimal side effect can be modeled as an efficient mechanism design problem. Therefore, this study not only needs to find out the efficient social laws with maximum social welfare for the given target property but also pays for the agents to induce incentive compatibility and individual rationality. The study first designs an efficient mechanism based on the VCG mechanism, namely VCG-SLM, and proves that it satisfies all the required formal properties. However, as the computation of VCG-SLM is an FPNP-complete problem, the study proposes an ILP-based implementation of this mechanism (VCG-SLM-ILP), transforms the computation of allocation and payment to ILPs based on the semantics of ATL, and strictly proves its correction, so as to effectively utilize the currently mature industrial-grade integer programming solver and successfully solve the intractable mechanism computing problems.
Abstract: The heterogeneous many-core architecture with an ultra-high energy efficiency ratio has become an important development trend of supercomputer architecture. However, the complexity of heterogeneous systems puts forward higher requirements for application development and optimization, and they face many technical challenges such as usability and programmability in the development process. The independently developed new-generation Sunway supercomputer is equipped with a homegrown heterogeneous many-core processor, SW26010Pro. To take full advantage of the performance of the new-generation many-core processors and support the development and optimization of emerging scientific computing applications, this study designs and implements an optimized compiler swLLVM oriented to the SW26010Pro platform. The compiler supports Athread and SDAA dual-mode heterogeneous programming models and provides multi-level storage hierarchy description and SIMD extensions for vector-like operations. In addition, it realizes control-flow vectorization, cost-based node combination, and compiler optimization for multi-level storage hierarchy according to the architecture characteristics of SW26010Pro. The experimental results show that the compiler optimization designed and implemented in this paper achieves significant performance improvements. The average speedup of control-flow vectorization and node combination and optimization is 1.23 and 1.11, respectively, and the memory access optimization achieves a maximum performance improvement of 2.49 times. Finally, a comprehensive evaluation of swLLVM is performed from multiple dimensions on the standard test set SPEC CPU2006. The results show that swLLVM reports an average increase of 9.04% in the performance of floating-point projects, 5.25% in overall performance, and 79.1% in compilation speed and an average decline of 0.12% in the performance of integer projects and 1.15% in the code size compared to SWGCC with the same optimization level.
Abstract: In recent years, RGB-D salient detection method has achieved better performance than RGB salient detection model by virtue of its rich geometric structure and spatial position information in depth maps and thus has been highly concerned by the academic community. However, the existing RGB-D detection model still faces the challenge of improving performance continuously. The emerging Transformer is good at modeling global information, while the convolutional neural network (CNN) is good at extracting local details. Therefore, effectively combining the advantages of CNN and Transformer to mine global and local information will help to improve the accuracy of salient object detection. For this purpose, an RGB-D salient object detection method based on cross-modal interactive fusion and global awareness is proposed in this study. The transformer network is embedded into U-Net to better extract features by combining the global attention mechanism with local convolution. First, with the help of the U-Net encoder-decoder structure, this study efficiently extracts multi-level complementary features and decodes them step by step to generate a salient feature map. Then, the Transformer module is used to learn the global dependency between high-level features to enhance the feature representation, and the progressive upsampling fusion strategy is used to process the input and reduce the introduction of noise information. Moreover, to reduce the negative impact of low-quality depth maps, the study also designs a cross-modal interactive fusion module to realize cross-modal feature fusion. Finally, experimental results on five benchmark datasets show that the proposed algorithm has an excellent performance than other latest algorithms.
Abstract: Federated learning is an effective method to solve the problem of data silos. When the server calculates all gradients, incorrect calculation of global gradients exists due to the inertia and self-interest of the server, so it is necessary to verify the integrity of global gradients. The existing schemes based on cryptographic algorithms are overspending on verification. To solve these problems, this study proposes a rational and verifiable federated learning framework. Firstly, according to game theory, the prisoner contract and betrayal contract are designed to force the server to be honest. Secondly, the scheme uses a replication-based verification scheme to verify the integrity of the global gradient and supports the offline client side. Finally, the analysis proves the correctness of the scheme, and the experiments show that compared with the existing verification algorithms, the proposed scheme reduces the computing overhead of the client side to zero, the number of communication rounds in one iteration is optimized from three to two, and the training overhead is inversely proportional to the offline rate of the client side
Abstract: The ranking function method is the main method for the termination analysis of loops, and it indicates that loop programs can be terminated. In view of single-path linear constraint loop programs, this study presents a new method to analyze the termination of the loops. Based on the calculation of the normal space of the increasing function, this method considers the calculation of the ranking function in the original program space as that in the subspace. Experimental results show that the method can effectively verify the termination of most loop programs in the existing literature.
Abstract: Multi-behavior recommendation aims to utilize interactive data from multiple behaviors of users to improve recommendation performance. Existing multi-behavior recommendation methods generally directly exploit the multi-behavior data for the shared initialized user representations and involve the mining of user preferences and modeling of relationships among different behaviors in the tasks. However, these methods ignore the data imbalance under different interactive behaviors (the amount of interactive data varies greatly among different behaviors) and the information loss caused by the adaptation to the above two tasks. User preferences refer to the interests that users exhibit in different behaviors (e.g., browsing preferences), and the relationship among behaviors indicates a potential conversion from one behavior to another behavior (e.g., the conversion from browsing to purchasing). In multi-behavior recommendation, the mining of user preferences and the modeling of relationships among different behaviors can be regarded as a two-stage task. On the basis of the above considerations, the model of two-stage learning for multi-behavior recommendation (TSL-MBR for short) is proposed, which decouples the above two tasks with a two-stage strategy. In particular, the model retains the end-to-end structure and learns the two tasks by alternating training with fixed parameters. The first stage is to model user preferences under different behaviors. In this stage, the interactive data from all behaviors (without distinction as to behavior type) are first used to model the global preferences of users to alleviate the problem of data sparsity to the greatest extent. Then, the interactive data of each behavior are used to refine the behavior-specific user preference (local preference) and thus lessen the influence of the data imbalance among different behaviors. The second stage is to model the relationships among different behaviors. In this stage, the mining of user preferences and modeling of relationships among different behaviors are decoupled to relieve the information loss problem caused by adaptation to the two tasks. This two-stage model significantly improves the system’s ability to predict target behaviors. Extensive experimental results show that TSL-MBR can substantially outperform the state-of-the-art baseline models, achieving 103.01% and 33.87% of relative gains on average over the best baseline on the Tmall and Beibei datasets, respectively.
Abstract: Microservice architectures have been widely deployed and applied, which can greatly improve the efficiency of software system development, reduce the cost of system update and maintenance, and enhance the extendibility of software systems. However, However, microservices are characterized by frequent changes and heterogeneous fusion, which result in frequent faults, fast fault propagation, and great influence. Meanwhile, complex call dependency or logical dependency between microservices makes it difficult to locate and diagnose faults timely and accurately, which poses a challenge to the intelligent operation and maintenance of microservice architecture systems. The service dependency discovery technology identifies and deduces the call dependency or logical dependency between services from data during system running and constructs a service dependency graph, which helps to timely and accurately discover and locate faults and diagnose causes during system running and is conducive to intelligent operation and maintenance requirements such as resource scheduling and change management. This study first analyzes the problem of service dependency discovery in microservice systems and then summarizes the technical status of the service dependency discovery from the perspective of three types of runtime data, such as monitoring data, system log data, and trace data. Then, based on the fault cause location, resource scheduling, and change management of the service dependency graph, the study discusses the application of service dependency discovery technology to intelligent operation and maintenance. Finally, the study discusses how service dependency discovery technology can accurately discover call dependency or logical dependency and use service dependency graph to conduct change management and predicts future research directions.
Abstract: How to reduce secure and repeated replies is a challenging problem in the open-domain multi-turn dialogue model. However, the existing open-domain dialogue models often ignore the guiding role of dialogue objectives and how to introduce and select more accurate knowledge information in dialogue history and dialogue objectives. Based on these phenomena, this study proposes a multi-turn dialogue model based on knowledge enhancement. Firstly, the model replaces the notional words in the dialogue history with semaphores and domain words, so as to eliminate ambiguity and enrich the dialogue text representation. Then, the knowledge-enhanced dialogue history and expanded triplet world knowledge are effectively integrated into the knowledge management and knowledge copy modules, so as to integrate information of knowledge, vocabularies, dialogue history, and dialogue objectives and generate diverse responses. The experimental results and visualization on two international benchmark open-domain Chinese dialogue corpora verify the effectiveness of the proposed model in both automatic evaluation and human judgment.
Abstract: Deep learning has achieved great success in image classification, natural language processing, and speech recognition. Data augmentation can effectively increase the scale and diversity of training data, thereby improving the generalization of deep learning models. However, for a given dataset, a well-designed data augmentation strategy relies heavily on expert experience and domain knowledge and requires repeated attempts, which is time-consuming and labor-intensive. In recent years, automated data augmentation has attracted widespread attention from the academic community and the industry through the automated design of data augmentation strategies. To solve the problem that existing automated data augmentation algorithms cannot strike a good balance between prediction accuracy and search efficiency, this study proposes an efficient automated data augmentation algorithm SGES AA based on a self-guided evolution strategy. First, an effective continuous vector representation method is designed for the data augmentation strategy, and then the automated data augmentation problem is converted into a search problem of continuous strategy vectors. Second, a strategy vector search method based on the self-guided evolution strategy is presented. By introducing historical estimation gradient information to guide the sampling and updating of exploration points, it can effectively avoid the local optimal solution while improving the convergence of the search process. The results of extensive experiments on image, text, and speech datasets show that the proposed algorithm is superior to or matches the current optimal automated data augmentation methods without significantly increasing the time consumption of searches.
Abstract: Migrating from monolithic systems to microservice systems is one of the mainstream options for the industry to realize the reengineering of legacy systems, and microservice architecture refactoring based on monolithic legacy systems is the key to realizing migration. Currently, academia mainly focuses on the research on microservice identification methods, and there are many industry practices of legacy systems refactored into microservices. However, systematic approaches and efficient and robust tools are insufficient. Therefore, based on earlier research on microservices identification and model-driven development method, this study presents MSA-Lab, an integrated design platform for microservice refactoring of monolithic legacy systems based on the model-driven development approach. MSA-Lab analyzes the method call sequence in the running log of the monolithic legacy system, identifies and clusters classes and data tables for constructing abstract microservices, and generates a system architecture design model including the microservice diagram and microservice sequence diagram. The model has two core components: MSA-Generator for automatic microservice identification and design model generation and MSA-Modeller for visualization, interactive modeling, and model syntax constraint checking of microservice static structure and dynamic behavior models. This study conducts experiments in the MSA-Lab platform for effectiveness, robustness, and function transformation completeness on four open-source projects and carries out performance comparison experiments with three same-type tools. The results show that the platform has excellent effectiveness and robustness, function transform completeness for running logs, and superior performance.
Abstract: Data replication is an important way to improve the availability of distributed databases. By placing multiple database replicas in different regions, the response speed of local reading and writing operations can be increased. Furthermore, increasing the number of replicas can improve the linear scalability of the read throughput. In view of these advantages, a number of multi-replica distributed database systems have emerged in recent years, including some mainstream systems from the industry such as Google Spanner, CockroachDB, TiDB, and OceanBase, as well as some excellent systems from academia such as Calvin, Aria, and Berkeley Anna. However, these multi-replica databases bring a series of challenges such as consistency maintenance, cross-node transactions, and transaction isolation while providing many benefits. This study summarizes the existing replication architecture, consistency maintenance strategy, cross-node transaction concurrency control, and other technologies. It also analyzes the differences and similarities between several representative multi-replica database systems in terms of distributed transaction processing. Finally, the study builds a cross-region distributed cluster environment on Alibaba Cloud and conducts multiple experiments to study the distributed transaction processing performance of these several representative systems.
Abstract: Recently, with the popularity of ubiquitous computing, intelligent sensing technology has become the focus of researchers, and non-contact sensing based on WiFi is more and more popular in academia and industry because of its excellent generality, low deployment cost, and great user experience. The typical non-contact sensing work based on WiFi includes gesture recognition, breath detection, intrusion detection, behavior recognition, etc. For real-life deployment of these works, one of the major challenges is to avoid the interference of irrelevant behaviors in other irrelevant areas, so it is necessary to judge whether the target is in a specific sensing area or not, which means that the system should be able to determine exactly which side of the boundary line the target is on. However, the existing work cannot find a way to accurately monitor a freely set boundary, which hinders the actual implementation of WiFi-based sensing applications. In order to solve this problem, based on the physical essence of electromagnetic wave diffraction and the Fresnel diffraction model, this study finds a signal feature, namely Rayleigh distribution in Fresnel diffraction model (RFD), when the target passes through the link (the line between the WiFi receiver and transmitter antennas) and reveals the mathematical relationship between the signal feature and human activity. Then, the study realizes a boundary monitoring algorithm through line crossing detection by using the link as the boundary and considering the waveform delay caused by antenna spacing and the features of automatic?gain?control (AGC) when the link is blocked. On this basis, the study also implements two practical applications, that is, intrusion detection system and home state detection system. The intrusion detection system achieves a precision of more than 89% and a recall rate of more than 91%, while the home state detection system achieves an accuracy of more than 89%. While verifying the availability and robustness of the boundary monitoring algorithm, the study also shows the great potential of combining the proposed method with other WiFi-based sensing technologies and provides a direction for the actual deployment of WiFi-based sensing technologies.
Abstract: As challenges such as serious occlusions and deformations coexist, video segmentation with accurate robustness has become one of the hot topics in computer vision. This study proposes a video segmentation method with absorbing Markov chains and skeleton mapping, which progressively produces accurate object contours through the process of pre-segmentation—optimization—improvement. In the phase of pre-segmentation, based on the twin network and the region proposal network, the study obtains regions of interest for objects, constructs the absorbing Markov chains of superpixels in these regions, and calculates the labels of foreground/background of the superpixels. The absorbing Markov chains can perceive and propagate the object features flexibly and effectively and preliminarily pre-segment the target object from the complex scene. In the phase of optimization, the study designs the short-term and long-term spatial-temporal cue models to obtain the short-term variation and the long-term feature of the object, so as to optimize superpixel labels and reduce errors caused by similar objects and noise. In the phase of improvement, to reduce the artifacts and discontinuities of optimization results, this study proposes an automatic generation algorithm for foreground/background skeleton based on superpixel labels and positions and constructs a skeleton mapping network based on encoding and decoding, so as to learn the pixel-level object contour and finally obtain accurate video segmentation results. Many experiments on standard datasets show that the proposed method is superior to the existing mainstream video segmentation methods and can produce segmentation results with higher region similarity and contour accuracy.
Abstract: As the trusted decentralized application, smart contracts attract widespread attention, whereas their security vulnerabilities threaten the reliability. To this end, researchers employ various advanced technologies (such as fuzz testing, machine learning, and formal verification) to study several vulnerability detection technologies and yield sound effects. This study collects 84 related papers by July 2021 to systematically sort out and analyze existing vulnerability detection technologies of smart contracts. First of all, vulnerability detection technologies are categorized according to their core methodologies. These technologies are analyzed from the aspects of implementation methods, vulnerability categories, and experimental data. Additionally, the differences between domestic and international research in these aspects are compared. Finally, after summarizing the existing technologies, the study discusses the challenges of vulnerability detection technologies and potential research directions.
Abstract: Efficient mobile charging scheduling is a key technology to build wireless rechargeable sensor networks (WRSN) which have long life cycle and sustainable operation ability. The existing charging methods based on reinforcement learning only consider the spatial dimension of mobile charging scheduling, i.e., the path planning of mobile chargers (MCs), while leaving out the temporal dimension of the problem, i.e., the adjustment of the charging duration, and thus these methods have suffered some performance limitations. This study proposes a dynamic spatiotemporal charging scheduling scheme based on deep reinforcement learning (SCSD) and establishes a deep reinforcement learning model for dynamic adjustment of charging sequence scheduling and charging duration. In view of the discrete charging sequence planning and continuous charging duration adjustment in mobile charging scheduling, the study uses DQN to optimize the charging sequence for nodes to be charged and calculates and dynamically adjusts the charging duration of the nodes. By optimizing the two dimensions of space and time respectively, the SCSD proposed in this study can effectively improve the charging performance while avoiding the power failure of nodes. Simulation experiments show that SCSD has significant performance advantages over several well-known typical charging schemes.
Abstract: With the development of deep learning and steganography, deep neural networks are widely used in image steganography, especially in a new research direction, namely embedding an image message in an image. The mainstream steganography of embedding an image message in an image based on deep neural networks requires cover images and secret images to be input into a steganographic model to generate stego-images. But recent studies have demonstrated that the steganographic model only needs secret images as input, and then the output secret perturbation is added to cover images, so as to embed secret images. This novel embedding method that does not rely on cover images greatly expands the application scenarios of steganography and realizes the universality of steganography. However, this method currently only verifies the feasibility of embedding and recovering secret images, and the more important evaluation criterion for steganography, namely concealment, has not been considered and verified. This study proposes a high-capacity universal steganography generative adversarial network (USGAN) model based on an attention mechanism. By using the attention module, the USGAN encoder can adjust the perturbation intensity distribution of the pixel position on the channel dimension in the secret image, thereby reducing the influence of the secret perturbation on the cover images. In addition, in this study, the CNN-based steganalyzer is used as the target model of USGAN, and the encoder learns to generate a secret adversarial perturbation through adversarial training with the target model so that the stego-image can become an adversarial example for attacking the steganalyzer at the same time. The experimental results show that the proposed model can not only realize a universal embedding method that does not rely on cover images but also further improves the concealment of steganography.
Abstract: How brains realize learning and perception is an essential question for both artificial intelligence and neuroscience communities. Since the existing artificial neural networks (ANNs) are different from the real brain in terms of structures and computing mechanisms, they cannot be directly used to explore the mechanisms of learning and dealing with perceptual tasks in the real brain. The dendritic neuron model is a computational model to model and simulate the information processing process of neuron dendrites in the brain and is closer to biological reality than ANNs. The use of the dendritic neural network model to deal with and learn perceptual tasks plays an important role in understanding the learning process in the real brain. However, current learning models based on dendritic neural networks mainly focus on simplified dendritic models and are unable to model the entire signal-processing mechanisms of dendrites. To solve this problem, this study proposes a learning model of the biophysically detailed neural network of medium spiny neurons (MSNs). The neural network can fulfill corresponding perceptual tasks through learning. Experimental results show that the proposed model can achieve high performance on the classical image classification task. In addition, the neural network shows strong robustness under noise interference. By further analyzing the network features, this study finds that the neurons in the network after learning show stimulus selectivity, which is a classical phenomenon in neuroscience. This indicates that the proposed model is biologically plausible and implies that stimulus selectivity is an essential property of the brain in fulfilling perceptual tasks through learning.
Abstract: The Olympic heritage is the treasure of the world. The integration of technology, culture, and art is crucial to the diversified presentation and efficient dissemination of the heritage of the Beijing Winter Olympics. As an important trend form of digital museums in the information era, online exhibition halls lay a good foundation in the research on individual digital museums and interactive technologies, but so far, no systematic, intelligent, interactive, and friendly system of the Winter Olympics digital museum has been built. This study proposes an online exhibition hall construction method with interactive feedback for the Beijing 2022 Winter Olympics. By constructing an interactive exhibition hall with intelligent virtual agent, it has further explored the role of interactive feedback in disseminating intangible cultural heritage in a knowledge dissemination-based digital museum. To explore the influence of audio-visual interactive feedback on spreading Olympic spiritual culture in the exhibition hall and improve the user experience, the study conducts a user experiment with 32 participants. The results show that the constructed exhibition hall can greatly promote the dissemination of Olympic culture and spirit, and the introduction of audio-visual interactive feedback in the exhibition hall can improve users’ perceptual control, thereby improving the user experience.
Abstract: Basic linear algebra subprogram (BLAS) is one of the most basic and important math libraries. The matrix-matrix operations covered in the level-3 BLAS functions are particularly significant for a standard BLAS library and are widely employed in many large-scale scientific and engineering computing applications. Additionally, level-3 BLAS functions are computing intensive functions and play a vital role in fully exploiting the computing performance of processors. Multi-core parallel optimization technologies are studied for level-3 BLAS functions on SW26010-Pro, a domestic processor. According to the memory hierarchy of SW26010-Pro, this study designs a multi-level blocking algorithm to exploit the parallelism of matrix operations. Then, a data-sharing scheme based on remote memory access (RMA) mechanism is proposed to improve the data transmission efficiency among CPEs. Additionally, it employs triple buffering and parameter tuning to fully optimize the algorithm and hide the memory access costs of direct memory access (DMA) and the communication overhead of RMA. Besides, the study adopts two hardware pipelines and several vectorized arithmetic/memory access instructions of SW26010-Pro and improves the floating-point computing efficiency of level-3 BLAS functions by writing assembly code manually for matrix-matrix multiplication, matrix equation solving, and matrix transposition. The experimental results show that level-3 BLAS functions can significantly improve the performance on SW26010-Pro by leveraging the proposed parallel optimization. The floating-point computing efficiency of single-core level-3 BLAS is up to 92% of the peak performance, while that of multi-core level-3 BLAS is up to 88% of the peak performance.
Abstract: In large-scale and complex software systems, requirement analysis and generation are accomplished through a top-down process, and the construction of tracking relationships between cross-level requirements is very important for project management, development, and evolution. The loosely-coupled contribution approach of open-source systems requires each participant to easily understand the context and state of the requirements, which relies on cross-level requirement tracking. The issue description log is a common way of presenting requirements in open-source systems. It has no fixed template, and its content is diverse (including text, code, and debugging information). Furthermore, the terms can be freely used, and the gap in abstraction level between cross-level requirements is large, which brings great challenges to automatic tracking. In this paper, a correlation feedback method for key feature dimensions is proposed. Through static analysis of the project’s code structure, code-related terms and their correlation strength are extracted, and a code vocabulary base is constructed to alleviate the gap in abstraction level and the inconsistency of terminology between cross-level requirements. By measuring the importance of terms to requirement description and screening key feature dimensions on this basis, the inquiry statement is optimized to effectively reduce the noise of requirement description length, content form, and other aspects. Experiments with two scenarios on three open-source systems suggest that the proposed method outperforms baseline approaches in cross-level requirement tracking and improves F2 value to 29.01%, 7.75.1%, and 59,21% compared with vector space model (VSM), standard Rocchio, and trace bidirectional encoder representations from transformers (BERT), respectively.
Abstract: Remaining process time prediction is important for preventing and intervening in abnormal business operations. For predicting the remaining time, existing approaches have achieved high accuracy through deep learning techniques. However, most of these techniques involve complex model structures, and the prediction results are difficult to be explained, namely, unexplainable issues. In addition, the prediction of the remaining time usually uses the key attribute, namely activity, or selects several other attributes as the input features of the predicted model according to the domain knowledge. However, a general feature selection method is missing, which may affect both prediction accuracy and model explainability. To tackle these two challenges, this study introduces a remaining process time prediction framework based on an explainable feature-based hierarchical (EFH) model. Specifically, a feature self-selection strategy is first proposed, and the attributes that have a positive impact on the prediction task are obtained as the input features of the model through the backward feature deletion based on priority and the forward feature selection based on feature importance. Then an EFH model is proposed. The prediction results of each layer are obtained by adding different features layer by layer, so as to explain the relationship between input features and prediction results. The study also uses the light gradient boosting machine (LightGBM) and long short-term memory (LSTM) algorithms to implement the proposed approach, and the framework is general and not limited to the algorithms selected in this study. Finally, the proposed approach is compared with other methods on eight real-life event logs. The experimental results show that the proposed approach can select effective features and improve prediction accuracy. In addition, the prediction results are explained.
Abstract: The uncertainty of tasks in mobile edge computing scenarios makes task offloading and resource allocation more complex and difficult. Therefore, a continuous offloading and resource allocation method of uncertain tasks in mobile edge computing is proposed. Firstly, a continuous offloading model of uncertain tasks in mobile edge computing is built, and the multi-batch processing technology based on duration slice partition is employed to address task uncertainty. A multi-device computing resource coordination mechanism is designed to improve the carrying capacity of computation-intensive tasks. Secondly, an adaptive strategy selection algorithm based on load balancing is put forward to avoid channel congestion and additional energy consumption caused by the over-allocation of computing resources. Finally, the uncertain task scenario model is simulated based on Poisson distribution, and experimental results show that the reduction of time slice length can reduce the total energy consumption of the system. In addition, the proposed algorithm can achieve task offloading and resource allocation more effectively and can reduce energy consumption by up to 11.8% compared with comparison algorithms.
Abstract: Emotional dialogue technology focuses on the “emotional quotient” of conversational robots, aiming to give the robots the ability to observe, understand and express emotions as humans do. This technology can be seen as the intersection of emotional computing and dialogue technology, and can simultaneously consider the “intelligent quotient” and “emotional quotient” of conversational robots to realize spiritual companionship, emotional comfort, and psychological guidance for users. Combined with the characteristics of emotions in dialogues, this study provides a comprehensive analysis of emotional dialogue technology: 1) Three important technical points including emotion recognition, emotion management, and emotion expression in dialogue scenarios are shown, and the technology of emotional dialogues in multimodal scenarios is expanded. 2) This study presents the latest research progress on technology points related to emotional dialogues and summarizes the main challenges and possible solutions correspondingly. 3) Data resources for emotional dialogue technologies are introduced. 4) The difficulty and prospect of emotional dialogue technology are pointed out.
Abstract: In a hybrid cloud environment, enterprise business applications and data are often transferred across different cloud services. For complex and diversified cloud service environments, most hybrid cloud applications adopt access control policies made around only access subjects and adjust the policies manually, which cannot meet the fine-grained dynamic access control requirements at different stages of the data life cycle. This study proposes AHCAC, an adaptive access control method oriented to data life cycle in a hybrid cloud environment. Firstly, the the policy description idea based on key attributes are employed to unify the heterogeneous policies of the full life cycle of data under the hybrid cloud. Especially, the “stage” attribute is introduced to explicitly identify the life-cycle state of data, which is the basis for achieving fine-grained access control oriented to data life cycle. Secondly, in view of the similarity and consistency of access control policy with the same life-cycle stage, the policy distance is defined, and a hierarchical clustering algorithm based on the policy distance is proposed to construct the corresponding data access control policy in each life-cycle stage. Finally, when the life-cycle stage of data is changed, the adaptation and loading of policies of corresponding data stages in the policy evaluation are triggered through key attribute matching, which realizes the adaptive access control oriented to the data life cycle. This study also conducts experiments to verify the effectiveness and feasibility of the proposed method on OpenStack and open-source policy evaluation engine Balana.
Abstract: With the increasingly powerful performance of neural network models, they are widely used to solve various computer-related tasks and show excellent capabilities. However, a clear understanding of the operation mechanism of neural network models is lacking. Therefore, this study reviews and summarizes the current research on the interpretability of neural networks. A detailed discussion is rendered on the definition, necessity, classification, and evaluation of research on model interpretability. With the emphasis on the focus of interpretable algorithms, a new classification method for the interpretable algorithms of neural networks is proposed, which provides a novel perspective for the understanding of neural networks. According to the proposed method, this study sorts out the current interpretable methods for convolutional neural networks and comparatively analyzes the characteristics of interpretable algorithms falling within different categories. Moreover, it introduces the evaluation principles and methods of common interpretable algorithms and expounds on the research directions and applications of interpretable neural networks. Finally, the problems confronted in this regard are discussed, and possible solutions to these problems are given.
Abstract: With the development of Internet of Things (IoT) technology, IoT devices are widely applied in many areas of production and life. However, IoT devices also bring severe challenges to equipment asset management and security management. Firstly, Due to the diversity of IoT device types and access modes, it is often difficult for network administrators to know the IoT device types and operating status in the network. Secondly, IoT devices are becoming the focus of cyber attacks due to their limited computing and storage resources, which makes it difficult to deploy traditional defense measures. Therefore, it is important to acknowledge the IoT devices in the network through device identification and detect anomalies based on the device identification results, so as to ensure the normal operation of IoT devices. In recent years, academia has carried out a lot of research on the above issues. This study systematically reviews the work related to IoT device identification and anomaly detection. In terms of device identification, existing research can be divided into passive identification methods and active identification methods according to whether data packets are sent to the network. The passive identification methods are further investigated according to the identification method, identification granularity, and application scenarios. The study also investigates the active identification methods according to the identification method, identification granularity, and detection granularity. In terms of anomaly detection, the existing work can be divided into detection methods based on machine learning algorithms and rule-matching methods based on behavioral norms. On this basis, challenges in IoT device identification and anomaly detection are summarized, and the future development direction is proposed.
Abstract: Smoothed particle hydrodynamics (SPH) is one key technology for fluid simulation. With the growing demand for applications of SPH fluid simulation technology in production practices, many relevant studies have emerged in recent years, which improve the visual authenticity, efficiency, and stability simulated by physical properties including fluid incompressibility, viscosity, and surface tension. Additionally, some researchers focus on high-quality simulation in complex scenarios and a unified simulation framework with multiple scenarios and materials, thereby enhancing the application efficiency of SPH fluid simulation technology. This study discusses and summarizes related research on SPH fluid simulation technology from the above aspects, and proposes a prospect for the technology.
Abstract: Stochastic configuration network (SCN), as an emerging incremental neural network model, is different from other randomized neural network methods. It can configure the parameters of hidden layer nodes through supervision mechanisms, thereby ensuring the fast convergence performance of SCN. Due to the advantages of high learning efficiency, low human intervention, and strong generalization ability, SCN has attracted a large number of national and international scholars and developed rapidly since it was proposed in 2017. In this study, SCN research is summarized from the aspects of basic theories, typical algorithm variants, application fields, and future research directions of SCN. Firstly, the algorithm principles, universal approximation capacity, and advantages of SCN are analyzed theoretically. Secondly, typical variants of SCN are studied, such as DeepSCN, 2DSCN, Robust SCN, Ensemble SCN, Distributed SCN, Parallel SCN, and Regularized SCN. Then, the applications of SCN in different fields, including hardware implementation, computer vision, medical data analysis, fault detection and diagnosis, and system modeling and prediction are introduced. Finally, the development potential of SCN in convolutional neural network architectures, semi-supervised learning, unsupervised learning, multi-view learning, fuzzy neural network, and recurrent neural network is pointed out.
Abstract: As a complement and extension of the terrestrial network, the satellite network contributes to the acceleration of bridging the digital divide between different regions and can expand the coverage and service range of the terrestrial network. However, the satellite network features highly dynamic topology, long transmission delay, and limited on-board computing and storage capacity. Hence, various technical challenges, including routing scalability and transmission stability, are encountered in the organic integration of the satellite network and the terrestrial network and the construction of a global space-ground integrated network (SGIN). Considering the research challenges of SGIN, this paper describes the international and domestic research progress of SGIN in terms of network architecture, routing, transmission, multicast-based content delivery, etc., and then discusses the research trends.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,315-325页.
原文链接如下:https://doi.org/10.1145/3106237.3106242,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and the ACM SigSoft Symposium on The Foundations of Software Engineering (ESEC/FSE),ACM,2017年9月,303-314页.
原文链接如下:https://doi.org/10.1145/3106237.3106239,
读者如需引用该文请标引原文出处。
Abstract: GitHub, a popular social-software-development
platform, has fostered a variety of software ecosystems where
projects depend on one another and
practitioners interact with
each other. Projects within an
ecosystem often have complex
inter-dependencies that impose new challenges in bug
reporting and fixing. In this paper, we conduct an empirical
study on cross-project correlated bugs, i.e., causally related
bugs reported to different projects, focusing on two aspects: 1)
how developers track the root causes across projects; and 2)
how the downstream developers coordinate to deal with
upstream bugs. Through manual inspection of bug reports collected from the scientific Python ecosystem and an online survey with developers, this study reveals the common practices of developers and the
various factors in fixing cross-project bugs. These findings provide implications for future software bug analysis in the scope of ecosystem, as well as shed light on the requirements of issue trackers for such bugs.
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 39th International Conference on Software Engineering, Pages 27-37, Buenos Aires, Argentina — May 20 - 28, 2017, IEEE Press Piscataway, NJ, USA ?2017, ISBN: 978-1-5386-3868-2
原文链接如下:http://dl.acm.org/citation.cfm?id=3097373,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 871-882. DOI: https://doi.org/10.1145/2950290.2950364
原文链接如下:http://dl.acm.org/citation.cfm?id=2950364,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Pages 133—143, Seattle WA, USA, November 2016.
原文链接如下:http://dl.acm.org/citation.cfm?id=2950327,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE'16), 810 – 821, November 13 - 18, 2016.
原文链接如下:https://doi.org/10.1145/2950290.2950310,
读者如需引用该文请标引原文出处。
Abstract: 文章由CCF软件工程专业委员会白颖教授推荐。
文章发表在FSE'16会议上Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering,
原文链接如下:http://dl.acm.org/citation.cfm?id=2950340,
读者如需引用该文请标引原文出处。
Abstract: CCF 软件工程专业委员会白晓颖教授(清华大学)推荐。
原文发表在 ASE 2016 Proceedings of the 31st IEEE/ACM International Conference on Automated
Software Engineering。 全文链接:http://dx.doi.org/10.1145/2970276.2970307。
重要提示:读者如引用该文时请标注原文出处。
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: With the rapid development of e-business, web applications based on the Web are developed from localization to globalization, from B2C(business-to-customer) to B2B(business-to-business), from centralized fashion to decentralized fashion. Web service is a new application model for decentralized computing, and it is also an effective mechanism for the data and service integration on the web. Thus, web service has become a solution to e-business. It is important and necessary to carry out the research on the new architecture of web services, on the combinations with other good techniques, and on the integration of services. In this paper, a survey presents on various aspects of the research of web services from the basic concepts to the principal research problems and the underlying techniques, including data integration in web services, web service composition, semantic web service, web service discovery, web service security, the solution to web services in the P2P (Peer-to-Peer) computing environment, and the grid service, etc. This paper also presents a summary of the current art of the state of these techniques, a discussion on the future research topics, and the challenges of the web services.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: With the explosive growth of network applications and complexity, the threat of Internet worms against network security becomes increasingly serious. Especially under the environment of Internet, the variety of the propagation ways and the complexity of the application environment result in worm with much higher frequency of outbreak, much deeper latency and more wider coverage, and Internet worms have been a primary issue faced by malicious code researchers. In this paper, the concept and research situation of Internet worms, exploration function component and execution mechanism are first presented, then the scanning strategies and propagation model are discussed, and finally the critical techniques of Internet worm prevention are given. Some major problems and research trends in this area are also addressed.
Abstract: This paper studies uncertain graph data mining and especially investigates the problem of mining frequent subgraph patterns from uncertain graph data. A data model is introduced for representing uncertainties in graphs, and an expected support is employed to evaluate the significance of subgraph patterns. By using the apriori property of expected support, a depth-first search-based mining algorithm is proposed with an efficient method for computing expected supports and a technique for pruning search space, which reduces the number of subgraph isomorphism testings needed by computing expected support from the exponential scale to the linear scale. Experimental results show that the proposed algorithm is 3 to 5 orders of magnitude faster than a na?ve depth-first search algorithm, and is efficient and scalable.
Abstract: This paper introduces the concrete details of combining the automated reasoning techniques with planning methods, which includes planning as satisfiability using propositional logic, Conformant planning using modal logic and disjunctive reasoning, planning as nonmonotonic logic, and Flexible planning as fuzzy description logic. After considering experimental results of International Planning Competition and relevant papers, it concludes that planning methods based on automated reasoning techniques is helpful and can be adopted. It also proposes the challenges and possible hotspots.
Abstract: Sensor networks are integration of sensor techniques, nested computation techniques, distributed computation techniques and wireless communication techniques. They can be used for testing, sensing, collecting and processing information of monitored objects and transferring the processed information to users. Sensor network is a new research area of computer science and technology and has a wide application future. Both academia and industries are very interested in it. The concepts and characteristics of the sensor networks and the data in the networks are introduced, and the issues of the sensor networks and the data management of sensor networks are discussed. The advance of the research on sensor networks and the data management of sensor networks are also presented.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: In a multi-hop wireless sensor network (WSN), the sensors closest to the sink tend to deplete their energy faster than other sensors, which is known as an energy hole around the sink. No more data can be delivered to the sink after an energy hole appears, while a considerable amount of energy is wasted and the network lifetime ends prematurely. This paper investigates the energy hole problem, and based on the improved corona model with levels, it concludes that the assignment of transmission ranges of nodes in different coronas is an effective approach for achieving energy-efficient network. It proves that the optimal transmission ranges for all areas is a multi-objective optimization problem (MOP), which is NP hard. The paper proposes an ACO (ant colony optimization)-based distributed algorithm to prolong the network lifetime, which can help nodes in different areas to adaptively find approximate optimal transmission range based on the node distribution. Furthermore, the simulation results indicate that the network lifetime under this solution approximates to that using the optimal list. Compared with existing algorithms, this ACO-based algorithm can not only make the network lifetime be extended more than two times longer, but also have good performance in the non-uniform node distribution.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: With the recent development of cloud computing, the importance of cloud databases has been widely acknowledged. Here, the features, influence and related products of cloud databases are first discussed. Then, research issues of cloud databases are presented in detail, which include data model, architecture, consistency, programming model, data security, performance optimization, benchmark, and so on. Finally, some future trends in this area are discussed.
Abstract: Intrusion detection is a highlighted topic of network security research in recent years. In this paper, first the necessity o f intrusion detection is presented, and its concepts and models are described. T hen, many intrusion detection techniques and architectures are summarized. Final ly, the existing problems and the future direction in this field are discussed.
Abstract: Many specific application oriented NoSQL database systems are developed for satisfying the new requirement of big data management. This paper surveys researches on typical NoSQL database based on key-value data model. First, the characteristics of big data, and the key technique issues supporting big data management are introduced. Then frontier efforts and research challenges are given, including system architecture, data model, access mode, index, transaction, system elasticity, load balance, replica strategy, data consistency, flash cache, MapReduce based data process and new generation data management system etc. Finally, research prospects are given.
Abstract: Software architecture (SA) is emerging as one of the primary research areas in software engineering recently and one of the key technologies to the development of large-scale software-intensive system and software product line system. The history and the major direction of SA are summarized, and the concept of SA is brought up based on analyzing and comparing the several classical definitions about SA. Based on summing up the activities about SA, two categories of study about SA are extracted out, and the advancements of researches on SA are subsequently introduced from seven aspects.Additionally,some disadvantages of study on SA are discussed,and the causes are explained at the same.Finally,it is concluded with some singificantly promising tendency about research on SA.
Abstract: For most peer-to-peer file-swapping applications, sharing is a volunteer action, and peers are not responsible for their irresponsible bartering history. This situation indicates the trust between participants can not be set up simply on the traditional trust mechanism. A reasonable trust construction approach comes from the social network analysis, in which trust relations between individuals are set up upon recommendations of other individuals. Current p2p trust model could not promise the convergence of iteration for trust computation, and takes no consideration for model security problems, such as sybil attack and slandering. This paper presents a novel recommendation-based global trust model and gives a distributed implementation method. Mathematic analyses and simulations show that, compared to the current global trust model, the proposed model is more robust on trust security problems and more complete on iteration for computing peer trust.
Abstract: The Internet traffic model is the key issue for network performance management, Quality of Service
management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
Abstract: Routing technology at the network layer is pivotal in the architecture of wireless sensor networks. As an active branch of routing technology, cluster-based routing protocols excel in network topology management, energy minimization, data aggregation and so on. In this paper, cluster-based routing mechanisms for wireless sensor networks are analyzed. Cluster head selection, cluster formation and data transmission are three key techniques in cluster-based routing protocols. As viewed from the three techniques, recent representative cluster-based routing protocols are presented, and their characteristics and application areas are compared. Finally, the future research issues in this area are pointed out.
Abstract: An ad hoc network is a collection of wireless mobile nodes dynamically forming a temporary network without the use of any existing network infrastructure or centralized administration. Due to bandwidth constraint and dynamic topology of mobile ad hoc networks, multipath supported routing is a very important research issue. In this paper, we present an entropy-based metric to support stability multipath on-demand routing (SMDR). The key idea of SMDR protocol is to construct the new metric-entropy and select the stability multipath with the help of entropy metric to reduce the number of route reconstruction so as to provide QoS guarantee in the ad hoc network whose topology changes continuously. Simulation results show that, with the proposed multipath routing protocol, packet delivery ratio, end-to-end delay, and routing overhead ratio can be improved in most of cases. It is an available approach to multipath routing decision.
Abstract: Constrained optimization problems (COPs) are mathematical programming problems frequently encountered in the disciplines of science and engineering application. Solving COPs has become an important research area of evolutionary computation in recent years. In this paper, the state-of-the-art of constrained optimization evolutionary algorithms (COEAs) is surveyed from two basic aspects of COEAs (i.e., constraint-handling techniques and evolutionary algorithms). In addition, this paper discusses some important issues of COEAs. More specifically, several typical algorithms are analyzed in detail. Based on the analyses, it concluded that to obtain competitive results, a proper constraint-handling technique needs to be considered in conjunction with an appropriate search algorithm. Finally, the open research issues in this field are also pointed out.
Abstract: In recent years, transfer learning has provoked vast amount of attention and research. Transfer learning is a new machine learning method that applies the knowledge from related but different domains to target domains. It relaxes the two basic assumptions in traditional machine learning: (1) the training (also referred as source domain) and test data (also referred target domain) follow the independent and identically distributed (i.i.d.) condition; (2) there are enough labeled samples to learn a good classification model, aiming to solve the problems that there are few or even not any labeled data in target domains. This paper surveys the research progress of transfer learning and introduces its own works, especially the ones in building transfer learning models by applying generative model on the concept level. Finally, the paper introduces the applications of transfer learning, such as text classification and collaborative filtering, and further suggests the future research direction of transfer learning.
Abstract: As an important application of acceleration in the cloud, the distributed caching technology has received considerable attention in industry and academia. This paper starts with a discussion on the combination of cloud computing and distributed caching technology, giving an analysis of its characteristics, typical application scenarios, stages of development, standards, and several key elements, which have promoted its development. In order to systematically know the state of art progress and weak points of the distributed caching technology, the paper builds a multi-dimensional framework, DctAF. This framework is constituted of 6 dimensions through analyzing the characteristics of cloud computing and boundary of the caching techniques. Based on DctAF, current techniques have been analyzed and summarized; comparisons among several influential products have also been made. Finally, the paper describes and highlights the several challenges that the cache system faces and examines the current research through in-depth analysis and comparison.
Abstract: Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
Abstract: Visual language techniques have exhibited more advantages in describing various software artifacts than one-dimensional textual languages during software development, ranging from the requirement analysis and design to testing and maintenance, as diagrammatic and graphical notations have been well applied in modeling system. In addition to an intuitive appearance, graph grammars provide a well-established foundation for defining visual languages with the power of precise modeling and verification on computers. This paper discusses the issues and techniques for a formal foundation of visual languages, reviews related practical graphical environments, presents a spatial graph grammar formalism, and applies the spatial graph grammar to defining behavioral semantics of UML diagrams and developing a style-driven framework for software architecture design.
Abstract: Wide-Spread deployment for interactive information visualization is difficult. Non-Specialist users need a general development method and a toolkit to support the generic data structures suited to tree, network and multi-dimensional data, special visualization techniques and interaction techniques, and well-known generic information tasks. This paper presents a model driven development method for interactive information visualization. First, an interactive information visualization interface model (IIVM) is proposed. Then, the development method for interactive information visualization based on IIVM is presented. The Daisy toolkit is introduced, which includes Daisy model builder, Daisy IIV generator and runtime framework with Daisy library. Finally, an application example is given. Experimental results show that Daisy can provide a general solution for development for interactive information visualization.
Abstract: Computer forensics is the technology field that attempts to prove thorough, efficient, and secure means to investigate computer crime. Computer evidence must be authentic, accurate, complete and convincing to juries. In this paper, the stages of computer forensics are presented, and the theories and the realization of the forensics software are described. An example about forensic practice is also given. The deficiency of computer forensics technique and anti-forensics are also discussed. The result comes out that it is as the improvement of computer science technology, the forensics technique will become more integrated and thorough.
Abstract: The crucial technologies related to personalization are introduced in this paper, which include the representation and modification of user profile, the representation of resource, the recommendation technology, and the architecture of personalization. By comparing with some existing prototype systems, the key technologies about how to implement personalization are discussed in detail. In addition, three representative personalization systems are analyzed. At last, some research directions for personalization are presented.
Abstract: Botnets are one of the most serious threats to the Internet. Researchers have done plenty of research and made significant progress. However, botnets keep evolving and have become more and more sophisticated. Due to the underlying security limitation of current system and Internet architecture, and the complexity of botnet itself, how to effectively counter the global threat of botnets is still a very challenging issue. This paper first introduces the evolving of botnet’s propagation, attack, command, and control mechanisms. Then the paper summarizes recent advances of botnet defense research and categorizes into five areas: Botnet monitoring, botnet infiltration, analysis of botnet characteristics, botnet detection and botnet disruption. The limitation of current botnet defense techniques, the evolving trend of botnet, and some possible directions for future research are also discussed.
Abstract: In this paper, a framework is proposed for handling fault of service composition through analyzing fault requirements. Petri nets are used in the framework for fault detecting and its handling, which focuses on targeting the failure of available services, component failure and network failure. The corresponding fault models are given. Based on the model, the correctness criterion of fault handling is given to analyze fault handling model, and its correctness is proven. Finally, CTL (computational tree logic) is used to specify the related properties and enforcement algorithm of fault analysis. The simulation results show that this method can ensure the reliability and consistency of service composition.
Abstract: Software defect prediction has been one of the active parts of software engineering since it was developed in 1970's. It plays a very important role in the analysis of software quality and balance of software cost. This paper investigates and discusses the motivation, evolvement, solutions and challenges of software defect prediction technologies, and it also categorizes, analyzes and compares the representatives of these prediction technologies. Some case studies for software defect distribution models are given to help understanding.
Abstract: As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
Abstract: Knapsack problem (KP) is a well-known combinatorial optimization problem which includes 0-1 KP, bounded KP, multi-constraint KP, multiple KP, multiple-choice KP, quadratic KP, dynamic knapsack KP, discounted KP and other types of KPs. KP can be considered as a mathematical model extracted from variety of real fields and therefore has wide applications. Evolutionary algorithms (EAs) are universally considered as an efficient tool to solve KP approximately and quickly. This paper presents a survey on solving KP by EAs over the past ten years. It not only discusses various KP encoding mechanism and the individual infeasible solution processing but also provides useful guidelines for designing new EAs to solve KPs.
Abstract: Data deduplication technologies can be divided into two categories: a) identical data detection
techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these
two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since
data deduplication technologies can affect the reliability and performance of storage systems, this paper also
surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of
the current state of research on data deduplication technologies, this paper makes several conclusions as follows:
a) How to mine data characteristic information in data deduplication has not been completely solved, and how to
use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the
perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome
the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by
data deduplication techniques.
Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
Abstract: Web search engine has become a very important tool for finding information efficiently from the massive Web data. With the explosive growth of the Web data, traditional centralized search engines become harder to catch up with the growing step of people's information needs. With the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web search has been proposed and quickly becomes a research focus. The goal of this paper is to give a brief summary of current P2P Web search technologies in order to facilitate future research. First, some main challenges for P2P Web search are presented. Then, key techniques for building a feasible and efficient P2P Web search engine are reviewed, including system topology, data placement, query routing, index partitioning, collection selection, relevance ranking and Web crawling. Finally, three recently proposed novel P2P Web search prototypes are introduced.
Abstract: This paper presents a research work in children Truing test(CTT).The main defference between our test program and other ones is its knowledge-based character,which is supported by a massive commonsense knowledge base.The motivation,design,techniques,experimental results and platform(including a knowledge engine and a cinverstation engine)of the CTT are described in this paper.Finally,some cincluding thoughts about the CTT and AI are given.
Abstract: Sensor network, which is made by the convergence of sensor, micro-electro-mechanism system and networks technologies, is a novel technology about acquiring and processing information. In this paper, the architecture of wireless sensor network is briefly introduced. Next, some valuable applications are explained and forecasted. Combining with the existing work, the hot spots including power-aware routing and media access control schemes are discussed and presented in detail. Finally, taking account of application requirements, several future research directions are put forward.
Abstract: The research actuality and new progress in clustering algorithm in recent years are summarized in this paper. First, the analysis and induction of some representative clustering algorithms have been made from several aspects, such as the ideas of algorithm, key technology, advantage and disadvantage. On the other hand, several typical clustering algorithms and known data sets are selected, simulation experiments are implemented from both sides of accuracy and running efficiency, and clustering condition of one algorithm with different data sets is analyzed by comparing with the same clustering of the data set under different algorithms. Finally, the research hotspot, difficulty, shortage of the data clustering and some pending problems are addressed by the integration of the aforementioned two aspects information. The above work can give a valuable reference for data clustering and data mining.
Abstract: This paper surveys the state of the art of sentiment analysis. First, three important tasks of sentiment analysis are summarized and analyzed in detail, including sentiment extraction, sentiment classification, sentiment retrieval and summarization. Then, the evaluation and corpus for sentiment analysis are introduced. Finally, the applications of sentiment analysis are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, making detailed comparison and analysis.
Abstract: Cloud Computing is the fundamental change happening in the field of Information Technology. It is a
representation of a movement towards the intensive, large scale specialization. On the other hand, it brings about not only convenience and efficiency problems, but also great challenges in the field of data security and privacy protection. Currently, security has been regarded as one of the greatest problems in the development of Cloud Computing. This paper describes the great requirements in Cloud Computing, security key technology, standard and regulation etc., and provides a Cloud Computing security framework. This paper argues that the changes in the above aspects will result in a technical revolution in the field of information security.
Abstract: Network community structure is one of the most fundamental and important topological properties of complex networks, within which the links between nodes are very dense, but between which they are quite sparse. Network clustering algorithms which aim to discover all natural network communities from given complex networks are fundamentally important for both theoretical researches and practical applications, and can be used to analyze the topological structures, understand the functions, recognize the hidden patterns, and predict the behaviors of complex networks including social networks, biological networks, World Wide Webs and so on. This paper reviews the background, the motivation, the state of arts as well as the main issues of existing works related to discovering network communities, and tries to draw a comprehensive and clear outline for this new and active research area. This work is hopefully beneficial to the researchers from the communities of complex network analysis, data mining, intelligent Web and bioinformatics.
Abstract: This paper surveys the current technologies adopted in cloud computing as well as the systems in enterprises. Cloud computing can be viewed from two different aspects. One is about the cloud infrastructure which is the building block for the up layer cloud application. The other is of course the cloud application. This paper focuses on the cloud infrastructure including the systems and current research. Some attractive cloud applications are also discussed. Cloud computing infrastructure has three distinct characteristics. First, the infrastructure is built on top of large scale clusters which contain a large number of cheap PC servers. Second, the applications are co-designed with the fundamental infrastructure that the computing resources can be maximally utilized. Third, the reliability of the whole system is achieved by software building on top of redundant hardware instead of mere hardware. All these technologies are for the two important goals for distributed system: high scalability and high availability. Scalability means that the cloud infrastructure can be expanded to very large scale even to thousands of nodes. Availability means that the services are available even when quite a number of nodes fail. From this paper, readers will capture the current status of cloud computing as well as its future trends.
Abstract: Evolutionary multi-objective optimization (EMO), whose main task is to deal with multi-objective optimization problems by evolutionary computation, has become a hot topic in evolutionary computation community. After summarizing the EMO algorithms before 2003 briefly, the recent advances in EMO are discussed in details. The current research directions are concluded. On the one hand, more new evolutionary paradigms have been introduced into EMO community, such as particle swarm optimization, artificial immune systems, and estimation distribution algorithms. On the other hand, in order to deal with many-objective optimization problems, many new dominance schemes different from traditional Pareto-dominance come forth. Furthermore, the essential characteristics of multi-objective optimization problems are deeply investigated. This paper also gives experimental comparison of several representative algorithms. Finally, several viewpoints for the future research of EMO are proposed.
Abstract: This paper makes a comprehensive survey of the recommender system research aiming to facilitate readers to understand this field. First the research background is introduced, including commercial application demands, academic institutes, conferences and journals. After formally and informally describing the recommendation problem, a comparison study is conducted based on categorized algorithms. In addition, the commonly adopted benchmarked datasets and evaluation methods are exhibited and most difficulties and future directions are concluded.
Abstract: Graphics processing unit (GPU) has been developing rapidly in recent years at a speed over Moor抯 law, and as a result, various applications associated with computer graphics advance greatly. At the same time, the highly processing power, parallelism and programmability available nowadays on the contemporary GPU provide an ideal platform on which the general-purpose computation could be made. Starting from an introduction to the development history and the architecture of GPU, the technical fundamentals of GPU are described in the paper. Then in the main part of the paper, the development of various applications on general purpose computation on GPU is introduced, and among those applications, fluid dynamics, algebraic computation, database operations, and spectrum analysis are introduced in detail. The experience of our work on fluid dynamics has been also given, and the development of software tools in this area is introduced. Finally, a conclusion is made, and the future development and the new challenge on both hardware and software in this subject are discussed.
Abstract: Automatic generation of poetry has always been considered a hard nut in natural language generation.This paper reports some pioneering research on a possible generic algorithm and its automatic generation of SONGCI. In light of the characteristics of Chinese ancient poetry, this paper designed the level and oblique tones-based coding method, the syntactic and semantic weighted function of fitness, the elitism and roulette-combined selection operator, and the partially mapped crossover operator and the heuristic mutation operator. As shown by tests, the system constructed on the basis of the computing model designed in this paper is basically capable of generating Chinese SONGCI with some aesthetic merit. This work represents progress in the field of Chinese poetry automatic generation.
Abstract: This paper first introduces the key features of big data in different processing modes and their typical application scenarios, as well as corresponding representative processing systems. It then summarizes three development trends of big data processing systems. Next, the paper gives a brief survey on system supported analytic technologies and applications (including deep learning, knowledge computing, social computing, and visualization), and summarizes the key roles of individual technologies in big data analysis and understanding. Finally, the paper lays out three grand challenges of big data processing and analysis, i.e., data complexity, computation complexity, and system complexity. Potential ways for dealing with each complexity are also discussed.
Abstract: Probabilistic graphical models are powerful tools for compactly representing complex probability distributions, efficiently computing (approximate) marginal and conditional distributions, and conveniently learning parameters and hyperparameters in probabilistic models. As a result, they have been widely used in applications that require some sort of automated probabilistic reasoning, such as computer vision and natural language processing, as a formal approach to deal with uncertainty. This paper surveys the basic concepts and key results of representation, inference and learning in probabilistic graphical models, and demonstrates their uses in two important probabilistic models. It also reviews some recent advances in speeding up classic approximate inference algorithms, followed by a discussion of promising research directions.
Abstract: Considered as the next generation computing model, cloud computing plays an important role in scientific and commercial computing area and draws great attention from both academia and industry fields. Under cloud computing environment, data center consist of a large amount of computers, usually up to millions, and stores petabyte even exabyte of data, which may easily lead to the failure of the computers or data. The large amount of computers composition not only leads to great challenges to the scalability of the data center and its storage system, but also results in high hardware infrastructure cost and power cost. Therefore, fault-tolerance, scalability, and power consumption of the distributed storage for a data center becomes key part in the technology of cloud computing, in order to ensure the data availability and reliability. In this paper, a survey is made on the state of art of the key technologies in cloud computing in the following aspects: Design of data center network, organization and arrangement of data, strategies to improve fault-tolerance, methods to save storage space, and energy. Firstly, many kinds of classical topologies of data center network are introduced and compared. Secondly, kinds of current fault-tolerant storage techniques are discussed, and data replication and erasure code strategies are especially compared. Thirdly, the main current energy saving technology is addressed and analyzed. Finally, challenges in distributed storage are reviewed as well as future research trends are predicted.
Abstract: Context-Aware recommender systems, aiming to further improve performance accuracy and user satisfaction by fully utilizing contextual information, have recently become one of the hottest topics in the domain of recommender systems. This paper presents an overview of the field of context-aware recommender systems from a process-oriented perspective, including system frameworks, key techniques, main models, evaluation, and typical applications. The prospects for future development and suggestions for possible extensions are also discussed.
Abstract: Android is a modern and most popular software platform for smartphones. According to report, Android accounted for a huge 81% of all smartphones in 2014 and shipped over 1 billion units worldwide for the first time ever. Apple, Microsoft, Blackberry and Firefox trailed a long way behind. At the same time, increased popularity of the Android smartphones has attracted hackers, leading to massive increase of Android malware applications. This paper summarizes and analyzes the latest advances in Android security from multidimensional perspectives, covering Android architecture, design principles, security mechanisms, major security threats, classification and detection of malware, static and dynamic analyses, machine learning approaches, and security extension proposals.
Abstract: Computer aided detection/diagnosis (CAD) can improve the accuracy of diagnosis,reduce false positive,and provide decision supports for doctors.The main purpose of this paper is to analyze the latest development of computer aided diagnosis tools.Focusing on the top four fatal cancer's incidence positions,major recent publications on CAD applications in different medical imaging areas are reviewed in this survey according to different imaging techniques and diseases.Further more,multidimentional analysis is made on the researches from image data sets,algorithms and evaluation methods.Finally,existing problems,research trend and development direction in the field of medical image CAD system are discussed.
Abstract: In many areas such as science, simulation, Internet, and e-commerce, the volume of data to be analyzed grows rapidly. Parallel techniques which could be expanded cost-effectively should be invented to deal with the big data. Relational data management technique has gone through a history of nearly 40 years. Now it encounters the tough obstacle of scalability, which relational techniques can not handle large data easily. In the mean time, none relational techniques, such as MapReduce as a typical representation, emerge as a new force, and expand their application from Web search to territories that used to be occupied by relational database systems. They confront relational technique with high availability, high scalability and massive parallel processing capability. Relational technique community, after losing the big deal of Web search, begins to learn from MapReduce. MapReduce also borrows valuable ideas from relational technique community to improve performance. Relational technique and MapReduce compete with each other, and learn from each other; new data analysis platform and new data analysis eco-system are emerging. Finally the two camps of techniques will find their right places in the new eco-system of big data analysis.
Abstract: Wireless Sensor Networks, a novel technology about acquiring and processing information, have been proposed for a multitude of diverse applications. The problem of self-localization, that is, determining where a given node is physically or relatively located in the networks, is a challenging one, and yet extremely crucial for many applications. In this paper, the evaluation criterion of the performance and the taxonomy for wireless sensor networks self-localization systems and algorithms are described, the principles and characteristics of recent representative localization approaches are discussed and presented, and the directions of research in this area are introduced.
Abstract: The Internet traffic model is the key issue for network performance management, Quality of Service
management, and admission control. The paper first summarizes the primary characteristics of Internet traffic, as well as the metrics of Internet traffic. It also illustrates the significance and classification of traffic modeling. Next, the paper chronologically categorizes the research activities of traffic modeling into three phases: 1) traditional Poisson modeling; 2) self-similar modeling; and 3) new research debates and new progress. Thorough reviews of the major research achievements of each phase are conducted. Finally, the paper identifies some open research issue and points out possible future research directions in traffic modeling area.
Abstract: Task parallel programming model is a widely used parallel programming model on multi-core platforms. With the intention of simplifying parallel programming and improving the utilization of multiple cores, this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives: Parallelism expression, data management and task scheduling. In the end, some future trends in this area are discussed.
Abstract: Network abstraction brings about the naissance of software-defined networking. SDN decouples data plane and control plane, and simplifies network management. The paper starts with a discussion on the background in the naissance and developments of SDN, combing its architecture that includes data layer, control layer and application layer. Then their key technologies are elaborated according to the hierarchical architecture of SDN. The characteristics of consistency, availability, and tolerance are especially analyzed. Moreover, latest achievements for profiled scenes are introduced. The future works are summarized in the end.
Abstract: Few-shot learning is defined as learning models to solve problems from small samples. In recent years, under the trend of training model with big data, machine learning and deep learning have achieved success in many fields. However, in many application scenarios in the real world, there is not a large amount of data or labeled data for model training, and labeling a large number of unlabeled samples will cost a lot of manpower. Therefore, how to use a small number of samples for learning has become a problem that needs to be paid attention to at present. This paper systematically combs the current approaches of few-shot learning. It introduces each kind of corresponding model from the three categories: fine-tune based, data augmentation based, and transfer learning based. Then, the data augmentation based approaches are subdivided into unlabeled data based, data generation based, and feature augmentation based approaches. The transfer learning based approaches are subdivided into metric learning based, meta-learning based, and graph neural network based methods. In the following, the paper summarizes the few-shot datasets and the results in the experiments of the aforementioned models. Next, the paper summarizes the current situation and challenges in few-shot learning. Finally, the future technological development of few-shot learning is prospected.
Abstract: The development of mobile internet and the popularity of mobile terminals produce massive trajectory data of moving objects under the era of big data. Trajectory data has spatio-temporal characteristics and rich information. Trajectory data processing techniques can be used to mine the patterns of human activities and behaviors, the moving patterns of vehicles in the city and the changes of atmospheric environment. However, trajectory data also can be exploited to disclose moving objects' privacy information (e.g., behaviors, hobbies and social relationships). Accordingly, attackers can easily access moving objects' privacy information by digging into their trajectory data such as activities and check-in locations. In another front of research, quantum computation presents an important theoretical direction to mine big data due to its scalable and powerful storage and computing capacity. Applying quantum computing approaches to handle trajectory big data could make some complex problem solvable and achieve higher efficiency. This paper reviews the key technologies of processing trajectory data. First the concept and characteristics of trajectory data is introduced, and the pre-processing methods, including noise filtering and data compression, are summarized. Then, the trajectory indexing and querying techniques, and the current achievements of mining trajectory data, such as pattern mining and trajectory classification, are reviewed. Next, an overview of the basic theories and characteristics of privacy preserving with respect to trajectory data is provided. The supporting techniques of trajectory big data mining, such as processing framework and data visualization, are presented in detail. Some possible ways of applying quantum computation into trajectory data processing, as well as the implementation of some core trajectory mining algorithms by quantum computation are also described. Finally, the challenges of trajectory data processing and promising future research directions are discussed.
Abstract: Attribute-Based encryption (ABE) scheme takes attributes as the public key and associates the ciphertext and user’s secret key with attributes, so that it can support expressive access control policies. This dramatically reduces the cost of network bandwidth and sending node’s operation in fine-grained access control of data sharing. Therefore, ABE has a broad prospect of application in the area of fine-grained access control. After analyzing the basic ABE system and its two variants, Key-Policy ABE (KP-ABE) and Ciphertext-Policy ABE (CP-ABE), this study elaborates the research problems relating to ABE systems, including access structure design for CP-ABE, attribute key revocation, key abuse and multi-authorities ABE with an extensive comparison of their functionality and performance. Finally, this study discusses the need-to-be solved problems and main research directions in ABE.
Abstract: Nowadays it has been widely accepted that the quality of software highly depends on the process that iscarried out in an organization. As part of the effort to support software process engineering activities, the researchon software process modeling and analysis is to provide an effective means to represent and analyze a process and,by doing so, to enhance the understanding of the modeled process. In addition, an enactable process model canprovide a direct guidance for the actual development process. Thus, the enforcement of the process model candirectly contribute to the improvement of the software quality. In this paper, a systematic review is carried out tosurvey the recent development in software process modeling. 72 papers from 20 conference proceedings and 7journals are identified as the evidence. The review aims to promote a better understanding of the literature byanswering the following three questions: 1) What kinds of paradigms are existing methods based on? 2) What kinds of purposes does the existing research have? 3) What kinds of new trends are reflected in the current research? Afterproviding the systematic review, we present our software process modeling method based on a multi-dimensionaland integration methodology that is intended to address several core issues facing the community.
Abstract: The appearance of plenty of intelligent devices equipped for short-range wireless communications boosts the fast rise of wireless ad hoc networks application. However, in many realistic application environments, nodes form a disconnected network for most of the time due to nodal mobility, low density, lossy link, etc. Conventional communication model of mobile ad hoc network (MANET) requires at least one path existing from source to destination nodes, which results in communication failure in these scenarios. Opportunistic networks utilize the communication opportunities arising from node movement to forward messages in a hop-by-hop way, and implement communications between nodes based on the "store-carry-forward" routing pattern. This networking approach, totally different from the traditional communication model, captures great interests from researchers. This paper first introduces the conceptions and theories of opportunistic networks and some current typical applications. Then it elaborates the popular research problems including opportunistic forwarding mechanism, mobility model and opportunistic data dissemination and retrieval. Some other interesting research points such as communication middleware, cooperation and security problem and new applications are stated briefly. Finally, the paper concludes and looks forward to the possible research focuses for opportunistic networks in the future.
Abstract: In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. Highlighting the state-of-art challenging issues and research trends for content information processing of Internet and other complex applications, this paper presents a survey on the up-to-date development in text categorization based on machine learning, including model, algorithm and evaluation. It is pointed out that problems such as nonlinearity, skewed data distribution, labeling bottleneck, hierarchical categorization, scalability of algorithms and categorization of Web pages are the key problems to the study of text categorization. Possible solutions to these problems are also discussed respectively. Finally, some future directions of research are given.
Abstract: Uncertainty exists widely in the subjective and objective world. In all kinds of uncertainty, randomness and fuzziness are the most important and fundamental. In this paper, the relationship between randomness and fuzziness is discussed. Uncertain states and their changes can be measured by entropy and hyper-entropy respectively. Taken advantage of entropy and hyper-entropy, the uncertainty of chaos, fractal and complex networks by their various evolution and differentiation are further studied. A simple and effective way is proposed to simulate the uncertainty by means of knowledge representation which provides a basis for the automation of both logic and image thinking with uncertainty. The AI (artificial intelligence) with uncertainty is a new cross-discipline, which covers computer science, physics, mathematics, brain science, psychology, cognitive science, biology and philosophy, and results in the automation of representation, process and thinking for uncertain information and knowledge.
Abstract: The paper gives some thinking according to the following four aspects: 1) from the law of things development, revealing the development history of software engineering technology; 2) from the point of software natural characteristic, analyzing the construction of every abstraction layer of virtual machine; 3) from the point of software development, proposing the research content of software engineering discipline, and research the pattern of industrialized software production; 4) based on the appearance of Internet technology, exploring the development trend of software technology.
Abstract: The Distributed denial of service (DDoS) attack is a major threat to the current network. Based on the attack packet level, the study divides DDoS attacks into network-level DDoS attacks and application-level DDoS attacks. Next, the study analyzes the detection and control methods of these two kinds of DDoS attacks in detail, and it also analyzes the drawbacks of different control methods implemented in different network positions. Finally, the study analyzes the drawbacks of the current detection and control methods, the development trend of the DDoS filter system, and corresponding technological challenges are also proposed.
Abstract: This paper surveys the state of the art of speech emotion recognition (SER), and presents an outlook on the trend of future SER technology. First, the survey summarizes and analyzes SER in detail from five perspectives, including emotion representation models, representative emotional speech corpora, emotion-related acoustic features extraction, SER methods and applications. Then, based on the survey, the challenges faced by current SER research are concluded. This paper aims to take a deep insight into the mainstream methods and recent progress in this field, and presents detailed comparison and analysis between these methods.
Abstract: In recent years, the rapid development of Internet technology and Web applications has triggered the explosion of various data on the Internet, which generates a large amount of valuable knowledge. How to organize, represent and analyze these knowledge has attracted much attention. Knowledge graph was thus developed to organize these knowledge in a semantical and visualized manner. Knowledge reasoning over knowledge graph then becomes one of the hot research topics and plays an important role in many applications such as vertical search and intelligent question-answer. The goal of knowledge reasoning over knowledge graph is to infer new facts or identify erroneous facts according to existing ones. Unlike traditional knowledge reasoning, knowledge reasoning over knowledge graph is more diversified, due to the simplicity, intuitiveness, flexibility, and richness of knowledge representation in knowledge graph. Starting with the basic concept of knowledge reasoning, this paper presents a survey on the recently developed methods for knowledge reasoning over knowledge graph. Specifically, the research progress is reviewed in detail from two aspects:One-Step reasoning and multi-step reasoning, each including rule based reasoning, distributed embedding based reasoning, neural network based reasoning and hybrid reasoning. Finally, future research directions and outlook of knowledge reasoning over knowledge graph are discussed.
Abstract: As an application of mobile ad hoc networks (MANET) on Intelligent Transportation Information System, the most important goal of vehicular ad hoc networks (VANET) is to reduce the high number of accidents and fatal consequences dramatically. One of the most important factors that would contribute to the realization of this goal is the design of effective broadcast protocols. This paper introduces the characteristics and application fields of VANET briefly. Then, it discusses the characteristics, performance, and application areas with analysis and comparison of various categories of broadcast protocols in VANET. According to the characteristic of VANET and its application requirement, the paper proposes the ideas and breakthrough direction of information broadcast model design of inter-vehicle communication.
Abstract: Recommendation system is one of the most important technologies in E-commerce. With the development of E-commerce, the magnitudes of users and commodities grow rapidly, resulted in the extreme sparsity of user rating data. Traditional similarity measure methods work poor in this situation, make the quality of recommendation system decreased dramatically. To address this issue a novel collaborative filtering algorithm based on item rating prediction is proposed. This method predicts item ratings that users have not rated by the similarity of items, then uses a new similarity measure to find the target users?neighbors. The experimental results show that this method can efficiently improve the extreme sparsity of user rating data, and provid better recommendation results than traditional collaborative filtering algorithms.
Abstract: Ultrasonography is the first choice of imaging examination and preoperative evaluation for thyroid and breast cancer. However, ultrasonic characteristics of benign and malignant nodules are commonly overlapped. The diagnosis heavily relies on operator's experience other than quantitative and stable methods. In recent years, medical imaging analysis based on computer technology has developed rapidly, and a series of landmark breakthroughs have been made, which provides effective decision supports for medical imaging diagnosis. In this work, the research progress of computer vision and image recognition technologies in thyroid and breast ultrasound images is studied. A series of key technologies involved in automatic diagnosis of ultrasound images is the main lines of the work. The major algorithms in recent years are summarized and analyzed, such as ultrasound image preprocessing, lesion localization and segmentation, feature extraction and classification. Moreover, multi-dimensional analysis is made on the algorithms, data sets, and evaluation methods. Finally, existing problems related to automatic analysis of those two kinds of ultrasound imaging are discussed, research trend and development direction in the field of ultrasound images analysis are discussed.
Abstract: This paper presents a survey on the theory of provable security and its applications to the design and analysis of security protocols. It clarifies what the provable security is, explains some basic notions involved in the theory of provable security and illustrates the basic idea of random oracle model. It also reviews the development and advances of provably secure public-key encryption and digital signature schemes, in the random oracle model or the standard model, as well as the applications of provable security to the design and analysis of session-key distribution protocols and their advances.
Abstract: Under the new application mode, the traditional hierarchy data centers face several limitations in size, bandwidth, scalability, and cost. In order to meet the needs of new applications, data center network should fulfill the requirements with low-cost, such as high scalability, low configuration overhead, robustness and energy-saving. First, the shortcomings of the traditional data center network architecture are summarized, and new requirements are pointed out. Secondly, the existing proposals are divided into two categories, i.e. server-centric and network-centric. Then, several representative architectures of these two categories are overviewed and compared in detail. Finally, the future directions of data center network are discussed.
Abstract: Batch computing and stream computing are two important forms of big data computing. The research and discussions on batch computing in big data environment are comparatively sufficient. But how to efficiently deal with stream computing to meet many requirements, such as low latency, high throughput and continuously reliable running, and how to build efficient stream big data computing systems, are great challenges in the big data computing research. This paper provides a research of the data computing architecture and the key issues in stream computing in big data environments. Firstly, the research gives a brief summary of three application scenarios of stream computing in business intelligence, marketing and public service. It also shows distinctive features of the stream computing in big data environment, such as real time, volatility, burstiness, irregularity and infinity. A well-designed stream computing system always optimizes in system structure, data transmission, application interfaces, high-availability, and so on. Subsequently, the research offers detailed analyses and comparisons of five typical and open-source stream computing systems in big data environment. Finally, the research specifically addresses some new challenges of the stream big data systems, such as scalability, fault tolerance, consistency, load balancing and throughput.
Abstract: The control and data planes are decoupled in software-defined networking, which provide a new solution for research on new network applications and future Internet technologies. The development status of OpenFlow-based SDN technologies is surveyed in this paper. The research background of decoupled architecture of network control and data transmission in OpenFlow network is summarized first, and the key components and research progress including OpenFlow switch, controller, and SDN technologies are introduced. Moreover, current problems and solutions of OpenFlow-based SDN technologies are analyzed in four aspects. Combined with the development status in recent years, the applications used in campus, data center, network management and network security are summarized. Finally, future research trends are discussed.
Abstract: The rapid development of Internet leads to an increase in system complexity and uncertainty. Traditional network management can not meet the requirement, and it shall evolve to fusion based Cyberspace Situational Awareness (CSA). Based on the analysis of function shortage and development requirement, this paper introduces CSA as well as its origin, conception, objective and characteristics. Firstly, a CSA research framework is proposed and the research history is investigated, based on which the main aspects and the existing issues of the research are analyzed. Meanwhile, assessment methods are divided into three categories: Mathematics model, knowledge reasoning and pattern recognition. Then, this paper discusses CSA from three aspects: Model, knowledge representation and assessment methods, and then goes into detail about main idea, assessment process, merits and shortcomings of novel methods. Many typical methods are compared. The current application research of CSA in the fields of security, transmission, survivable, system evaluation and so on is presented. Finally, this paper points the development directions of CSA and offers the conclusions from issue system, technical system and application system.
Abstract: Combinatorial testing can use a small number of test cases to test systems while preserving fault detection ability. However, the complexity of test case generation problem for combinatorial testing is NP-complete. The efficiency and complexity of this testing method have attracted many researchers from the area of combinatorics and software engineering. This paper summarizes the research works on this topic in recent years. They include: various combinatorial test criteria, the relations between the test generation problem and other NP-complete problems, the mathematical methods for constructing test cases, the computer search techniques for test generation and fault localization techniques based on combinatorial testing.
Abstract: A semi-supervised clustering method based on affinity propagation (AP) algorithm is proposed in this paper. AP takes as input measures of similarity between pairs of data points. AP is an efficient and fast clustering algorithm for large dataset compared with the existing clustering algorithms, such as K-center clustering. But for the datasets with complex cluster structures, it cannot produce good clustering results. It can improve the clustering performance of AP by using the priori known labeled data or pairwise constraints to adjust the similarity matrix. Experimental results show that such method indeed reaches its goal for complex datasets, and this method outperforms the comparative methods when there are a large number of pairwise constraints.
Abstract: Inspired from the idea of data fields, a community discovery algorithm based on topological potential is proposed. The basic idea is that a topological potential function is introduced to analytically model the virtual interaction among all nodes in a network and, by regarding each community as a local high potential area, the community structure in the network can be uncovered by detecting all local high potential areas margined by low potential nodes. The experiments on some real-world networks show that the algorithm requires no input parameters and can discover the intrinsic or even overlapping community structure in networks. The time complexity of the algorithm is O(m+n3/γ)~O(n2), where n is the number of nodes to be explored, m is the number of edges, and 2<γ<3 is a constant.
Abstract: The popularity of the Internet and the boom of the World Wide Web foster innovative changes in software technology that give birth to a new form of software—networked software, which delivers diversified and personalized on-demand services to the public. With the ever-increasing expansion of applications and users, the scale and complexity of networked software are growing beyond the information processing capability of human beings, which brings software engineers a series of challenges to face. In order to come to a scientific understanding of this kind of ultra-large-scale artificial complex systems, a survey research on the infrastructure, application services, and social interactions of networked software is conducted from a three-dimensional perspective of cyberization, servicesation, and socialization. Interestingly enough, most of them have been found to share the same global characteristics of complex networks such as “Small World” and “Scale Free”. Next, the impact of the empirical study on software engineering research and practice and its implications for further investigations are systematically set forth. The convergence of software engineering and other disciplines will put forth new ideas and thoughts that will breed a new way of thinking and input new methodologies for the study of networked software. This convergence is also expected to achieve the innovations of theories, methods, and key technologies of software engineering to promote the rapid development of software service industry in China.
Abstract: Image segmentation is the process of dividing the image into a number of regions with similar properties, and it's the preprocessing step for many image processing tasks. In recent years, domestic and foreign scholars mainly focus on the content-based image segmentation algorithms. Based on extensive research on the existing literatures and the latest achievements, this paper categorizes image segmentation algorithms into three types:graph theory based method, pixel clustering based method and semantic segmentation method. The basic ideas, advantage and disadvantage of typical algorithms belong to each category, especially the most recent image semantic segmentation algorithms based on deep neural network are analyzed, compared and summarized. Furthermore, the paper introduces the datasets which are commonly used as benchmark in image segmentation and evaluation criteria for algorithms, and compares several image segmentation algorithms with experiments as well. Finally, some potential future research work is discussed.
Abstract: The Internet has penetrated into all aspects of human society and has greatly promoted social progress. At the same time, various forms of cybercrimes and network theft occur frequently, bringing great harm to our society and national security. Cyber security has become a major concern to the public and the government. As a large number of Internet functionalities and applications are implemented by software, software plays a crucial role in cyber security research and practice. In fact, almost all cyberattacks were carried out by exploiting vulnerabilities in system software or application software. It is increasingly urgent to investigate the problems of software security in the new age. This paper reviews the state of the art of malware, software vulnerabilities and software security mechanism, and analyzes the new challenges and trends that the software ecosystem is currently facing.
Abstract: The research on the software quality model and software quality evaluation model has always been a
hot topic in the area of software quality assurance and assessment. A great amount of domestic and foreignresearches have been done in building software quality model and quality assessment model, and so far certainaccomplishments have been achieved in these areas. In recent years, platform building and systematization havebecome the trends of developing basic softwares based on operating systems. Therefore, the quality evaluation ofthe foundational software platform becomes an essential issue to be solved. This article analyzes and concludes thecurrent development of researches on software quality model and software quality assessment model focusing onsummarizing and depicting the developing process of quality evaluation of foundational software platform. It alsodiscusses the future development of researches on quality assessment of foundational software platform in brief,
trying to establish a good foundation for it.
Abstract: Honeypot is a proactive defense technology, introduced by the defense side to change the asymmetric situation of a network attack and defensive game. Through the deployment of the honeypots, i.e. security resources without any production purpose, the defenders can deceive attackers to illegally take advantage of the honeypots and capture and analyze the attack behaviors to understand the attack tools and methods, and to learn the intentions and motivations. Honeypot technology has won the sustained attention of the security community to make considerable progress and get wide application, and has become one of the main technical means of the Internet security threat monitoring and analysis. In this paper, the origin and evolution process of the honeypot technology are presented first. Next, the key mechanisms of honeypot technology are comprehensively analyzed, the development process of the honeypot deployment structure is also reviewed, and the latest applications of honeypot technology in the directions of Internet security threat monitoring, analysis and prevention are summarized. Finally, the problems of honeypot technology, development trends and further research directions are discussed.
Abstract: Learning to rank(L2R) techniques try to solve sorting problems using machine learning methods, and have been well studied and widely used in various fields such as information retrieval, text mining, personalized recommendation, and biomedicine.The main task of L2R based recommendation algorithms is integrating L2R techniques into recommendation algorithms, and studying how to organize a large number of users and features of items, build more suitable user models according to user preferences requirements, and improve the performance and user satisfaction of recommendation algorithms.This paper surveys L2R based recommendation algorithms in recent years, summarizes the problem definition, compares key technologies and analyzes evaluation metrics and their applications.In addition, the paper discusses the future development trend of L2R based recommendation algorithms.
Abstract: Data deduplication technologies can be divided into two categories: a) identical data detection
techniques, and b) similar data detection and encoding techniques. This paper presents a systematic survey on these
two categories of data deduplication technologies and analyzes their advantages and disadvantages. Besides, since
data deduplication technologies can affect the reliability and performance of storage systems, this paper also
surveys various kinds of technologies proposed to cope with these two aspects of problems. Based on the analysis of
the current state of research on data deduplication technologies, this paper makes several conclusions as follows:
a) How to mine data characteristic information in data deduplication has not been completely solved, and how to
use data characteristic information to effectively eliminate duplicate data also needs further study; b) From the
perspective of storage system design, it still needs further study how to introduce proper mechanisms to overcome
the reliability limitations of data deduplication techniques and reduce the additional system overheads caused by
data deduplication techniques.