SCIE, Ei, INSPEC, JST, AJ, MR, CA, DBLP, etc.
Edited by: Editorial Board of Journal Of Computer Science and Technology
Guo-Jie Li, Editor-in-Chief,
P.O. Box 2704, Beijing 100190, P.R. China Sponsored by: Institute of Computing Technology, CAS & China Computer Federation Undertaken by: Institute of Computing Technology, CAS Published by: SCIENCE PRESS, BEIJING, CHINA Distributed by: China: All Local Post Offices Other Countries: Springer
Bug triaging, which routes the bug reports to potential fixers, is an integral step in software development and maintenance. To make bug triaging more efficient, many researchers propose to adopt machine learning and information retrieval techniques to identify some suitable fixers for a given bug report. However, none of the existing proposals simultaneously take into account the following three aspects that matter for the efficiency of bug triaging:1) the textual content in the bug reports, 2) the metadata in the bug reports, and 3) the tossing sequence of the bug reports. To simultaneously make use of the above three aspects, we propose iTriage which first adopts a sequence-to-sequence model to jointly learn the features of textual content and tossing sequence, and then uses a classification model to integrate the features from textual content, metadata, and tossing sequence. Evaluation results on three different open-source projects show that the proposed approach has significantly improved the accuracy of bug triaging compared with the state-of-the-art approaches.
Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However, the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four components:(1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository; (2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary, to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms the other approaches and SemiTagRec's accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively.
Missing checks for untrusted inputs used in security-sensitive operations is one of the major causes of various vulnerabilities. Efficiently detecting and repairing missing checks are essential for prognosticating potential vulnerabilities and improving code reliability. We propose a systematic static analysis approach to detect missing checks for manipulable data used in security-sensitive operations of C/C++ programs and recommend repair references. First, customized securitysensitive operations are located by lightweight static analysis. Then, the assailability of sensitive data used in securitysensitive operations is determined via taint analysis. And, the existence and the risk degree of missing checks are assessed. Finally, the repair references for high-risk missing checks are recommended. We implemented the approach into an automated and cross-platform tool named Vanguard based on Clang/LLVM 3.6.0. Large-scale experimental evaluation on open-source projects has shown its effectiveness and efficiency. Furthermore, Vanguard has helped us uncover five known vulnerabilities and 12 new bugs.
Searching application programming interfaces (APIs) is very important for developers to reuse software projects. Existing natural language based API search mainly faces the following challenges. 1) More accurate results are required as software projects evolve to be more heterogeneous and complex. 2) The semantic relationships between APIs (e.g., inheritances between classes, and invocations between methods) need to be illustrated so that developers can better understand their usage scenarios. To deal with these issues, we propose GeAPI, a novel graph embedding based approach for API graph search and recommendation in this paper. First, we build a software project's API graph automatically from its source code and represent each API using graph embedding methods. Second, we search the API graph with a question in natural language, and return the corresponding subgraph that is composed of relevant code elements and their associated relationships, as the best answer of the question. In experiments, we select three well-known open source projects, JodaTime, Apache Lucene and POI, as examples to perform API search tasks. The experimental results show that our approach GeAPI improves F1-score by 10% compared with the existing shortest path based API search approach, while reduces the average response time about 60 times.
The mobile crowdsensing software systems can complete large-scale and complex sensing tasks with the help of the collective intelligence from large numbers of ordinary users. In this paper, we build a typical crowdsensing system, which can efficiently calibrate large numbers of smartphone barometer sensors. The barometer sensor now becomes a very common sensor on smartphones. It is very useful in many applications, such as positioning, environment sensing and activity detection. Unfortunately, most smartphone barometers today are not accurate enough, and it is rather challenging to efficiently calibrate a large number of smartphone barometers. Here, we try to achieve this goal by designing a crowdsensingbased smartphone calibration system, which is called CBSC. It makes use of low-power barometers on smartphones and needs few reference points and little human assistant. We propose a hidden Markov model for peer-to-peer calibration, and calibrate all the barometers by solving a minimum dominating set problem. The field studies show that CBSC can get an accuracy of within 0.1 hPa in 84% cases. Compared with the traditional solutions, CBSC is more practical and the accuracy is satisfying. The experience gained when building this system can also help the development of other crowdsensing-based systems.
In current software defect prediction (SDP) research, most previous empirical studies only use datasets provided by PROMISE repository and this may cause a threat to the external validity of previous empirical results. Instead of SDP dataset sharing, SDP model sharing is a potential solution to alleviate this problem and can encourage researchers in the research community and practitioners in the industrial community to share more models. However, directly sharing models may result in privacy disclosure, such as model inversion attack. To the best of our knowledge, we are the first to apply differential privacy (DP) to privacy-preserving SDP model sharing and then propose a novel method DP-Share, since DP mechanisms can prevent this attack when the privacy budget is carefully selected. In particular, DP-Share first performs data preprocessing for the dataset, such as over-sampling for minority instances (i.e., defective modules) and conducting discretization for continuous features to optimize privacy budget allocation. Then, it uses a novel sampling strategy to create a set of training sets. Finally it constructs decision trees based on these training sets and these decision trees can form a random forest (i.e., model). The last phase of DP-Share uses Laplace and exponential mechanisms to satisfy the requirements of DP. In our empirical studies, we choose nine experimental subjects from real software projects. Then, we use AUC (area under ROC curve) as the performance measure and holdout as our model validation technique. After privacy and utility analysis, we find that DP-Share can achieve better performance than a baseline method DF-Enhance in most cases when using the same privacy budget. Moreover, we also provide guidelines to effectively use our proposed method. Our work attempts to fill the research gap in terms of differential privacy for SDP, which can encourage researchers and practitioners to share more SDP models and then effectively advance the state of the art of SDP.
Defect prediction assists the rational allocation of testing resources by detecting the potentially defective software modules before releasing products. When a project has no historical labeled defect data, cross project defect prediction (CPDP) is an alternative technique for this scenario. CPDP utilizes labeled defect data of an external project to construct a classification model to predict the module labels of the current project. Transfer learning based CPDP methods are the current mainstream. In general, such methods aim to minimize the distribution differences between the data of the two projects. However, previous methods mainly focus on the marginal distribution difference but ignore the conditional distribution difference, which will lead to unsatisfactory performance. In this work, we use a novel balanced distribution adaptation (BDA) based transfer learning method to narrow this gap. BDA simultaneously considers the two kinds of distribution differences and adaptively assigns different weights to them. To evaluate the effectiveness of BDA for CPDP performance, we conduct experiments on 18 projects from four datasets using six indicators (i.e., F-measure, g-means, Balance, AUC, EARecall, and EAF-measure). Compared with 12 baseline methods, BDA achieves average improvements of 23.8%, 12.5%, 11.5%, 4.7%, 34.2%, and 33.7% in terms of the six indicators respectively over four datasets.
Software metrics are used to measure different attributes of software. To practically measure software attributes using these metrics, metric thresholds are needed. Many researchers attempted to identify these thresholds based on personal experiences. However, the resulted experience-based thresholds cannot be generalized due to the variability in personal experiences and the subjectivity of opinions. The goal of this paper is to propose an automated clustering framework based on the expectation maximization (EM) algorithm where clusters are generated using a simplified 3-metric set (LOC, LCOM, and CBO). Given these clusters, different threshold levels for software metrics are systematically determined such that each threshold reflects a specific level of software quality. The proposed framework comprises two major steps:the clustering step where the software quality historical dataset is decomposed into a fixed set of clusters using the EM algorithm, and the threshold extraction step where thresholds, specific to each software metric in the resulting clusters, are estimated using statistical measures such as the mean (μ) and the standard deviation (σ) of each software metric in each cluster. The paper's findings highlight the capability of EM-based clustering, using a minimum metric set, to group software quality datasets according to different quality levels.
Time-division multiple access (TDMA) and code-division multiple access (CDMA) are two technologies used in digital cellular networks. The authentication protocols of TDMA networks have been proven to be vulnerable to side-channel analysis (SCA), giving rise to a series of powerful SCA-based attacks against unprotected subscriber identity module (SIM) cards. CDMA networks have two authentication protocols, cellular authentication and voice encryption (CAVE) based authentication protocol and authentication and key agreement (AKA) based authentication protocol, which are used in different phases of the networks. However, there has been no SCA attack for these two protocols so far. In this paper, in order to figure out if the authentication protocols of CDMA networks are sufficiently secure against SCA, we investigate the two existing protocols and their cryptographic algorithms. We find the side-channel weaknesses of the two protocols when they are implemented on embedded systems. Based on these weaknesses, we propose specific attack strategies to recover their authentication keys for the two protocols, respectively. We verify our strategies on an 8-bit microcontroller and a real-world SIM card, showing that the authentication keys can be fully recovered within a few minutes with a limited number of power measurements. The successful experiments demonstrate the correctness and the effectiveness of our proposed strategies and prove that the unprotected implementations of the authentication protocols of CDMA networks cannot resist SCA.
Processor specialization has become the development trend of modern processor industry. It is quite possible that this will still be the main-stream in the next decades of semiconductor era. As the diversity of heterogeneous systems grows, organizing computation efficiently on systems with multiple kinds of heterogeneous processors is a challenging problem and will be a normality. In this paper, we analyze some state-of-the-art task scheduling algorithms of heterogeneous computing systems and propose a Degree of Node First (DONF) algorithm for task scheduling of fine-grained parallel programs on heterogeneous systems. The major innovations of DONF include:1) simplifying task priority calculation for directed acyclic graph (DAG) based fine-grained parallel programs which not only reduces the complexity of task selection but also enables the algorithm to solve the scheduling problem for dynamic DAGs; 2) building a novel communication model in the processor selection phase that makes the task scheduling much more efficient. They are achieved by exploring finegrained parallelism via a dataflow program execution model, and validated through experimental results with a selected set of benchmarks. The results on synthesized and real-world application DAGs show a very good performance. The proposed DONF algorithm significantly outperforms all the evaluated state-of-the-art heuristic algorithms in terms of scheduling length ratio (SLR) and efficiency.
This paper presents a rigidity-preserving morphing technique that blends a pair of 2D shapes in a controllable manner. The morphing is controllable in two aspects:1) motion dynamics in the interpolation sequences can be effectively enhanced through an intuitive skeleton control and 2) not only the boundaries but also the interior features of the source and target shapes are precisely aligned during the morphing. We introduce a new compatible triangulation algorithm to decompose the source and target shapes into isomorphic triangles. Moreover, a robust and motion-controllable rigiditypreserving transformation scheme is proposed to blend the compatible triangulations, ultimately leading to a morphing sequence which is appearance-preserving and with the desired motion dynamics. Our approach constitutes a powerful and easy-to-use morphing tool for two-dimensional animation. We demonstrate its versatility, effectiveness and visual accuracy through a variety of examples and comparisons to prior work.
In this paper, we present DEMC, a deep dual-encoder network to remove Monte Carlo noise efficiently while preserving details. Denoising Monte Carlo rendering is different from natural image denoising since inexpensive by-products (feature buffers) can be extracted in the rendering stage. Most of them are noise-free and can provide sufficient details for image reconstruction. However, these feature buffers also contain redundant information. Hence, the main challenge of this topic is how to extract useful information and reconstruct clean images. To address this problem, we propose a novel network structure, dual-encoder network with a feature fusion sub-network, to fuse feature buffers firstly, then encode the fused feature buffers and a noisy image simultaneously, and finally reconstruct a clean image by a decoder network. Compared with the state-of-the-art methods, our model is more robust on a wide range of scenes, and is able to generate satisfactory results in a significantly faster way.
The reliability allowance of circuits tends to decrease with the increase of circuit integration and the application of new technology and materials, and the hardening strategy oriented toward gates is an effective technology for improving the circuit reliability of the current situations. Therefore, a parallel-structured genetic algorithm (GA), PGA, is proposed in this paper to locate reliability-critical gates to successfully perform targeted hardening. Firstly, we design a binary coding method for reliability-critical gates and build an ordered initial population consisting of dominant individuals to improve the quality of the initial population. Secondly, we construct an embedded parallel operation loop for directional crossover and directional mutation to compensate for the deficiency of the poor local search of the GA. Thirdly, for combination with a diversity protection strategy for the population, we design an elitism retention based selection method to boost the convergence speed and avoid being trapped by a local optimum. Finally, we present an ordered identification method oriented toward reliability-critical gates using a scoring mechanism to retain the potential optimal solutions in each round to improve the robustness of the proposed locating method. The simulation results on benchmark circuits show that the proposed method PGA is an efficient locating method for reliability-critical gates in terms of accuracy and convergence speed.
The designation of the cluster number K and the initial centroids is essential for K-modes clustering algorithm. However, most of the improved methods based on K-modes specify the K value manually and generate the initial centroids randomly, which makes the clustering algorithm significantly dependent on human-based decisions and unstable on the iteration time. To overcome this limitation, we propose a cohesive K-modes (CK-modes) algorithm to generate the cluster number K and the initial centroids automatically. Explicitly, we construct a labeled property graph based on index-free adjacency to capture both global and local cohesion of the node in the sample of the input datasets. The cohesive node calculated based on the property similarity is exploited to split the graph to a K-node tree that determines the K value, and then the initial centroids are selected from the split subtrees. Since the property graph construction and the cohesion calculation are only performed once, they account for a small amount of execution time of the clustering operation with multiple iterations, but significantly accelerate the clustering convergence. Experimental validation in both real-world and synthetic datasets shows that the CK-modes algorithm outperforms the state-of-the-art algorithms.