Data science targets the data life cycle of real applications, studying phenomena at scales, complexities, and granularities never before possible. This data life cycle encompasses databases and data engineering often leveraging statistical, machine learning, and artificial intelligence methods and, in many instances, using massive and heterogeneous collections of potentially noisy datasets. In this special section, we focus on data-intensive components of data science pipelines; and solve problems in areas of interest to our community (e.g., data curation, optimization, performance, storage, and systems).
To promote the recent work on scalable data science, we organize this special section at Journal of Computer Science and Technology (JCST). We received XX papers from all over the world. First, the guest editors preformed quick reviews and immediately rejected insufficiently highquality submissions. Then, each remaining submission was reviewed by at least three invited international reviewers. All the papers were carried out two rounds of reviews, and the authors were asked to address all the major and minor issues in their submissions during the review process. Eventually we accepted seven high-quality submissions in terms of clarity, novelty, significance, and relevance.
The first paper "GAM: A GPU-Accelerated Algorithm for MaxRS Queries in Road Networks" by Jian Chen et al. proposes a novel GPU-accelerated algorithm GAM to tackle maximizing range sum queries in road networks efficiently with a two-level framework. The framework first proposes an effective multi-grained pruning technique to prune the cells derived from partitioning the road network, and then GPU-friendly storage structure is designed to compute the final result in the remaining cells.
The second paper "Experiments and Analyses of Anonymization Mechanisms for Trajectory Data Publishing" by Sun et al. systematically evaluates the individual privacy in terms of unicity and the utility in terms of practical applications of the anonymized trajectory data. This paper reveals the true situation of the privacy preservation for trajectories in terms of reidentification and the true situation of the utility of anonymized trajectories.
The third paper "Efficient Partitioning Method for Optimizing the Compression on Array Data" by Han et al. utilizes header compression to address the problem of array partitioning for optimizing the compression performance. The paper designs a greedy strategy which can help to find the partition point with the best compression performance.
The forth paper "Discovering Cohesive Temporal Subgraphs with Temporal Density Aware Exploration" by Zhu et al. proposes a temporal subgraph model to discover cohesive temporal subgraphs by capturing both the structural and the temporal characteristics of temporal cohesive subgraphs. This paper designs strategies to mine temporal densest subgraphs efficiently by decomposing the temporal graph into the sequence of snapshots.
The fifth paper "Incremental User Identification Across Social Networks Based on User-Guider Similarity Index" by Kou et al. proposes an incremental user identification method across social networks based on User-guider Similarity Index. The paper first constructs a novel User guider Similarity Index to speed up the matching between users, and then applies a two-phase user identification strategy to efficiently identify users.
The sixth paper "An Exercise Collection Auto-Assembling Framework with Knowledge Tracing and Reinforcement Learning" by Zhao et al. introduces an exercise collection auto-assembling framework, in which the assembled exercise collection can meet the teacher’s requirements on the difficulty index and the discrimination index. The paper designs a two-stage approach where a knowledge tracing model is used to predict the students’ answers and a deep reinforcement learning model to select exercises to satisfy the query parameters.