Bimonthly    Since 1986
ISSN 1000-9000(Print)
CN 11-2296/TP
Indexed in:
Publication Details
Edited by: Editorial Board of Journal Of Computer Science and Technology
P.O. Box 2704, Beijing 100190, P.R. China
Sponsored by: Institute of Computing Technology, CAS & China Computer Federation
Undertaken by: Institute of Computing Technology, CAS
Distributed by:
China: All Local Post Offices
Other Countries: Springer
  • Table of Content
      30 May 2022, Volume 37 Issue 3 Previous Issue   
    For Selected: View Abstracts Toggle Thumbnails
    Special Section on Self-Learning with Deep Neural Networks
    Min-Ling Zhang (张敏灵), Xiu-Shen Wei (魏秀参), and Gao Huang (黄高)
    Journal of Computer Science and Technology, 2022, 37 (3): 505-506.  DOI: 10.1007/s11390-022-0002-y
    Related Articles | Metrics
    Connecting the Dots in Self-Supervised Learning: A Brief Survey for Beginners
    Peng-Fei Fang (方鹏飞), Xian Li (李贤), Yang Yan (燕阳), Shuai Zhang (章帅), Qi-Yue Kang (康启越), Xiao-Fei Li (李晓飞), and Zhen-Zhong Lan (蓝振忠)
    Journal of Computer Science and Technology, 2022, 37 (3): 507-526.  DOI: 10.1007/s11390-022-2158-x

    The artificial intelligence (AI) community has recently made tremendous progress in developing self-supervised learning (SSL) algorithms that can learn high-quality data representations from massive amounts of unlabeled data. These methods brought great results even to the fields outside of AI. Due to the joint efforts of researchers in various areas, new SSL methods come out daily. However, such a sheer number of publications make it difficult for beginners to see clearly how the subject progresses. This survey bridges this gap by carefully selecting a small portion of papers that we believe are milestones or essential work. We see these researches as the "dots" of SSL and connect them through how they evolve. Hopefully, by viewing the connections of these dots, readers will have a high-level picture of the development of SSL across multiple disciplines including natural language processing, computer vision, graph learning, audio processing, and protein learning.

    References | Supplementary Material | Related Articles | Metrics
    Self-Supervised Task Augmentation for Few-Shot Intent Detection
    Peng-Fei Sun (孙鹏飞), Ya-Wen Ouyang (欧阳亚文), Ding-Jie Song (宋定杰), and Xin-Yu Dai (戴新宇)
    Journal of Computer Science and Technology, 2022, 37 (3): 527-538.  DOI: 10.1007/s11390-022-2029-5
    Few-shot intent detection is a practical challenge task, because new intents are frequently emerging and collecting large-scale data for them could be costly. Meta-learning, a promising technique for leveraging data from previous tasks to enable efficient learning of new tasks, has been a popular way to tackle this problem. However, the existing meta-learning models have been evidenced to be overfitting when the meta-training tasks are insufficient. To overcome this challenge, we present a novel self-supervised task augmentation with meta-learning framework, namely STAM. Firstly, we introduce the task augmentation, which explores two different strategies and combines them to extend meta-training tasks. Secondly, we devise two auxiliary losses for integrating self-supervised learning into meta-learning to learn more generalizable and transferable features. Experimental results show that STAM can achieve consistent and considerable performance improvement to existing state-of-the-art methods on four datasets.
    References | Supplementary Material | Related Articles | Metrics
    Self-Supervised Music Motion Synchronization Learning for Music-Driven Conducting Motion Generation
    Fan Liu (刘凡), De-Long Chen (陈德龙), Rui-Zhi Zhou (周睿志), Sai Yang (杨赛), and Feng Xu (许峰)
    Journal of Computer Science and Technology, 2022, 37 (3): 539-558.  DOI: 10.1007/s11390-022-2030-z
    The correlation between music and human motion has attracted widespread research attention. Although recent studies have successfully generated motion for singers, dancers, and musicians, few have explored motion generation for orchestral conductors. The generation of music-driven conducting motion should consider not only the basic music beats, but also mid-level music structures, high-level music semantic expressions, and hints for different parts of orchestras (strings, woodwind, etc.). However, most existing conducting motion generation methods rely heavily on human-designed rules, which significantly limits the quality of generated motion. Therefore, we propose a novel Music Motion Synchronized Generative Adversarial Network (M2S-GAN), which generates motions according to the automatically learned music representations. More specifically, M2S-GAN is a cross-modal generative network comprising four components: 1) a music encoder that encodes the music signal; 2) a generator that generates conducting motion from the music codes; 3) a motion encoder that encodes the motion; 4) a discriminator that differentiates the real and generated motions. These four components respectively imitate four key aspects of human conductors: understanding music, interpreting music, precision and elegance. The music and motion encoders are first jointly trained by a self-supervised contrastive loss, and can thus help to facilitate the music motion synchronization during the following adversarial learning process. To verify the effectiveness of our method, we construct a large-scale dataset, named ConductorMotion100, which consists of unprecedented 100 hours of conducting motion data. Extensive experiments on ConductorMotion100 demonstrate the effectiveness of M2S-GAN. Our proposed approach outperforms various comparison methods both quantitatively and qualitatively. Through visualization, we show that our approach can generate plausible, diverse, and music-synchronized conducting motion.
    References | Supplementary Material | Related Articles | Metrics
    Special Section of CVM 2022
    Shi-Min Hu (胡事民), Paul L. Rosin, and Tian-Jia Shao (邵天甲)
    Journal of Computer Science and Technology, 2022, 37 (3): 559-560.  DOI: 10.1007/s11390-022-0003-x
    Related Articles | Metrics
    A Comprehensive Review of Redirected Walking Techniques: Taxonomy, Methods, and Future Directions
    Yi-Jun Li (李奕君), Frank Steinicke, and Miao Wang (汪淼)
    Journal of Computer Science and Technology, 2022, 37 (3): 561-583.  DOI: 10.1007/s11390-022-2266-7
    Virtual reality (VR) allows users to explore and experience a computer-simulated virtual environment so that VR users can be immersed in a totally artificial virtual world and interact with arbitrary virtual objects. However, the limited physical tracking space usually restricts the exploration of large virtual spaces, and VR users have to use special locomotion techniques to move from one location to another. Among these techniques, redirected walking (RDW) is one of the most natural locomotion techniques to solve the problem based on near-natural walking experiences. The core idea of the RDW technique is to imperceptibly guide users on virtual paths, which might vary from the paths they physically walk in the real world. In a similar way, some RDW algorithms imperceptibly change the structure and layout of the virtual environment such that the virtual environment fits into the tracking space. In this survey, we first present a taxonomy of existing RDW work. Based on this taxonomy, we compare and analyze both contributions and shortcomings of the existing methods in detail, and find view manipulation methods offer satisfactory visual effect but the experience can be interrupted when users reach the physical boundaries, while virtual environment manipulation methods can provide users with consistent movement but have limited application scenarios. Finally, we discuss possible future research directions, indicating combining artificial intelligence with this area will be effective and intriguing.
    References | Supplementary Material | Related Articles | Metrics
    Probability-Based Channel Pruning for Depthwise Separable Convolutional Networks
    Han-Li Zhao (赵汉理), Kai-Jie Shi (史开杰), Xiao-Gang Jin (金小刚), Ming-Liang Xu (徐明亮), Hui Huang (黄辉), Wang-Long Lu (卢望龙), and Ying Liu (刘影)
    Journal of Computer Science and Technology, 2022, 37 (3): 584-600.  DOI: 10.1007/s11390-022-2131-8
    Channel pruning can reduce memory consumption and running time with least performance damage, and is one of the most important techniques in network compression. However, existing channel pruning methods mainly focus on the pruning of standard convolutional networks, and they rely intensively on time-consuming fine-tuning to achieve the performance improvement. To this end, we present a novel efficient probability-based channel pruning method for depthwise separable convolutional networks. Our method leverages a new simple yet effective probability-based channel pruning criterion by taking the scaling and shifting factors of batch normalization layers into consideration. A novel shifting factor fusion technique is further developed to improve the performance of the pruned networks without requiring extra time-consuming fine-tuning. We apply the proposed method to five representative deep learning networks, namely MobileNetV1, MobileNetV2, ShuffleNetV1, ShuffleNetV2, and GhostNet, to demonstrate the efficiency of our pruning method. Extensive experimental results and comparisons on publicly available CIFAR10, CIFAR100, and ImageNet datasets validate the feasibility of the proposed method.
    References | Supplementary Material | Related Articles | Metrics
    A Comparative Study of CNN- and Transformer-Based Visual Style Transfer
    Hua-Peng Wei (魏华鹏), Ying-Ying Deng (邓盈盈), Fan Tang (唐帆), Xing-Jia Pan (潘兴甲), and Wei-Ming Dong (董未名)
    Journal of Computer Science and Technology, 2022, 37 (3): 601-614.  DOI: 10.1007/s11390-022-2140-7
    Vision Transformer has shown impressive performance on the image classification tasks. Observing that most existing visual style transfer (VST) algorithms are based on the texture-biased convolution neural network (CNN), here raises the question of whether the shape-biased Vision Transformer can perform style transfer as CNN. In this work, we focus on comparing and analyzing the shape bias between CNN- and transformer-based models from the view of VST tasks. For comprehensive comparisons, we propose three kinds of transformer-based visual style transfer (Tr-VST) methods (Tr-NST for optimization-based VST, Tr-WCT for reconstruction-based VST and Tr-AdaIN for perceptual-based VST). By engaging three mainstream VST methods in the transformer pipeline, we show that transformer-based models pre-trained on ImageNet are not proper for style transfer methods. Due to the strong shape bias of the transformer-based models, these Tr-VST methods cannot render style patterns. We further analyze the shape bias by considering the influence of the learned parameters and the structure design. Results prove that with proper style supervision, the transformer can learn similar texture-biased features as CNN does. With the reduced shape bias in the transformer encoder, Tr-VST methods can generate higher-quality results compared with state-of-the-art VST methods.
    References | Supplementary Material | Related Articles | Metrics
    Local Homography Estimation on User-Specified Textureless Regions
    Zheng Chen (陈铮), Xiao-Nan Fang (方晓楠), and Song-Hai Zhang (张松海)
    Journal of Computer Science and Technology, 2022, 37 (3): 615-625.  DOI: 10.1007/s11390-022-2185-7
    This paper presents a novel deep neural network for designated point tracking (DPT) in a monocular RGB video, VideoInNet. More concretely, the aim is to track four designated points correlated by a local homography on a textureless planar region in the scene. DPT can be applied to augmented reality and video editing, especially in the field of video advertising. Existing methods predict the location of four designated points without appropriately considering the point correlation. To solve this problem, VideoInNet predicts the motion of the four designated points correlated by a local homography within the heatmap prediction framework. Our network refines the heatmaps of designated points through two stages. On the first stage, we introduce a context-aware and location-aware structure to learn a local homography for the designated plane in a supervised way. On the second stage, we introduce an iterative heatmap refinement module to improve the tracking accuracy. We propose a dataset focusing on textureless planar regions, named ScanDPT, for training and evaluation. We show that the error rate of VideoInNet is about 29% lower than that of the state-of-the-art approach when testing in the first 120 frames of testing videos on ScanDPT.
    References | Supplementary Material | Related Articles | Metrics
    CGTracker: Center Graph Network for One-Stage Multi-Pedestrian-Object Detection and Tracking
    Xin Feng (冯欣), Hao-Ming Wu (吴浩铭), Yi-Hao Yin (殷一皓), and Li-Bin Lan (兰利彬)
    Journal of Computer Science and Technology, 2022, 37 (3): 626-640.  DOI: 10.1007/s11390-022-2204-8
    Most current online multi-object tracking (MOT) methods include two steps: object detection and data association, where the data association step relies on both object feature extraction and affinity computation. This often leads to additional computation cost, and degrades the efficiency of MOT methods. In this paper, we combine the object detection and data association module in a unified framework, while getting rid of the extra feature extraction process, to achieve a better speed-accuracy trade-off for MOT. Considering that a pedestrian is the most common object category in real-world scenes and has particularity characteristics in objects relationship and motion pattern, we present a novel yet efficient one-stage pedestrian detection and tracking method, named CGTracker. In particular, CGTracker detects the pedestrian target as the center point of the object, and directly extracts the object features from the feature representation of the object center point, which is used to predict the axis-aligned bounding box. Meanwhile, the detected pedestrians are constructed as an object graph to facilitate the multi-object association process, where the semantic features, displacement information and relative position relationship of the targets between two adjacent frames are used to perform the reliable online tracking. CGTracker achieves the multiple object tracking accuracy (MOTA) of 69.3% and 65.3% at 9 FPS on MOT17 and MOT20, respectively. Extensive experimental results under widely-used evaluation metrics demonstrate that our method is one of the best techniques on the leader board for the MOT17 and MOT20 challenges at the time of submission of this work.
    References | Supplementary Material | Related Articles | Metrics
    Learn Robust Pedestrian Representation Within Minimal Modality Discrepancy for Visible-Infrared Person Re-Identification
    Yu-Jie Liu (刘玉杰), Wen-Bin Shao (邵文斌), and Xiao-Rui Sun (孙晓瑞)
    Journal of Computer Science and Technology, 2022, 37 (3): 641-651.  DOI: 10.1007/s11390-022-2146-1
    Visible-infrared person re-identification has attracted extensive attention from the community due to its potential great application prospects in video surveillance. There are huge modality discrepancies between visible and infrared images caused by different imaging mechanisms. Existing studies alleviate modality discrepancies by aligning modality distribution or extracting modality-shared features on the original image. However, they ignore a key solution, i.e., converting visible images to gray images directly, which is efficient and effective to reduce modality discrepancies. In this paper, we transform the cross-modality person re-identification task from visible-infrared images to gray-infrared images, which is named as the minimal modality discrepancy. In addition, we propose a pyramid feature integration network (PFINet) which mines the discriminative refined features of pedestrian images and fuses high-level and semantically strong features to build a robust pedestrian representation. Specifically, PFINet first performs the feature extraction from concrete to abstract and the top-down semantic transfer to obtain multi-scale feature maps. Second, the multi-scale feature maps are inputted to the discriminative-region response module to emphasize the identity-discriminative regions by the spatial attention mechanism. Finally, the pedestrian representation is obtained by the feature integration. Extensive experiments demonstrate the effectiveness of PFINet which achieves the rank-1 accuracy of 81.95% and mAP of 74.49% on the multi-all evaluation mode of the SYSU-MM01 dataset.
    References | Supplementary Material | Related Articles | Metrics
    Element-Arrangement Context Network for Facade Parsing
    Yan Tao (陶琰), Yi-Teng Zhang (张翼腾), and Xue-Jin Chen (陈雪锦)
    Journal of Computer Science and Technology, 2022, 37 (3): 652-665.  DOI: 10.1007/s11390-022-2189-3
    Facade parsing aims to decompose a building facade image into semantic regions of the facade objects. Considering each architectural element on a facade as a parameterized rectangle, we formulate the facade parsing task as object detection, allowing overlapping and nesting, which will support structural 3D modeling and editing for further applications. In contrast to general object detection, the spatial arrangement regularity and appearance similarity between the facade elements of the same category provide valuable context for accurate element localization. In this paper, we propose to exploit the spatial arrangement regularity and appearance similarity of facade elements in a detection framework. Our element-arrangement context network (EACNet) consists of two unidirectional attention branches, one to capture the column-context and the other to capture row-context to aggregate element-specific features from multiple instances on the facade. We conduct extensive experiments on four public datasets (ECP, CMP, Graz50, and eTRIMS). The proposed EACNet achieves the highest mIoU (82.1% on ECP, 77.35% on Graz50, and 82.3% on eTRIMS) compared with the state-of-the-art methods. Both the quantitative and qualitative evaluation results demonstrate the effectiveness of our dual unidirectional attention branches to parse facade elements.
    References | Supplementary Material | Related Articles | Metrics
    ARSlice: Head-Mounted Display Augmented with Dynamic Tracking and Projection
    Yu-Ping Wang (王瑀屏), Sen-Wei Xie (解森炜), Li-Hui Wang (王立辉), Hongjin Xu (徐鸿金), Satoshi Tabata, and Masatoshi Ishikawa
    Journal of Computer Science and Technology, 2022, 37 (3): 666-679.  DOI: 10.1007/s11390-022-2173-y
    Computed tomography (CT) generates cross-sectional images of the body. Visualizing CT images has been a challenging problem. The emergence of the augmented and virtual reality technology has provided promising solutions. However, existing solutions suffer from tethered display or wireless transmission latency. In this paper, we present ARSlice, a proof-of-concept prototype that can visualize CT images in an untethered manner without wireless transmission latency. Our ARSlice prototype consists of two parts, the user end and the projector end. By employing dynamic tracking and projection, the projector end can track the user-end equipment and project CT images onto it in real time. The user-end equipment is responsible for displaying these CT images into the 3D space. Its main feature is that the user-end equipment is a pure optical device with light weight, low cost, and no energy consumption. Our experiments demonstrate that our ARSlice prototype provides part of six degrees of freedom for the user, and a high frame rate. By interactively visualizing CT images into the 3D space, our ARSlice prototype can help untrained users better understand that CT images are slices of a body.
    References | Supplementary Material | Related Articles | Metrics
    Regular Paper
    NfvInsight: A Framework for Automatically Deploying and Benchmarking VNF Chains
    Tian-Ni Xu (徐天妮), Hai-Feng Sun (孙海锋), Di Zhang (张笛), Xiao-Ming Zhou (周小明), Xiu-Feng Sui (隋秀峰), Sa Wang (王卅), Qun Huang (黄群), and Yun-Gang Bao (包云岗)
    Journal of Computer Science and Technology, 2022, 37 (3): 680-698.  DOI: 10.1007/s11390-020-0434-1
    With the advent of virtualization techniques and software-defined networking (SDN), network function virtualization (NFV) shifts network functions (NFs) from hardware implementations to software appliances, between which exists a performance gap. How to narrow the gap is an essential issue of current NFV research. However, the cumbersomeness of deployment, the water pipe effect of virtual network function (VNF) chains, and the complexity of the system software stack together make it tough to figure out the cause of low performance in the NFV system. To pinpoint the NFV system performance, we propose NfvInsight, a framework for automatic deployment and benchmarking VNF chains. Our framework tackles the challenges in NFV performance analysis. The framework components include chain graph generation, automatic deployment, and fine granularity measurement. The design and implementation of each component have their advantages. To the best of our knowledge, we make the first attempt to collect rules forming a knowledge base for generating reasonable chain graphs. NfvInsight deploys the generated chain graphs automatically, which frees the network operators from executing at least 391 lines of bash commands for a single test. To diagnose the performance bottleneck, NfvInsight collects metrics from multiple layers of the software stack. Specifically, we collect the network stack latency distribution ingeniously, introducing only less than 2.2% overhead. We showcase the convenience and usability of NfvInsight in finding bottlenecks for both VNF chains and the underlying system. Leveraging our framework, we find several design flaws of the network stack, which are unsuitable for packet forwarding inside one single server under the NFV circumstance. Our optimization for these flaws gains at most 3x performance improvement.
    References | Supplementary Material | Related Articles | Metrics
    Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application
    Rong-Yu Cao (曹荣禹), Yi-Xuan Cao (曹逸轩), Gan-Bin Zhou (周干斌), and Ping Luo (罗平)
    Journal of Computer Science and Technology, 2022, 37 (3): 699-718.  DOI: 10.1007/s11390-021-1076-7
    In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures. The discovery of logical document hierarchy is the vital step to support many downstream applications (e.g., passage-based retrieval and high-quality information extraction). However, long documents, containing hundreds or even thousands of pages and a variable-depth hierarchy, challenge the existing methods. To address these challenges, we develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper position on the current tree. Determining whether each possible position is proper or not can be formulated as a binary classification problem. To further improve its effectiveness and efficiency, we study the design variants in HELD, including traversal orders of the insertion positions, heading extraction explicitly or implicitly, tolerance to insertion errors in predecessor steps, and so on. As for evaluations, we find that previous studies ignore the error that the depth of a node is correct while its path to the root is wrong. Since such mistakes may worsen the downstream applications seriously, a new measure is developed for a more careful evaluation. The empirical experiments based on thousands of long documents from Chinese financial market, English financial market and English scientific publication show that the HELD model with the "root-to-leaf" traversal order and explicit heading extraction is the best choice to achieve the tradeoff between effectiveness and efficiency with the accuracy of 0.972,6, 0.729,1 and 0.957,8 in the Chinese financial, English financial and arXiv datasets, respectively. Finally, we show that the logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task. In summary, we conduct a systematic study on this task in terms of methods, evaluations, and applications.
    References | Supplementary Material | Related Articles | Metrics
    6D Object Pose Estimation in Cluttered Scenes from RGB Images
    Xiao-Long Yang (杨小龙), Xiao-Hong Jia (贾晓红), Yuan Liang (梁缘), and Lu-Bin Fan (樊鲁宾)
    Journal of Computer Science and Technology, 2022, 37 (3): 719-730.  DOI: 10.1007/s11390-021-1311-2
    We propose a feature-fusion network for pose estimation directly from RGB images without any depth information in this study. First, we introduce a two-stream architecture consisting of segmentation and regression streams. The segmentation stream processes the spatial embedding features and obtains the corresponding image crop. These features are further coupled with the image crop in the fusion network. Second, we use an efficient perspective-n-point (E-PnP) algorithm in the regression stream to extract robust spatial features between 3D and 2D keypoints. Finally, we perform iterative refinement with an end-to-end mechanism to improve the estimation performance. We conduct experiments on two public datasets of YCB-Video and the challenging Occluded-LineMOD. The results show that our method outperforms state-of-the-art approaches in both the speed and the accuracy.
    References | Supplementary Material | Related Articles | Metrics
    BADF: Bounding Volume Hierarchies Centric Adaptive Distance Field Computation for Deformable Objects on GPUs
    Xiao-Rui Chen (陈潇瑞), Min Tang (唐敏), Cheng Li (李澄), Dinesh Manocha, and Ruo-Feng Tong (童若锋)
    Journal of Computer Science and Technology, 2022, 37 (3): 731-740.  DOI: 10.1007/s11390-022-0331-x
    We present a novel algorithm BADF (Bounding Volume Hierarchy Based Adaptive Distance Fields) for accelerating the construction of ADFs (adaptive distance fields) of rigid and deformable models on graphics processing units. Our approach is based on constructing a bounding volume hierarchy (BVH) and we use that hierarchy to generate an octree-based ADF. We exploit the coherence between successive frames and sort the grid points of the octree to accelerate the computation. Our approach is applicable to rigid and deformable models. Our GPU-based (graphics processing unit based) algorithm is about 20x--50x faster than current mainstream central processing unit based algorithms. Our BADF algorithm can construct the distance fields for deformable models with 60k triangles at interactive rates on an NVIDIA GTX GeForce 1060. Moreover, we observe 3x speedup over prior GPU-based ADF algorithms.
    References | Supplementary Material | Related Articles | Metrics
  Journal Online
Current Issue
Just Accepted
Top Cited Papers
Top 30 Most Read
Top 30 Most Download
   ScholarOne Manuscripts
   Log In

User ID:


  Forgot your password?

Enter your e-mail address to receive your account information.

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved