Journal of Computer Science and Technology ›› 2022, Vol. 37 ›› Issue (3): 601-614.doi: 10.1007/s11390-022-2140-7

Special Issue: Artificial Intelligence and Pattern Recognition; Computer Graphics and Multimedia

• Special Section of CVM 2022 • Previous Articles     Next Articles

A Comparative Study of CNN- and Transformer-Based Visual Style Transfer

Hua-Peng Wei1 (魏华鹏), Ying-Ying Deng2 (邓盈盈), Fan Tang1,* (唐帆), Member, CCF, Xing-Jia Pan3 (潘兴甲), and Wei-Ming Dong2 (董未名), Member, CCF, ACM, IEEE        

  1. 1School of Artificial Intelligence, Jilin University, Changchun 130012, China
    2National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    3Youtu Laboratory, Tencent Incorporated, Shanghai 200233, China
  • Received:2022-01-05 Revised:2022-04-12 Accepted:2022-04-24 Online:2022-05-30 Published:2022-05-30
  • Contact: Fan Tang E-mail:tangfan@jlu.edu.cn
  • About author:Fan Tang is an assistant professor in the School of Artificial Intelligence, Jilin University, Changchun. He received his Ph.D. degree from the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, in 2019. His research interests include computer graphics, computer vision, and machine learning.
  • Supported by:
    The work was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0106200, the National Natural Science Foundation of China under Grant Nos. 62102162, 61832016, U20B2070, and 6210070958, the CASIA-Tencent Youtu Joint Research Project, and the Open Projects Program of the National Laboratory of Pattern Recognition.

Vision Transformer has shown impressive performance on the image classification tasks. Observing that most existing visual style transfer (VST) algorithms are based on the texture-biased convolution neural network (CNN), here raises the question of whether the shape-biased Vision Transformer can perform style transfer as CNN. In this work, we focus on comparing and analyzing the shape bias between CNN- and transformer-based models from the view of VST tasks. For comprehensive comparisons, we propose three kinds of transformer-based visual style transfer (Tr-VST) methods (Tr-NST for optimization-based VST, Tr-WCT for reconstruction-based VST and Tr-AdaIN for perceptual-based VST). By engaging three mainstream VST methods in the transformer pipeline, we show that transformer-based models pre-trained on ImageNet are not proper for style transfer methods. Due to the strong shape bias of the transformer-based models, these Tr-VST methods cannot render style patterns. We further analyze the shape bias by considering the influence of the learned parameters and the structure design. Results prove that with proper style supervision, the transformer can learn similar texture-biased features as CNN does. With the reduced shape bias in the transformer encoder, Tr-VST methods can generate higher-quality results compared with state-of-the-art VST methods.

Key words: transformer; convolution neural network; visual style transfer; comparative study;

[1] Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2414-2423. DOI: 10.1109/CVPR.2016.265.

[2] Kolkin N, Salavon J, Shakhnarovich G. Style transfer by relaxed optimal transport and self-similarity. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.10051-10060. DOI: 10.1109/CVPR.2019.01029.

[3] Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.1501-1510. DOI: 10.1109/ICCV.2017.167.

[4] Li Y, Fang C, Yang J, Wang Z, Lu X, Yang M H. Universal style transfer via feature transforms. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp.385-395.

[5] Deng Y, Tang F, Dong W, Sun W, Huang F, Xu C. Arbitrary style transfer via multi-adaptation network. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.2719-2727. DOI: 10.1145/3394171.3414015.

[6] Deng Y, Tang F, Dong W, Huang H, Ma C, Xu C. Arbitrary video style transfer via multi-channel correlation. In Proc. the 35th AAAI Conference on Artificial Intelligence, February 2021, pp.1210-1217.

[7] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L U, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp.6000-6010.

[8] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.

[9] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.213-229. DOI: 10.1007/978-3-030-58452-8.

[10] Yang F, Yang H, Fu J, Lu H, Guo B. Learning texture transformer network for image super-resolution. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.5790-5799. DOI: 10.1109/CVPR42600.2020.00583.

[11] Lee K, Chang H, Jiang L, Zhang H, Tu Z, Liu C. ViTGAN: Training GANs with vision transformers. arXiv:2107.04589, 2021. https://arxiv.org/abs/2107.04589, Jan. 2022.

[12] Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, June 2021, 7(2): 187-199. DOI: 10.1007/s41095-021-0229-5.

[13] Tuli S, Dasgupta I, Grant E, Griffiths T L. Are convolutional neural networks or transformers more like human vision? arXiv:2105.07197, 2021. https://arxiv.org/abs/ 2105.07197, Jan. 2022.

[14] Naseer M, Ranasinghe K, Khan S, Hayat M, Khan F, Yang M H. Intriguing properties of vision transformers. In Proc. the 35th Conference on Neural Information Processing Systems, December 2021.

[15] Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2016, pp.2921-2929. DOI: 10.1109/CVPR.2016.319

[16] Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M. Neural style transfer: A review. IEEE Trans. Visualization and Computer Graphics, 2020, 26(11): 3365-3385. DOI: 10.1109/TVCG.2019.2921336.

[17] Johnson J, Alahi A, Li F F. Perceptual losses for real-time style transfer and super-resolution. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.694-711. DOI: 10.1007/978-3-319-46475-6.

[18] Ulyanov D, Vedaldi A, Lempitsky V. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4105-4113. DOI: 10.1109/CVPR.2017.437.

[19] An J, Huang S, Song Y, Dou D, Liu W, Luo J. ArtFlow: Unbiased image style transfer via reversible neural flows. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.862-871. DOI: 10.1109/CVPR46437.2021.00092.

[20] Park D Y, Lee K H. Arbitrary style transfer with style-attentional networks. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.5880-5888. DOI: 10.1109/CVPR.2019.00603.

[21] Li X, Liu S, Kautz J, Yang M H. Learning linear transformations for fast image and video style transfer. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.3809-3817. DOI: 10.1109/CVPR.2019.00393.

[22] Wang Z, Zhao L, Chen H, Qiu L, Mo Q, Lin S, Xing W, Lu D. Diversified arbitrary style transfer via deep feature perturbation. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.7786-7795. DOI: 10.1109/CVPR42600.2020.00781.

[23] Wu X, Hu Z, Sheng L, Xu D. StyleFormer: Real-time arbitrary style transfer via parametric style composition. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, October 2021, pp.14618-14627. DOI: 10.1109/ICCV48922.2021.01435.

[24] Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.1691-1703.

[25] Xu Y, Wei H, Lin M, Deng Y, Sheng K, Zhang M, Tang F, Dong W, Huang F, Xu C. Transformers in computational visual media: A survey. Computational Visual Media, 2022, 8(1): 33-62. DOI: 10.1007/s41095-021-0247-3.

[26] Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H. End-to-end video instance segmentation with transformers. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.8741-8750. DOI: 10.1109/CVPR46437.2021.00863.

[27] Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W. Pre-trained image processing transformer. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.12299-12310. DOI: 10.1109/CVPR46437.2021.01212.

[28] Kumar M, Weissenborn D, Kalchbrenner N. Colorization transformer. In Proc. the 9th International Conference on Learning Representations, May 2021.

[29] Liu S, Lin T, He D, Li F, Deng R, Li X, Ding E, Wang H. Paint transformer: Feed forward neural painting with stroke prediction. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, October 2021, pp.6598-6607. DOI: 10.1109/ICCV48922.2021.00653.

[30] Jiang Y, Chang S, Wang Z. TransGAN: Two pure transformers can make one strong GAN, and that can scale up. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021.

[31] Cordonnier J B, Loukas A, Jaggi M. On the relationship between self-attention and convolutional layers. In Proc. the 8th International Conference on Learning Representations, April 2020.

[32] Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T. On layer normalization in the transformer architecture. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.10524-10533.

[33] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[34] Dosovitskiy A, Brox T. Generating images with perceptual similarity metrics based on deep networks. In Proc. the 30th International Conference on Neural Information Processing Systems, December 2016, pp.658-666.

[35] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755. DOI: 10.1007/978-3-319-10602-1.

[36] Phillips F, Mackintosh B. Wiki Art Gallery, Inc.: A case for critical thinking. Issues in Accounting Education, 2011, 26(3): 593-608. DOI: 10.2308/iace-50038.

[37] Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[38] Baker N, Lu H, Erlikhman G, Kellman P J. Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 2018, 14(12): Article No. e1006613. DOI: 10.1371/journal.pcbi.1006613.

[39] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90.

[40] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A C, Li F F. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211-252. DOI: 10.1007/s11263-015-0816-y.

[41] Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann F A, Brendel W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. the 7th International Conference on Learning Representations, May 2019.

[42] Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers distillation through attention. In Proc. the 38th International Conference on Machine Learning, July 2021, pp.10347-10357.

[1] Ze-Lin Zhao, Di Huang, and Xiao-Xing Ma. TOAST: Automated Testing of Object Transformers in Dynamic Software Updates [J]. Journal of Computer Science and Technology, 2022, 37(1): 50-66.
[2] Han Liu, Hang Du, Dan Zeng, Qi Tian. Cloud Detection Using Super Pixel Classification and Semantic Segmentation [J]. Journal of Computer Science and Technology, 2019, 34(3): 622-633.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[3] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[4] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[5] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[6] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[7] Jin Lan; Yang Yuanyuan;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[8] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[9] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[10] Qiao Xiangzhen;. An Efficient Parallel Algorithm for FFT[J]. , 1987, 2(3): 174 -190 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved