|
计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (3): 601-614.doi: 10.1007/s11390-022-2140-7
所属专题: Artificial Intelligence and Pattern Recognition; Computer Graphics and Multimedia
Hua-Peng Wei1 (魏华鹏), Ying-Ying Deng2 (邓盈盈), Fan Tang1,* (唐帆), Member, CCF, Xing-Jia Pan3 (潘兴甲), and Wei-Ming Dong2 (董未名), Member, CCF, ACM, IEEE
1、研究背景(context):近期,基于多头自注意力机制的Transformer结构在计算机视觉领域,特别是图像分类、检测等感知任务,取得了显著的进展。不同于卷积神经网络强调层次化局部感受野叠加,视觉Transformer模型更多的关注图像中的大范围全局关联信息。相关研究指出,与Transformer模型展现出的形状偏好性质相比,卷积神经网络更偏向于纹理建模。目前大部分卷积神经网络与视觉Transformer工作多基于图像分类、检测等感知任务,但很少有研究关注二者在生成类任务(如风格迁移)上差异的表型比较及原因。
2、目的(Objective):本文针对图像风格化任务,对比分析卷积神经网络与Transformer结构在生成任务上对于形状、纹理两方面的偏好,并探究造成两类结构主要差异的原因是来自于模型的结构还是参数。
3、方法(Method):本文把Transformer结构引入到典型的视觉风格迁移算法(代表基于优化VST的NST、代表基于感知VST的AdaIN和代表基于重建VST的WCT)中,提出基于Transformer结构的任意图像风格化方法,并通过控制Transformer编码器和风格迁移算法的训练配置(相同结构,不同训练目标;相同训练目标,不同结构),对基于卷积神经网络和Transformer的视觉风格迁移方法进行了对比分析。
4、结果(Result & Findings):在本文的对比分析中,我们发现,使用预训练 ViT的视觉风格迁移方法生成的风格化结果无法从风格图像中呈现出风格模式。而使用基于卷积神经网络的感知损失训练模型时,我们获得了纹理偏好更强的 Transformer编码器,基于该Transformer编码器的视觉风格迁移方法能够成功地生成风格化图像,并且取得了与原始基于卷积神经网络的视觉风格迁移方法相当的质量。我们还讨论了Transformer结构中一些基本模块的影响,例如位置编码和上采样方式。
5、结论(Conclusions):结果表明,由于比较强的形状偏好,预训练的 ViT对于主流视觉风格迁移方法无效。我们证明了可以通过适当感知监督的训练来减少形状偏差。我们还得出结论:模型使用可学习的位置嵌入和不使用任何位置嵌入能够得到类似的结果,但使用正弦位置编码却不能,因为正弦位置编码将学到了的风格因子与位置信息所绑定的关系。此外,我们还证明了使用 CNN 作为上采样方法是避免棋盘伪影和重复模式的合适选择。
[1] Gatys L A, Ecker A S, Bethge M. Image style transfer using convolutional neural networks. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.2414-2423. DOI: 10.1109/CVPR.2016.265. [2] Kolkin N, Salavon J, Shakhnarovich G. Style transfer by relaxed optimal transport and self-similarity. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.10051-10060. DOI: 10.1109/CVPR.2019.01029. [3] Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.1501-1510. DOI: 10.1109/ICCV.2017.167. [4] Li Y, Fang C, Yang J, Wang Z, Lu X, Yang M H. Universal style transfer via feature transforms. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp.385-395. [5] Deng Y, Tang F, Dong W, Sun W, Huang F, Xu C. Arbitrary style transfer via multi-adaptation network. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.2719-2727. DOI: 10.1145/3394171.3414015. [6] Deng Y, Tang F, Dong W, Huang H, Ma C, Xu C. Arbitrary video style transfer via multi-channel correlation. In Proc. the 35th AAAI Conference on Artificial Intelligence, February 2021, pp.1210-1217. [7] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L U, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, December 2017, pp.6000-6010. [8] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021. [9] Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, August 2020, pp.213-229. DOI: 10.1007/978-3-030-58452-8. [10] Yang F, Yang H, Fu J, Lu H, Guo B. Learning texture transformer network for image super-resolution. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.5790-5799. DOI: 10.1109/CVPR42600.2020.00583. [11] Lee K, Chang H, Jiang L, Zhang H, Tu Z, Liu C. ViTGAN: Training GANs with vision transformers. arXiv:2107.04589, 2021. https://arxiv.org/abs/2107.04589, Jan. 2022. [12] Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R, Hu S M. PCT: Point cloud transformer. Computational Visual Media, June 2021, 7(2): 187-199. DOI: 10.1007/s41095-021-0229-5. [13] Tuli S, Dasgupta I, Grant E, Griffiths T L. Are convolutional neural networks or transformers more like human vision? arXiv:2105.07197, 2021. https://arxiv.org/abs/ 2105.07197, Jan. 2022. [14] Naseer M, Ranasinghe K, Khan S, Hayat M, Khan F, Yang M H. Intriguing properties of vision transformers. In Proc. the 35th Conference on Neural Information Processing Systems, December 2021. [15] Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning deep features for discriminative localization. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2016, pp.2921-2929. DOI: 10.1109/CVPR.2016.319 [16] Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M. Neural style transfer: A review. IEEE Trans. Visualization and Computer Graphics, 2020, 26(11): 3365-3385. DOI: 10.1109/TVCG.2019.2921336. [17] Johnson J, Alahi A, Li F F. Perceptual losses for real-time style transfer and super-resolution. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.694-711. DOI: 10.1007/978-3-319-46475-6. [18] Ulyanov D, Vedaldi A, Lempitsky V. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4105-4113. DOI: 10.1109/CVPR.2017.437. [19] An J, Huang S, Song Y, Dou D, Liu W, Luo J. ArtFlow: Unbiased image style transfer via reversible neural flows. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.862-871. DOI: 10.1109/CVPR46437.2021.00092. [20] Park D Y, Lee K H. Arbitrary style transfer with style-attentional networks. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.5880-5888. DOI: 10.1109/CVPR.2019.00603. [21] Li X, Liu S, Kautz J, Yang M H. Learning linear transformations for fast image and video style transfer. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2019, pp.3809-3817. DOI: 10.1109/CVPR.2019.00393. [22] Wang Z, Zhao L, Chen H, Qiu L, Mo Q, Lin S, Xing W, Lu D. Diversified arbitrary style transfer via deep feature perturbation. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020, pp.7786-7795. DOI: 10.1109/CVPR42600.2020.00781. [23] Wu X, Hu Z, Sheng L, Xu D. StyleFormer: Real-time arbitrary style transfer via parametric style composition. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, October 2021, pp.14618-14627. DOI: 10.1109/ICCV48922.2021.01435. [24] Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. Generative pretraining from pixels. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.1691-1703. [25] Xu Y, Wei H, Lin M, Deng Y, Sheng K, Zhang M, Tang F, Dong W, Huang F, Xu C. Transformers in computational visual media: A survey. Computational Visual Media, 2022, 8(1): 33-62. DOI: 10.1007/s41095-021-0247-3. [26] Wang Y, Xu Z, Wang X, Shen C, Cheng B, Shen H, Xia H. End-to-end video instance segmentation with transformers. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.8741-8750. DOI: 10.1109/CVPR46437.2021.00863. [27] Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W. Pre-trained image processing transformer. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp.12299-12310. DOI: 10.1109/CVPR46437.2021.01212. [28] Kumar M, Weissenborn D, Kalchbrenner N. Colorization transformer. In Proc. the 9th International Conference on Learning Representations, May 2021. [29] Liu S, Lin T, He D, Li F, Deng R, Li X, Ding E, Wang H. Paint transformer: Feed forward neural painting with stroke prediction. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, October 2021, pp.6598-6607. DOI: 10.1109/ICCV48922.2021.00653. [30] Jiang Y, Chang S, Wang Z. TransGAN: Two pure transformers can make one strong GAN, and that can scale up. In Proc. the 35th Conference on Neural Information Processing Systems, Dec. 2021. [31] Cordonnier J B, Loukas A, Jaggi M. On the relationship between self-attention and convolutional layers. In Proc. the 8th International Conference on Learning Representations, April 2020. [32] Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T. On layer normalization in the transformer architecture. In Proc. the 37th International Conference on Machine Learning, July 2020, pp.10524-10533. [33] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015. [34] Dosovitskiy A, Brox T. Generating images with perceptual similarity metrics based on deep networks. In Proc. the 30th International Conference on Neural Information Processing Systems, December 2016, pp.658-666. [35] Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755. DOI: 10.1007/978-3-319-10602-1. [36] Phillips F, Mackintosh B. Wiki Art Gallery, Inc.: A case for critical thinking. Issues in Accounting Education, 2011, 26(3): 593-608. DOI: 10.2308/iace-50038. [37] Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015. [38] Baker N, Lu H, Erlikhman G, Kellman P J. Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 2018, 14(12): Article No. e1006613. DOI: 10.1371/journal.pcbi.1006613. [39] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90. [40] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A C, Li F F. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211-252. DOI: 10.1007/s11263-015-0816-y. [41] Geirhos R, Rubisch P, Michaelis C, Bethge M, Wichmann F A, Brendel W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. the 7th International Conference on Learning Representations, May 2019. [42] Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H. Training data-efficient image transformers distillation through attention. In Proc. the 38th International Conference on Machine Learning, July 2021, pp.10347-10357. |
[1] | 张鑫, 陆思源, 王水花, 余翔, 王甦菁, 姚仑, 潘毅, 张煜东. 通过新型深度学习架构诊断COVID-19肺炎[J]. 计算机科学技术学报, 2022, 37(2): 330-343. |
[2] | Shao-Jie Qiao, Guo-Ping Yang, Nan Han, Hao Chen, Fa-Liang Huang, Kun Yue, Yu-Gen Yi, Chang-An Yuan. 基数估计器:利用垂直扫描卷积神经网络处理SQL[J]. 计算机科学技术学报, 2021, 36(4): 762-777. |
[3] | Yang Liu, Ruili He, Xiaoqian Lv, Wei Wang, Xin Sun, Shengping Zhang. 婴儿的年龄和性别容易被识别吗?[J]. 计算机科学技术学报, 2021, 36(3): 508-519. |
[4] | Zhang-Jin Huang, Xiang-Xiang He, Fang-Jun Wang, Qing Shen. 基于卷积神经网络的实时多阶段斑马鱼头部姿态估计框架[J]. 计算机科学技术学报, 2021, 36(2): 434-444. |
[5] | 梁盾, 郭元晨, 张少魁, 穆太江, 黄晓蕾. 车道检测-新结果和调查研究[J]. 计算机科学技术学报, 2020, 35(3): 493-505. |
[6] | Jin-Gong Jia, Yuan-Feng Zhou, Xing-Wei Hao, Feng Li, Christian Desrosiers, Cai-Ming Zhang. 双流时间卷积神经网络用于基于骨架的人体动作识别[J]. 计算机科学技术学报, 2020, 35(3): 538-550. |
[7] | Rui-Song Zhang, Wei-Ze Quan, Lu-Bin Fan, Li-Ming Hu, Dong-Ming Yan. 基于通道和像素相关性的计算机生成图像与自然图像鉴别[J]. 计算机科学技术学报, 2020, 35(3): 592-602. |
[8] | Robail Yasrab. SRNET:用于解析奇点的基于浅跳跃连接的卷积神经网络[J]. 计算机科学技术学报, 2019, 34(4): 924-938. |
[9] | Ri-Sheng Liu, Cai-Sheng Mao, Zhi-Hui Wang, Hao-Jie Li. 基于灵活稀疏结构控制和自适应优化算法的模糊图像盲复原[J]. 计算机科学技术学报, 2019, 34(3): 609-621. |
[10] | Han Liu, Hang Du, Dan Zeng, Qi Tian. 基于超像素分类和语义分割的云检测算法[J]. 计算机科学技术学报, 2019, 34(3): 622-633. |
[11] | Dong-Di Zhao, Fan Li, Kashif Sharif, Guang-Min Xia, Yu Wang. 深度卷积神经网络的空间高效量化[J]. 计算机科学技术学报, 2019, 34(2): 305-317. |
[12] | Tie-Ke He, Hao Lian, Ze-Min Qin, Zhen-Yu Chen, Bin Luo. 一种用于罚金判定的主题模型[J]. , 2018, 33(4): 756-767. |
[13] | Bei-Ji Zou, Yun-Di Guo, Qi He, Ping-Bo Ouyang, Ke Liu, Zai-Liang Chen. 基于块匹配三维滤波和卷积神经网络的图像去噪[J]. , 2018, 33(4): 838-848. |
[14] | Zhi-Feng Xie, Yu-Chen Guo, Shu-Han Zhang, Wen-Jun Zhang, Li-Zhuang Ma. 基于深度卷积网络的多重曝光运动估计[J]. , 2018, 33(3): 487-501. |
[15] | Nai-Ming Yao, Hui Chen, Qing-Pei Guo, Hong-An Wang. 使用一种基于Depth-Patch的深度神经网络的非正面人脸表情识别[J]. , 2017, 32(6): 1172-1185. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |