计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (3): 522-537.doi: 10.1007/s11390-020-0305-9

所属专题: Artificial Intelligence and Pattern Recognition Computer Graphics and Multimedia

• Special Section of CVM 2020 • 上一篇    下一篇

基于文字的复杂图像合成综合系统

Fei Fang1, Fei Luo1, Hong-Pan Zhang1, Hua-Jian Zhou1, Alix L. H. Chow2, Chun-Xia Xiao1,*, Member, CCF, IEEE   

  1. 1 School of Computer Science, Wuhan University, Wuhan 430072, China;
    2 Xiaomi Technology Co. LTD, Beijing 100085, China
  • 收稿日期:2020-01-15 修回日期:2020-04-15 出版日期:2020-05-28 发布日期:2020-05-28
  • 通讯作者: Chun-Xia Xiao E-mail:cxxiao@whu.edu.cn
  • 作者简介:Fei Fang received her Bachelor's degree in computer science and technology from Zhengzhou University, Zhengzhou, in 2011, and her Master's degree in computer science and technology from Guangxi University, Nanning, in 2014. Currently, she is working toward her Ph.D. degree in computer science and technology at the School of Computer Science, Wuhan University, Wuhan. Her research interests are image processing and machine learning.
  • 基金资助:
    This work was supported by the Key Technological Innovation Projects of Hubei Province of China under Grant No. 2018AAA062, the Wuhan Science and Technology Plan Project of Hubei Province of China under Grant No. 2017010201010109, the National Key Research and Development Program of China under Grant No. 2017YFB1002600, and the National Natural Science Foundation of China under Grant Nos. 61672390 and 61972298. ?Corresponding Auth

A Comprehensive Pipeline for Complex Text-to-Image Synthesis

Fei Fang1, Fei Luo1, Hong-Pan Zhang1, Hua-Jian Zhou1, Alix L. H. Chow2, Chun-Xia Xiao1,*, Member, CCF, IEEE        

  1. 1 School of Computer Science, Wuhan University, Wuhan 430072, China;
    2 Xiaomi Technology Co. LTD, Beijing 100085, China
  • Received:2020-01-15 Revised:2020-04-15 Online:2020-05-28 Published:2020-05-28
  • Contact: Chun-Xia Xiao E-mail:cxxiao@whu.edu.cn
  • About author:Fei Fang received her Bachelor's degree in computer science and technology from Zhengzhou University, Zhengzhou, in 2011, and her Master's degree in computer science and technology from Guangxi University, Nanning, in 2014. Currently, she is working toward her Ph.D. degree in computer science and technology at the School of Computer Science, Wuhan University, Wuhan. Her research interests are image processing and machine learning.
  • Supported by:
    This work was supported by the Key Technological Innovation Projects of Hubei Province of China under Grant No. 2018AAA062, the Wuhan Science and Technology Plan Project of Hubei Province of China under Grant No. 2017010201010109, the National Key Research and Development Program of China under Grant No. 2017YFB1002600, and the National Natural Science Foundation of China under Grant Nos. 61672390 and 61972298. ?Corresponding Auth

目的 本文主要研究如何根据输入的英文文本,由计算机自动生成一幅含有多个物体和背景的复杂场景图片,即说话生图。
说话生图在在计算机视觉领域是一个具有挑战性的研究课题。当今互联网上可以找到数以亿计的图片,但是这些图片的内容不一定符合特定用户的多种特殊要求,而且用户可能期望得到一些超现实的图片,因此通过检索的方法并不一定能得到期望的图片。文字是大多数人用作描述一个场景或者画面的常用工具,因此本项目的目标在于通过用户提供的文字描述信息生成符合用户要求的图片。
只含有单个物体的简单图片在网络上比较容易找到。复杂的场景图片由于含有的语义信息比较多而复杂,一般包括一个背景和前景物体,这些物体和背景之间一般存在某些位置关系,比如A物体挨着B物体,C物体存在于背景的某个区域等,造成了完全符合输入文字要求的场景图片较难检索。因此,我们需要利用现有的前景物体图片和背景图片资料,对其进行分析和重组,根据用户输入的文本来合成需要的场景图片。
本文的研究有两方面的重要性。第一,根据文本生成图片能够降低人工检索图片的盲目性和低效性,由输入文本一次性得到满足要求的图片;第二,能够提高现有图片资源的利用率。
方法 本文的大致框架如图1所示。本课题的基本设计主要包括以下几个部分:(1)文字处理:对于用户输入的文字,首先需要用自然语言处理(Natural Language Processing,NLP)工具Stanford Core NLP进行处理,这个语言分析工具使计算机知道要生成的图片包括什么物体,物体需要符合什么样的约束条件。这些约束条件包括成对的前景物体之间的位置关系以及前景物体和背景区域之间的位置关系,我们用多个语义三元组(Semantic Triplets)来表示这些约束。
(2)检索前景物体和背景场景:结果图片中需要的各种物体和背景图片都需要从数据库中选择出来。我们建立了包括已经标记和分割的前景物体数据库,也必须包括经过分析的背景图片数据库。我们根据文字处理的结果首先根据名称和属性信息检索所需的前景物体,然后对背景图片集中的图片进行字幕生成(Image Captioning),根据名称和一定的规则检索需要的背景图片。
(3)受约束MCMC优化算法图像生成:这一步就是把物体以合适的大小摆放到合适的位置,使得这些物体的状态满足文字描述的要求。刚从数据库中选出的物体可以随意选择初始位置,每优化一次就改变一次物体的位置和大小。我们对于物体当前状态建立一个代价函数。在位置方面,代价函数主要计算物体是否重叠,是否在合理的背景区域,相对位置关系是否正确;在大小方面,主要考虑物体的相对大小和透视效果的影响。我们使用MCMC(Markov Chain Monte Carlo)采样方法来对代价函数进行优化,此方法能够在较短时间内降低代价,找到最优解。
(4)图片后处理:在后处理阶段,由于前景物体和背景图片对象来自不同的源图像,因此需要进行一些后处理,使各部分合成和谐的图片结果。我们使用了两种不同的区域融合方法:基于泊松的融合(Poisson based blending)和基于重光照(Relighting based blending)的混合。基于泊松的融合可以使前景物体块具有自然的边界和和谐的颜色,基于重光照的融合可以处理光照的协调。
结果 我们的图像生成结果具有以下优点。首先,我们的合成结果是清晰的高分辨率图像,具有完整的前景对象和清晰的背景场景。第二是能够与输入的句子语义一致。第三,图像中的物体和场景是合理的,所有的前景对象有适当的位置和大小。最后,我们的结果图前景物体和背景图像颜色协调的和谐图像。
图5中展示了我们方法所生成的场景图片及其对应的输入语句,图6展示了我们的用户调查结果。结果表明,大多数用户认为我们的方法可以生成更高质量的结果图像,具有更高的文本-图像匹配度。我们的方法在系统可用性上得分较低,因为我们的系统在每个输入句子上生成结果图像的时间比训练过的Obj-GAN模型要长一些。
本文的方法也存在一些局限性。由于语言文字的复杂性,我们目前还不能处理所有的文字信息,比如天气,时间,季节等信息。因为前景物体源图片中含有一些俯视图或者仰视图,这样会造成相应前景物体与背景图片的视角不一致。如下图7所示。一定成的的视角不一致是不会造成明显的影响的,但是不排除有少数图片的视角差异比较大,如果融合到一个场景中会造成明显的视角差异错误。
结论 本文的方法在实现过程中需要从多方面制定规则以控制合成的效果。比如,场景中前景物体不能重叠,它们之间的位置关系要符合人的观察习惯,在调整尺寸的时候,需要注意近大远小等问题。在本文中,我们经过长时间的研究和实验最终得到较高质量的合成场景图片,实现用户只要输入一句文字,就能得到相应的场景图片。
在未来,我们期望处理更复杂、更有意义的文本,比如一段文字甚至一个故事,生成更生动、更复杂的合成图像或者视频,与图像相比,视频内容时空一致性的研究更具挑战性。利用文字生成的视频具有更好的传播性,在教学、办公、自媒体等领域具有很高的应用价值。

关键词: 图像合成, 场景生成, 文字-图像转换, 马尔可夫链蒙特卡洛

Abstract: Synthesizing a complex scene image with multiple objects and background according to text description is a challenging problem. It needs to solve several difficult tasks across the fields of natural language processing and computer vision. We model it as a combination of semantic entity recognition, object retrieval and recombination, and objects’ status optimization. To reach a satisfactory result, we propose a comprehensive pipeline to convert the input text to its visual counterpart. The pipeline includes text processing, foreground objects and background scene retrieval, image synthesis using constrained MCMC, and post-processing. Firstly, we roughly divide the objects parsed from the input text into foreground objects and background scenes. Secondly, we retrieve the required foreground objects from the foreground object dataset segmented from Microsoft COCO dataset, and retrieve an appropriate background scene image from the background image dataset extracted from the Internet. Thirdly, in order to ensure the rationality of foreground objects’ positions and sizes in the image synthesis step, we design a cost function and use the Markov Chain Monte Carlo (MCMC) method as the optimizer to solve this constrained layout problem. Finally, to make the image look natural and harmonious, we further use Poisson-based and relighting-based methods to blend foreground objects and background scene image in the post-processing step. The synthesized results and comparison results based on Microsoft COCO dataset prove that our method outperforms some of the state-of-the-art methods based on generative adversarial networks (GANs) in visual quality of generated scene images.

Key words: image synthesis, scene generation, text-to-image conversion, Markov Chain Monte Carlo (MCMC)

[1] Lin T Y, Maire M, Belongie S et al. Microsoft COCO:Common objects in context. In Proc. the 13th European Conference on Computer Vision, September 2014, pp.740-755.
[2] Krishna R, Zhu Y, Groth O et al. Visual genome:Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1):32-73.
[3] Mansimov E, Parisotto E, Ba J L et al. Generating images from captions with attention. arXiv:1511.02793, 2015. https://arxiv.org/abs/1511.02793, October 2019.
[4] Reed S, Akata Z, Yan X et al. Generative adversarial text to image synthesis. arXiv:1605.05396, 2016. https://arxiv.org/abs/1605.05396, October 2019.
[5] Zhang H, Xu T, Li H et al. StackGAN:Text to photorealistic image synthesis with stacked generative adversarial networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.5907-5915.
[6] Lalonde J F, Hoiem D, Efros A A et al. Photo clip art. ACM Transactions on Graphics, 2007, 26(3):Article No. 3.
[7] Chen T, Cheng M M, Tan P et al. Sketch2Photo:Internet image montage. ACM Transactions on Graphics, 2009, 28(5):Article No. 124.
[8] Chen T, Tan P, Ma L Q et al. PoseShop:Human image database construction and personalized content synthesis. IEEE Transactions on Visualization and Computer Graphics, 2013, 19(5):824-837.
[9] Fang F, Yi M, Feng H et al. Narrative collage of image collections by scene graph recombination. IEEE Transactions on Visualization and Computer Graphics, 2018, 24(9):2559-2572.
[10] Zitnick C L, Parikh D. Bringing semantics into focus using visual abstraction. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.3009-3016.
[11] Zitnick C L, Parikh D, Vanderwende L. Learning the visual interpretation of sentences. In Proc. the IEEE International Conference on Computer Vision, December 2013, pp.1681-1688.
[12] Coyne B, Sproat R. WordsEye:An automatic text-to-scene conversion system. In Proc. the 28th Annual Conference on Computer Graphics and Interactive Techniques, August 2001, pp.487-496.
[13] Chang A, Savva M, Manning C D. Learning spatial knowledge for text to 3D scene generation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.2028-2038.
[14] Reed S, van den Oord A, Kalchbrenner N et al. Generating interpretable images with controllable structure. In Proc. the International Conference on Learning Representations, April 2017.
[15] Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets. In Proc. the Annual Conference on Neural Information Processing Systems, December 2014, pp.2672-2680.
[16] Reed S E, Akata Z, Mohan S et al. Learning what and where to draw. In Proc. the Annual Conference on Neural Information Processing Systems, December 2016, pp.217-225.
[17] Zhang H, Xu T, Li H et al. StackGAN++:Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8):1947-1962.
[18] Xu T, Zhang P, Huang Q et al. AttnGAN:Fine-grained text to image generation with attentional generative adversarial networks. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1316-1324.
[19] Yin G, Liu B, Sheng L et al. Semantics disentangling for text-to-image generation. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.2327-2336.
[20] Zhou X, Huang S, Li B et al. Text guided person image synthesis. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3663-3672.
[21] Tan H, Liu X, Li X et al. Semantics-enhanced adversarial nets for text-to-image synthesis. In Proc. the IEEE International Conference on Computer Vision, October 2019, pp.10500-10509.
[22] Qiao T, Zhang J, Xu D et al. MirrorGAN:Learning textto-image generation by redescription. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1505-1514.
[23] Johnson J, Gupta A, Li F F. Image generation from scene graphs. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2018, pp.1219-1228.
[24] Li W, Zhang P, Zhang L et al. Object-driven text-to-image synthesis via adversarial training. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12174-12182.
[25] Hinz T, Heinrich S, Wermter S. Generating multiple objects at spatially distinct locations. arXiv:1901.00686, 2019. https://arxiv.org/abs/1901.00686, October 2019.
[26] Xu K, Ba J, Kiros R et al. Show, attend and tell:Neural image caption generation with visual attention. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.2048-2057.
[27] Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3128-3137.
[28] Johnson J, Karpathy A, Li F F. DenseCap:Fully convolutional localization networks for dense captioning. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4565-4574.
[29] Krause J, Johnson J, Krishna R et al. A hierarchical approach for generating descriptive image paragraphs. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.3337-3345.
[30] Yao L, Torabi A, Cho K et al. Describing videos by exploiting temporal structure. In Proc. the IEEE International Conference on Computer Vision, December 2015, pp.4507-4515.
[31] Yu H, Wang J, Huang Z et al. Video paragraph captioning using hierarchical recurrent neural networks. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.4584-4593.
[32] Li A, Sun J, Ng J Y H et al. Generating holistic 3D scene abstractions for text-based image retrieval. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1942-1950.
[33] Fellbaum C. WordNet. In Theory and Applications of Ontology:Computer Applications, Poli P, Healy M, Kameas A (eds.), Springer Netherlands, 2010, pp.231-243.
[34] He K, Gkioxari G, Dollár P et al. Mask R-CNN. In Proc. the IEEE International Conference on Computer Vision, October 2017, pp.2980-2988.
[35] Laina I, Rupprecht C, Belagiannis V et al. Deeper depth prediction with fully convolutional residual networks. In Proc. the 4th International Conference on 3D Vision, October 2016, pp.239-248.
[36] Yeh Y T, Yang L, Watson M et al. Synthesizing open worlds with constraints using locally annealed reversible jump MCMC. ACM Transactions on Graphics, 2012, 31(4):Article No. 56.
[37] Pérez P, Gangnet M, Blake A. Poisson image editing. ACM Transactions on Graphics, 2003, 22(3):313-318.
[38] Liao Z, Karsch K, Forsyth D. An approximate shading model for object relighting. In Proc. the IEEE Conference on Computer Vission and Pattern Recognition, June 2015, pp.5307-5314.
[39] Elder J H. Shape from contour:Computation and representation. Annual Review of Vision Science, 2018, 4(1):423-450.
[40] Johnston S F. Lumo:Illumination for cel animation. In Proc. the 2nd International Symposium on NonPhotorealistic Animation and Rendering, June 2002, pp.45-52.
[41] Wu T P, Sun J, Tang C K et al. Interactive normal reconstruction from a single image. ACM Transactions on Graphics, 2008, 27(5):Article No. 119.
[42] Grosse R, Johnson M K, Adelson E H et al. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In Proc. the 12th IEEE International Conference on Computer Vision, September 2009, pp.2335-2342.
[43] Karsch K, Sunkavalli K, Hadap S et al. Automatic scene inference for 3D object compositing. ACM Transactions on Graphics, 2014, 33(3):Article No. 32.
[44] Godard C, Aodha M O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.6602-6611.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 陈世华;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[4] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[5] 冯玉琳;. Recursive Implementation of VLSI Circuits[J]. , 1986, 1(2): 72 -82 .
[6] 王建潮; 魏道政;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[7] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[8] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[9] 郑国梁; 李辉;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[10] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: