• Acta Optica Sinica
  • Vol. 43, Issue 15, 1510002 (2023)
Liang Lin and Binbin Yang*
Author Affiliations
  • School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, Guangdong, China
  • show less
    DOI: 10.3788/AOS230758 Cite this Article Set citation alerts
    Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002 Copy Citation Text show less
    References

    [1] Müller V C, Bostrom N. Future progress in artificial intelligence: a survey of expert opinion[M]. Müller V C. Fundamental issues of artificial intelligence. Synthese library, 376, 555-572(2016).

    [2] Došilović F K, Brčić M, Hlupić N. Explainable artificial intelligence: a survey[C], 210-215(2018).

    [3] Lu Y. Artificial intelligence: a survey on evolution, models, applications and future trends[J]. Journal of Management Analytics, 6, 1-29(2019).

    [4] Henry W P[M]. Artificial intelligence(1984).

    [5] Huang C X, Wang G R, Zhou Z B et al. Reward-adaptive reinforcement learning: dynamic policy gradient optimization for bipedal locomotion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7686-7695(2023).

    [6] Huang C X, Zhang R H, Ouyang M Z et al. Deductive reinforcement learning for visual autonomous urban driving navigation[J]. IEEE Transactions on Neural Networks and Learning Systems, 32, 5379-5391(2021).

    [7] Wu J, Li G B, Han X G et al. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos[C], 1283-1291(2020).

    [8] Xie S R, Huang J N, Lei L X et al. NADPEx: an on-policy temporally consistent exploration method for deep reinforcement learning[EB/OL]. https://arxiv.org/abs/1812.09028

    [9] Garland M, le Grand S, Nickolls J et al. Parallel computing experiences with CUDA[J]. IEEE Micro, 28, 13-27(2008).

    [10] Kalaiselvi T, Sriramakrishnan P, Somasundaram K. Survey of using GPU CUDA programming model in medical image analysis[J]. Informatics in Medicine Unlocked, 9, 133-144(2017).

    [11] Paszke A, Gross S, Massa F et al. Pytorch: an imperative style, high-performance deep learning library[EB/OL]. https://arxiv.org/abs/1912.01703

    [12] Abadi M. TensorFlow: learning functions at scale[C](2016).

    [13] Wang G R, Lin L, Chen R C et al. Joint learning of neural transfer and architecture adaptation for image recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 33, 5401-5415(2022).

    [14] Chen T S, Lin L, Chen R Q et al. Knowledge-guided multi-label few-shot learning for general image recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1371-1384(2022).

    [15] Wang K Z, Zhang D Y, Li Y et al. Cost-effective active learning for deep image classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2591-2600(2017).

    [16] Wang X L, Lin L, Huang L C et al. Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection[C], 3334-3341(2013).

    [17] Wang K Z, Lin L, Zuo W M et al. Dictionary pair classifier driven convolutional neural networks for object detection[C], 2138-2146(2016).

    [18] Wang K Z, Yan X P, Zhang D Y et al. Towards human-machine cooperation: self-supervised sample mining for object detection[C], 1605-1613(2018).

    [19] Jiang C H, Xu H, Liang X D et al. Hybrid knowledge routed modules for large-scale object detection[EB/OL]. https://arxiv.org/abs/1810.12681

    [20] Xu H, Jiang C H, Liang X D et al. Reasoning-RCNN: unifying adaptive global reasoning into large-scale object detection[C], 6412-6421(2020).

    [21] Yang B B, Deng X C, Shi H et al. Continual object detection via prototypical task correlation guided gating mechanism[C], 9245-9254(2022).

    [22] Wu Y X, Zhang G W, Xu H et al. Auto-panoptic: cooperative multi-component architecture search for panoptic segmentation[EB/OL]. https://arxiv.org/abs/2010.16119

    [23] Wu Y X, Zhang G W, Gao Y M et al. Bidirectional graph reasoning network for panoptic segmentation[C], 9077-9086(2020).

    [24] Yang J H, Xu R J, Li R Y et al. An adversarial perturbation oriented domain adaptation approach for semantic segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12613-12620(2020).

    [25] Jordan M I, Mitchell T M. Machine learning: trends, perspectives, and prospects[J]. Science, 349, 255-260(2015).

    [26] He K M, Zhang X Y, Ren S Q et al. Deep residual learning for image recognition[C], 770-778(2016).

    [27] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. https://arxiv.org/abs/1409.1556

    [28] Girshick R. Fast R-CNN[C], 1440-1448(2016).

    [29] Ren S Q, He K M, Girshick R B et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149(2015).

    [30] Strumbelj E, Kononenko I. An efficient explanation of individual classifications using game theory[J]. Journal of Machine Learning Research, 11, 1-18(2010).

    [31] Baehrens D, Schroeter T, Harmeling S et al. How to explain individual classification decisions[J]. Journal of Machine Learning Research, 11, 1803-1831(2010).

    [32] Oussidi A, Elhassouny A. Deep generative models: survey[C](2018).

    [33] Pan Z Q, Yu W J, Yi X K et al. Recent progress on generative adversarial networks (GANs): a survey[J]. IEEE Access, 7, 36322-36333(2019).

    [34] Harshvardhan G M, Gourisaria M K, Pandey M et al. A comprehensive survey and analysis of generative models in machine learning[J]. Computer Science Review, 38, 100285(2020).

    [35] Cao H Q, Tan C, Gao Z Y et al. A survey on generative diffusion model[EB/OL]. https://arxiv.org/abs/2209.02646

    [36] Wiatrak M, Albrecht S V, Nystrom A. Stabilizing generative adversarial networks: a survey[EB/OL]. https://arxiv.org/abs/1910.00927

    [37] Wang K F, Gou C, Duan Y J et al. Generative adversarial networks: introduction and outlook[J]. IEEE/CAA Journal of Automatica Sinica, 4, 588-598(2017).

    [38] He K M, Chen X L, Xie S N et al. Masked autoencoders are scalable vision learners[C], 15979-15988(2022).

    [39] Wang G R, Tang Y S, Lin L et al. Semantic-aware auto-encoders for self-supervised representation learning[C], 9654-9665(2022).

    [40] Guo Z Y, Zhang R R, Qiu L T et al. Joint-MAE: 2D-3D joint masked autoencoders for 3D point cloud pre-training[EB/OL]. https://arxiv.org/abs/2302.14007

    [41] Tan Q Y, Liu N H, Huang X et al. S2GAE: self-supervised graph autoencoders are generalizable learners with graph masking[C], 787-795(2023).

    [42] Bao H B, Dong L, Piao S H et al. BEiT: BERT pre-training of image transformers[EB/OL]. https://arxiv.org/abs/2106.08254

    [43] Xia L H, Huang C, Huang C Z et al. Automated self-supervised learning for recommendation[EB/OL]. https://arxiv.org/abs/2303.07797

    [44] Chen A, Zhang K, Zhang R R et al. PiMAE: point cloud and image interactive masked autoencoders for 3D object detection[EB/OL]. https://arxiv.org/abs/2303.08129

    [45] Ren S C, Wei F Y, Albanie S et al. DeepMIM: deep supervision for masked image modeling[EB/OL]. https://arxiv.org/abs/2303.08817

    [46] Wei C, Fan H Q, Xie S N et al. Masked feature prediction for self-supervised visual pre-training[C], 14648-14658(2022).

    [47] He K M, Fan H Q, Wu Y X et al. Momentum contrast for unsupervised visual representation learning[C], 9726-9735(2020).

    [48] Chen X L, Fan H Q, Girshick R et al. Improved baselines with momentum contrastive learning[EB/OL]. https://arxiv.org/abs/2003.04297

    [49] Chen X L, Xie S N, He K M. An empirical study of training self-supervised vision transformers[C], 9620-9629(2022).

    [50] Chen X L, He K M. Exploring simple Siamese representation learning[C], 15745-15753(2021).

    [51] Grill J B, Strub F, Altché F et al. Bootstrap your own latent-a new approach to self-supervised learning[EB/OL]. https://arxiv.org/abs/2006.07733

    [52] Cheng Z Z, Yang Q X, Sheng B. Deep colorization[C], 415-423(2016).

    [53] Xiao Y, Zhou P Y, Zheng Y et al. Interactive deep colorization using simultaneous global and local inputs[C], 1887-1891(2019).

    [54] Richard Z, Phillip I, Efros A A. Colorful image colorization[M]. Leibe B, Matas J, Sebe N, et al. Computer vision–ECCV 2016. Lecture notes in computer science, 9907, 649-666(2016).

    [55] Larsson G, Maire M, Shakhnarovich G. Learning representations for automatic colorization[M]. Leibe B, Matas J, Sebe N, et al. Computer vision–ECCV 2016. Lecture notes in computer science, 9908, 577-593(2016).

    [56] Zhang R, Zhu J Y, Isola P et al. Real-time user-guided image colorization with learned deep priors[EB/OL]. https://arxiv.org/abs/1705.02999

    [57] He M M, Chen D D, Liao J et al. Deep exemplar-based colorization[J]. ACM Transactions on Graphics, 37, 1-16(2018).

    [58] Ledig C, Theis L, Huszár F et al. Photo-realistic single image super-resolution using a generative adversarial network[C], 105-114(2017).

    [59] Sharif S M A, Ali Naqvi R, Ali F et al. DarkDeblur: learning single-shot image deblurring in low-light condition[J]. Expert Systems With Applications, 222, 119739(2023).

    [60] Li B C, Li X, Lu Y T et al. Hst: hierarchical swin transformer for compressed image super-resolution[M]. Karlinsky L, Michaeli T, Nishino K. Computer vision–ECCV 2022 workshops. Lecture notes in computer science, 13802(2022).

    [61] Liang J Y, Cao J Z, Sun G L et al. SwinIR: image restoration using swin transformer[C], 1833-1844(2021).

    [62] Zamir S W, Arora A, Khan S et al. Multi-stage progressive image restoration[C], 14816-14826(2021).

    [63] Yang F Z, Yang H, Fu J L et al. Learning texture transformer network for image super-resolution[C], 5790-5799(2020).

    [64] Dai T, Cai J R, Zhang Y B et al. Second-order attention network for single image super-resolution[C], 11057-11066(2020).

    [65] Liu X, Wu Q Y, Zhou H et al. Audio-driven co-speech gesture video generation[EB/OL]. https://arxiv.org/abs/2212.02350

    [66] Cui R P, Cao Z, Pan W S et al. Deep gesture video generation with learning on regions of interest[J]. IEEE Transactions on Multimedia, 22, 2551-2563(2020).

    [67] Saunders B, Camgoz N C, Bowden R. Anonysign: novel human appearance synthesis for sign language video anonymisation[C](2022).

    [68] Natarajan B, Elakkiya R. Dynamic GAN for high-quality sign language video generation from skeletal poses using generative adversarial networks[J]. Soft Computing, 26, 13153-13175(2022).

    [69] Ferstl Y, Neff M, McDonnell R. Multi-objective adversarial gesture generation[C](2019).

    [70] Zeng D, Liu H, Lin H et al. Talking face generation with expression-tailored generative adversarial network[C], 1716-1724(2020).

    [71] Zhou H, Liu Y, Liu Z W et al. Talking face generation by adversarially disentangled audio-visual representation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9299-9306(2019).

    [72] Zhang B W, Qi C Y, Zhang P et al. MetaPortrait: identity-preserving talking head generation with fast personalized adaptation[EB/OL]. https://arxiv.org/abs/2212.08062

    [73] Zhou H, Sun Y S, Wu W et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C], 4174-4184(2021).

    [74] Zeng D, Zhao S T, Zhang J J et al. Expression-tailored talking face generation with adaptive cross-modal weighting[J]. Neurocomputing, 511, 117-130(2022).

    [75] Chen L L, Maddox R K, Duan Z Y et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C], 7824-7833(2020).

    [76] Zhang Z M, Li L C, Ding Y et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C], 3660-3669(2021).

    [77] Mildenhall B, Srinivasan P P, Tancik M et al. NeRF: representing scenes as neural radiance fields for view synthesis[M]. Vedaldi A, Bischof H, Brox T, et al. Computer vision–ECCV 2020. Lecture notes in computer science, 12346, 405-421(2020).

    [78] Park K, Sinha U, Barron J T et al. Nerfies: deformable neural radiance fields[C], 5845-5854(2022).

    [79] Niemeyer M, Geiger A. GIRAFFE: representing scenes as compositional generative neural feature fields[C], 11448-11459(2021).

    [80] Pumarola A, Corona E, Pons-Moll G et al. D-NeRF: neural radiance fields for dynamic scenes[C], 10313-10322(2021).

    [81] Martin-Brualla R, Radwan N, Sajjadi M S M et al. NeRF in the wild: neural radiance fields for unconstrained photo collections[C], 7206-7215(2021).

    [82] Chan E R, Monteiro M, Kellnhofer P et al. Pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis[C], 5795-5805(2021).

    [83] Chan E R, Lin C Z, Chan M A et al. Efficient geometry-aware 3D generative adversarial networks[C], 16102-16112(2022).

    [84] Li Z Q, Niklaus S, Snavely N et al. Neural scene flow fields for space-time view synthesis of dynamic scenes[C], 6494-6504(2021).

    [85] Oechsle M, Peng S Y, Geiger A. UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction[C], 5569-5579(2022).

    [86] Weng L. What are diffusion models[EB/OL]. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

    [87] Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets[EB/OL]. https://arxiv.org/abs/1406.2661

    [88] Chen Z L, Yang Z F, Wang X X et al. Multivariate-information adversarial ensemble for scalable joint distribution matching[EB/OL]. https://arxiv.org/abs/1907.03426

    [89] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks[EB/OL]. https://arxiv.org/abs/1701.04862

    [90] Arjovsky M, Chintala S, Bottou L. Wasserstein GAN[EB/OL]. https://arxiv.org/abs/1701.07875

    [91] Kingma D P, Welling M. Auto-encoding variational Bayes[EB/OL]. https://arxiv.org/abs/1312.6114

    [92] Rezende D, Mohamed S. Variational inference with normalizing flows[EB/OL]. https://arxiv.org/abs/1505.05770

    [93] Dinh L, Sohl-Dickstein J, Bengio S. Density estimation using real NVP[EB/OL]. https://arxiv.org/abs/1605.08803

    [94] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[EB/OL]. https://arxiv.org/abs/2006.11239

    [95] Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis[EB/OL]. https://arxiv.org/abs/2105.05233

    [96] Ramesh A, Dhariwal P, Nichol A et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. https://arxiv.org/abs/2204.06125

    [97] Rombach R, Blattmann A, Lorenz D et al. High-resolution image synthesis with latent diffusion models[C], 10674-10685(2022).

    [98] Nichol A Q, Dhariwal P, Ramesh A et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[EB/OL]. https://arxiv.org/abs/2112.10741

    [99] Saharia C, William C, Saurabh S et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. https://arxiv.org/abs/2205.11487

    [100] Radford A, Kim J W, Hallacy C et al. Learning transferable visual models from natural language supervision[EB/OL]. https://arxiv.org/abs/2103.00020

    [101] van den Oord A, Vinyals O. Neural discrete representation learning[EB/OL]. https://arxiv.org/abs/1711.00937

    [102] Ruiz N, Li Y Z, Jampani V et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[EB/OL]. https://arxiv.org/abs/2208.12242

    [103] Gal R, Alaluf Y, Atzmon Y et al. An image is worth one word: personalizing text-to-image generation using textual inversion[EB/OL]. https://arxiv.org/abs/2208.01618

    [104] Kumari N, Zhang B L, Zhang R et al. Multi-concept customization of text-to-image diffusion[EB/OL]. https://arxiv.org/abs/2212.04488

    [105] Avrahami O, Lischinski D, Fried O. Blended diffusion for text-driven editing of natural images[C], 18187-18197(2022).

    [106] Hertz A, Mokady R, Tenenbaum J et al. Prompt-to-prompt image editing with cross attention control[EB/OL]. https://arxiv.org/abs/2208.01626

    [107] Kawar B, Zada S, Lang O et al. Imagic: text-based real image editing with diffusion models[EB/OL]. https://arxiv.org/abs/2210.09276

    [108] Ho J, Chan W, Saharia C et al. Imagen video: high definition video generation with diffusion models[EB/OL]. https://arxiv.org/abs/2210.02303

    [109] Ho J, Salimans T, Gritsenko A A et al. Video diffusion models[EB/OL]. https://arxiv.org/abs/2204.03458

    [110] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation[M]. Navab N, Hornegger J, Wells W M, et al. Medical image computing and computer-assisted intervention–MICCAI 2015. Lecture notes in computer science, 9351, 234-241(2015).

    [111] Mei K F, Patel V M. VIDM: video implicit diffusion models[EB/OL]. https://arxiv.org/abs/2212.00235

    [112] Yu S, Sohn K, Kim S et al. Video probabilistic diffusion models in projected latent space[EB/OL]. https://arxiv.org/abs/2302.07685

    [113] Singer U, Polyak A, Hayes T et al. Make-a-video: text-to-video generation without text-video data[EB/OL]. https://arxiv.org/abs/2209.14792

    [114] Luo Z X, Chen D Y, Zhang Y Y et al. VideoFusion: decomposed diffusion models for high-quality video generation[EB/OL]. https://arxiv.org/abs/2303.08320

    [115] Esser P, Chiu J, Atighehchian P et al. Structure and content-guided video synthesis with diffusion models[EB/OL]. https://arxiv.org/abs/2302.03011

    [116] Wu J Z, Ge Y X, Wang X T et al. Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation[EB/OL]. https://arxiv.org/abs/2212.11565

    [117] Lin T Y, Maire M, Belongie S et al. Microsoft coco: common objects in context[M]. Fleet D, Pajdla T, Schiele B, et al. Computer vision–ECCV 2014. Lecture notes in computer science, 8693, 740-755(2014).

    [118] Heusel M, Ramsauer H, Unterthiner T et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[EB/OL]. https://arxiv.org/abs/1706.08500

    [119] Zhou Y F, Zhang R Y, Chen C Y et al. LAFITE: towards language-free training for text-to-image generation[EB/OL]. https://arxiv.org/abs/2111.13792

    [120] Ramesh A, Pavlov M, Goh G et al. Zero-shot text-to-image generation[EB/OL]. https://arxiv.org/abs/2102.12092

    [121] Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. https://arxiv.org/abs/1212.0402

    [122] Xiong W, Luo W H, Ma L et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks[C], 2364-2373(2018).

    [123] Unterthiner T, van Steenkiste S, Kurach K et al. Towards accurate generative models of video: a new metric & challenges[EB/OL]. https://arxiv.org/abs/1812.01717

    [124] Salimans T, Goodfellow I, Zaremba W et al. Improved techniques for training GANs[EB/OL]. https://arxiv.org/abs/1606.03498

    [125] Tulyakov S, Liu M Y, Yang X D et al. MoCoGAN: decomposing motion and content for video generation[C], 1526-1535(2018).

    [126] Yan W, Zhang Y Z, Abbeel P et al. VideoGPT: video generation using VQ-VAE and transformers[EB/OL]. https://arxiv.org/abs/2104.10157

    [127] Tian Y, Ren J, Chai M L et al. A good image generator is what you need for high-resolution video synthesis[EB/OL]. https://arxiv.org/abs/2104.15069

    [128] Yu S, Tack J, Mo S et al. Generating videos with dynamics-aware implicit generative adversarial networks[EB/OL]. https://arxiv.org/abs/2202.10571