From Perception to Creation: Exploring Frontier of Image and Video Generation Methods

Liang Lin; Binbin Yang

doi:10.3788/AOS230758

[1] Müller V C, Bostrom N. Future progress in artificial intelligence: a survey of expert opinion[M]. Müller V C. Fundamental issues of artificial intelligence. Synthese library, 376, 555-572(2016).

[2] Došilović F K, Brčić M, Hlupić N. Explainable artificial intelligence: a survey[C], 210-215(2018).

[3] Lu Y. Artificial intelligence: a survey on evolution, models, applications and future trends[J]. Journal of Management Analytics, 6, 1-29(2019).

[4] Henry W P[M]. Artificial intelligence(1984).

[5] Huang C X, Wang G R, Zhou Z B et al. Reward-adaptive reinforcement learning: dynamic policy gradient optimization for bipedal locomotion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 7686-7695(2023).

[6] Huang C X, Zhang R H, Ouyang M Z et al. Deductive reinforcement learning for visual autonomous urban driving navigation[J]. IEEE Transactions on Neural Networks and Learning Systems, 32, 5379-5391(2021).

[7] Wu J, Li G B, Han X G et al. Reinforcement learning for weakly supervised temporal grounding of natural language in untrimmed videos[C], 1283-1291(2020).

[8] Xie S R, Huang J N, Lei L X et al. NADPEx: an on-policy temporally consistent exploration method for deep reinforcement learning[EB/OL]. https://arxiv.org/abs/1812.09028

[9] Garland M, le Grand S, Nickolls J et al. Parallel computing experiences with CUDA[J]. IEEE Micro, 28, 13-27(2008).

[10] Kalaiselvi T, Sriramakrishnan P, Somasundaram K. Survey of using GPU CUDA programming model in medical image analysis[J]. Informatics in Medicine Unlocked, 9, 133-144(2017).

[11] Paszke A, Gross S, Massa F et al. Pytorch: an imperative style, high-performance deep learning library[EB/OL]. https://arxiv.org/abs/1912.01703

[12] Abadi M. TensorFlow: learning functions at scale[C](2016).

[13] Wang G R, Lin L, Chen R C et al. Joint learning of neural transfer and architecture adaptation for image recognition[J]. IEEE Transactions on Neural Networks and Learning Systems, 33, 5401-5415(2022).

[14] Chen T S, Lin L, Chen R Q et al. Knowledge-guided multi-label few-shot learning for general image recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 1371-1384(2022).

[15] Wang K Z, Zhang D Y, Li Y et al. Cost-effective active learning for deep image classification[J]. IEEE Transactions on Circuits and Systems for Video Technology, 27, 2591-2600(2017).

[16] Wang X L, Lin L, Huang L C et al. Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection[C], 3334-3341(2013).

[17] Wang K Z, Lin L, Zuo W M et al. Dictionary pair classifier driven convolutional neural networks for object detection[C], 2138-2146(2016).

[18] Wang K Z, Yan X P, Zhang D Y et al. Towards human-machine cooperation: self-supervised sample mining for object detection[C], 1605-1613(2018).

[19] Jiang C H, Xu H, Liang X D et al. Hybrid knowledge routed modules for large-scale object detection[EB/OL]. https://arxiv.org/abs/1810.12681

[20] Xu H, Jiang C H, Liang X D et al. Reasoning-RCNN: unifying adaptive global reasoning into large-scale object detection[C], 6412-6421(2020).

[21] Yang B B, Deng X C, Shi H et al. Continual object detection via prototypical task correlation guided gating mechanism[C], 9245-9254(2022).

[22] Wu Y X, Zhang G W, Xu H et al. Auto-panoptic: cooperative multi-component architecture search for panoptic segmentation[EB/OL]. https://arxiv.org/abs/2010.16119

[23] Wu Y X, Zhang G W, Gao Y M et al. Bidirectional graph reasoning network for panoptic segmentation[C], 9077-9086(2020).

[24] Yang J H, Xu R J, Li R Y et al. An adversarial perturbation oriented domain adaptation approach for semantic segmentation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12613-12620(2020).

[25] Jordan M I, Mitchell T M. Machine learning: trends, perspectives, and prospects[J]. Science, 349, 255-260(2015).

[26] He K M, Zhang X Y, Ren S Q et al. Deep residual learning for image recognition[C], 770-778(2016).

[27] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. https://arxiv.org/abs/1409.1556

[28] Girshick R. Fast R-CNN[C], 1440-1448(2016).

[29] Ren S Q, He K M, Girshick R B et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149(2015).

[30] Strumbelj E, Kononenko I. An efficient explanation of individual classifications using game theory[J]. Journal of Machine Learning Research, 11, 1-18(2010).

[31] Baehrens D, Schroeter T, Harmeling S et al. How to explain individual classification decisions[J]. Journal of Machine Learning Research, 11, 1803-1831(2010).

[32] Oussidi A, Elhassouny A. Deep generative models: survey[C](2018).

[33] Pan Z Q, Yu W J, Yi X K et al. Recent progress on generative adversarial networks (GANs): a survey[J]. IEEE Access, 7, 36322-36333(2019).

[34] Harshvardhan G M, Gourisaria M K, Pandey M et al. A comprehensive survey and analysis of generative models in machine learning[J]. Computer Science Review, 38, 100285(2020).

[35] Cao H Q, Tan C, Gao Z Y et al. A survey on generative diffusion model[EB/OL]. https://arxiv.org/abs/2209.02646

[36] Wiatrak M, Albrecht S V, Nystrom A. Stabilizing generative adversarial networks: a survey[EB/OL]. https://arxiv.org/abs/1910.00927

[37] Wang K F, Gou C, Duan Y J et al. Generative adversarial networks: introduction and outlook[J]. IEEE/CAA Journal of Automatica Sinica, 4, 588-598(2017).

[38] He K M, Chen X L, Xie S N et al. Masked autoencoders are scalable vision learners[C], 15979-15988(2022).

[39] Wang G R, Tang Y S, Lin L et al. Semantic-aware auto-encoders for self-supervised representation learning[C], 9654-9665(2022).

[40] Guo Z Y, Zhang R R, Qiu L T et al. Joint-MAE: 2D-3D joint masked autoencoders for 3D point cloud pre-training[EB/OL]. https://arxiv.org/abs/2302.14007

[41] Tan Q Y, Liu N H, Huang X et al. S2GAE: self-supervised graph autoencoders are generalizable learners with graph masking[C], 787-795(2023).

[42] Bao H B, Dong L, Piao S H et al. BEiT: BERT pre-training of image transformers[EB/OL]. https://arxiv.org/abs/2106.08254

[43] Xia L H, Huang C, Huang C Z et al. Automated self-supervised learning for recommendation[EB/OL]. https://arxiv.org/abs/2303.07797

[44] Chen A, Zhang K, Zhang R R et al. PiMAE: point cloud and image interactive masked autoencoders for 3D object detection[EB/OL]. https://arxiv.org/abs/2303.08129

[45] Ren S C, Wei F Y, Albanie S et al. DeepMIM: deep supervision for masked image modeling[EB/OL]. https://arxiv.org/abs/2303.08817

[46] Wei C, Fan H Q, Xie S N et al. Masked feature prediction for self-supervised visual pre-training[C], 14648-14658(2022).

[47] He K M, Fan H Q, Wu Y X et al. Momentum contrast for unsupervised visual representation learning[C], 9726-9735(2020).

[48] Chen X L, Fan H Q, Girshick R et al. Improved baselines with momentum contrastive learning[EB/OL]. https://arxiv.org/abs/2003.04297

[49] Chen X L, Xie S N, He K M. An empirical study of training self-supervised vision transformers[C], 9620-9629(2022).

[50] Chen X L, He K M. Exploring simple Siamese representation learning[C], 15745-15753(2021).

[51] Grill J B, Strub F, Altché F et al. Bootstrap your own latent-a new approach to self-supervised learning[EB/OL]. https://arxiv.org/abs/2006.07733

[52] Cheng Z Z, Yang Q X, Sheng B. Deep colorization[C], 415-423(2016).

[53] Xiao Y, Zhou P Y, Zheng Y et al. Interactive deep colorization using simultaneous global and local inputs[C], 1887-1891(2019).

[54] Richard Z, Phillip I, Efros A A. Colorful image colorization[M]. Leibe B, Matas J, Sebe N, et al. Computer vision–ECCV 2016. Lecture notes in computer science, 9907, 649-666(2016).

[55] Larsson G, Maire M, Shakhnarovich G. Learning representations for automatic colorization[M]. Leibe B, Matas J, Sebe N, et al. Computer vision–ECCV 2016. Lecture notes in computer science, 9908, 577-593(2016).

[56] Zhang R, Zhu J Y, Isola P et al. Real-time user-guided image colorization with learned deep priors[EB/OL]. https://arxiv.org/abs/1705.02999

[57] He M M, Chen D D, Liao J et al. Deep exemplar-based colorization[J]. ACM Transactions on Graphics, 37, 1-16(2018).

[58] Ledig C, Theis L, Huszár F et al. Photo-realistic single image super-resolution using a generative adversarial network[C], 105-114(2017).

[59] Sharif S M A, Ali Naqvi R, Ali F et al. DarkDeblur: learning single-shot image deblurring in low-light condition[J]. Expert Systems With Applications, 222, 119739(2023).

[60] Li B C, Li X, Lu Y T et al. Hst: hierarchical swin transformer for compressed image super-resolution[M]. Karlinsky L, Michaeli T, Nishino K. Computer vision–ECCV 2022 workshops. Lecture notes in computer science, 13802(2022).

[61] Liang J Y, Cao J Z, Sun G L et al. SwinIR: image restoration using swin transformer[C], 1833-1844(2021).

[62] Zamir S W, Arora A, Khan S et al. Multi-stage progressive image restoration[C], 14816-14826(2021).

[63] Yang F Z, Yang H, Fu J L et al. Learning texture transformer network for image super-resolution[C], 5790-5799(2020).

[64] Dai T, Cai J R, Zhang Y B et al. Second-order attention network for single image super-resolution[C], 11057-11066(2020).

[65] Liu X, Wu Q Y, Zhou H et al. Audio-driven co-speech gesture video generation[EB/OL]. https://arxiv.org/abs/2212.02350

[66] Cui R P, Cao Z, Pan W S et al. Deep gesture video generation with learning on regions of interest[J]. IEEE Transactions on Multimedia, 22, 2551-2563(2020).

[67] Saunders B, Camgoz N C, Bowden R. Anonysign: novel human appearance synthesis for sign language video anonymisation[C](2022).

[68] Natarajan B, Elakkiya R. Dynamic GAN for high-quality sign language video generation from skeletal poses using generative adversarial networks[J]. Soft Computing, 26, 13153-13175(2022).

[69] Ferstl Y, Neff M, McDonnell R. Multi-objective adversarial gesture generation[C](2019).

[70] Zeng D, Liu H, Lin H et al. Talking face generation with expression-tailored generative adversarial network[C], 1716-1724(2020).

[71] Zhou H, Liu Y, Liu Z W et al. Talking face generation by adversarially disentangled audio-visual representation[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9299-9306(2019).

[72] Zhang B W, Qi C Y, Zhang P et al. MetaPortrait: identity-preserving talking head generation with fast personalized adaptation[EB/OL]. https://arxiv.org/abs/2212.08062

[73] Zhou H, Sun Y S, Wu W et al. Pose-controllable talking face generation by implicitly modularized audio-visual representation[C], 4174-4184(2021).

[74] Zeng D, Zhao S T, Zhang J J et al. Expression-tailored talking face generation with adaptive cross-modal weighting[J]. Neurocomputing, 511, 117-130(2022).

[75] Chen L L, Maddox R K, Duan Z Y et al. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss[C], 7824-7833(2020).

[76] Zhang Z M, Li L C, Ding Y et al. Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset[C], 3660-3669(2021).

[77] Mildenhall B, Srinivasan P P, Tancik M et al. NeRF: representing scenes as neural radiance fields for view synthesis[M]. Vedaldi A, Bischof H, Brox T, et al. Computer vision–ECCV 2020. Lecture notes in computer science, 12346, 405-421(2020).

[78] Park K, Sinha U, Barron J T et al. Nerfies: deformable neural radiance fields[C], 5845-5854(2022).

[79] Niemeyer M, Geiger A. GIRAFFE: representing scenes as compositional generative neural feature fields[C], 11448-11459(2021).

[80] Pumarola A, Corona E, Pons-Moll G et al. D-NeRF: neural radiance fields for dynamic scenes[C], 10313-10322(2021).

[81] Martin-Brualla R, Radwan N, Sajjadi M S M et al. NeRF in the wild: neural radiance fields for unconstrained photo collections[C], 7206-7215(2021).

[82] Chan E R, Monteiro M, Kellnhofer P et al. Pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis[C], 5795-5805(2021).

[83] Chan E R, Lin C Z, Chan M A et al. Efficient geometry-aware 3D generative adversarial networks[C], 16102-16112(2022).

[84] Li Z Q, Niklaus S, Snavely N et al. Neural scene flow fields for space-time view synthesis of dynamic scenes[C], 6494-6504(2021).

[85] Oechsle M, Peng S Y, Geiger A. UNISURF: unifying neural implicit surfaces and radiance fields for multi-view reconstruction[C], 5569-5579(2022).

[86] Weng L. What are diffusion models[EB/OL]. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

[87] Goodfellow I, Pouget-Abadie J, Mirza M et al. Generative adversarial nets[EB/OL]. https://arxiv.org/abs/1406.2661

[88] Chen Z L, Yang Z F, Wang X X et al. Multivariate-information adversarial ensemble for scalable joint distribution matching[EB/OL]. https://arxiv.org/abs/1907.03426

[89] Arjovsky M, Bottou L. Towards principled methods for training generative adversarial networks[EB/OL]. https://arxiv.org/abs/1701.04862

[90] Arjovsky M, Chintala S, Bottou L. Wasserstein GAN[EB/OL]. https://arxiv.org/abs/1701.07875

[91] Kingma D P, Welling M. Auto-encoding variational Bayes[EB/OL]. https://arxiv.org/abs/1312.6114

[92] Rezende D, Mohamed S. Variational inference with normalizing flows[EB/OL]. https://arxiv.org/abs/1505.05770

[93] Dinh L, Sohl-Dickstein J, Bengio S. Density estimation using real NVP[EB/OL]. https://arxiv.org/abs/1605.08803

[94] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[EB/OL]. https://arxiv.org/abs/2006.11239

[95] Dhariwal P, Nichol A. Diffusion models beat GANs on image synthesis[EB/OL]. https://arxiv.org/abs/2105.05233

[96] Ramesh A, Dhariwal P, Nichol A et al. Hierarchical text-conditional image generation with CLIP latents[EB/OL]. https://arxiv.org/abs/2204.06125

[97] Rombach R, Blattmann A, Lorenz D et al. High-resolution image synthesis with latent diffusion models[C], 10674-10685(2022).

[98] Nichol A Q, Dhariwal P, Ramesh A et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[EB/OL]. https://arxiv.org/abs/2112.10741

[99] Saharia C, William C, Saurabh S et al. Photorealistic text-to-image diffusion models with deep language understanding[EB/OL]. https://arxiv.org/abs/2205.11487

[100] Radford A, Kim J W, Hallacy C et al. Learning transferable visual models from natural language supervision[EB/OL]. https://arxiv.org/abs/2103.00020

[101] van den Oord A, Vinyals O. Neural discrete representation learning[EB/OL]. https://arxiv.org/abs/1711.00937

[102] Ruiz N, Li Y Z, Jampani V et al. DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation[EB/OL]. https://arxiv.org/abs/2208.12242

[103] Gal R, Alaluf Y, Atzmon Y et al. An image is worth one word: personalizing text-to-image generation using textual inversion[EB/OL]. https://arxiv.org/abs/2208.01618

[104] Kumari N, Zhang B L, Zhang R et al. Multi-concept customization of text-to-image diffusion[EB/OL]. https://arxiv.org/abs/2212.04488

[105] Avrahami O, Lischinski D, Fried O. Blended diffusion for text-driven editing of natural images[C], 18187-18197(2022).

[106] Hertz A, Mokady R, Tenenbaum J et al. Prompt-to-prompt image editing with cross attention control[EB/OL]. https://arxiv.org/abs/2208.01626

[107] Kawar B, Zada S, Lang O et al. Imagic: text-based real image editing with diffusion models[EB/OL]. https://arxiv.org/abs/2210.09276

[108] Ho J, Chan W, Saharia C et al. Imagen video: high definition video generation with diffusion models[EB/OL]. https://arxiv.org/abs/2210.02303

[109] Ho J, Salimans T, Gritsenko A A et al. Video diffusion models[EB/OL]. https://arxiv.org/abs/2204.03458

[110] Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation[M]. Navab N, Hornegger J, Wells W M, et al. Medical image computing and computer-assisted intervention–MICCAI 2015. Lecture notes in computer science, 9351, 234-241(2015).

[111] Mei K F, Patel V M. VIDM: video implicit diffusion models[EB/OL]. https://arxiv.org/abs/2212.00235

[112] Yu S, Sohn K, Kim S et al. Video probabilistic diffusion models in projected latent space[EB/OL]. https://arxiv.org/abs/2302.07685

[113] Singer U, Polyak A, Hayes T et al. Make-a-video: text-to-video generation without text-video data[EB/OL]. https://arxiv.org/abs/2209.14792

[114] Luo Z X, Chen D Y, Zhang Y Y et al. VideoFusion: decomposed diffusion models for high-quality video generation[EB/OL]. https://arxiv.org/abs/2303.08320

[115] Esser P, Chiu J, Atighehchian P et al. Structure and content-guided video synthesis with diffusion models[EB/OL]. https://arxiv.org/abs/2302.03011

[116] Wu J Z, Ge Y X, Wang X T et al. Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation[EB/OL]. https://arxiv.org/abs/2212.11565

[117] Lin T Y, Maire M, Belongie S et al. Microsoft coco: common objects in context[M]. Fleet D, Pajdla T, Schiele B, et al. Computer vision–ECCV 2014. Lecture notes in computer science, 8693, 740-755(2014).

[118] Heusel M, Ramsauer H, Unterthiner T et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[EB/OL]. https://arxiv.org/abs/1706.08500

[119] Zhou Y F, Zhang R Y, Chen C Y et al. LAFITE: towards language-free training for text-to-image generation[EB/OL]. https://arxiv.org/abs/2111.13792

[120] Ramesh A, Pavlov M, Goh G et al. Zero-shot text-to-image generation[EB/OL]. https://arxiv.org/abs/2102.12092

[121] Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. https://arxiv.org/abs/1212.0402

[122] Xiong W, Luo W H, Ma L et al. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks[C], 2364-2373(2018).

[123] Unterthiner T, van Steenkiste S, Kurach K et al. Towards accurate generative models of video: a new metric & challenges[EB/OL]. https://arxiv.org/abs/1812.01717

[124] Salimans T, Goodfellow I, Zaremba W et al. Improved techniques for training GANs[EB/OL]. https://arxiv.org/abs/1606.03498

[125] Tulyakov S, Liu M Y, Yang X D et al. MoCoGAN: decomposing motion and content for video generation[C], 1526-1535(2018).

[126] Yan W, Zhang Y Z, Abbeel P et al. VideoGPT: video generation using VQ-VAE and transformers[EB/OL]. https://arxiv.org/abs/2104.10157

[127] Tian Y, Ren J, Chai M L et al. A good image generator is what you need for high-resolution video synthesis[EB/OL]. https://arxiv.org/abs/2104.15069

[128] Yu S, Tack J, Mo S et al. Generating videos with dynamics-aware implicit generative adversarial networks[EB/OL]. https://arxiv.org/abs/2202.10571

微信扫一扫：分享

微信扫一扫：分享