Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002

Search by keywords or author
- Acta Optica Sinica
- Vol. 43, Issue 15, 1510002 (2023)

Fig. 1. Overview of generative adversarial network (GAN) principle

Fig. 2. Overview of variational auto-encoder (VAE)

Fig. 3. Overview of flow-based generative model

Fig. 4. Overview of diffusion model

Fig. 5. Comparison of different generative models

Fig. 6. Overview of image and video generation models
![Class-conditioned image generation[95]](/Images/icon/loading.gif)
Fig. 7. Class-conditioned image generation[95]
![Text-conditioned image generation[96]](/Images/icon/loading.gif)
Fig. 8. Text-conditioned image generation[96]
![Text-to-image generation results of Stable Diffusion[97]](/Images/icon/loading.gif)
Fig. 9. Text-to-image generation results of Stable Diffusion[97]
![Weight-tuning-based image customization[102]](/Images/icon/loading.gif)
Fig. 10. Weight-tuning-based image customization[102]
![Token-learning-based image customization[103]](/Images/icon/loading.gif)
Fig. 11. Token-learning-based image customization[103]
![Image customization with multi-concept composition[104]](/Images/icon/loading.gif)
Fig. 12. Image customization with multi-concept composition[104]
![Mask-region-based text-to-image editing[105]](/Images/icon/loading.gif)
Fig. 13. Mask-region-based text-to-image editing[105]
![Prompt-editing-based text-to-image editing[106]](/Images/icon/loading.gif)
Fig. 14. Prompt-editing-based text-to-image editing[106]
![Embedding-interpolation-based text-to-image editing[107]](/Images/icon/loading.gif)
Fig. 15. Embedding-interpolation-based text-to-image editing[107]
![Generated videos of Imagen Video[108]](/Images/icon/loading.gif)
Fig. 16. Generated videos of Imagen Video[108]
![VIDM framework using two diffusion model to generate video content and action information respectively[111]](/Images/icon/loading.gif)
Fig. 17. VIDM framework using two diffusion model to generate video content and action information respectively[111]
![PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training[112]](/Images/icon/loading.gif)
Fig. 18. PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training[112]
![Text-to-video generation results of Make-A-Video[113]](/Images/icon/loading.gif)
Fig. 19. Text-to-video generation results of Make-A-Video[113]
![VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator[114]](/Images/icon/loading.gif)
Fig. 20. VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator[114]
![Video editing based on input image or text prompt[115]](/Images/icon/loading.gif)
Fig. 21. Video editing based on input image or text prompt[115]
![One shot text-to-video generation[116]](/Images/icon/loading.gif)
Fig. 22. One shot text-to-video generation[116]
|
Table 1. FID comparison of different text-to-image pre-trained models on MS-COCO dataset
|
Table 2. Performance comparison of different class-to-video generation methods on UCF-101 dataset and Sky Time-lapse dataset

Set citation alerts for the article
Please enter your email address