From Perception to Creation: Exploring Frontier of Image and Video Generation Methods

Liang Lin; Binbin Yang

doi:10.3788/AOS230758

Journals >Acta Optica Sinica >Volume 43 >Issue 15 >Page 1510002 > Article

Acta Optica Sinica
Vol. 43, Issue 15, 1510002 (2023)

From Perception to Creation: Exploring Frontier of Image and Video Generation Methods

Liang Lin and Binbin Yang^*

Author Affiliations

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, Guangdong, China

show less

DOI: 10.3788/AOS230758 Cite this Article Set citation alerts

Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002 Copy Citation Text

show less

Fig. 1. Overview of generative adversarial network (GAN) principle

Download full size

Fig. 2. Overview of variational auto-encoder (VAE)

Download full size

Fig. 3. Overview of flow-based generative model

Download full size

Fig. 4. Overview of diffusion model

Download full size

Fig. 5. Comparison of different generative models

Download full size

Fig. 6. Overview of image and video generation models

Download full size

Fig. 7. Class-conditioned image generation^[95]

Download full size

Fig. 8. Text-conditioned image generation^[96]

Download full size

Fig. 9. Text-to-image generation results of Stable Diffusion^[97]

Download full size

Fig. 10. Weight-tuning-based image customization^[102]

Download full size

Fig. 11. Token-learning-based image customization^[103]

Download full size

Fig. 12. Image customization with multi-concept composition^[104]

Download full size

Fig. 13. Mask-region-based text-to-image editing^[105]

Download full size

Fig. 14. Prompt-editing-based text-to-image editing^[106]

Download full size

Fig. 15. Embedding-interpolation-based text-to-image editing^[107]

Download full size

Fig. 16. Generated videos of Imagen Video^[108]

Download full size

Fig. 17. VIDM framework using two diffusion model to generate video content and action information respectively^[111]

Download full size

Fig. 18. PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training^[112]

Download full size

Fig. 19. Text-to-video generation results of Make-A-Video^[113]

Download full size

Fig. 20. VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator^[114]

Download full size

Fig. 21. Video editing based on input image or text prompt^[115]

Download full size

Fig. 22. One shot text-to-video generation^[116]

Download full size

Method	FID↓
LAFITE^［119］	26.94
DALL·E^［120］	17.89
LDM^［100］	12.63
GLIDE^［96］	12.24
DALL·E2^［97］	10.39
Imagen^［98］	7.27

Table 1. FID comparison of different text-to-image pre-trained models on MS-COCO dataset

Method	FVD↓ （Sky Time-lapse $256 \times 256$ ）	FVD↓ （UCF-101 $256 \times 256$ ）	FID↓ （UCF-101 128 $\times 128$ ）	IS↑ （UCF-101 128 $\times 128$ ）
MoCoGAN^［125］	206.6	1821.4		12.42
VideoGPT^［126］	222.7	2880.6		24.69
MoCoGAN-HD^［127］	164.1	1729.6	838	32.36
DIGAN^［128］	83.1	471.9	655	29.71
VIDM^［111］	57.4	294.7	306	53.34
PVDM^［112］	55.4	343.6		74.40

Table 2. Performance comparison of different class-to-video generation methods on UCF-101 dataset and Sky Time-lapse dataset

Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002

Download Citation

Set citation alerts for the article

Tools

Set citation alerts for the article

Save the article for my favorites

Paper Information

微信扫一扫：分享

微信扫一扫：分享