• Acta Optica Sinica
  • Vol. 43, Issue 15, 1510002 (2023)
Liang Lin and Binbin Yang*
Author Affiliations
  • School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, Guangdong, China
  • show less
    DOI: 10.3788/AOS230758 Cite this Article Set citation alerts
    Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002 Copy Citation Text show less
    Overview of generative adversarial network (GAN) principle
    Fig. 1. Overview of generative adversarial network (GAN) principle
    Overview of variational auto-encoder (VAE)
    Fig. 2. Overview of variational auto-encoder (VAE)
    Overview of flow-based generative model
    Fig. 3. Overview of flow-based generative model
    Overview of diffusion model
    Fig. 4. Overview of diffusion model
    Comparison of different generative models
    Fig. 5. Comparison of different generative models
    Overview of image and video generation models
    Fig. 6. Overview of image and video generation models
    Class-conditioned image generation[95]
    Fig. 7. Class-conditioned image generation[95]
    Text-conditioned image generation[96]
    Fig. 8. Text-conditioned image generation[96]
    Text-to-image generation results of Stable Diffusion[97]
    Fig. 9. Text-to-image generation results of Stable Diffusion[97]
    Weight-tuning-based image customization[102]
    Fig. 10. Weight-tuning-based image customization[102]
    Token-learning-based image customization[103]
    Fig. 11. Token-learning-based image customization[103]
    Image customization with multi-concept composition[104]
    Fig. 12. Image customization with multi-concept composition[104]
    Mask-region-based text-to-image editing[105]
    Fig. 13. Mask-region-based text-to-image editing[105]
    Prompt-editing-based text-to-image editing[106]
    Fig. 14. Prompt-editing-based text-to-image editing[106]
    Embedding-interpolation-based text-to-image editing[107]
    Fig. 15. Embedding-interpolation-based text-to-image editing[107]
    Generated videos of Imagen Video[108]
    Fig. 16. Generated videos of Imagen Video[108]
    VIDM framework using two diffusion model to generate video content and action information respectively[111]
    Fig. 17. VIDM framework using two diffusion model to generate video content and action information respectively[111]
    PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training[112]
    Fig. 18. PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training[112]
    Text-to-video generation results of Make-A-Video[113]
    Fig. 19. Text-to-video generation results of Make-A-Video[113]
    VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator[114]
    Fig. 20. VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator[114]
    Video editing based on input image or text prompt[115]
    Fig. 21. Video editing based on input image or text prompt[115]
    One shot text-to-video generation[116]
    Fig. 22. One shot text-to-video generation[116]
    MethodFID↓
    LAFITE11926.94
    DALL·E12017.89
    LDM10012.63
    GLIDE9612.24
    DALL·E29710.39
    Imagen987.27
    Table 1. FID comparison of different text-to-image pre-trained models on MS-COCO dataset
    Method

    FVD↓

    (Sky Time-lapse

    256×256

    FVD↓

    (UCF-101

    256×256

    FID↓

    (UCF-101

    128×128

    IS↑

    (UCF-101

    128×128

    MoCoGAN125206.61821.412.42
    VideoGPT126222.72880.624.69
    MoCoGAN-HD127164.11729.683832.36
    DIGAN12883.1471.965529.71
    VIDM11157.4294.730653.34
    PVDM11255.4343.674.40
    Table 2. Performance comparison of different class-to-video generation methods on UCF-101 dataset and Sky Time-lapse dataset