视觉文本视频生成最强方案！Text-Animator效果确实好！（中科大&腾讯等）

发布于 2024-10-26

1003

版权声明

我们非常重视原创文章，为尊重知识产权并避免潜在的版权问题，我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容，访问作者的公众号页面获取完整文章。

查看原文：视觉文本视频生成最强方案！Text-Animator效果确实好！（中科大&腾讯等）

文章来源：

AI生成未来

扫码关注公众号

扫码阅读

手机扫码阅读

Summary of Text-Animator: Controllable Visual Text Video Generation

Text-Animator is an innovative approach introduced for generating visual text in videos while maintaining structural consistency of the generated textual content. This work addresses the challenge in the Text-to-Video (T2V) generation domain, where current methods struggle with incorporating visual text directly into videos. Traditional methods focus on summarizing scene semantics and depicting actions, whereas Text-Animator emphasizes the accurate depiction of visual text structure through a text embedding injection module. It also incorporates a camera control module and a text optimization module to enhance the stability of the generated visual text.

Main Contributions

Introduction of Text-Animator, a novel method for generating visual text in videos, marking a first attempt to address this specific problem.
Development of a text embedding injection module that precisely portrays the structure of visual text, along with a camera control and text optimization module for accurate motion control of the camera and visual text.
Experimental results demonstrating the superiority of Text-Animator in terms of the accuracy of generated visual text compared to existing methods.

Methodology

Text-Animator includes a text embedding injection module, camera control module, text glyph and position optimization module, and a 3D-UNet module. The approach utilizes a diffusion model framework for T2V generation, with a neural network trained to predict the noise added to a sequence of latent features. The model injects dual control signals during the denoising process to enhance video stability. Text-Animator also incorporates camera pose information to control text movement and ensure consistency with scene content.

Experiments

The method was implemented using the AnimateDiffV3 as the base T2V model, with the camera control net and text and position control net trained using datasets and methods from prior works. The performance of Text-Animator was evaluated using a subset of the LAION dataset, focusing on sentence accuracy, normalized edit distance (NED), Frechet Inception Distance (FID), prompt similarity, and frame similarity metrics. Text-Animator showed significant improvements in generating accurate visual text and maintaining text structure stability.

Conclusion

Text-Animator represents a significant step forward in integrating text elements into generated videos effectively. By introducing a dual control mechanism that synchronizes text animations with video movements, the method enhances the unity and coherence between text elements and video scenes. Extensive quantitative and visual experiments verify that Text-Animator outperforms existing T2V and hybrid T2I/I2V methods in terms of video quality and fidelity of text representation, paving the way for further exploration and innovation in multimedia content generation.