2024


2023


More Text, Image or Video to Video from the comfort of your own home

You still just need is 4090 with 24GB of VRAM (less if you're patient)

AI generated video of the distracted boyfriend meme created in comfy UI using Cogxvideo

Another day and another model release

CogvideoX dropped its 1.5 version of their model a couple of days ago.

The authors say:

The CogVideoX model is designed to generate high-quality videos based on detailed and highly descriptive prompts. The model performs best when provided with refined, granular prompts, which enhance the quality of video generation. It can handle both text-to-video (t2v) and image-to-video (i2v) conversions.

They recommend using multimodal LLMs to help formulate the descriptive video prompts. I haven't tried yet, but I will experiment by creating a workflow with Florance2.

My early testing

As you can see in the example above using a Image to Video workflow using CogvideoXWrapper nodes in ComfyUI, the model does a pretty good job at bringing an image to life.

Resolution

Version 1.5 has a higher resolution than the first release allowing, but I have stayed to the 480 x 720 resolution for my tests.

Speed

Cogvideo is comparable to Mochi1 on a beefy consumer GPU. It takes about 3 minutes to render 49 frames of video. That was with 25 steps of generation. Time increases with both frame and step count.

What about other video generation workflows?

As mentioned in the AI Video generation on consumer hardware post, there are several workflows including video to video with ControlNet and Prompt Travel to guide generations character sheets to guide character consistency, but you've always had to fight againts artifacts and unexpected surprises on any given frame. At this stage, it's not an either or situation. You can use all these techniques together to create longer form story telling with these tools.