Testing Local AI Video Generation: A Short Film Experiment with Wan2.2 5B
Introduction
I've been curious about the recent advances in AI video generation, particularly local models like Wan2.2 5B. So I decided to create a short film to test the capabilities and document the technical process. The result is a 70-second anime-style narrative about human-AI collaboration, created entirely using local AI video generation.
The Experiment
Model Choice: Hardware Constraints Matter
The first challenge was model selection based on GPU constraints:
- My setup: Wan-AI/Wan2.2-TI2V-5B model (5 billion parameters)
- For larger GPUs: Wan-AI/Wan2.2-T2V-A14B (14 billion parameters)
The Creative Process
Story Concept: "Human + AI Ascension"
The narrative follows three characters from the year 2000 to 2025:
- Alex: A software engineer (brown hair, green hoodie)
- Sam: A physicist (curly red hair, lab coat)
- Maya: A designer/artist (brown hair, creating beautiful hand-drawn illustrations)
The story arc moves from their early careers, through concerns about AI displacement, to ultimately finding collaborative success with AI systems.
Technical Implementation
- Scene Generation: 14 distinct scenes, each with carefully crafted prompts
- Style Consistency: Anime aesthetic throughout to maintain visual coherence
- Prompt Engineering: Each scene included specific details about: character appearance and continuity, dynamic camera movements, background activity and atmosphere, lighting transitions, color schemes
Sample Prompts
Here are a few examples of the detailed prompts used:
Scene 1 - Young Software Engineer:
"Anime style, dynamic scene with camera zooming in, young software engineer Alex with messy brown hair and green hoodie coding enthusiastically on laptop, bustling urban street with people walking past, food trucks, street vendors, dynamic background activity, multiple moving elements, fingers flying over keyboard, vibrant blues and greens, bright energetic lighting, fluid camera movement"
Scene 14: Jubilant Celebration:
"Anime style, massive jubilant crowds celebrating in city squares, anime-style characters of all ages cheering and raising hands in victory, confetti falling, no screens or text visible, diverse anime crowds celebrating human-AI collaboration successes through pure celebration and joy, brilliant golden celebratory lighting, festival atmosphere, sense of collective triumph and hope for the future, pure anime art style with no digital displays, emphasize anime character designs"
Technical Pipeline
Video Production Workflow
- Generation: Created 10 iterations of each scene (130 videos total)
- Selection: Chose the best version of each scene
- Text Overlays: Used FFmpeg for efficient text generation:
bash ffmpeg -i black.mp4 -vf "drawtext=text='Meet Alex - the coder':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:enable='between(t,0,2)'" meet_alex.mp4
- Assembly: Concatenated scenes using FFmpeg
- Thumbnail Generation: Created preview grid with:
bash ffmpeg -i output.mp4 -vf "fps=1/2,scale=256:141,tile=8x6" thumbnail_grid.png
The thumbnail grid proved invaluable for quality assessment - uploading it to a large multimodal LLM helped identify the best scenes and spot consistency issues across the narrative.
Quality Observations
Character Consistency: Maintaining character appearance across scenes proved challenging. Anime style helped reduce artifacts compared to photorealistic approaches.
Motion Quality: The model handled dynamic scenes well, with convincing camera movements and background activity.
Artifacts: While present, artifacts were minimized by the anime aesthetic and careful prompt engineering.
Results
Final Output
- Duration: 89 seconds
- Resolution: 1280x704
- Format: MP4 (H.264)
- Audio: None (visual-only experiment)
Video Preview
All scenes are created locally using Wan-AI/Wan2.2-TI2V-5B model.