Testing Local AI Video Generation: A Short Film Experiment with Wan2.2 5B

Introduction

I've been curious about the recent advances in AI video generation, particularly local models like Wan2.2 5B. So I decided to create a short film to test the capabilities and document the technical process. The result is a 70-second anime-style narrative about human-AI collaboration, created entirely using local AI video generation.

The Experiment

Model Choice: Hardware Constraints Matter

The first challenge was model selection based on GPU constraints:

My setup: Wan-AI/Wan2.2-TI2V-5B model (5 billion parameters)
For larger GPUs: Wan-AI/Wan2.2-T2V-A14B (14 billion parameters)

The Creative Process

Story Concept: "Human + AI Ascension"

The narrative follows three characters from the year 2000 to 2025:

Alex: A software engineer (brown hair, green hoodie)
Sam: A physicist (curly red hair, lab coat)
Maya: A designer/artist (brown hair, creating beautiful hand-drawn illustrations)

The story arc moves from their early careers, through concerns about AI displacement, to ultimately finding collaborative success with AI systems.

Technical Implementation

Scene Generation: 14 distinct scenes, each with carefully crafted prompts
Style Consistency: Anime aesthetic throughout to maintain visual coherence
Prompt Engineering: Each scene included specific details about: character appearance and continuity, dynamic camera movements, background activity and atmosphere, lighting transitions, color schemes

Sample Prompts

Here are a few examples of the detailed prompts used:

Scene 1 - Young Software Engineer:

"Anime style, dynamic scene with camera zooming in, young software engineer Alex with messy brown hair and green hoodie coding enthusiastically on laptop, bustling urban street with people walking past, food trucks, street vendors, dynamic background activity, multiple moving elements, fingers flying over keyboard, vibrant blues and greens, bright energetic lighting, fluid camera movement"

Scene 14: Jubilant Celebration:

"Anime style, massive jubilant crowds celebrating in city squares, anime-style characters of all ages cheering and raising hands in victory, confetti falling, no screens or text visible, diverse anime crowds celebrating human-AI collaboration successes through pure celebration and joy, brilliant golden celebratory lighting, festival atmosphere, sense of collective triumph and hope for the future, pure anime art style with no digital displays, emphasize anime character designs"

Technical Pipeline

Video Production Workflow

Generation: Created 10 iterations of each scene (130 videos total)
Selection: Chose the best version of each scene
Text Overlays: Used FFmpeg for efficient text generation: bash ffmpeg -i black.mp4 -vf "drawtext=text='Meet Alex - the coder':fontcolor=white:fontsize=60:x=(w-text_w)/2:y=(h-text_h)/2:enable='between(t,0,2)'" meet_alex.mp4
Assembly: Concatenated scenes using FFmpeg
Thumbnail Generation: Created preview grid with: bash ffmpeg -i output.mp4 -vf "fps=1/2,scale=256:141,tile=8x6" thumbnail_grid.png

Scene Thumbnails Overview

The thumbnail grid proved invaluable for quality assessment - uploading it to a large multimodal LLM helped identify the best scenes and spot consistency issues across the narrative.

Quality Observations

Character Consistency: Maintaining character appearance across scenes proved challenging. Anime style helped reduce artifacts compared to photorealistic approaches.

Motion Quality: The model handled dynamic scenes well, with convincing camera movements and background activity.

Artifacts: While present, artifacts were minimized by the anime aesthetic and careful prompt engineering.

Results

Final Output

Duration: 89 seconds
Resolution: 1280x704
Format: MP4 (H.264)
Audio: None (visual-only experiment)

Video Preview

All scenes are created locally using Wan-AI/Wan2.2-TI2V-5B model.

← Back to main page