graph TD START[User Input: Avatar & Audio Files] subgraph AUDIO_SILO[Audio Processing Silo] A1[Input MP3] A2[Get Transcription with Timings] A3[Extract Clean Audio] A4[Phoneme Data Output] end subgraph IMAGE_SILO[Image Processing Silo] I1[Input Avatar PNG] I2[Resize to 1024x1024] I3[GPT Vision Analysis] I4[Feature Map Output] end subgraph VIDEO_SILO[Video Assembly Silo] V1[Combine Components] V2[Generate Raw Video] V3[Cut and Clean Video] V4[Final MP4 Output] end START --> A1 START --> I1 A1 --> A2 A2 --> A4 A1 --> A3 A3 --> A4 I1 --> I2 I2 --> I3 I3 --> I4 A4 --> V1 I4 --> V1 V1 --> V2 V2 --> V3 V3 --> V4 END[Final Talking Head Video] --> V4 style AUDIO_SILO fill:#f9f,stroke:#333,stroke-width:2px style IMAGE_SILO fill:#bbf,stroke:#333,stroke-width:2px style VIDEO_SILO fill:#bfb,stroke:#333,stroke-width:2px