graph TD
START[User Input: Avatar & Audio Files]
subgraph AUDIO_SILO[Audio Processing Silo]
A1[Input MP3]
A2[Get Transcription with Timings]
A3[Extract Clean Audio]
A4[Phoneme Data Output]
end
subgraph IMAGE_SILO[Image Processing Silo]
I1[Input Avatar PNG]
I2[Resize to 1024x1024]
I3[GPT Vision Analysis]
I4[Feature Map Output]
end
subgraph VIDEO_SILO[Video Assembly Silo]
V1[Combine Components]
V2[Generate Raw Video]
V3[Cut and Clean Video]
V4[Final MP4 Output]
end
START --> A1
START --> I1
A1 --> A2
A2 --> A4
A1 --> A3
A3 --> A4
I1 --> I2
I2 --> I3
I3 --> I4
A4 --> V1
I4 --> V1
V1 --> V2
V2 --> V3
V3 --> V4
END[Final Talking Head Video] --> V4
style AUDIO_SILO fill:#f9f,stroke:#333,stroke-width:2px
style IMAGE_SILO fill:#bbf,stroke:#333,stroke-width:2px
style VIDEO_SILO fill:#bfb,stroke:#333,stroke-width:2px