graph TD START([Start]) --> IN[Input: MP3 & Avatar Image] IN --> CONV[Convert MP3 to WAV] IN --> IMG[Analyze Avatar Reference Image] subgraph AudioProcessing[Audio Processing Silo] CONV --> TRANS[Generate Detailed Transcription] TRANS --> WAVE[Create Audio Waveform] WAVE --> BEAT[Extract Beat Points] end subgraph VisualAnalysis[Visual Analysis Silo] IMG --> VREF[Vision Analysis of Avatar] WAVE --> WANA[Vision Analysis of Waveform] end subgraph DataStructuring[Data Structuring Silo] BEAT --> PROC[Process Data via LLM] WANA --> PROC VREF --> PROC PROC --> JSON[Format to JSON] end JSON --> OUT([Output: Structured Lipsync Data]) style AudioProcessing fill:#f9f,stroke:#333 style VisualAnalysis fill:#bbf,stroke:#333 style DataStructuring fill:#bfb,stroke:#333