What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent flow according to the guidelines:

A) SUBAGENT SUMMARY: 
ScriptMaster takes a user's topic/description and generates an optimized, well-structured voice-over script specifically designed for talking head avatar videos.

B) FINAL TASK OUTPUT:
A text file containing a 100-300 word script that is:
- Formatted for voice-over delivery (including pauses, emphasis)
- Structured with clear introduction, body, and conclusion
- Written in a conversational tone suitable for avatar delivery
- Includes timing estimates for each section
- Contains pronunciation guides for complex terms

C) SUBAGENT INPUT:
1. User's topic/description
2. Optional style preferences (tone, formality level)
3. Target duration (if specified)
4. Any specific keywords or points to include

D) SUBAGENT TASK SUMMARY:
Input > #216 Research Topic Deeply > #223 Powerful LLM Prompt-to-Text Response > #171 Write Voice Over Script Based On Instructions > Output

Specifically:
1. #216 will gather comprehensive topic research
2. #223 will structure the research into a preliminary script outline
3. #171 will convert the outline into the final voice-over script

E) SILOS:
SILO 1: RESEARCH PHASE
- Input: User's topic/description
- Skill: #216 Research Topic Deeply
- Output: Research summary (1000-3000 characters)

SILO 2: SCRIPT STRUCTURING
- Input: Research summary from Silo 1
- Skill: #223 Powerful LLM Prompt-to-Text Response
- Output: Structured script outline with sections marked

SILO 3: FINAL SCRIPT GENERATION
- Input: Structured outline from Silo 2
- Skill: #171 Write Voice Over Script Based On Instructions
- Output: Final voice-over script file

The silos are designed to ensure:
1. Comprehensive research forms the foundation
2. Proper structure is established
3. Final script is optimized for voice-over delivery

Each silo's output feeds directly into the next silo's input, creating a smooth workflow that builds towards the final optimized script.

SubAgent #1 - Diagram

Expand Diagram

Let me break down the VoiceForge subagent according to the specified format:

A) SUBAGENT SUMMARY: 
VoiceForge converts a text script into a high-quality voice-over MP3 file, with appropriate pacing and natural speech patterns suitable for a talking head video.

B) FINAL TASK OUTPUT: 
A single MP3 file URL containing the voice-over narration, typically 1-5 minutes in length, with clear audio quality suitable for lip-syncing (44.1kHz sample rate, 16-bit depth, mono channel).

C) SUBAGENT INPUT:
- Primary Input: Text script (100-300 words) from ScriptMaster
- Optional Input: Voice style preferences (if any specified in original user prompt)

E) SUBAGENT TASK SUMMARY:
The flow works like this:

1. Input text script > #170 (Turn Script Into Voice Over MP3) > First pass MP3
2. First pass MP3 > #198 (Get Transcription Of MP3 With Timings) > Timing verification
3. Timing data > #223 (Powerful LLM Prompt-to-Text Response) > Analysis of pacing/quality
4. If timing analysis shows issues > Return to step 1 with adjusted script
5. Final output: Verified MP3 URL

F) SILOS:
Silo 1: VOICE GENERATION
- Input: Raw script text
- Skill: #170 Turn Script Into Voice Over MP3
- Output: Initial MP3 URL

Silo 2: QUALITY VERIFICATION
- Input: Initial MP3 URL
- Skill: #198 Get Transcription Of MP3 (With Timings)
- Skill: #223 Powerful LLM Prompt-to-Text Response
- Output: Quality analysis with timing verification

Silo 3: FINAL DELIVERY
- Input: Verified MP3 or request for regeneration
- Action: Either approve MP3 for final output or loop back to Silo 1
- Output: Final approved MP3 URL

This workflow ensures we don't just generate a voice-over, but verify its quality and timing suitability for lip-sync purposes, which is crucial for the final talking head video generation.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent in detail:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, themed AI avatar image that matches the video's topic and style, ensuring the avatar is front-facing with clear facial features suitable for lip-sync animation.

B) FINAL TASK OUTPUT:
A single 1024x1024 transparent PNG file of a front-facing avatar with clear facial features, particularly well-defined mouth area, saved with a transparent background to allow for animation flexibility.

C) SUBAGENT INPUT:
1. User's topic/theme description
2. Style preferences for avatar (professional, casual, specific profession, etc.)
3. Any specific facial feature requirements

D) SUBAGENT TASK SUMMARY:
Input > #223 (Generate optimal avatar prompt) > #182 (Create initial avatar) > #176 (Analyze avatar for facial features) > #221 (Recreate if needed) > #191 (Resize if needed) > Final PNG Output

Detailed Flow:
1. Use #223 (Powerful LLM) to convert user requirements into an optimized image generation prompt that emphasizes:
   - Front-facing position
   - Clear facial features
   - Well-defined mouth area
   - Theme-appropriate styling
   
2. Use #182 (Create Dalle Image) to generate the initial transparent avatar
   
3. Use #176 (Analyze Image) to verify:
   - Face is properly front-facing
   - Mouth area is well-defined
   - Features are clear enough for animation
   
4. IF analysis shows issues:
   Use #221 (Recreate Image) with refined prompt based on analysis
   
5. IF size adjustment needed:
   Use #191 (Resize Image) to ensure 1024x1024 dimensions

E) SILOS:

SILO 1: PROMPT ENGINEERING
- Input: User requirements
- Skill: #223 Powerful LLM
- Output: Optimized avatar generation prompt

SILO 2: IMAGE GENERATION
- Input: Optimized prompt
- Skill: #182 Create Dalle Image
- Output: Initial avatar PNG

SILO 3: QUALITY CONTROL
- Input: Initial avatar PNG
- Skill: #176 Analyze Image
- Output: Analysis report
- Conditional: If issues found, trigger SILO 4

SILO 4: REFINEMENT (if needed)
- Input: Analysis report + original prompt
- Skill: #221 Recreate Image
- Output: Refined avatar PNG

SILO 5: FORMATTING (if needed)
- Input: Final/Refined avatar PNG
- Skill: #191 Resize Image
- Output: Properly sized avatar PNG

Each silo is designed to ensure the final avatar meets the specific requirements needed for a talking head video, with particular emphasis on facial features that will work well with lip-sync animation.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent specifically:

A) SUBAGENT SUMMARY:
LipSyncWizard processes an audio file to generate precise phoneme timing data and corresponding viseme mappings, creating a structured animation data file that determines when and how the avatar's mouth and facial features should move in sync with the speech.

B) FINAL TASK OUTPUT:
A JSON data file containing timestamped phoneme-to-viseme mappings, including:
- Precise timing markers (in milliseconds)
- Corresponding mouth shapes (visemes)
- Optional head movement markers
- Basic facial expression triggers

C) SUBAGENT INPUT:
- MP3 voice-over file URL
- Transcription with word-level timing
- Avatar image reference (for understanding mouth position boundaries)

D) SUBAGENT TASK SUMMARY:
1. Extract detailed audio analysis:
   - Input MP3 > #198 Get Transcription Of MP3 (With Timings) > Detailed transcript
   - Input MP3 > #178 Convert 1-20 MP3s to wav > WAV file
   - WAV file > #179 Create Visual Waveform Of 60 second Wav/mp3 File > Waveform data

2. Generate phoneme timing data:
   - Input WAV > #219 Cut Wav/mp3 Audio into Multiple Pieces/Samples > Individual phoneme segments
   - Waveform + segments > #176 Analyze An Image With GPT Vision & Return Text > Amplitude analysis

3. Create animation mapping:
   - Transcription + timing data > #223 Powerful LLM Prompt-to-Text Response > Initial viseme mapping
   - Avatar image > #176 Analyze An Image With GPT Vision & Return Text > Mouth position constraints
   - Combined data > #185 Write Text (Or Copy) From Inputted Text > Final JSON animation data

E) SILOS:

SILO 1: AUDIO ANALYSIS
• Input: MP3 file
• Skills: #198, #178, #179
• Output: WAV file, transcription, waveform

SILO 2: PHONEME EXTRACTION
• Input: WAV file, waveform
• Skills: #219, #176
• Output: Phoneme timing data

SILO 3: ANIMATION MAPPING
• Input: Timing data, avatar image
• Skills: #223, #176, #185
• Output: Final JSON animation data

This workflow creates a precise phoneme-to-viseme mapping that can be used by the video generation component to create naturally synchronized lip movements.

4 Template & Links

Expand Flow

Let me break down Subagent 5 (VideoAssemblerPro) according to the guidelines:

A) SUBAGENT SUMMARY: 
VideoAssemblerPro combines an AI avatar image, voice-over audio, and lip-sync data to create a synchronized talking head video where the avatar's mouth and facial movements match the audio.

B) FINAL TASK OUTPUT:
MP4 video file (1920x1080), duration matching input audio length, featuring the AI avatar with synchronized lip movements and natural head movements, with clear audio quality at 48kHz.

C) SUBAGENT INPUT:
- PNG file URL of the AI-generated avatar image
- MP3 file URL of the voice-over audio
- JSON/text file containing lip-sync timing data
- Original script text (for backup reference)

E) SUBAGENT TASK SUMMARY:
This subagent requires multiple parallel processes that then converge:

1. Audio Processing Silo:
Input MP3 > #196 (Extract audio) > #198 (Get transcription with timings) > timing data

2. Avatar Processing Silo:
Input Avatar PNG > #191 (Resize image to 1920x1080) > processed avatar

3. Final Assembly:
Processed avatar + MP3 + timing data > #168 (Generate Talking Head Video) > Final MP4

F) SILOS:

SILO 1: AUDIO PROCESSING
Purpose: Prepare audio and extract precise timing data
Input: MP3 URL
Steps:
1. #196 - Extract MP3 Audio From MP4 File (to ensure clean audio)
2. #198 - Get Transcription Of MP3 (With Timings)
Output: Clean MP3 + timing data

SILO 2: AVATAR PREPARATION
Purpose: Ensure avatar is properly sized/formatted
Input: PNG URL
Steps:
1. #191 - Resize Image (to 1920x1080)
Output: Properly formatted avatar PNG

SILO 3: FINAL ASSEMBLY
Purpose: Combine all elements into final video
Input: Processed avatar PNG + Clean MP3 + Timing data
Steps:
1. #168 - Generate Talking Head Video From MP3 & transcription
Output: Final MP4 video file

The key insight here is that while the original subagent description suggested we needed new skills for lip-sync, we can actually leverage existing Skill #168 which has this capability built in, when provided with proper timing data from Skill #198. This makes the workflow much more efficient than originally conceived.