What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent in detail:

A) SUBAGENT SUMMARY: 
ScriptMaster takes a user's topic description and generates an optimized, natural-sounding voice-over script specifically designed for talking head videos, with appropriate pacing, tone, and length (100-300 words).

B) FINAL TASK OUTPUT: 
A text file containing a professionally formatted voice-over script, structured with proper timing breaks, approximately 100-300 words in length, optimized for natural speech patterns and talking head delivery.

C) SUBAGENT INPUT:
1. User's topic/theme description
2. Optional style preferences (tone, formality, etc.)
3. Optional length preference (within 100-300 word range)

E) SUBAGENT TASK SUMMARY:
The workflow involves three main chained steps:

1. Research Phase:
INPUT: User topic description >
SKILL #216 (Research Topic Deeply) >
OUTPUT: Comprehensive research summary

2. Initial Script Generation:
INPUT: Research summary >
SKILL #171 (Write Voice Over Script Based On Instructions) >
OUTPUT: First draft of voice-over script

3. Script Optimization:
INPUT: First draft script >
SKILL #190 (Write or rewrite text based on instructions) with specific prompt for talking head optimization >
OUTPUT: Final optimized script

F) SILOS:
This subagent operates in three distinct silos:

SILO 1: RESEARCH
• Purpose: Gather comprehensive topic information
• Input: User topic description
• Skill: #216 (Research Topic Deeply)
• Output: Research summary (1000-3000 characters)

SILO 2: SCRIPT CREATION
• Purpose: Transform research into initial script
• Input: Research summary from Silo 1
• Skill: #171 (Write Voice Over Script Based On Instructions)
• Output: Initial script draft

SILO 3: OPTIMIZATION
• Purpose: Optimize script for talking head delivery
• Input: Initial script from Silo 2
• Skill: #190 (Write or rewrite text based on instructions)
• Output: Final optimized script

Each silo's output becomes the input for the next silo, creating a refined progression toward the final script. This structure ensures thorough research, proper script formatting, and optimization specifically for talking head delivery.

SubAgent #1 - Diagram

Expand Diagram

Let me break down the VoiceForge subagent workflow according to the guidelines:

A) SUBAGENT SUMMARY: 
VoiceForge converts a text script into a high-quality voice-over MP3 file, optimizing the script if needed for length constraints and ensuring proper audio output formatting.

B) FINAL TASK OUTPUT:
An MP3 file URL containing the voice-over audio, specifically formatted for use in a talking head video, with duration between 30 seconds to 5 minutes (based on 100-300 word input script), with clear audio quality suitable for lip-sync processing.

C) SUBAGENT INPUT:
- Primary Input: Text script (100-300 words) from ScriptMaster subagent
- Additional Parameters (if provided):
  * Voice style preference
  * Speaking pace/tempo requirements
  * Any specific pronunciation guides

E) SUBAGENT TASK SUMMARY:
The workflow follows this sequence:

1. Script Length Verification and Formatting:
   - Use #190 (Write or rewrite text based on instructions) to verify script length and format if needed
   * Input: Original script
   * Output: Formatted script ready for voice conversion

2. Primary Voice Generation:
   - Use #170 (Turn Script Into Voice Over MP3) to create the voice-over
   * Input: Formatted script
   * Output: Initial MP3 URL

3. Audio Quality Check:
   - Use #179 (Create Visual Waveform Of 60 second Wav/mp3 File) to analyze audio quality
   * Input: Initial MP3 URL
   * Output: Waveform visualization

4. Audio Analysis:
   - Use #176 (Analyze An Image With GPT Vision & Return Text) to review waveform
   * Input: Waveform visualization
   * Output: Audio quality analysis

5. If needed, Audio Optimization:
   - If issues detected, use #178 (Convert 1-20 MP3s to wav) for format conversion
   * Input: MP3 URL
   * Output: Final optimized audio file URL

F) SILOS:
SILO 1: SCRIPT PREPARATION
- Input: Raw script
- Skill #190: Format/verify script
- Output: Formatted script

SILO 2: VOICE GENERATION
- Input: Formatted script
- Skill #170: Generate voice-over
- Output: Initial MP3

SILO 3: QUALITY ASSURANCE
- Input: Initial MP3
- Skill #179: Generate waveform
- Skill #176: Analyze waveform
- Skill #178: Optimize if needed
- Output: Final MP3 URL

This workflow ensures high-quality voice-over generation with built-in quality checks and optimization steps, preparing the audio specifically for use in talking head video generation.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent following the requested format:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, themed AI avatar image that matches the video's topic and maintains consistent visual quality for lip-sync animation, with particular attention to facial features and head positioning.

B) FINAL TASK OUTPUT: 
A 1024x1024 transparent PNG file of a professional-looking avatar head/shoulders shot, with clear facial features (especially mouth area), neutral expression, and clean edges for animation purposes, saved with transparent background.

C) SUBAGENT INPUT:
1. Topic/theme description from user
2. Style preferences (professional, casual, specific profession, etc.)
3. Any specific facial feature requirements

E) SUBAGENT TASK SUMMARY:
The workflow chains together as follows:

1. Start with user input > #223 (Powerful LLM) to generate optimal image generation prompts
2. Generated prompt > #222 (Make Image With Text) to create initial avatar
3. Initial avatar URL > #176 (Analyze Image with GPT Vision) to verify facial features
4. If needed, URL > #221 (Recreate New Image) to refine/adjust
5. Final image > #191 (Resize Image) to ensure 1024x1024 format
6. Output: Final themed-avatar-image.png

F) SILOS:
SILO 1: PROMPT ENGINEERING
- Input: User's topic/theme description
- Skill: #223 (Powerful LLM)
- Output: Optimized image generation prompt
- Purpose: Ensures prompt will generate suitable avatar

SILO 2: INITIAL GENERATION
- Input: Optimized prompt
- Skill: #222 (Make Image With Text)
- Output: Initial avatar PNG
- Purpose: Creates base avatar image

SILO 3: QUALITY CONTROL
- Input: Initial avatar PNG
- Skill: #176 (Analyze Image with GPT Vision)
- Purpose: Verifies facial features suitable for animation
- If needed, triggers SILO 4

SILO 4: REFINEMENT (Conditional)
- Input: Initial avatar + analysis feedback
- Skill: #221 (Recreate New Image)
- Output: Refined avatar
- Purpose: Improves initial generation if needed

SILO 5: FORMATTING
- Input: Final avatar (from either SILO 2 or 4)
- Skill: #191 (Resize Image)
- Output: 1024x1024 transparent PNG
- Purpose: Ensures correct format for video generation

This structured approach ensures high-quality avatar generation with proper verification and refinement steps, producing an image suitable for lip-sync animation.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent in detail:

A) SUBAGENT SUMMARY: 
A specialized component that analyzes an audio file to generate precise phoneme timing data and maps it to corresponding viseme (mouth shape) positions, creating a synchronized animation data stream for avatar lip movement.

B) FINAL TASK OUTPUT:
A JSON data structure containing time-coded phoneme-to-viseme mappings, including:
- Timestamps (in milliseconds)
- Phoneme identifiers
- Corresponding viseme positions (mouth shapes)
- Optional head movement coordinates
File format: .json with structured timing data

C) SUBAGENT INPUT:
- MP3 voice-over file URL
- Transcription with word timing data
- List of supported visemes for the avatar system

D) SUBAGENT TASK SUMMARY:
Input > #198 (Get Transcription of MP3 with Timings) > #178 (Convert MP3 to WAV) > #179 (Create Visual Waveform) > #176 (Analyze Waveform with GPT Vision) > #223 (LLM Process Phoneme Data) > JSON Output

E) SILOS:
The subagent operates in three distinct silos:

SILO 1: AUDIO ANALYSIS
1. Convert MP3 to WAV format (#178)
2. Generate precise transcription with timings (#198)
3. Create visual waveform for amplitude analysis (#179)

SILO 2: PHONEME EXTRACTION
1. Analyze waveform with GPT Vision (#176) to identify speech patterns
2. Map transcription timings to waveform patterns
3. Use LLM (#223) to convert speech patterns to phoneme sequences

SILO 3: VISEME MAPPING
1. Use LLM (#223) to convert phoneme data to viseme positions
2. Generate timing data for mouth movements
3. Create final JSON structure with all synchronized data

This breakdown ensures accurate phoneme detection and proper viseme mapping while maintaining precise timing synchronization throughout the process. The final JSON output can then be used by VideoAssemblerPro to create the synchronized talking head animation.

Note: This approach uses existing skills creatively to approximate phoneme detection, though a dedicated phoneme detection skill would be ideal for future implementations.

4 Template & Links

Expand Flow

Let me break down the VideoAssemblerPro subagent following the requested format:

A) SUBAGENT SUMMARY: 
A specialized video assembly agent that combines an AI avatar image, voice-over audio, and lip-sync data to create a synchronized talking head video with natural mouth movements and facial expressions.

B) FINAL TASK OUTPUT: 
An MP4 video file (16:9 aspect ratio) featuring the AI avatar speaking with synchronized lip movements, approximately matching the duration of the input audio file, with the avatar centered in frame against a neutral background.

C) SUBAGENT INPUT:
- PNG URL of the AI-generated avatar image
- MP3 URL of the voice-over audio
- Text transcription with precise timing data for lip-sync
- Optional: Movement/animation instructions

E) SUBAGENT TASK SUMMARY:
The flow requires multiple parallel processes that merge into the final output:

1. Audio Processing Track:
Input MP3 > #198 Get Transcription Of MP3 (With Timings) > Timing Data
Input MP3 > #196 Extract MP3 Audio From MP4 File > Clean Audio

2. Image Processing Track:
Input Avatar PNG > #191 Resize Image (to 1024x1024) > Processed Avatar

3. Final Assembly:
(Timing Data + Clean Audio + Processed Avatar) > #168 Generate Talking Head Video From MP3 & transcription > Final MP4

F) SILOS:
SILO 1 - AUDIO PROCESSING
- Purpose: Process audio and extract precise timing data
- Input: MP3 URL
- Skills: #198, #196
- Output: Timing data + cleaned audio

SILO 2 - IMAGE PROCESSING
- Purpose: Prepare avatar image for video generation
- Input: PNG URL
- Skills: #191
- Output: Properly sized avatar image

SILO 3 - VIDEO ASSEMBLY
- Purpose: Combine all elements into final talking head video
- Input: All processed components
- Skills: #168
- Output: Final MP4 video

This workflow takes advantage of existing skills while accommodating the need for precise lip-sync and avatar animation. The #168 skill (Generate Talking Head Video) serves as the crucial final step that brings everything together into a cohesive talking head video.