What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent in detail:

A) SUBAGENT SUMMARY: 
ScriptMaster takes a user's topic description and generates an optimized, well-structured voice-over script specifically designed for talking head videos, incorporating natural speech patterns and appropriate pacing.

B) FINAL TASK OUTPUT: 
A text file containing a 100-300 word script, formatted with proper punctuation and natural speech breaks, optimized for voice-over recording, including clear paragraph breaks and timing markers.

C) SUBAGENT INPUT:
1. User's topic/theme description
2. Target length preference (defaulting to 100-300 words if not specified)
3. Any specific tone/style requirements (defaulting to conversational if not specified)

E) SUBAGENT TASK SUMMARY:
The workflow follows this sequence:

Initial Input > #216 Research Topic Deeply > #223 Powerful LLM Prompt-to-Text Response > #171 Write Voice Over Script Based On Instructions > Final Script Output

Specifically:
1. First uses #216 to gather comprehensive topic research
2. Then uses #223 to structure this research into a preliminary script outline
3. Finally uses #171 to convert this into a properly formatted voice-over script

F) SILOS:
This subagent operates in three distinct silos:

SILO 1: RESEARCH
- Purpose: Gather comprehensive topic information
- Skill: #216 (Research Topic Deeply)
- Input: User's topic description
- Output: 1000-3000 character research summary

SILO 2: SCRIPT STRUCTURING
- Purpose: Convert research into structured narrative
- Skill: #223 (Powerful LLM Prompt-to-Text Response)
- Input: Research summary from Silo 1
- Output: Structured script outline

SILO 3: VOICE-OVER OPTIMIZATION
- Purpose: Format for natural speech and timing
- Skill: #171 (Write Voice Over Script Based On Instructions)
- Input: Structured script from Silo 2
- Output: Final voice-over ready script

Each silo builds upon the previous one, creating a refined and optimized script that will work effectively for the talking head video format. The process ensures the script is both content-rich and naturally speakable.

SubAgent #1 - Diagram

Expand Diagram

Here's my complete analysis and workflow for SUBAGENT 2 "VoiceForge":

A) SUBAGENT SUMMARY: 
Converts a written script into high-quality voice-over audio, with additional quality checks and optimizations for the talking head video use case.

B) FINAL TASK OUTPUT: 
A single MP3 file URL containing clear voice-over audio, optimized for lip-syncing, with normalized volume levels, typically 1-5 minutes in length based on the 100-300 word input script.

C) SUBAGENT INPUT:
- Primary Input: Text script (from ScriptMaster subagent)
- Secondary Input: Any voice preference parameters (gender, accent, style, etc.)

E) SUBAGENT TASK SUMMARY:
The workflow requires several chained steps to ensure optimal audio quality for lip-syncing:

1. Initial Voice Generation:
text script > #170 (Turn Script Into Voice Over MP3) > initial MP3 URL

2. Audio Quality Check & Optimization:
MP3 URL > #178 (Convert MP3 to WAV) > WAV URL
WAV URL > #179 (Create Visual Waveform) > waveform JPEG
waveform JPEG > #176 (Analyze Image With GPT Vision) > audio quality analysis text

3. Volume Normalization (if needed based on analysis):
If quality check indicates issues:
WAV URL > #219 (Cut WAV into pieces) > multiple WAV segments
multiple WAV segments > #178 (Convert MP3 to WAV) > final normalized MP3 URL

F) SILOS:
SILO 1: Voice Generation
- Purpose: Create initial voice-over
- Input: Text script
- Skills: #170
- Output: Initial MP3

SILO 2: Quality Verification
- Purpose: Analyze audio quality
- Input: MP3 from Silo 1
- Skills: #178, #179, #176
- Output: Quality analysis and WAV file

SILO 3: Audio Optimization (conditional)
- Purpose: Normalize and optimize audio if needed
- Input: WAV from Silo 2
- Skills: #219, #178
- Output: Final MP3 URL

This workflow ensures we get consistently high-quality audio suitable for lip-syncing, with proper volume levels and clear pronunciation, while maintaining efficiency by only running optimization steps when needed based on quality analysis.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent following the requested format:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, thematically appropriate AI avatar image that will serve as the base for the talking head video, ensuring the avatar matches the topic and style requirements while maintaining professional quality suitable for lip-syncing.

B) FINAL TASK OUTPUT:
A 1024x1024 transparent PNG file of a professionally rendered AI avatar head/shoulders shot, with clear facial features (especially the mouth area), uniform lighting, and a clean background removal, saved as a URL.

C) SUBAGENT INPUT:
- Primary topic/theme description from user
- Style preferences for avatar (gender, age, profession, etc.)
- Any specific visual requirements (like "wearing a lab coat" for science videos)

E) SUBAGENT TASK SUMMARY:
1. #223 (Powerful LLM Prompt-to-Text Response) 
→ Takes user input and generates optimal image generation prompts
→ Creates three variations of prompts optimized for avatar creation

2. #222 (Make Image Especially With Text) x3
→ Generates three different versions of the avatar using the optimized prompts
→ Each with slightly different angles/expressions to ensure optimal output

3. #176 (Analyze An Image With GPT Vision & Return Text) x3
→ Analyzes each generated avatar for quality and suitability
→ Checks for clear facial features, proper framing, and professional appearance

4. #182 (Create Dalle Image Transparent Square)
→ Takes the best prompt (based on analysis) and generates final 1024x1024 transparent PNG
→ Ensures clean background removal and proper head/shoulders framing

F) SILOS:

SILO 1: PROMPT ENGINEERING
- Input: User requirements
- Skill: #223
- Output: 3x optimized prompts

SILO 2: IMAGE GENERATION & TESTING
- Input: 3x prompts from Silo 1
- Skills: #222 (x3 parallel runs)
- Skills: #176 (x3 parallel analyses)
- Output: Best performing prompt + quality assessment

SILO 3: FINAL RENDER
- Input: Best prompt from Silo 2
- Skill: #182
- Output: Final transparent PNG avatar

This workflow ensures multiple attempts at creating the perfect avatar while maintaining quality control through analysis, ultimately delivering a production-ready avatar image suitable for the talking head video.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent in detail:

A) SUBAGENT SUMMARY: 
A specialized agent that analyzes audio files to generate precise phoneme timing data and maps these to corresponding viseme (mouth shape) positions, creating a structured animation dataset for lip-sync animation.

B) FINAL TASK OUTPUT:
A JSON data structure containing:
- Timestamped phoneme sequences
- Corresponding viseme mappings
- Head position/movement keyframes
- Facial expression keyframes
Each with precise millisecond timing markers

C) SUBAGENT INPUT:
- Voice-over MP3 file URL
- Transcription text with word timing data
- Avatar image reference (for facial reference points)

D) SUBAGENT TASK SUMMARY:
1. Convert MP3 to WAV for precise audio analysis
   (#178 - Convert 1-20 MP3s to wav)

2. Generate visual waveform for amplitude analysis
   (#179 - Create Visual Waveform Of 60 second Wav/mp3 File)

3. Extract precise audio timing data
   (#198 - Get Transcription Of MP3 With Timings)

4. Analyze waveform patterns
   (#176 - Analyze An Image With GPT Vision & Return Text)
   Used to analyze the waveform for amplitude patterns

5. Generate structured timing data
   (#223 - Powerful LLM Prompt-to-Text Response)
   Used to convert all collected data into final JSON structure

E) SILOS:

SILO 1: AUDIO PREPROCESSING
- Input: MP3 URL
- Skill #178: Convert to WAV
- Skill #179: Generate waveform
Output: WAV file + Waveform image

SILO 2: TIMING EXTRACTION
- Input: WAV file
- Skill #198: Get precise transcription/timing
- Skill #176: Analyze waveform patterns
Output: Timing data + Amplitude patterns

SILO 3: DATA STRUCTURING
- Input: Timing data + Amplitude patterns
- Skill #223: Convert to structured JSON
- Output: Final JSON animation data structure

This workflow addresses the technical challenge of generating precise lip-sync data by breaking down the audio analysis into distinct phases, each contributing specific data points needed for the final animation structure. The combination of audio conversion, waveform analysis, and structured data generation creates a comprehensive lip-sync dataset that can be used by the video assembly component.

4 Template & Links

Expand Flow

Let me analyze and break down the VideoAssemblerPro subagent in detail:

A) SUBAGENT SUMMARY: 
VideoAssemblerPro combines an AI-generated avatar image, voice-over audio, and lip-sync data to create a synchronized talking head video where the avatar's mouth and facial movements match the audio speech patterns.

B) FINAL TASK OUTPUT:
MP4 video file (16:9 aspect ratio) featuring the avatar speaking with synchronized lip movements, with duration matching the input audio file length, at 30fps, with the avatar centered in frame against a clean background.

C) SUBAGENT INPUT:
- PNG file URL of the AI-generated avatar image
- MP3 file URL of the voice-over audio
- Text transcription with timing data (for lip sync matching)

E) SUBAGENT TASK SUMMARY:
1. Input validation and preparation:
- #198 (Get Transcription Of MP3) to generate precise timing data from the voice-over
- #178 (Convert MP3 to WAV) to prepare audio for processing

2. Avatar processing:
- #191 (Resize Image) to ensure avatar is properly sized for video frame
- #221 (Recreate New Image) to optimize avatar for animation

3. Final video assembly:
- #168 (Generate Talking Head Video From MP3 & transcription) to create the final synchronized video

F) SILOS:

SILO 1: AUDIO PREPARATION
Input: Original MP3 voice-over
> #198 Get Transcription Of MP3
> #178 Convert MP3 to WAV
Output: WAV file + detailed transcription with timing data

SILO 2: AVATAR PREPARATION
Input: Original avatar PNG
> #191 Resize Image (to video frame specs)
> #221 Recreate New Image (optimization)
Output: Processed avatar image ready for animation

SILO 3: VIDEO GENERATION
Input: Processed avatar + WAV + transcription
> #168 Generate Talking Head Video
Output: Final MP4 video with synchronized lip movements

The workflow is designed to prepare both the audio and visual elements separately before combining them in the final video generation step, ensuring optimal quality and synchronization.