What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent in detail:

A) SUBAGENT SUMMARY: 
ScriptMaster generates a concise, well-structured video script optimized for voice-over delivery by first researching the topic thoroughly and then crafting a conversational script that fits within the 100-300 word target length.

B) FINAL TASK OUTPUT:
A single text file containing a professionally formatted voice-over script of 100-300 words, structured with clear paragraphs, proper punctuation, and natural speaking patterns suitable for voice-over recording.

C) SUBAGENT INPUT:
- User's topic/description
- Any specific tone/style preferences
- Target length preference (within 100-300 word range)

E) SUBAGENT TASK SUMMARY:
Here's the detailed flow using specific skills:

1. Research Phase:
Input > #216 (Research Topic Deeply) > Comprehensive research data

2. Keyword Enhancement:
Research data > #218 (Brainstorm Related Keywords) > Enhanced topic understanding

3. Script Generation:
Enhanced topic understanding > #171 (Write Voice Over Script Based On Instructions) > Draft script

4. Script Refinement:
Draft script > #190 (Write or rewrite text based on instructions) [with specific instruction to optimize for voice-over delivery] > Final polished script

F) SILOS:
SILO 1: RESEARCH & PREPARATION
- Skill #216: Deep research of topic
- Skill #218: Keyword analysis for topic coverage
Output: Comprehensive research document

SILO 2: SCRIPT CREATION
- Skill #171: Initial script generation
- Skill #190: Script optimization
Output: Final voice-over ready script

This workflow ensures that the script is:
1. Well-researched and accurate
2. Properly structured for voice-over
3. Within the target length
4. Natural and conversational in tone
5. Ready for the next subagent (VoiceForge) to process

The double-pass approach (using both #171 and #190) ensures that the script is both content-rich and optimized for voice-over delivery, which is crucial for the final talking head video quality.

SubAgent #1 - Diagram

Expand Diagram

Let me break down the VoiceForge subagent flow in detail:

A) SUBAGENT SUMMARY: 
VoiceForge converts a text script into a high-quality voice-over MP3 file, ensuring proper audio formatting for subsequent lip-sync processing.

B) FINAL TASK OUTPUT:
An MP3 file URL containing the voice-over audio, specifically formatted for lip-sync processing, with clear pronunciation and appropriate pacing (typically 130-150 words per minute), saved with consistent audio levels (-14 LUFS) and proper encoding (44.1kHz, 320kbps).

C) SUBAGENT INPUT:
- Primary Input: Text script (100-300 words) from ScriptMaster subagent
- Secondary Input: Voice style preferences (if any specified in original user prompt)

E) SUBAGENT TASK SUMMARY:
The flow proceeds as:

1. Initial Input (script) > 
2. #170 (Turn Script Into Voice Over MP3) > 
3. #198 (Get Transcription Of MP3 With Timings) [to verify audio quality and timing] >
4. #179 (Create Visual Waveform Of 60 second Wav/mp3 File) [to verify audio levels] >
5. #176 (Analyze An Image With GPT Vision & Return Text) [to analyze waveform for quality control] >
Final Output: Verified high-quality MP3 URL

F) SILOS:
SILO 1: AUDIO GENERATION
- Input: Text script
- Skill: #170 Turn Script Into Voice Over MP3
- Output: Initial MP3 URL

SILO 2: QUALITY VERIFICATION
- Input: MP3 from Silo 1
- Skills: 
  * #198 Get Transcription With Timings
  * #179 Create Visual Waveform
  * #176 Analyze Waveform
- Output: Quality verification data

SILO 3: FINAL OUTPUT PREPARATION
- Input: Verified MP3 from Silo 2
- Action: If quality checks pass, proceed with original MP3
- Output: Final MP3 URL with verified quality for lip-sync processing

This structure ensures not just voice generation but also quality control through multiple verification steps, which is crucial for successful lip-sync processing in later stages.

SubAgent #2 - Diagram

Expand Diagram

Here's my complete analysis and workflow for the AvatarVision subagent:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, themed AI avatar image specifically designed for talking head videos, ensuring the avatar matches the video's topic and maintains professional quality suitable for lip-syncing.

B) FINAL TASK OUTPUT: 
A square (1024x1024) transparent PNG file of a professional-looking avatar head/shoulders shot, with clear facial features (especially around the mouth area), suitable for lip-sync animation, saved with transparent background.

C) SUBAGENT INPUT:
- Topic/theme of the video
- Style preferences for avatar (gender, age, profession, etc.)
- Any specific visual requirements (clothing, accessories, background elements)

E) SUBAGENT TASK SUMMARY:
The workflow requires multiple stages to ensure the highest quality avatar:

1. Initial Avatar Generation:
input > #223 (LLM prompt crafting) > refined prompt for avatar
refined prompt > #222 (Make Image With Text) > initial avatar image

2. Avatar Analysis & Refinement:
initial avatar > #176 (Analyze Image With GPT Vision) > analysis of facial features
analysis > #223 (LLM prompt improvement) > refined prompt
refined prompt > #222 (Make Image With Text) > improved avatar

3. Final Processing:
improved avatar > #191 (Resize Image to 1024x1024) > resized avatar
resized avatar > #182 (Create Transparent Background) > final avatar PNG

F) SILOS:
SILO 1: PROMPT ENGINEERING
- Input: Raw topic/theme/requirements
- Skill #223: Craft optimal prompt for avatar generation
- Output: Refined prompt

SILO 2: INITIAL GENERATION
- Input: Refined prompt
- Skill #222: Generate initial avatar
- Output: First avatar attempt

SILO 3: QUALITY CONTROL
- Input: Initial avatar
- Skill #176: Analyze facial features
- Skill #223: Improve prompt based on analysis
- Skill #222: Generate improved avatar
- Output: Better avatar

SILO 4: FINAL PROCESSING
- Input: Improved avatar
- Skill #191: Resize to correct dimensions
- Skill #182: Create transparent background
- Output: Final avatar PNG

This workflow ensures we get a high-quality avatar that's:
1. Properly themed to the video content
2. Has clear facial features for animation
3. Is correctly sized and formatted
4. Has a transparent background for video processing

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent following the requested format:

A) SUBAGENT SUMMARY: 
A specialized subagent that analyzes an audio file to generate precise phoneme timing data and maps it to corresponding viseme (mouth shape) configurations for realistic lip-sync animation.

B) FINAL TASK OUTPUT:
A structured JSON data file containing:
- Timestamp markers (in milliseconds)
- Corresponding phoneme identifications
- Matching viseme configurations
- Head position/movement data
- Emotional markers/facial expressions
- Amplitude/energy levels for emphasis

C) SUBAGENT INPUT:
- Voice-over MP3 file URL
- Transcription with word-level timing
- Avatar image reference (for mouth shape mapping)

E) SUBAGENT TASK SUMMARY:
The workflow proceeds as:

1. Convert MP3 to WAV for analysis
#178 - Convert 1-20 MP3s to wav

2. Generate detailed audio analysis
#179 - Create Visual Waveform Of 60 second Wav/mp3 File
#180 - Extract Beatpoints & Tempo of MP3
#198 - Get Transcription Of MP3 (With Timings)

3. Process transcription data
#223 - Powerful LLM Prompt-to-Text Response
(Used to convert transcription into phoneme sequences)

4. Analyze audio patterns
#188 - Extract 10 audio stems from mp3
(To separate speech components for better analysis)

5. Generate timing data
#219 - Cut Wav/mp3 Audio into Multiple Pieces/Samples
(To create precise phoneme segments)

F) SILOS:

SILO 1: AUDIO PREPARATION
- Input: Original MP3
- Skills: #178, #179
- Output: WAV file + waveform analysis

SILO 2: SPEECH ANALYSIS
- Input: WAV file
- Skills: #180, #198, #188
- Output: Detailed speech pattern data

SILO 3: PHONEME MAPPING
- Input: Speech pattern data
- Skills: #223 (for phoneme identification), #219
- Output: JSON timing/phoneme data

This subagent structure ensures comprehensive audio analysis and precise phoneme-to-viseme mapping, which is crucial for realistic lip-sync animation in the final video output.

Note: This workflow makes the best use of available skills, though it's worth noting that some specialized lip-sync functionality might benefit from additional custom skills for more precise viseme mapping and facial animation data generation.

4 Template & Links

Expand Flow

I'll analyze SUBAGENT 5: "VideoAssemblerPro" and break it down according to your requirements.

A) SUBAGENT SUMMARY: 
A video assembly system that combines an AI avatar image, voice-over audio, and lip-sync data to create a synchronized talking head MP4 video with natural-looking mouth movements.

B) FINAL TASK OUTPUT:
An MP4 video file (1080x1080 square format) featuring a single talking head avatar with synchronized lip movements matching the audio track, duration matching the input audio file length, with clear audio quality at 30fps.

C) SUBAGENT INPUT:
- PNG URL of the AI-generated avatar image
- MP3 URL of the voice-over audio
- Text transcription with timestamps (for lip-sync mapping)

E) SUBAGENT TASK SUMMARY:
1. Input MP3 > #198 (Get Transcription Of MP3 With Timings) > Detailed timestamp data
2. MP3 URL + Transcription > #168 (Generate Talking Head Video From MP3 & transcription) > Initial MP4
3. Initial MP4 > #199 (Add Images & Videos On Top Of Existing MP4) [overlaying the custom avatar PNG] > Final MP4

F) SILOS:

SILO 1: AUDIO PREPARATION
- Input: MP3 voice-over file
- Skill: #198 Get Transcription Of MP3 (With Timings)
- Output: Precise transcription with timestamps

SILO 2: BASE VIDEO GENERATION
- Input: MP3 + Transcription from Silo 1
- Skill: #168 Generate Talking Head Video From MP3 & transcription
- Output: Base talking head MP4

SILO 3: AVATAR INTEGRATION
- Input: Base MP4 + Custom Avatar PNG
- Skill: #199 Add Images & Videos On Top Of Existing MP4
- Output: Final rendered MP4 with custom avatar

Note: While the original subagent description mentioned needing new skills for lip-sync data and video assembly, I've reconfigured the flow to use existing skills that can accomplish similar results. Skill #168 already handles basic lip-sync, and #199 can overlay the custom avatar. This approach may not provide as sophisticated animation as a dedicated lip-sync system, but it creates a functional talking head video using available skills.