What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent in detail:

A) SUBAGENT SUMMARY: 
ScriptMaster generates a professionally-written, conversational video script optimized for voice-over delivery, based on the user's topic description and intended content.

B) FINAL TASK OUTPUT: 
A text file containing a 100-300 word voice-over script, formatted with proper punctuation and natural speech patterns, optimized for AI voice synthesis, including clear paragraph breaks and any necessary pronunciation guides.

C) SUBAGENT INPUT:
- User's topic/description
- Desired length (if specified)
- Any specific tone/style requirements
- Target audience information (if provided)

D) SUBAGENT TASK SUMMARY:
The workflow follows this sequence:

1. Research Phase:
#216 (Research Topic Deeply) > Returns comprehensive research summary
↓
2. Keyword Enhancement:
#218 (Brainstorm Related Keywords) > Returns relevant keywords and context
↓
3. Initial Script Generation:
#171 (Write Voice Over Script Based On Instructions) > Returns draft script
↓
4. Script Optimization:
#190 (Write or rewrite text based on instructions) > Returns polished script
[This final step specifically optimizes for voice synthesis compatibility]

E) SILOS:
SILO 1: RESEARCH & CONTEXT
- Input: User's topic description
- Skill #216: Deep research of topic
- Skill #218: Keyword analysis
- Output: Research summary + keyword context

SILO 2: SCRIPT CREATION
- Input: Research summary + keywords
- Skill #171: Initial script generation
- Output: Draft script

SILO 3: OPTIMIZATION
- Input: Draft script
- Skill #190: Voice-over optimization
- Output: Final polished script

This structure ensures thorough research, proper context gathering, and multiple refinement stages to create a script that's both content-rich and optimized for voice synthesis and delivery.

The separation into silos allows for quality control at each stage and ensures that the final script is both well-researched and technically optimized for the next stages in the larger workflow (voice synthesis and avatar animation).

SubAgent #1 - Diagram

Expand Diagram

I'll analyze SUBAGENT 2: "VoiceForge" and break it down according to the guidelines:

A) SUBAGENT SUMMARY: 
VoiceForge converts a text script into a high-quality voice-over MP3 file, ensuring proper audio formatting and quality for lip-sync compatibility in the final talking head video.

B) FINAL TASK OUTPUT:
MP3 audio file (URL) containing clear voice-over narration of the script, with consistent volume levels, minimal background noise, and proper timing for lip-sync integration (typically 44.1kHz, 16-bit, stereo format).

C) SUBAGENT INPUT:
- Text script (100-300 words) from ScriptMaster
- Voice style preference (if any, defaulting to neutral professional voice)

E) SUBAGENT TASK SUMMARY:
The workflow follows these steps:

1. Initial Conversion:
text script > #170 (Turn Script Into Voice Over MP3) > initial MP3 URL

2. Audio Analysis & Quality Check:
MP3 URL > #198 (Get Transcription Of MP3 With Timings) > transcription with timing
MP3 URL > #179 (Create Visual Waveform Of 60 second Wav/mp3 File) > waveform image
waveform image > #176 (Analyze An Image With GPT Vision & Return Text) > audio quality analysis

3. Audio Processing (if needed based on analysis):
MP3 URL > #178 (Convert 1-20 MP3s to wav) > WAV file
WAV file > #219 (Cut Wav/mp3 Audio into Multiple Pieces/Samples) > processed audio segments
[Note: This step only executes if the quality analysis indicates issues]

F) SILOS:
SILO 1: INITIAL AUDIO GENERATION
- Input: Text script
- Skill: #170 (Turn Script Into Voice Over MP3)
- Output: Initial MP3 URL

SILO 2: QUALITY VERIFICATION
- Input: Initial MP3 URL
- Skills: #198, #179, #176
- Output: Quality analysis results

SILO 3: AUDIO REFINEMENT (Conditional)
- Input: MP3 URL (if quality check fails)
- Skills: #178, #219
- Output: Final processed audio file

The subagent concludes when either:
a) The initial MP3 passes quality verification
OR
b) The audio refinement process completes successfully

The final output is always a single MP3 file URL that meets the quality standards required for lip-sync integration in the final talking head video.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent flow in detail:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, themed AI avatar image that matches the video's topic and style, ensuring it's optimized for talking head animation with clear facial features and appropriate framing.

B) FINAL TASK OUTPUT: 
A 1024x1024 transparent PNG file of a front-facing avatar with clear facial features, neutral expression, good lighting, and appropriate themed styling, saved with transparent background for animation compatibility.

C) SUBAGENT INPUT:
- Topic/theme of the video
- Style preferences for avatar (eg. professional, casual, specific profession/character type)
- Any specific visual requirements (gender, age range, ethnicity if specified)

D) SUBAGENT TASK SUMMARY:
Input > #223 (Generate optimized prompt) > #222 (Initial avatar generation) > #176 (Quality check) > #221 (Refinement if needed) > #191 (Final size optimization) > Output

Detailed Flow:
1. Use #223 (Powerful LLM) to convert user requirements into an optimized image generation prompt that emphasizes:
   - Front-facing position
   - Clear facial features
   - Neutral expression
   - Professional lighting
   - Themed styling elements
   - Background removal compatibility

2. Use #222 (Make Image) to generate initial avatar using the optimized prompt

3. Use #176 (Analyze Image) to verify:
   - Facial clarity
   - Proper positioning
   - Theme appropriateness
   - Animation suitability

4. IF quality check fails, use #221 (Recreate Image) to refine based on specific feedback

5. Use #191 (Resize Image) to ensure final 1024x1024 dimension

F) SILOS:
SILO 1: PROMPT ENGINEERING
- Input: Raw user requirements
- Skill: #223 
- Output: Optimized generation prompt

SILO 2: IMAGE GENERATION
- Input: Optimized prompt
- Skill: #222
- Output: Initial avatar image

SILO 3: QUALITY ASSURANCE
- Input: Initial avatar image
- Skills: #176 > #221 (if needed)
- Output: Verified/refined avatar

SILO 4: OPTIMIZATION
- Input: Verified avatar
- Skill: #191
- Output: Final sized avatar PNG

This workflow ensures consistent, high-quality avatar generation with appropriate checks and refinements for talking head animation compatibility.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent:

A) SUBAGENT SUMMARY: 
A specialized component that analyzes voice-over audio to generate precise phoneme timing data and maps it to corresponding viseme (mouth shape) positions for synchronized avatar animation.

B) FINAL TASK OUTPUT: 
A JSON data structure containing:
- Timestamp-mapped phoneme data (millisecond precision)
- Corresponding viseme positions (mouth shapes)
- Head movement coordinates (x,y,z rotation values)
- Facial expression markers
Format: {timing: ms, phoneme: string, viseme: int, head_pos: [x,y,z], expression: float}

C) SUBAGENT INPUT:
- Voice-over MP3 URL
- Transcription with word-level timing
- Avatar reference image (to understand mouth/face structure)

E) SUBAGENT TASK SUMMARY:
1. Convert MP3 to WAV for analysis
   > #178 (Convert MP3 to WAV)

2. Generate detailed audio waveform
   > #179 (Create Visual Waveform)

3. Extract precise audio timing data
   > #198 (Get Transcription with Timings)

4. Analyze audio characteristics
   > #180 (Extract Beatpoints & Tempo)

5. Generate viseme mapping data
   > #223 (LLM to process audio analysis and generate viseme mappings)

F) SILOS:

SILO 1: AUDIO PREPARATION
- Input: MP3 URL
- Skill #178: Convert to WAV
- Skill #179: Generate waveform
Output: WAV file + waveform visualization

SILO 2: TIMING ANALYSIS
- Input: WAV from Silo 1
- Skill #198: Get precise transcription
- Skill #180: Extract timing data
Output: Detailed timing data

SILO 3: VISEME MAPPING
- Input: Timing data + Avatar reference
- Skill #223: Process data and generate viseme mappings
Output: Final JSON animation data

Note: The limitation here is that while we can generate the precise timing data, creating accurate viseme mappings requires a custom solution beyond the current skill set. The LLM (#223) can help structure the data, but additional development would be needed for precise mouth shape calculations.

4 Template & Links

Expand Flow

Let me break down the VideoAssemblerPro subagent following your guidelines:

A) SUBAGENT SUMMARY: 
A specialized video assembly agent that combines an AI avatar image with voice-over audio and lip-sync data to create a synchronized talking head video output.

B) FINAL TASK OUTPUT:
An MP4 video file (1920x1080) featuring a speaking AI avatar with synchronized lip movements and natural head movements, duration matching the input audio file length, with clear audio quality at 48kHz.

C) SUBAGENT INPUT:
- PNG file URL of the AI-generated avatar image
- MP3 file URL of the voice-over audio
- Text transcription with timing data (for lip-sync alignment)

D) SUBAGENT TASK SUMMARY:
Input > #198 Get Transcription Of MP3 (With Timings) > #168 Generate Talking Head Video From MP3 & transcription > #199 Add Images & Videos On Top Of Existing MP4 > Final MP4 Output

The flow works like this:
1. First, get precise timing data from the audio using #198
2. Use #168 to generate the base talking head video
3. Use #199 to enhance the video with any additional overlay elements

F) SILOS:
SILO 1: AUDIO PREPARATION
- Input: MP3 voice-over file
- Skill: #198 (Get Transcription Of MP3 With Timings)
- Output: Precise transcription with timing data

SILO 2: BASE VIDEO GENERATION
- Input: MP3 + transcription
- Skill: #168 (Generate Talking Head Video)
- Output: Base MP4 with synchronized lip movements

SILO 3: VIDEO ENHANCEMENT
- Input: Base MP4 + avatar PNG
- Skill: #199 (Add Images & Videos On Top Of Existing MP4)
- Output: Final enhanced MP4 video

Note: This workflow assumes skill #168 has built-in lip-sync capabilities. If it doesn't, we would need an additional specialized lip-sync processing step before video generation, but this isn't currently available in the skill list.

The main advantage of this approach is that it uses proven skills (#168, #198, #199) rather than requiring new unproven ones, while still achieving the core objective of creating a synchronized talking head video.