What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

I'll analyze SubAgent 1 "ScriptMaster" in detail:

A) SUBAGENT SUMMARY:
ScriptMaster generates a well-structured, conversational voice-over script (100-300 words) by researching the topic thoroughly and crafting content specifically optimized for spoken delivery.

B) FINAL TASK OUTPUT:
A single text file containing a polished, ready-to-record voice-over script of 100-300 words, formatted with proper punctuation and natural speech patterns, saved as plain text.

C) SUBAGENT INPUT:
- User's topic/description for the video
- Optional style preferences (tone, length, target audience)

E) SUBAGENT TASK SUMMARY:
The subagent will follow this sequence:

1. Research Phase:
user_input > #216 (Research Topic Deeply) > research_data

2. Keyword Enhancement:
research_data > #218 (Brainstorm Related Keywords) > enhanced_keywords

3. Script Generation:
[research_data + enhanced_keywords] > #171 (Write Voice Over Script Based On Instructions) > draft_script

4. Script Refinement:
draft_script > #190 (Write or rewrite text based on instructions) with specific prompt for voice-over optimization > final_script

F) SILOS:
SILO 1: RESEARCH & PREPARATION
- Input: User's topic/description
- Skill #216: Research topic deeply
- Skill #218: Brainstorm related keywords
- Output: Research data + enhanced keywords

SILO 2: SCRIPT CREATION
- Input: Research data + keywords
- Skill #171: Generate initial voice-over script
- Output: Draft script

SILO 3: OPTIMIZATION
- Input: Draft script
- Skill #190: Refine for voice-over delivery
- Output: Final polished script

This flow ensures thorough research, proper keyword integration, and voice-over-specific optimization, resulting in a script that's both informative and natural-sounding when spoken. The multiple refinement steps help create content that's specifically crafted for audio delivery rather than just written text.

SubAgent #1 - Diagram

Expand Diagram

Let me break down the VoiceForge subagent in detail:

A) SUBAGENT SUMMARY: 
VoiceForge converts a text script into a high-quality voice-over MP3 file, with proper pacing and natural speech patterns suitable for an avatar talking head video.

B) FINAL TASK OUTPUT: 
A single MP3 file (URL) containing the voice-over narration of the script, with duration typically between 30 seconds to 5 minutes (based on the 100-300 word input script limitation), saved in standard web audio format (44.1kHz, 128-320kbps).

C) SUBAGENT INPUT:
- Primary Input: Text script (100-300 words) from ScriptMaster subagent
- Optional Input: Voice style preferences or specifications (if provided in original user prompt)

E) SUBAGENT TASK SUMMARY:
The workflow follows this sequence:

1. Input text script > #190 (Write or rewrite text based on instructions) to optimize the script for voice-over delivery (adding pauses, adjusting punctuation) > optimized script

2. Optimized script > #170 (Turn Script Into Voice Over MP3) > initial MP3 URL

3. Initial MP3 URL > #198 (Get Transcription Of MP3 With Timings) > transcription with timing verification

4. Transcription > #223 (Powerful LLM Prompt-to-Text Response) to verify timing and pacing matches requirements > verification report

5. If verification passes > final MP3 URL is outputted
   If verification fails > return to step 1 with adjusted parameters

F) SILOS:
SILO 1: SCRIPT OPTIMIZATION
- Input: Raw script
- Process: Skill #190
- Output: Voice-optimized script

SILO 2: AUDIO GENERATION
- Input: Optimized script
- Process: Skill #170
- Output: MP3 file

SILO 3: QUALITY VERIFICATION
- Input: Generated MP3
- Process: Skills #198 and #223
- Output: Verification and timing check

The subagent includes verification steps to ensure the voice-over meets quality standards before passing to the next subagent in the main workflow. This is crucial since the audio forms the foundation for the lip-sync animation in later steps.

SubAgent #2 - Diagram

Expand Diagram

Here's my complete analysis and workflow for the AvatarVision subagent:

A) SUBAGENT SUMMARY: 
A specialized image generation subagent that creates a high-quality, contextually appropriate AI avatar image that matches the topic and style of the video content, with particular attention to facial features and head/shoulder framing suitable for a talking head video.

B) FINAL TASK OUTPUT: 
A single 1024x1024 transparent PNG file of a professional-looking AI avatar with clear facial features, proper head/shoulders framing, and thematic styling relevant to the video content.

C) SUBAGENT INPUT:
1. Topic/theme of the video content
2. Style preferences for the avatar (if any)
3. Any specific requirements about profession/expertise the avatar should convey

D) SUBAGENT TASK SUMMARY:
The workflow follows these steps:

1. Generate optimal prompt using LLM
Input: Raw topic/theme/style preferences
Skill: #223 (Powerful LLM Prompt-to-Text Response)
Output: Optimized image generation prompt

2. Create initial avatar image
Input: Optimized prompt from step 1
Skill: #182 (Create Dalle Image (Transparent Square))
Output: Initial 1024x1024 transparent PNG

3. Analyze generated image
Input: PNG from step 2
Skill: #176 (Analyze An Image With GPT Vision & Return Text)
Output: Analysis text of image quality/suitability

4. If needed, regenerate with refined prompt
Input: Analysis feedback
Skill: #223 (Powerful LLM Prompt-to-Text Response)
Output: Refined prompt
Then repeat skill #182 with refined prompt

E) SILOS:
SILO 1: PROMPT ENGINEERING
- Input: Raw requirements
- Skill #223: Generate optimal prompt
- Output: Refined prompt

SILO 2: IMAGE GENERATION
- Input: Optimized prompt
- Skill #182: Generate transparent avatar
- Output: PNG file

SILO 3: QUALITY CONTROL
- Input: Generated PNG
- Skill #176: Analyze image
- Output: Quality assessment
- If needed, return to SILO 1 for refinement

The workflow includes a feedback loop where if the generated image doesn't meet quality standards (assessed by #176), it can return to the prompt engineering stage for refinement. This ensures the final avatar is optimally suited for talking head video use.

Key considerations built into this workflow:
- Ensures transparent background for video compatibility
- Focuses on head/shoulders framing
- Maintains professional appearance while incorporating thematic elements
- Provides quality control check
- Allows for refinement if needed

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent in detail:

A) SUBAGENT SUMMARY: 
LipSyncWizard analyzes an audio file to generate precise phoneme timing data and maps it to corresponding viseme (mouth shape) positions, creating a structured animation data file that synchronizes speech with avatar mouth movements.

B) FINAL TASK OUTPUT:
A JSON data structure containing:
- Timestamp markers (in milliseconds)
- Corresponding phoneme identifiers
- Mapped viseme positions (mouth shapes)
- Optional head movement coordinates
- Audio amplitude data for expression intensity

C) SUBAGENT INPUT:
- MP3 voice-over file URL
- Transcription of the audio (with timing markers)

E) SUBAGENT TASK SUMMARY:
1. Convert MP3 to WAV for precise audio analysis
   > #178 (Convert MP3 to WAV)

2. Generate visual waveform for amplitude analysis
   > #179 (Create Visual Waveform)
   > #176 (Analyze Image with GPT Vision) to extract amplitude patterns

3. Get precise transcription with timestamps
   > #198 (Get Transcription with Timings)

4. Extract audio characteristics:
   > #180 (Extract Beatpoints & Tempo) for natural head movement timing
   > #188 (Extract audio stems) to isolate clear voice track

F) SILOS:

SILO 1: AUDIO PREPROCESSING
- Input: MP3 URL
- Skill #178: Convert to WAV
- Skill #188: Extract vocal stem
Output: Clean WAV file for analysis

SILO 2: AMPLITUDE MAPPING
- Input: WAV from Silo 1
- Skill #179: Generate waveform
- Skill #176: Analyze waveform
Output: Amplitude data points

SILO 3: SPEECH TIMING
- Input: Clean vocal WAV
- Skill #198: Get precise transcription
- Skill #180: Extract rhythm points
Output: Timestamped phoneme data

SILO 4: DATA COMPILATION
- Input: All silo outputs
- Skill #223: Use LLM to structure final JSON
Output: Final lip-sync data file

Note: While this workflow creates the best possible preparation for lip-sync data, it's worth noting that the actual final step of converting this data into viseme mappings would require a custom skill. The current skill set can prepare all the necessary timing and audio analysis, but the final conversion to mouth position data would need additional capability.

4 Template & Links

Expand Flow

Let me break down the VideoAssemblerPro subagent according to your format:

A) SUBAGENT SUMMARY: 
VideoAssemblerPro combines an AI-generated avatar image with voice-over audio and lip-sync data to create a synchronized talking head video where the avatar's mouth and facial movements match the spoken audio.

B) FINAL TASK OUTPUT: 
An MP4 video file (16:9 aspect ratio) featuring the AI avatar speaking with synchronized lip movements, approximately matching the duration of the input audio file (typically 1-5 minutes), with the avatar centered in frame against a clean background.

C) SUBAGENT INPUT:
- PNG URL of the AI-generated avatar image
- MP3 URL of the voice-over audio
- Text transcription with timing data (for lip-sync mapping)

E) SUBAGENT TASK SUMMARY:
1. First Silo - Audio Processing:
- Input MP3 > #198 (Get Transcription of MP3 With Timings) > Detailed phoneme timing data
- Input MP3 > #196 (Extract MP3 Audio From MP4 File) > Clean audio file for sync

2. Second Silo - Image Processing:
- Input Avatar PNG > #191 (Resize Image) [to 1024x1024] > Properly sized avatar
- Resized Avatar > #176 (Analyze Image With GPT Vision) > Facial feature mapping data

3. Third Silo - Video Generation:
- Input all processed components > #168 (Generate Talking Head Video From MP3 & transcription) > Raw MP4
- Raw MP4 > #194 (Cut Small Section From MP4 Video) > Clean final MP4

F) SILOS:
SILO 1: AUDIO PREPARATION
Purpose: Process audio for precise timing
Input: Original MP3
Output: Clean audio file + timing data

SILO 2: IMAGE PREPARATION
Purpose: Optimize avatar image for animation
Input: Original PNG
Output: Processed avatar image with mapping

SILO 3: VIDEO ASSEMBLY
Purpose: Generate and clean final video
Input: All processed components
Output: Final talking head MP4

Note: While this workflow uses existing skills where possible, it's worth noting that the actual lip-sync generation might require additional specialized skills not currently in the list. The current workflow uses skill #168 as the primary video generator, but this might need enhancement for more sophisticated lip-sync capabilities.