What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

I'll analyze ScriptMaster and break it down according to your requirements:

A) SUBAGENT SUMMARY: 
ScriptMaster generates an optimized, engaging voice-over script from a user's topic description, incorporating research and best practices for spoken-word content.

B) FINAL TASK OUTPUT: 
A text file containing a 100-300 word voice-over script, formatted specifically for audio narration, with natural speech patterns, appropriate pacing breaks, and clear pronunciation guides where needed.

C) SUBAGENT INPUT:
- Primary user topic/description
- Optional style preferences (tone, length, target audience)
- Optional technical requirements (max duration, specific terminology)

E) SUBAGENT TASK SUMMARY:
The optimal flow would be:

user input > #216 Research Topic Deeply > #223 Powerful LLM Prompt-to-Text Response > #171 Write Voice Over Script Based On Instructions > final script output

This chain ensures:
1. Deep research gathers comprehensive topic information
2. LLM processes research into initial script structure
3. Specialized voice-over writing skill finalizes the script

F) SILOS:
SILO 1: RESEARCH & PREPARATION
- Input: User topic description
- Skill: #216 Research Topic Deeply
- Output: Comprehensive research summary

SILO 2: INITIAL SCRIPT STRUCTURING
- Input: Research summary
- Skill: #223 Powerful LLM Prompt-to-Text Response
- Output: Initial script draft with structure and key points

SILO 3: VOICE-OVER OPTIMIZATION
- Input: Initial script draft
- Skill: #171 Write Voice Over Script Based On Instructions
- Output: Final voice-over ready script

This arrangement ensures each stage builds upon the previous one, with clear handoffs between silos. The research phase ensures accuracy, the LLM structures the content appropriately, and the final voice-over optimization makes it suitable for spoken delivery.

The final output from Silo 3 will be ready to pass to the next subagent in the larger workflow (VoiceForge).

SubAgent #1 - Diagram

Expand Diagram

I'll analyze SUBAGENT 2: "VoiceForge" and break it down according to your requirements.

A) SUBAGENT SUMMARY:
VoiceForge converts a text script into a high-quality voice-over MP3 file, ensuring proper audio formatting and quality for subsequent lip-sync processing.

B) FINAL TASK OUTPUT:
An MP3 file URL containing clear voice-over audio, with duration between 30 seconds to 5 minutes (based on 100-300 word script), sampled at 44.1kHz, stereo format, with consistent volume levels suitable for lip-sync processing.

C) SUBAGENT INPUT:
- Primary input: Text script (100-300 words) from ScriptMaster
- Optional input: Voice style preferences (if any specified in original user request)

E) SUBAGENT TASK SUMMARY:
The workflow needs multiple steps to ensure optimal audio quality:

1. Initial Voice Generation:
input: text script > #170 (Turn Script Into Voice Over MP3) > MP3 URL

2. Audio Quality Check & Processing:
MP3 URL > #178 (Convert MP3 to WAV) > WAV URL > #179 (Create Visual Waveform) > Waveform JPEG URL > #176 (Analyze Image With GPT Vision) > Audio quality analysis text

3. Final Audio Preparation:
If quality check passes > original MP3 becomes final output
If quality check fails > repeat #170 with adjusted parameters

F) SILOS:
SILO 1: VOICE GENERATION
- Purpose: Create initial voice-over
- Skill: #170
- Input: Text script
- Output: MP3 URL

SILO 2: QUALITY VERIFICATION
- Purpose: Verify audio quality
- Skills: #178 > #179 > #176
- Input: Initial MP3
- Output: Quality analysis text

SILO 3: FINALIZATION
- Purpose: Ensure final output meets requirements
- Input: Quality analysis + MP3
- Output: Final verified MP3 URL

This structure ensures we not only generate the voice-over but also verify its quality for optimal lip-sync processing in later stages. The quality check step is crucial as poor audio quality could negatively impact the final talking head animation.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent following the requested format:

A) SUBAGENT SUMMARY: 
AvatarVision generates a high-quality, thematically appropriate AI avatar image that will serve as the base for the talking head video, ensuring the avatar matches the content topic and maintains professional visual quality.

B) FINAL TASK OUTPUT:
A single 1024x1024 transparent PNG file of a professional-looking AI avatar head/shoulders shot, with clear facial features suitable for lip-sync animation, saved with transparent background.

C) SUBAGENT INPUT:
- User's topic/theme description
- Style preferences for avatar (if any)
- Content context from ScriptMaster's script (to ensure avatar matches content theme)

E) SUBAGENT TASK SUMMARY:
1. Input > Skill #223 (Powerful LLM) to convert user requirements into specific image generation prompts
2. Generated prompt > Skill #182 (Create Dalle Image Transparent Square) to create initial avatar
3. Initial avatar > Skill #176 (Analyze Image with GPT Vision) to verify avatar quality/suitability
4. If needed based on analysis > Skill #221 (Recreate New Image From Image URL) to refine/improve
5. Final avatar image > Skill #191 (Resize Image) to ensure exact 1024x1024 dimensions
Output: Final transparent PNG avatar image URL

F) SILOS:
SILO 1 - PROMPT ENGINEERING
• Input: User requirements + script context
• Skill #223: Generate optimal image prompt
• Output: Refined image generation prompt

SILO 2 - IMAGE GENERATION
• Input: Refined prompt from Silo 1
• Skill #182: Generate transparent avatar
• Output: Initial avatar PNG

SILO 3 - QUALITY CONTROL
• Input: Initial avatar from Silo 2
• Skill #176: Analyze avatar quality
• If needed, Skill #221: Recreate/refine avatar
• Output: Quality-verified avatar

SILO 4 - FORMAT OPTIMIZATION
• Input: Quality-verified avatar
• Skill #191: Ensure correct dimensions
• Output: Final 1024x1024 transparent PNG avatar

This workflow ensures we get a high-quality, appropriate avatar that's technically suitable for the subsequent video generation steps while maintaining strict quality control at each stage.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent in detail:

A) SUBAGENT SUMMARY: 
LipSyncWizard analyzes an audio file to create precise phoneme timing data and maps it to corresponding viseme (mouth shape) positions, generating animation data that will drive the avatar's lip movements and facial expressions.

B) FINAL TASK OUTPUT: 
A JSON file containing timestamped phoneme-to-viseme mapping data, including:
- Timestamp markers (in milliseconds)
- Corresponding phoneme identifiers
- Mouth shape/viseme positions (x,y coordinates)
- Optional head movement/facial expression markers
- Audio amplitude data for emphasis

C) SUBAGENT INPUT:
- MP3 voice-over file URL
- Transcription with word timing data
- Basic configuration parameters for animation style

E) SUBAGENT TASK SUMMARY:
1. Convert MP3 to WAV for analysis
   > Skill #178 (Convert MP3 to WAV)

2. Get precise audio transcription with timing
   > Skill #198 (Get Transcription with Timings)

3. Generate audio waveform for amplitude analysis
   > Skill #179 (Create Visual Waveform)
   > Skill #176 (Analyze Image with GPT Vision) to extract amplitude data

4. Extract beat/tempo data for natural head movements
   > Skill #180 (Extract Beatpoints & Tempo)

5. Use LLM to convert transcription+timing into phoneme sequence
   > Skill #223 (Powerful LLM) to map words to phoneme sequences

6. Generate final JSON mapping using collected data
   > Skill #223 (Powerful LLM) to compile all data into structured format

F) SILOS:
SILO 1: AUDIO PREPARATION
- Input: MP3 URL
- Skill #178 > WAV file
- Skill #198 > Timestamped transcription
Output: WAV file + transcription

SILO 2: MOVEMENT ANALYSIS
- Input: WAV file
- Skill #179 > Waveform image
- Skill #176 > Amplitude data
- Skill #180 > Beat/tempo data
Output: Movement timing data

SILO 3: PHONEME MAPPING
- Input: Transcription
- Skill #223 > Phoneme sequence
Output: Timestamped phonemes

SILO 4: FINAL COMPILATION
- Input: All previous silo outputs
- Skill #223 > Final JSON compilation
Output: Complete lip-sync animation data file

Each silo operates somewhat independently but feeds into the final compilation stage, where everything is merged into the final animation data format required by the video renderer.

4 Template & Links

Expand Flow

Let me break down the VideoAssemblerPro subagent following your guidelines:

A) SUBAGENT SUMMARY: 
VideoAssemblerPro combines an AI-generated avatar image, voice-over audio, and lip-sync data to create a synchronized talking head video where the avatar's mouth and facial movements match the audio precisely.

B) FINAL TASK OUTPUT:
MP4 video file (1080x1080 square format) featuring the AI avatar with synchronized lip movements matching the voice-over audio, duration matching the input audio file length (typically 30-180 seconds), with high-quality 48kHz audio.

C) SUBAGENT INPUT:
- PNG URL of the AI-generated avatar image (transparent background)
- MP3 URL of the voice-over audio
- Text transcription with precise timing data for lip-sync
- (Optional) Animation parameters for head movements

E) SUBAGENT TASK SUMMARY:
1. Input avatar PNG > #191 Resize Image (to ensure 1080x1080) > Resized avatar
2. Input MP3 > #198 Get Transcription of MP3 (With Timings) > Detailed phoneme timing data
3. Input MP3 + transcription + avatar > #168 Generate Talking Head Video From MP3 & transcription > Initial MP4
4. Initial MP4 > #194 Cut Small Section From MP4 Video (trim any excess) > Final MP4

F) SILOS:

SILO 1: AVATAR PREPARATION
- Input: Original avatar PNG
- Skill: #191 (Resize Image)
- Output: Properly sized avatar PNG (1080x1080)

SILO 2: AUDIO ANALYSIS
- Input: Voice-over MP3
- Skill: #198 (Get Transcription of MP3)
- Output: Precise timing data for lip-sync

SILO 3: VIDEO GENERATION
- Input: 
  * Sized avatar from Silo 1
  * MP3 and timing data from Silo 2
- Skill: #168 (Generate Talking Head Video)
- Output: Raw talking head MP4

SILO 4: VIDEO FINALIZATION
- Input: Raw MP4 from Silo 3
- Skill: #194 (Cut Small Section)
- Output: Final trimmed MP4 with precise timing

Note: This workflow assumes skill #168 (Generate Talking Head Video) has built-in capability to handle lip-sync animation. If not, additional skills for lip-sync generation would need to be added, but currently there isn't a specific skill in the list for this functionality.