What Shall We Build Next?

Describe

Describe your task

Refine

Refine the plan

SubAgents

Review all agents

Deploy

Deploy your agent

Let me break down the ScriptMaster subagent based on your guidelines:

A) SUBAGENT SUMMARY: 
ScriptMaster takes a user's topic description and generates a professionally structured voice-over script optimized for talking head videos, incorporating both research-based content and natural speaking patterns.

B) FINAL TASK OUTPUT:
A text file containing a 100-300 word voice-over script, formatted with proper punctuation and natural speech patterns, optimized for AI voice synthesis, and structured with clear introduction, body, and conclusion sections.

C) SUBAGENT INPUT:
- Primary user topic/description (text)
- Optional style preferences (casual/formal/educational)
- Optional target duration (in seconds/minutes)

E) SUBAGENT TASK SUMMARY:
The workflow requires three chained steps:

1. Research Phase:
#216 (Research Topic Deeply) 
INPUT: user's topic description
OUTPUT: comprehensive research summary

2. Keyword Enhancement:
#218 (Brainstorm Related Keywords)
INPUT: research summary from step 1
OUTPUT: relevant keywords and phrases to incorporate

3. Final Script Generation:
#171 (Write Voice Over Script Based On Instructions)
INPUT: combined research summary + keywords + specific instructions for voice-over formatting
OUTPUT: final voice-over script

F) SILOS:
SILO 1 - RESEARCH & PREPARATION
- Skill #216: Deep research of topic
- Skill #218: Keyword enhancement

SILO 2 - SCRIPT CREATION
- Skill #171: Final script generation with specific instructions:
  * Include natural speech patterns
  * Format for lip-sync compatibility
  * Structure with clear intro/body/conclusion
  * Maintain 100-300 word limit
  * Optimize for AI voice synthesis

The workflow ensures comprehensive research is done first, then enhanced with relevant keywords, before being transformed into a natural-sounding script that's optimized for AI voice synthesis and talking head video generation.

This approach guarantees that the final script is both well-researched and properly formatted for the subsequent stages in the larger workflow (voice synthesis and video generation).

SubAgent #1 - Diagram

Expand Diagram

Let me break down the VoiceForge subagent in detail:

A) SUBAGENT SUMMARY:
VoiceForge converts a text script into high-quality voice-over audio, ensuring proper formatting and optimal audio quality for lip-sync compatibility.

B) FINAL TASK OUTPUT:
An MP3 file URL containing clear, well-paced voice-over audio with consistent volume levels, typically 1-5 minutes in duration, sampled at 44.1kHz with a bitrate of 192kbps or higher (required for clean lip-sync).

C) SUBAGENT INPUT:
- Primary Input: Text script (100-300 words) from ScriptMaster
- Additional Input: Voice style preferences (if any)

E) SUBAGENT TASK SUMMARY:
The flow requires multiple steps to ensure optimal audio quality:

1. Initial Voice Generation:
Input(text script) > #170 (Turn Script Into Voice Over MP3) > MP3_URL

2. Audio Quality Check:
MP3_URL > #179 (Create Visual Waveform Of 60 second Wav/mp3 File) > Waveform_Image
Waveform_Image > #176 (Analyze An Image With GPT Vision & Return Text) > Audio_Quality_Analysis

3. If Audio Quality Analysis indicates issues:
- Convert format: MP3_URL > #178 (Convert 1-20 MP3s to wav) > WAV_URL
- Extract crucial timing: WAV_URL > #198 (Get Transcription Of MP3 With Timings) > Timing_Data
- Regenerate if needed: Return to step 1 with adjusted pacing based on timing data

F) SILOS:
SILO 1: VOICE GENERATION
- Purpose: Initial voice creation
- Skill: #170
- Input: Text script
- Output: Initial MP3

SILO 2: QUALITY VERIFICATION
- Purpose: Ensure audio meets lip-sync requirements
- Skills: #179, #176, #178
- Input: Initial MP3
- Output: Quality analysis and WAV format

SILO 3: TIMING OPTIMIZATION
- Purpose: Verify proper pacing and timing
- Skill: #198
- Input: WAV file
- Output: Timing verification and final MP3 URL

This structure ensures we get consistently high-quality voice output suitable for lip-syncing, with built-in quality checks and optimization steps.

SubAgent #2 - Diagram

Expand Diagram

Let me break down the AvatarVision subagent following the requested format:

A) SUBAGENT SUMMARY:
AvatarVision generates a high-quality, themed AI avatar image that matches the video's topic and style, optimized for talking head animation purposes.

B) FINAL TASK OUTPUT:
A square (1024x1024) transparent PNG file of a professional-looking avatar head/shoulders, with clean edges, neutral expression, and clear facial features suitable for animation, saved with transparent background.

C) SUBAGENT INPUT:
- Topic/theme of the video
- Style preferences for avatar (professional, casual, specific profession, etc.)
- Any specific visual requirements (age, gender, ethnicity, clothing, etc.)

E) SUBAGENT TASK SUMMARY:
The workflow chains together as follows:

1. Input description > #223 (Powerful LLM) to enhance/optimize the avatar prompt
2. Enhanced prompt > #222 (Make Image With Text) to generate initial avatar attempt
3. Initial avatar URL > #176 (Analyze Image with GPT Vision) to verify suitability
4. If needed based on analysis > #221 (Recreate Image) to refine/improve
5. Final image > #191 (Resize Image) to ensure 1024x1024 dimension
6. Output: Final transparent PNG avatar image URL

F) SILOS:
SILO 1 - PROMPT OPTIMIZATION
- Input: Raw topic/style description
- Skill: #223 (Powerful LLM)
- Output: Optimized image generation prompt

SILO 2 - IMAGE GENERATION
- Input: Optimized prompt
- Skill: #222 (Make Image With Text)
- Output: Initial avatar image URL

SILO 3 - QUALITY CONTROL
- Input: Initial avatar URL
- Skill: #176 (Analyze Image with GPT Vision)
- Output: Analysis of avatar suitability
- Conditional: If analysis indicates issues, use #221 (Recreate Image)

SILO 4 - FINAL FORMATTING
- Input: Best avatar image
- Skill: #191 (Resize Image) to 1024x1024
- Output: Final formatted transparent PNG

This structured approach ensures we get a high-quality, appropriately-themed avatar that's properly formatted for the subsequent animation process, with quality control measures built in.

SubAgent #3 - Diagram

Expand Flow

Let me break down the LipSyncWizard subagent in detail:

A) SUBAGENT SUMMARY:
LipSyncWizard processes an audio file to generate precise phoneme timing data and maps it to viseme (visual phoneme) positions for accurate lip-sync animation of an AI avatar.

B) FINAL TASK OUTPUT:
A structured JSON data file containing:
- Timestamped phoneme sequences
- Corresponding viseme mappings
- Head movement timing markers
- Facial expression change points
- Audio amplitude data for emphasis

C) SUBAGENT INPUT:
- Voice-over MP3 file URL
- Transcription with word-level timing
- Avatar image reference (for motion constraints)

E) SUBAGENT TASK SUMMARY:
1. Convert MP3 to WAV for analysis
#178 - Convert 1-20 MP3s to wav

2. Generate detailed audio analysis
#198 - Get Transcription Of MP3 (With Timings)
#179 - Create Visual Waveform Of 60 second Wav/mp3 File
#180 - Extract Beatpoints & Tempo of MP3

3. Process transcription for phoneme mapping
#185 - Write Text (Or Copy) From Inputted Text [to convert transcription into phoneme sequences]

4. Generate amplitude visualization for emphasis points
#176 - Analyze An Image With GPT Vision & Return Text [to analyze waveform]

5. Compile final data structure
#223 - Powerful LLM Prompt-to-Text Response [to format all data into required JSON structure]

F) SILOS:
SILO 1 - AUDIO PREPROCESSING
- Input: MP3 URL
- Skill: #178 (Convert to WAV)
- Output: WAV file URL

SILO 2 - AUDIO ANALYSIS
- Input: WAV URL
- Skills: #198, #179, #180
- Outputs: 
  * Detailed transcription with timing
  * Waveform visualization
  * Beat/tempo data

SILO 3 - PHONEME MAPPING
- Input: Transcription
- Skill: #185
- Output: Phoneme sequence data

SILO 4 - EMPHASIS DETECTION
- Input: Waveform image
- Skill: #176
- Output: Amplitude analysis

SILO 5 - DATA COMPILATION
- Inputs: All previous outputs
- Skill: #223
- Output: Final JSON structure

Each silo operates sequentially, with the final silo combining all data into the structured format required for avatar animation.

4 Template & Links

Expand Flow

Let me break down the VideoAssemblerPro subagent following your guidelines:

A) SUBAGENT SUMMARY: 
A specialized video assembly system that combines an AI avatar image, voice-over audio, and lip-sync data to create a synchronized talking head video with natural-looking mouth movements.

B) FINAL TASK OUTPUT:
MP4 video file (1920x1080, 16:9 aspect ratio) featuring the AI avatar with synchronized lip movements matching the voice-over audio, duration matching the input audio file length.

C) SUBAGENT INPUT:
- PNG URL of the AI-generated avatar image
- MP3 URL of the voice-over audio
- Text transcription with timing data (for lip-sync matching)

E) SUBAGENT TASK SUMMARY:
Input > #198 Get Transcription Of MP3 (With Timings) > #178 Convert MP3 to WAV > #168 Generate Talking Head Video From MP3 & transcription > #191 Resize Image (if needed to match 16:9) > Output

The flow works like this:
1. First, get precise phoneme timing data from the MP3 using transcription
2. Convert audio to WAV format for better processing
3. Generate the base talking head video using the core animation system
4. Resize if needed to ensure proper video dimensions
5. Return final MP4

F) SILOS:
SILO 1 - AUDIO PREPARATION
- Input: MP3 URL
- Skill #198: Generate precise transcription with timings
- Skill #178: Convert to WAV format
- Output: WAV file + timing data

SILO 2 - VIDEO GENERATION
- Input: Avatar PNG, WAV file, timing data
- Skill #168: Generate base talking head video
- Skill #191: Resize to final dimensions if needed
- Output: Final MP4 video

NOTE: While this workflow utilizes existing skills, there may be limitations in the current skill set for creating truly naturalistic lip-sync. The #168 skill (Generate Talking Head Video) is doing much of the heavy lifting here, and its capabilities for lip-sync quality should be carefully evaluated. If enhanced lip-sync quality is required, additional specialized skills might need to be developed.