How To Article from Video Guide

I want to record a video on a "how to" topic and turn it into an article, with text and images. For example, I will record a video about "how to make software mockups with GIMP" and then want it turned into a video. I want this to be a very media-rich article with lots and lots of screenshots, so the idea is that in the video I will explain what I am doing (eg "the first thing we do is open GIMP and click on new image and choose an image size of 1000x1000"), and then we will want to extract the transcription from the video. Then we review this transcription/timestamps and decide on the core steps of the how to task (eg step 1 - download GIMP.. step 2 - create a new image), and then we look at the timestamps mentioned and extract images at those specific timestamps (so if I say "let me open the image now" at 23.39 then we would extract an image frame from the video at that point). Then we will use my transcription file to set specific timings. My software works best when it has a fixed number of actions to take, so I suggest we (fairly abitrarily) go with six steps. This means we will want six images embedded, and six steps written in the article. It will mean that an LLM will need to review the transcription and decide which are the six steps (if the video narrator does not explicitly list six steps, or even does more or fewer).

First, pre-entered info...

final-output2

subagents-extracted2

OVERVIEW OF SUBAGENTS:
A system requiring four specialized subagents working sequentially to transform a video into a written tutorial: 
1. Transcription Processor - Creates timestamped transcript
2. Step Identification & Analysis - Structures content into 6 steps
3. Image Extraction & Processing - Captures and optimizes visuals
4. Article Assembly & Formatting - Produces final markdown document

SUBAGENT 1: TRANSCRIPTION PROCESSOR
Name: Video-to-Text Transcription Agent
Final Output: Timestamped JSON transcript
Method: Extracts audio track, processes speech to text with timestamps
Sequence:
1. Extract MP3 from video file
2. Process audio through speech-to-text API
3. Generate timestamped transcript
4. Output JSON formatted transcript with timing markers

SUBAGENT 2: STEP IDENTIFICATION & ANALYSIS
Name: Content Structure Analysis Agent
Final Output: JSON file with 6 key steps and associated timestamps
Method: Analyzes transcript to identify main tutorial steps
Sequence:
1. Process transcript content
2. Identify major tutorial segments
3. Extract 6 key steps with timestamps
4. Structure data in JSON format

SUBAGENT 3: IMAGE EXTRACTION & PROCESSING
Name: Visual Content Processor
Final Output: Six optimized PNG screenshots
Method: Extracts frames, processes for quality, standardizes format
Sequence:
1. Extract frames at identified timestamps
2. Analyze frame quality with GPT Vision
3. Resize and optimize images
4. Output standardized PNG files

SUBAGENT 4: ARTICLE ASSEMBLY & FORMATTING
Name: Content Integration Agent
Final Output: Markdown formatted tutorial article
Method: Combines text and images into structured document
Sequence:
1. Format step descriptions
2. Insert optimized images
3. Apply markdown formatting
4. Generate final tutorial document

subagentnumber

Then, agent response...

subagent1refined-new-json-core

G) TASKFLOW SUMMARY

Based on the subagent description provided, here are the three core skills/tasks identified in the workflow:

1. ID: #196
Title task: Extract MP3 Audio From MP4 File
Description: Processes an MP4 video file and extracts the audio track, saving it as a standalone MP3 file while maintaining audio quality and synchronization
Input(s) involved: [mp4_url]
Output token name: mp3_audio_file

2. ID: #207
Title task: Get Transcription From MP4 Video URL
Description: Processes the extracted audio file through speech recognition to generate a raw timestamped transcription of all spoken content, maintaining timing accuracy
Input(s) involved: [mp3_audio_file]
Output token name: raw_timestamped_transcript

3. ID: #223
Title task: Powerful LLM Prompt-to-Text Response
Description: Takes raw timestamped transcript and structures it into a clean JSON format with consistent timestamp:text pairs for each spoken segment
Input(s) involved: [raw_timestamped_transcript, json_formatting_prompt]
Output token name: formatted_json_transcript

Each of these tasks corresponds directly to the three silos identified in the subagent structure, creating a clear progression from video input to JSON output, with each task handling a specific transformation of the data.

The complete workflow follows this sequence:
[mp4_url] → #196 → [mp3_audio_file] → #207 → [raw_timestamped_transcript] → #223 → [formatted_json_transcript]

subagent1mermaid

Mermaid Image - https://static.aiz.ac/1734452867-mermaid/mermaid1.png

prompts-required

subagent1-json-array

[
  {
    "taskID": 196,
    "semanticTitleOfTask": "MP4 to MP3 Audio Extraction",
    "taskDescription": "Processes an MP4 video file and extracts the audio track, saving it as a standalone MP3 file while maintaining audio quality and synchronization",
    "inputDescription": "A single input consisting of the URL or file path to the MP4 video that needs audio extraction",
    "inputRequired": ["mp4_url"],
    "outputDescription": "A complete MP3 audio file extracted from the input video, maintaining original audio quality and timing",
    "outputName": "mp3_audio_file",
    "promptInstruction": ""
  },
  {
    "taskID": 207,
    "semanticTitleOfTask": "Speech-to-Text Transcription Generation",
    "taskDescription": "Processes the MP3 audio through speech recognition to generate a raw timestamped transcription of all spoken content",
    "inputDescription": "The MP3 audio file extracted from the previous task, containing the speech that needs to be transcribed",
    "inputRequired": ["mp3_audio_file"],
    "outputDescription": "A raw text transcript with timestamps marking each segment of speech throughout the audio",
    "outputName": "raw_timestamped_transcript",
    "promptInstruction": ""
  },
  {
    "taskID": 223,
    "semanticTitleOfTask": "JSON Transcript Formatter",
    "taskDescription": "Transforms raw timestamped transcript into a structured JSON format with precise timestamp:text pairs for each spoken segment",
    "inputDescription": "The raw timestamped transcript from the previous task that needs to be formatted into JSON structure",
    "inputRequired": ["raw_timestamped_transcript"],
    "outputDescription": "A formatted JSON file containing transcript segments with timestamps in MM:SS format and corresponding spoken text",
    "outputName": "formatted_json_transcript",
    "promptInstruction": "Please format the following timestamped transcription into a clean JSON structure. Each entry should contain exactly two fields: a timestamp and the corresponding spoken text. Follow these specific requirements:\n1. Format timestamps as \"MM:SS\" (minutes:seconds)\n2. Break the transcript into segments of roughly 3-8 seconds each\n3. Ensure each text segment is complete and coherent\n4. Remove any filler words, ums, ahs, or false starts\n5. Maintain chronological order\n6. Use this exact JSON structure: {\"segments\": [{\"timestamp\": \"00:00\",\"text\": \"spoken content here\"}]}\n7. Ensure all text is properly escaped for JSON formatting\n8. Maintain accuracy of technical terms and instructions from the original transcript"
  }
]

subagent1-json-array-mermaid

<!DOCTYPE html>
<html>
<head>
  <title>Mermaid Diagram Example</title>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mermaid/9.4.3/mermaid.min.js"></script>
</head>
<body>
  <div class="mermaid">
graph TD
    subgraph Subagent: Transcription Processor
        input[Input MP4 URL] --> task196[Task #196: MP4 to MP3 Audio Extraction]
        task196 --> mp3[MP3 Audio File]
        mp3 --> task207[Task #207: Speech-to-Text Transcription]
        task207 --> raw[Raw Timestamped Transcript]
        raw --> task223[Task #223: JSON Transcript Formatter]
        task223 --> json[Formatted JSON Transcript]
    end
    style input fill:#e6f3ff,stroke:#3399ff
    style mp3 fill:#f2ffe6,stroke:#66cc00
    style raw fill:#f2ffe6,stroke:#66cc00
    style json fill:#ffe6e6,stroke:#ff3333
    style task196 fill:#f0f0f0,stroke:#666666
    style task207 fill:#f0f0f0,stroke:#666666
    style task223 fill:#f0f0f0,stroke:#666666
  </div>