I want to record a video on a "how to" topic and turn it into an article, with text and images. For example, I will record a video about "how to make software mockups with GIMP" and then want it turned into a video. I want this to be a very media-rich article with lots and lots of screenshots, so the idea is that in the video I will explain what I am doing (eg "the first thing we do is open GIMP and click on new image and choose an image size of 1000x1000"), and then we will want to extract the transcription from the video. Then we review this transcription/timestamps and decide on the core steps of the how to task (eg step 1 - download GIMP.. step 2 - create a new image), and then we look at the timestamps mentioned and extract images at those specific timestamps (so if I say "let me open the image now" at 23.39 then we would extract an image frame from the video at that point). Then we will use my transcription file to set specific timings. My software works best when it has a fixed number of actions to take, so I suggest we (fairly abitrarily) go with six steps. This means we will want six images embedded, and six steps written in the article. It will mean that an LLM will need to review the transcription and decide which are the six steps (if the video narrator does not explicitly list six steps, or even does more or fewer).