Video Visualizer from Text

I have a very simple agent sequence I want to run - generating 3 videos that best represent an inputted piece of text content. Essentially I want to input a block of text, like a line from a video script, or content from an article (variable1). Then an LLM reviews that content and returns 3 different prompts that might work well to represent that given copy. Then each of these 3 prompts is turned into an image (3 different images in total). Then each image is turned into a video animation. The finally, all 3 video animations are returned. I guess we could have subagent 1 generating the 3 prompts (very simple), subagent 2 generating the 3 images, subagent 3 generates the 3 videos from those images, and then maybe subagent 4 is just an LLM that ingests the output from subagent3 and returns the 4 video URLs as JSON.

subagent1

subagentX-refined

subagentXmermaid

https://static.aiz.ac/1738242798-mermaid/mermaid-1.png