graph TD A[Start] --> B[Check Video Input] B --> C[Extract Audio Track] C --> D[Convert Audio to MP3] D --> E[Initialize Speech-to-Text] E --> F[Process Audio Chunks] F --> G[Generate Raw Transcript] G --> H[Add Timestamps] H --> I[Format to JSON] I --> J[Validate JSON Structure] J --> K[Save Timestamped Transcript] K --> L[End] subgraph Input Validation B end subgraph Audio Processing C D end subgraph Speech Recognition E F G end subgraph Timestamp Addition H end subgraph JSON Formatting I J K end