I want to input a youtube video, then get a collection of music samples returned to me (.wav files in a zip). I think I should specify 3 specific times (eg 5 seconds, 130 seconds, 190 seconds, and then the agent rip rip the audio file from YouTube, then extract stems (instrumentals, vocals), and then extract (say) 16 beats worth of audio - beginning from the 3 timings I've entered. This gives us 3x separate audio sections to sample, and if we then separate them into vocals and instrumentals that then gives us 6x separate audio files to sample from (although we will use the initial tempo/beatpoint for all 6).
Next we will want to use the beatpoint and tempo to extract samples (eg 1/2/4/8 beat in lengths, beginning from the beatpoints), potentially using SOX. So we chopup the samples using SOX, perhaps chopping based on beat length (if BPM = 90bpm, then 1 beat is 0.66secs) and beat position, and chopping maybe a mix of 1/2/4/8 beat samples.
This will likely give us far too many samples, so we may want to finally filter them down (to "the best 48 samples"). We could do this by using a library like Librosa to filter somehow (eg based on average amplitude). Perhaps we could run 1-3 analyses of each sample, score them based on results, and then rank them based on the total of these results -- or to keep things simple just rank them based on a single score (average amplitude?) We will likely need to calculate how many samples we are going to get, how/if to rank them, and then generate them acccordingly. I am not sure whether we should filter or just do a bulk import.
Finally we will want to save all the files to a server and zip them up.