In order to display these materials on the screen, a container for information such as playback position, length, and screen position is called a strip.
Convert these strips into the final video in a way that suits each material.
FFmpeg inputs are video, image sequences or audio.
If you want to add text, you have to generate image included text.
Video
So, convert all the images/videos to image sequence once and pass them to ffmpeg. used three.js for the renderer. Three.js can handle image as Texture, also video as VideoTexture.
The image capture step uses ccapture to ensure that the playback frame is converted to an image. All frame images to webm video.
Audio
Next, Merge image sequence and audio. This is ffmpeg part.
For video, get a specific range of audio from the video.
For audio, cut the audio.
Finally, use filter_compex to combine all the audio in strip time.