Streaming Segment Parser
Published:
This was my intern project at Amazon Prime Video.
Background - ISO-BMFF Parser
When the customer starts to play some video, the frontend client continuously requests data from the backend server. The raw data is nothing but a series of bytes that were produced through video compression and encoding. A decoder would process the raw data in order to display images and audios. ISO-BMFF parser serves as the bridge between the data and the decoder. It transforms the data in two stages: data -> box -> segment.
Box Parsing
The first part of ISO-BMFF parser is transforming the original data into basic structures called ”box”. A box is an object containing mandatory attributes type and size. Other attributes can be included according to the specific box type. There is one special box, whose type is mdat. It’s a container of media data. A box can have child boxes, making a nested architecture.
aligned(8) Box {
unsigned int (32) size ;
unsigned int (32) type ;
boxes: [] // child boxes (if existing)
// other attributes
}
A typical ISO file structure looks like:
Segment Parsing
The boxes described above are further parsed to become segment, either initialization segments or media segments. A video file starts with one initialization segment, with one or muliple media segments following behind. Each media segment contains a series of media samples, and each sample stores the pointer to the section of media data it contains.
Proposed design
What’s the problem?
In the current implementation of the ISO-BMFF parser, the parsing process is fully synchronous. This design is insufficient in two aspects:
- Unnecessary latency is introduced because the parser has to wait for the entire segment to be downloaded before starting parsing, and the decoder can do nothing when the parser is running.
- The video decoder is capable of dealing with separate frames, so there is no need to construct a complete media segment.
How to improve?
In order to improve the performance, the new segment parser is design to be asynchronouos. Intuitively, the parsing process “overlaps” with downloading and decoding.
- Whenever a chunk of data is downloaded, it is sent to the parser. The parsing can starts even before the segment downloading is finished.
- Whenever some media sample is constructed within the parser, is it passed to the decoder. The decoder no longer has to wait until an entire segment is completed.
Implementation
This proposed framework is implemented using event-based architecture.
Performance Analysis
In order to compare the performance of the old and new parser implementation, experiments are performed on three different types of segments: audio segment, video segment, trailer segment (non-encrypted video). The experiment configuration is as follows:
For each type of segment, 500 independent experiments are performed. Note that for the old parser, tFirst is the same as tLast as the entire segment is parsed as a whole.
Results of the audio segment:
Results of the video segment:
Results of the trailer segment: