Streaming Segment Parser

3 minute read

Published: October 14, 2019

Background - ISO-BMFF Parser
- Box Parsing
- Segment Parsing
Proposed design
Performance Analysis

This was my intern project at Amazon Prime Video.

Background - ISO-BMFF Parser

When the customer starts to play some video, the frontend client continuously requests data from the backend server. The raw data is nothing but a series of bytes that were produced through video compression and encoding. A decoder would process the raw data in order to display images and audios. ISO-BMFF parser serves as the bridge between the data and the decoder. It transforms the data in two stages: data -> box -> segment.

Box Parsing

The first part of ISO-BMFF parser is transforming the original data into basic structures called ”box”. A box is an object containing mandatory attributes type and size. Other attributes can be included according to the specific box type. There is one special box, whose type is mdat. It’s a container of media data. A box can have child boxes, making a nested architecture.

aligned(8) Box {
    unsigned int (32) size ;
    unsigned int (32) type ;
    boxes: [] // child boxes (if existing)
    // other attributes
}

A typical ISO file structure looks like:

iso-file

Segment Parsing

The boxes described above are further parsed to become segment, either initialization segments or media segments. A video file starts with one initialization segment, with one or muliple media segments following behind. Each media segment contains a series of media samples, and each sample stores the pointer to the section of media data it contains.

segment-structures

Proposed design

What’s the problem?

In the current implementation of the ISO-BMFF parser, the parsing process is fully synchronous. This design is insufficient in two aspects:

Unnecessary latency is introduced because the parser has to wait for the entire segment to be downloaded before starting parsing, and the decoder can do nothing when the parser is running.
The video decoder is capable of dealing with separate frames, so there is no need to construct a complete media segment.

How to improve?

In order to improve the performance, the new segment parser is design to be asynchronouos. Intuitively, the parsing process “overlaps” with downloading and decoding.

Whenever a chunk of data is downloaded, it is sent to the parser. The parsing can starts even before the segment downloading is finished.
Whenever some media sample is constructed within the parser, is it passed to the decoder. The decoder no longer has to wait until an entire segment is completed.

old-and-new-parser

Implementation

This proposed framework is implemented using event-based architecture.

event-based-impl

Performance Analysis

In order to compare the performance of the old and new parser implementation, experiments are performed on three different types of segments: audio segment, video segment, trailer segment (non-encrypted video). The experiment configuration is as follows:

experiment-config

For each type of segment, 500 independent experiments are performed. Note that for the old parser, tFirst is the same as tLast as the entire segment is parsed as a whole.

Results of the audio segment:

performance-audio

Results of the video segment:

performance-video

Results of the trailer segment:

performance-trailer

Yiqi Yan