CDI (Cloud Digital Interface) Baseline Profile Audio Format Specification

Document version

01.00

Document owner

Amazon Web Services

Summary

This document specifies the format of audio payloads sent to and received from the CDI SDK.

Scope

The specifications in this document apply to audio streams carried through the AWS CDI SDK's (Software Development Kit) AVM (Audio, Video, Metadata) API (Application Program Interface).

Status

current

Compatibility

CDI SDK version 1.0 and later support this version of the CDI baseline profile. Minor release number changes will only contain clarifications and corrections and therefore will maintain backwards compatibility with the SDK. The major number will be incremented when new features are added or any other incompatibility is introduced. A corresponding update to the CDI SDK will be required to support the changes documented in these future baseline profile documents.

Configuration Structure

The CdiAvmConfig structure is used to pass media format information through the CDI SDK for each stream. It contains three parts: a URI, a data array, and a data array size. The URI for this specification is defined to be https://cdi.elemental.com/specs/baseline-audio .

The bytes of the array are the ASCII characters forming <name>=<value> pairs, terminated by a semicolon and each entry is separated from the next by a space character. data_size is the total number of characters comprising the string of names, values, and separators. No terminating carriage return or NUL character shall be included.

Supported parameter names and allowed values are:

The number of channels in the group is inferred by the order value as specified in the "Quantity of Audio Channels in group" column of ST 2110-30-2017, Table 1.

Note: The CDI SDK provides CdiAvmMakeBaselineConfiguration() for generating an appropriate CdiAvmConfig structure to pass to CdiAvmTxPayload() and CdiAvmParseBaselineConfiguration() for parsing CdiAvmConfig structures from the receive payload callback function. These functions alleviate the need for application programs to deal directly with the CdiAvmConfig structure for CDI baseline media types, including audio.

The remaining audio transport parameters of ST 2110-30 are fixed. CDI is standardized on 24 bit linear audio encapsulation as documented in IETF RFC 3190, named "L24" in that specification. Refer to the spec for all of the details. The salient details are enumerated here:

Stream Contents

Each audio channel group must reside in its own stream within a connection. Individual streams have their own order, rate, and language but all of the channels in the stream’s group share all of these characteristics.

Payload Format

Data must be placed into memory buffers with the proper format before transmission in order for the receiving end to be able to consume it. The application is solely responsible for this formatting. The time frame of each payload is not required to be the same as the associated video frames. In order to limit the required buffering on the receive end which may be required to align the audio with its video, the maximum time difference, according to the audio and video PTP time stamps, must not exceed 200 ms audio leading video or 50 ms video leading audio. Payloads may contain as little as 1 ms and as much as 50 ms. Likewise, the same format is to be expected on the receiving end in this same format for the application to process. In the sample diagrams below, memory addresses increase going left to right, top to bottom.

Since all samples are 24 bits deep, each sample comprises three bytes.

            +------------------------+------------------------+------------------------+
1 sample:   | most significant byte  |      middle byte       | least significant byte |
            +------------------------+------------------------+------------------------+

Samples from the group's channels are interleaved. Channel order within the group is as specified in Table 1 of ST 2110-30-2017.

Example 1

The simplest example is one group consisting of a monaural channel. The configuration structure for this example is:

Data for a 1 ms payload contains 288 bytes (96 samples * 3 bytes per sample) and would be formatted as shown:

            +-----------+-----------+-----------+--        -+-----------+
1 payload:  | sample 0  | sample 1  | sample 2  |    ...    | sample 95 |
            +-----------+-----------+-----------+-        --+-----------+

Example 2

Multiple channels require interleaving of the samples. This example has six channels as a 5.1 surround group type. The configuration structure is:

For 1 ms, the payload size is 864 bytes (48 samples * 3 bytes per channel * 6 channels) formatted in memory like:

            +--------------+--------------+--------------+--------------+--------------+--------------+
            | sample L0    | sample R0    | sample C0    | sample LFE0  | sample Ls0   | sample Rs0   |
            +--------------+--------------+--------------+--------------+--------------+--------------+
            | sample L1    | sample R1    | sample C1    | sample LFE1  | sample Ls1   | sample Rs1   |
1 payload:  +--------------+--------------+--------------+--------------+--------------+--------------+
                                                        ...
            +--------------+--------------+--------------+--------------+--------------+--------------+
            | sample L47   | sample R47   | sample C47   | sample LFE47 | sample Ls47  | sample Rs47  |
            +--------------+--------------+--------------+--------------+--------------+--------------+

The packetizer on the transmit side breaks payloads up if necessary to fit into network packets which has a direct effect on the scatter gather list entries on the receive side. It ensures that each entry contains only whole samples from all of the channels in the group. In other words, the scatter gather entry boundaries will always coincide with the sample word boundaries between the byte of the last channel of the group and the first byte of the first channel of the group.

References

  1. IETF RFC 3190: https://tools.ietf.org/html/rfc3190

  2. ST 2110-30:2017 - SMPTE Standard - Professional Media Over Managed IP Networks: PCM Digital Audio," in ST 2110-30:2017, 27 Nov. 2017, doi: 10.5594/SMPTE.ST2110-30.2017: https://ieeexplore.ieee.org/document/8167392