rfc9628.original | rfc9628.txt | |||
---|---|---|---|---|
AVTCore Working Group J. Uberti | Internet Engineering Task Force (IETF) J. Uberti | |||
Internet-Draft S. Holmer | Request for Comments: 9628 S. Holmer | |||
Intended status: Standards Track M. Flodman | Category: Standards Track M. Flodman | |||
Expires: 12 December 2021 D. Hong | ISSN: 2070-1721 D. Hong | |||
J. Lennox | J. Lennox | |||
8x8 / Jitsi | 8x8 / Jitsi | |||
10 June 2021 | August 2024 | |||
RTP Payload Format for VP9 Video | RTP Payload Format for VP9 Video | |||
draft-ietf-payload-vp9-16 | ||||
Abstract | Abstract | |||
This specification describes an RTP payload format for the VP9 video | This specification describes an RTP payload format for the VP9 video | |||
codec. The payload format has wide applicability, as it supports | codec. The payload format has wide applicability as it supports | |||
applications from low bit-rate peer-to-peer usage, to high bit-rate | applications from low bitrate peer-to-peer usage to high bitrate | |||
video conferences. It includes provisions for temporal and spatial | video conferences. It includes provisions for temporal and spatial | |||
scalability. | scalability. | |||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This is an Internet Standards Track document. | |||
provisions of BCP 78 and BCP 79. | ||||
Internet-Drafts are working documents of the Internet Engineering | ||||
Task Force (IETF). Note that other groups may also distribute | ||||
working documents as Internet-Drafts. The list of current Internet- | ||||
Drafts is at https://datatracker.ietf.org/drafts/current/. | ||||
Internet-Drafts are draft documents valid for a maximum of six months | This document is a product of the Internet Engineering Task Force | |||
and may be updated, replaced, or obsoleted by other documents at any | (IETF). It represents the consensus of the IETF community. It has | |||
time. It is inappropriate to use Internet-Drafts as reference | received public review and has been approved for publication by the | |||
material or to cite them other than as "work in progress." | Internet Engineering Steering Group (IESG). Further information on | |||
Internet Standards is available in Section 2 of RFC 7841. | ||||
This Internet-Draft will expire on 12 December 2021. | Information about the current status of this document, any errata, | |||
and how to provide feedback on it may be obtained at | ||||
https://www.rfc-editor.org/info/rfc9628. | ||||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2021 IETF Trust and the persons identified as the | Copyright (c) 2024 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents (https://trustee.ietf.org/ | Provisions Relating to IETF Documents | |||
license-info) in effect on the date of publication of this document. | (https://trustee.ietf.org/license-info) in effect on the date of | |||
Please review these documents carefully, as they describe your rights | publication of this document. Please review these documents | |||
and restrictions with respect to this document. Code Components | carefully, as they describe your rights and restrictions with respect | |||
extracted from this document must include Simplified BSD License text | to this document. Code Components extracted from this document must | |||
as described in Section 4.e of the Trust Legal Provisions and are | include Revised BSD License text as described in Section 4.e of the | |||
provided without warranty as described in the Simplified BSD License. | Trust Legal Provisions and are provided without warranty as described | |||
in the Revised BSD License. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction | |||
2. Conventions, Definitions and Acronyms . . . . . . . . . . . . 3 | 2. Conventions | |||
3. Media Format Description . . . . . . . . . . . . . . . . . . 3 | 3. Media Format Description | |||
4. Payload Format . . . . . . . . . . . . . . . . . . . . . . . 5 | 4. Payload Format | |||
4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 5 | 4.1. RTP Header Usage | |||
4.2. VP9 Payload Descriptor . . . . . . . . . . . . . . . . . 6 | 4.2. VP9 Payload Descriptor | |||
4.2.1. Scalability Structure (SS): . . . . . . . . . . . . . 11 | 4.2.1. Scalability Structure (SS) | |||
4.3. Frame Fragmentation . . . . . . . . . . . . . . . . . . . 13 | 4.3. Frame Fragmentation | |||
4.4. Scalable encoding considerations . . . . . . . . . . . . 13 | 4.4. Scalable Encoding Considerations | |||
4.5. Examples of VP9 RTP Stream . . . . . . . . . . . . . . . 13 | 4.5. Examples of VP9 RTP Stream | |||
4.5.1. Reference picture use for scalable structure . . . . 14 | 4.5.1. Reference Picture Use for Scalable Structure | |||
5. Feedback Messages and Header Extensions . . . . . . . . . . . 14 | 5. Feedback Messages and Header Extensions | |||
5.1. Reference Picture Selection Indication (RPSI) . . . . . . 15 | 5.1. Reference Picture Selection Indication (RPSI) | |||
5.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 15 | 5.2. Full Intra Request (FIR) | |||
5.3. Layer Refresh Request (LRR) . . . . . . . . . . . . . . . 15 | 5.3. Layer Refresh Request (LRR) | |||
6. Payload Format Parameters . . . . . . . . . . . . . . . . . . 16 | 6. Payload Format Parameters | |||
6.1. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 18 | 6.1. SDP Parameters | |||
6.1.1. Mapping of Media Subtype Parameters to SDP . . . . . 18 | 6.1.1. Mapping of Media Subtype Parameters to SDP | |||
6.1.2. Offer/Answer Considerations . . . . . . . . . . . . . 19 | 6.1.2. Offer/Answer Considerations | |||
7. Media Type Definition . . . . . . . . . . . . . . . . . . . . 19 | 7. Media Type Definition | |||
8. Security Considerations . . . . . . . . . . . . . . . . . . . 21 | 8. Security Considerations | |||
9. Congestion Control . . . . . . . . . . . . . . . . . . . . . 21 | 9. Congestion Control | |||
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22 | 10. IANA Considerations | |||
11. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 22 | 11. References | |||
12. References . . . . . . . . . . . . . . . . . . . . . . . . . 22 | 11.1. Normative References | |||
12.1. Normative References . . . . . . . . . . . . . . . . . . 22 | 11.2. Informative References | |||
12.2. Informative References . . . . . . . . . . . . . . . . . 23 | Acknowledgments | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 | Authors' Addresses | |||
1. Introduction | 1. Introduction | |||
This specification describes an RTP [RFC3550] payload specification | This document describes an RTP [RFC3550] payload specification | |||
applicable to the transmission of video streams encoded using the VP9 | applicable to the transmission of video streams encoded using the VP9 | |||
video codec [VP9-BITSTREAM]. The format described in this document | video codec [VP9-BITSTREAM]. The format described in this document | |||
can be used both in peer-to-peer and video conferencing applications. | can be used both in peer-to-peer and video conferencing applications. | |||
The VP9 video codec was developed by Google, and is the successor to | The VP9 video codec was developed by Google and is the successor to | |||
its earlier VP8 [RFC6386] codec. Above the compression improvements | its earlier VP8 [RFC6386] codec. Above the compression improvements | |||
and other general enhancements above VP8, VP9 is also designed in a | and other general enhancements to VP8, VP9 is also designed in a way | |||
way that allows spatially-scalable video encoding. | that allows spatially scalable video encoding. | |||
2. Conventions, Definitions and Acronyms | 2. Conventions | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and | |||
"OPTIONAL" in this document are to be interpreted as described in BCP | "OPTIONAL" in this document are to be interpreted as described in | |||
14 [RFC2119] [RFC8174] when, and only when, they appear in all | BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all | |||
capitals, as shown here. | capitals, as shown here. | |||
3. Media Format Description | 3. Media Format Description | |||
The VP9 codec can maintain up to eight reference frames, of which up | The VP9 codec can maintain up to eight reference frames, of which up | |||
to three can be referenced by any new frame. | to three can be referenced by any new frame. | |||
VP9 also allows a frame to use another frame of a different | VP9 also allows a frame to use another frame of a different | |||
resolution as a reference frame. (Specifically, a frame may use any | resolution as a reference frame. (Specifically, a frame may use any | |||
references whose width and height are between 1/16th that of the | references whose width and height are between 1/16th that of the | |||
current frame and twice that of the current frame, inclusive.) This | current frame and twice that of the current frame, inclusive.) This | |||
allows internal resolution changes without requiring the use of key | allows internal resolution changes without requiring the use of | |||
frames. | keyframes. | |||
These features together enable an encoder to implement various forms | These features together enable an encoder to implement various forms | |||
of coarse-grained scalability, including temporal, spatial and | of coarse-grained scalability, including temporal, spatial, and | |||
quality scalability modes, as well as combinations of these, without | quality scalability modes, as well as combinations of these, without | |||
the need for explicit scalable coding tools. | the need for explicit scalable coding tools. | |||
Temporal layers define different frame rates of video; spatial and | Temporal layers define different frame rates of video; spatial and | |||
quality layers define different and possibly dependent | quality layers define different and possibly dependent | |||
representations of a single input frame. Spatial layers allow a | representations of a single input frame. Spatial layers allow a | |||
frame to be encoded at different resolutions, whereas quality layers | frame to be encoded at different resolutions, whereas quality layers | |||
allow a frame to be encoded at the same resolution but at different | allow a frame to be encoded at the same resolution but at different | |||
qualities (and thus with different amounts of coding error). VP9 | qualities (and, thus, with different amounts of coding error). VP9 | |||
supports quality layers as spatial layers without any resolution | supports quality layers as spatial layers without any resolution | |||
changes; hereinafter, the term "spatial layer" is used to represent | changes; hereinafter, the term "spatial layer" is used to represent | |||
both spatial and quality layers. | both spatial and quality layers. | |||
This payload format specification defines how such temporal and | This payload format specification defines how such temporal and | |||
spatial scalability layers can be described and communicated. | spatial scalability layers can be described and communicated. | |||
Temporal and spatial scalability layers are associated with non- | Temporal and spatial scalability layers are associated with non- | |||
negative integer IDs. The lowest layer of either type has an ID of | negative integer IDs. The lowest layer of either type has an ID of 0 | |||
0, and is sometimes referred to as the "base" temporal or spatial | and is sometimes referred to as the "base" temporal or spatial layer. | |||
layer. | ||||
Layers are designed, and MUST be encoded, such that if any layer, and | Layers are designed, and MUST be encoded, such that if any layer, and | |||
all higher layers, are removed from the bitstream along either the | all higher layers, are removed from the bitstream along either the | |||
spatial or temporal dimension, the remaining bitstream is still | spatial or temporal dimension, the remaining bitstream is still | |||
correctly decodable. | correctly decodable. | |||
For terminology, this document uses the term "frame" to refer to a | For terminology, this document uses the term "frame" to refer to a | |||
single encoded VP9 frame for a particular resolution/quality, and | single encoded VP9 frame for a particular resolution/quality, and | |||
"picture" to refer to all the representations (frames) at a single | "picture" to refer to all the representations (frames) at a single | |||
instant in time. A picture thus consists of one or more frames, | instant in time. Thus, a picture consists of one or more frames, | |||
encoding different spatial layers. | encoding different spatial layers. | |||
Within a picture, a frame with spatial layer ID equal to SID, where | Within a picture, a frame with spatial-layer ID equal to SID, where | |||
SID > 0, can depend on a frame of the same picture with a lower | SID > 0, can depend on a frame of the same picture with a lower | |||
spatial layer ID. This "inter-layer" dependency can result in | spatial-layer ID. This "inter-layer" dependency can result in | |||
additional coding gain compared to the case where only traditional | additional coding gain compared to the case where only traditional | |||
"inter-picture" dependency is used, where a frame depends on | "inter-picture" dependency is used, where a frame depends on a | |||
previously coded frame in time. For simplicity, this payload format | previously coded frame in time. For simplicity, this payload format | |||
assumes that, within a picture and if inter-layer dependency is used, | assumes that, within a picture and if inter-layer dependency is used, | |||
a spatial layer SID frame can depend only on the immediately previous | a spatial-layer SID frame can depend only on the immediately previous | |||
spatial layer SID-1 frame, when S > 0. Additionally, if inter- | spatial-layer SID-1 frame, when S > 0. Additionally, if inter- | |||
picture dependency is used, a spatial layer SID frame is assumed to | picture dependency is used, a spatial-layer SID frame is assumed to | |||
only depend on a previously coded spatial layer SID frame. | only depend on a previously coded spatial-layer SID frame. | |||
Given above simplifications for inter-layer and inter-picture | Given the above simplifications for inter-layer and inter-picture | |||
dependencies, a flag (the D bit described below) is used to indicate | dependencies, a flag (the D bit described below) is used to indicate | |||
whether a spatial layer SID frame depends on the spatial layer SID-1 | whether a spatial-layer SID frame depends on the spatial-layer SID-1 | |||
frame. Given the D bit, a receiver only needs to additionally know | frame. Given the D bit, a receiver only needs to additionally know | |||
the inter-picture dependency structure for a given spatial layer | the inter-picture dependency structure for a given spatial-layer | |||
frame in order to determine its decodability. Two modes of | frame in order to determine its decodability. Two modes of | |||
describing the inter-picture dependency structure are possible: | describing the inter-picture dependency structure are possible: | |||
"flexible mode" and "non-flexible mode". An encoder can only switch | "flexible mode" and "non-flexible mode". An encoder can only switch | |||
between the two on the first packet of a key frame with temporal | between the two on the first packet of a keyframe with a temporal- | |||
layer ID equal to 0. | layer ID equal to 0. | |||
In flexible mode, each packet can contain up to 3 reference indices, | In flexible mode, each packet can contain up to three reference | |||
which identify all frames referenced by the frame transmitted in the | indices, which identify all frames referenced by the frame | |||
current packet for inter-picture prediction. This (along with the D | transmitted in the current packet for inter-picture prediction. This | |||
bit) enables a receiver to identify if a frame is decodable or not | (along with the D bit) enables a receiver to identify if a frame is | |||
and helps it understand the temporal layer structure. Since this is | decodable or not and helps it understand the temporal-layer | |||
signaled in each packet it makes it possible to have very flexible | structure. Since this is signaled in each packet, it makes it | |||
temporal layer hierarchies, and scalability structures which are | possible to have very flexible temporal-layer hierarchies and | |||
changing dynamically. | scalability structures, which are changing dynamically. | |||
In non-flexible mode, frames are encoded using a fixed, recurring | In non-flexible mode, frames are encoded using a fixed, recurring | |||
pattern of dependencies; the set of pictures that recur in this | pattern of dependencies; the set of pictures that recur in this | |||
pattern is known as a Picture Group (PG). In this mode, the inter- | pattern is known as a "Picture Group" (or "PG"). In this mode, the | |||
picture dependencies (the reference indices) of the Picture Group | inter-picture dependencies (the reference indices) of the PG MUST be | |||
MUST be pre-specified as part of the scalability structure (SS) data. | pre-specified as part of the Scalability Structure (SS) data. Each | |||
Each packet has an index to refer to one of the described pictures in | packet has an index to refer to one of the described pictures in the | |||
the PG, from which the pictures referenced by the picture transmitted | PG from which the pictures referenced by the picture transmitted in | |||
in the current packet for inter-picture prediction can be identified. | the current packet for inter-picture prediction can be identified. | |||
(Note: A "Picture Group", as used in this document, is not the same | Note: A "Picture Group" or "PG", as used in this document, is not the | |||
thing as the term "Group of Pictures" as it is traditionally used in | same thing as the term "Group of Pictures" as it is traditionally | |||
video coding, i.e. to mean an independently-decoadable run of | used in video coding, i.e., to mean an independently decodable run of | |||
pictures beginning with a keyframe.) | pictures beginning with a keyframe. | |||
The SS data can also be used to specify the resolution of each | The SS data can also be used to specify the resolution of each | |||
spatial layer present in the VP9 stream for both flexible and non- | spatial layer present in the VP9 stream for both flexible and non- | |||
flexible modes. | flexible modes. | |||
4. Payload Format | 4. Payload Format | |||
This section describes how the encoded VP9 bitstream is encapsulated | This section describes how the encoded VP9 bitstream is encapsulated | |||
in RTP. To handle network losses usage of RTP/AVPF [RFC4585] is | in RTP. To handle network losses, usage of RTP/AVPF [RFC4585] is | |||
RECOMMENDED. All integer fields in the specifications are encoded as | RECOMMENDED. All integer fields in the specifications are encoded as | |||
unsigned integers in network octet order. | unsigned integers in network octet order. | |||
4.1. RTP Header Usage | 4.1. RTP Header Usage | |||
The general RTP payload format for VP9 is depicted below. | The general RTP payload format for VP9 is depicted below. | |||
0 1 2 3 | 0 1 2 3 | |||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
skipping to change at page 6, line 4 ¶ | skipping to change at line 231 ¶ | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| : | | | : | | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | |||
| | | | | | |||
+ | | + | | |||
: VP9 payload : | : VP9 payload : | |||
| | | | | | |||
| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| : OPTIONAL RTP padding | | | : OPTIONAL RTP padding | | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
Figure 1 | ||||
The VP9 payload descriptor will be described in Section 4.2; the VP9 | Figure 1: General RTP Payload Format for VP | |||
payload is described in [VP9-BITSTREAM]. OPTIONAL RTP padding MUST | ||||
NOT be included unless the P bit is set. | ||||
Marker bit (M): MUST be set to 1 for the final packet of the highest | See Section 4.2 for more information on the VP9 payload descriptor; | |||
spatial layer frame (the final packet of the picture), and 0 | the VP9 payload is described in [VP9-BITSTREAM]. OPTIONAL RTP | |||
otherwise. Unless spatial scalability is in use for this picture, | padding MUST NOT be included unless the P bit is set. | |||
this will have the same value as the E bit described below. Note | ||||
this bit MUST be set to 1 for the target spatial layer frame if a | Marker bit (M): This bit MUST be set to 1 for the final packet of | |||
stream is being rewritten to remove higher spatial layers. | the highest spatial-layer frame (the final packet of the picture), | |||
and 0 otherwise. Unless spatial scalability is in use for this | ||||
picture, this bit will have the same value as the E bit described | ||||
in Section 4.2. Note this bit MUST be set to 1 for the target | ||||
spatial-layer frame if a stream is being rewritten to remove | ||||
higher spatial layers. | ||||
Payload Type (PT): In line with the policy in Section 3 of | Payload Type (PT): In line with the policy in Section 3 of | |||
[RFC3551], applications using the VP9 RTP payload profile MUST | [RFC3551], applications using the VP9 RTP payload profile MUST | |||
assign a dynamic payload type number to be used in each RTP | assign a dynamic payload type number to be used in each RTP | |||
session and provide a mechanism to indicate the mapping. See | session and provide a mechanism to indicate the mapping. See | |||
Section 6.1 for the mechanism to be used with the Session | Section 6.1 for the mechanism to be used with the Session | |||
Description Protocol (SDP) [RFC8866]. | Description Protocol (SDP) [RFC8866]. | |||
Timestamp: The RTP timestamp [RFC3550] indicates the time when the | Timestamp: The RTP timestamp [RFC3550] indicates the time when the | |||
input frame was sampled, at a clock rate of 90 kHz. If the input | input frame was sampled, at a clock rate of 90 kHz. If the input | |||
picture is encoded with multiple layer frames, all of the frames | picture is encoded with multiple-layer frames, all of the frames | |||
of the picture MUST have the same timestamp. | of the picture MUST have the same timestamp. | |||
If a frame has the VP9 show_frame field set to 0 (i.e., it is | If a frame has the VP9 show_frame field set to 0 (i.e., it is | |||
meant only to populate a reference buffer, without being output) | meant only to populate a reference buffer without being output), | |||
its timestamp MAY alternatively be set to be the same as the | its timestamp MAY alternatively be set to be the same as the | |||
subsequent frame with show_frame equal to 1. (This will be | subsequent frame with show_frame equal to 1. (This will be | |||
convenient for playing out pre-encoded content packaged with VP9 | convenient for playing out pre-encoded content packaged with VP9 | |||
"superframes", which typically bundle show_frame==0 frames with a | "superframes", which typically bundle show_frame==0 frames with a | |||
subsequent show_frame==1 frame.) Every frame with show_frame==1, | subsequent show_frame==1 frame.) Every frame with show_frame==1, | |||
however, MUST have a unique timestamp modulo the 2^32 wrap of the | however, MUST have a unique timestamp modulo the 2^32 wrap of the | |||
field. | field. | |||
The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number, | The remaining RTP Fixed Header Fields (V, P, X, CC, sequence number, | |||
SSRC and CSRC identifiers) are used as specified in Section 5.1 of | SSRC, and CSRC identifiers) are used as specified in Section 5.1 of | |||
[RFC3550]. | [RFC3550]. | |||
4.2. VP9 Payload Descriptor | 4.2. VP9 Payload Descriptor | |||
In flexible mode (with the F bit below set to 1), the first octets | In flexible mode (with the F bit below set to 1), the first octets | |||
after the RTP header are the VP9 payload descriptor, with the | after the RTP header are the VP9 payload descriptor, with the | |||
following structure. | following structure. | |||
0 1 2 3 4 5 6 7 | 0 1 2 3 4 5 6 7 | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
skipping to change at page 7, line 21 ¶ | skipping to change at line 294 ¶ | |||
M: | EXTENDED PID | (RECOMMENDED) | M: | EXTENDED PID | (RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
L: | TID |U| SID |D| (Conditionally RECOMMENDED) | L: | TID |U| SID |D| (Conditionally RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ -\ | +-+-+-+-+-+-+-+-+ -\ | |||
P,F: | P_DIFF |N| (Conditionally REQUIRED) - up to 3 times | P,F: | P_DIFF |N| (Conditionally REQUIRED) - up to 3 times | |||
+-+-+-+-+-+-+-+-+ -/ | +-+-+-+-+-+-+-+-+ -/ | |||
V: | SS | | V: | SS | | |||
| .. | | | .. | | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
Figure 2 | Figure 2: Flexible Mode Format for VP9 Payload Descriptor | |||
In non-flexible mode (with the F bit below set to 0), the first | In non-flexible mode (with the F bit below set to 0), the first | |||
octets after the RTP header are the VP9 payload descriptor, with the | octets after the RTP header are the VP9 payload descriptor, with the | |||
following structure. | following structure. | |||
0 1 2 3 4 5 6 7 | 0 1 2 3 4 5 6 7 | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
|I|P|L|F|B|E|V|Z| (REQUIRED) | |I|P|L|F|B|E|V|Z| (REQUIRED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
I: |M| PICTURE ID | (RECOMMENDED) | I: |M| PICTURE ID | (RECOMMENDED) | |||
skipping to change at page 7, line 43 ¶ | skipping to change at line 316 ¶ | |||
M: | EXTENDED PID | (RECOMMENDED) | M: | EXTENDED PID | (RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
L: | TID |U| SID |D| (Conditionally RECOMMENDED) | L: | TID |U| SID |D| (Conditionally RECOMMENDED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
| TL0PICIDX | (Conditionally REQUIRED) | | TL0PICIDX | (Conditionally REQUIRED) | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
V: | SS | | V: | SS | | |||
| .. | | | .. | | |||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
Figure 3 | Figure 3: Non-flexible Mode Format for VP9 Payload Descriptor | |||
I: Picture ID (PID) present. When set to one, the OPTIONAL PID MUST | I: Picture ID (PID) present. When set to 1, the OPTIONAL PID MUST | |||
be present after the mandatory first octet and specified as below. | be present after the mandatory first octet and specified as below. | |||
Otherwise, PID MUST NOT be present. If the V bit was set in the | Otherwise, PID MUST NOT be present. If the V bit was set in the | |||
stream's most recent start of a keyframe (i.e. the SS field was | stream's most recent start of a keyframe (i.e., the SS field was | |||
present) and the F bit is set to 0 (i.e. non-flexible scalability | present) and the F bit is set to 0 (i.e., non-flexible scalability | |||
mode is in use), then this bit MUST be set on every packet. | mode is in use), then this bit MUST be set on every packet. | |||
P: Inter-picture predicted frame. When set to zero, the frame does | P: Inter-picture predicted frame. When set to 0, the frame does not | |||
not utilize inter-picture prediction. In this case, up-switching | utilize inter-picture prediction. In this case, up-switching to a | |||
to a current spatial layer's frame is possible from directly lower | current spatial layer's frame is possible from a directly lower | |||
spatial layer frame. P SHOULD also be set to zero when encoding a | spatial-layer frame. P SHOULD also be set to 0 when encoding a | |||
layer synchronization frame in response to an LRR | layer synchronization frame in response to a Layer Refresh Request | |||
[I-D.ietf-avtext-lrr] message (see Section 5.3). When P is set to | (LRR) [RFC9627] message (see Section 5.3). When P is set to 0, | |||
zero, the TID field (described below) MUST also be set to 0 (if | the TID field (described below) MUST also be set to 0 (if | |||
present). Note that the P bit does not forbid intra-picture, | present). Note that the P bit does not forbid intra-picture, | |||
inter-layer prediction from earlier frames of the same picture, if | inter-layer prediction from earlier frames of the same picture, if | |||
any. | any. | |||
L: Layer indices present. When set to one, the one or two octets | L: Layer indices present. When set to 1, the one or two octets | |||
following the mandatory first octet and the PID (if present) is as | following the mandatory first octet and the PID (if present) is as | |||
described by "Layer indices" below. If the F bit (described | described by "Layer indices" below. If the F bit (described | |||
below) is set to 1 (indicating flexible mode), then only one octet | below) is set to 1 (indicating flexible mode), then only one octet | |||
is present for the layer indices. Otherwise if the F bit is set | is present for the layer indices. Otherwise, if the F bit is set | |||
to 0 (indicating non-flexible mode), then two octets are present | to 0 (indicating non-flexible mode), then two octets are present | |||
for the layer indices. | for the layer indices. | |||
F: Flexible mode. F set to one indicates flexible mode and if the P | F: Flexible mode. When set to 1, this indicates flexible mode; if | |||
bit is also set to one, then the octets following the mandatory | the P bit is also set to 1, then the octets following the | |||
first octet, the PID, and layer indices (if present) are as | mandatory first octet, the PID, and layer indices (if present) are | |||
described by "Reference indices" below. This MUST only be set to | as described by "Reference indices" below. This bit MUST only be | |||
1 if the I bit is also set to one; if the I bit is set to zero, | set to 1 if the I bit is also set to 1; if the I bit is set to 0, | |||
then this MUST also be set to zero and ignored by receivers. | then this bit MUST also be set to 0 and ignored by receivers. | |||
(Flexible mode's Reference indices are defined as offsets from the | (Flexible mode's Reference indices are defined as offsets from the | |||
Picture ID field, so they would have no meaning if I were not | Picture ID field, so they would have no meaning if I were not | |||
set.) The value of this F bit MUST only change on the first | set.) The value of the F bit MUST only change on the first packet | |||
packet of a key picture. A key picture is a picture whose base | of a key picture. A "key picture" is a picture whose base | |||
spatial layer frame is a key frame, and which thus completely | spatial-layer frame is a keyframe, and thus one which completely | |||
resets the encoder state. This packet will have its P bit equal | resets the encoder state. This packet will have its P bit equal | |||
to zero, SID or L bit (described below) equal to zero, and B bit | to 0, SID or L bit (described below) equal to 0, and B bit | |||
(described below) equal to 1. | (described below) equal to 1. | |||
B: Start of a frame. MUST be set to 1 if the first payload octet of | B: Start of a frame. This bit MUST be set to 1 if the first payload | |||
the RTP packet is the beginning of a new VP9 frame, and MUST NOT | octet of the RTP packet is the beginning of a new VP9 frame; | |||
be 1 otherwise. Note that this frame might not be the first frame | otherwise, it MUST NOT be 1. Note that this frame might not be | |||
of a picture. | the first frame of a picture. | |||
E: End of a frame. MUST be set to 1 for the final RTP packet of a | E: End of a frame. This bit MUST be set to 1 for the final RTP | |||
VP9 frame, and 0 otherwise. This enables a decoder to finish | packet of a VP9 frame, and 0 otherwise. This enables a decoder to | |||
decoding the frame, where it otherwise may need to wait for the | finish decoding the frame, where it otherwise may need to wait for | |||
next packet to explicitly know that the frame is complete. Note | the next packet to explicitly know that the frame is complete. | |||
that, if spatial scalability is in use, more frames from the same | Note that, if spatial scalability is in use, more frames from the | |||
picture may follow; see the description of the B bit above. | same picture may follow; see the description of the B bit above. | |||
V: Scalability structure (SS) data present. When set to one, the | V: Scalability Structure (SS) data present. When set to 1, the | |||
OPTIONAL SS data MUST be present in the payload descriptor. | OPTIONAL SS data MUST be present in the payload descriptor. | |||
Otherwise, the SS data MUST NOT be present. | Otherwise, the SS data MUST NOT be present. | |||
Z: Not a reference frame for upper spatial layers. If set to 1, | Z: Not a reference frame for upper spatial layers. If set to 1, | |||
indicates that frames with higher spatial layers SID+1 and greater | indicates that frames with higher spatial layers SID+1 and greater | |||
of the current and following pictures do not depend on the current | of the current and following pictures do not depend on the current | |||
spatial layer SID frame. This enables a decoder which is | spatial-layer SID frame. This enables a decoder that is targeting | |||
targeting a higher spatial layer to know that it can safely | a higher spatial layer to know that it can safely discard this | |||
discard this packet's frame without processing it, without having | packet's frame without processing it, without having to wait for | |||
to wait for the "D" bit in the higher-layer frame (see below). | the D bit in the higher-layer frame (see below). | |||
The mandatory first octet is followed by the extension data fields | The mandatory first octet is followed by the extension data fields | |||
that are enabled: | that are enabled: | |||
M: The most significant bit of the first octet is an extension flag. | M: The most significant bit of the first octet is an extension flag. | |||
The field MUST be present if the I bit is equal to one. If M is | The field MUST be present if the I bit is equal to one. If M is | |||
set, the PID field MUST contain 15 bits; otherwise, it MUST | set, the PID field MUST contain 15 bits; otherwise, it MUST | |||
contain 7 bits. See PID below. | contain 7 bits. See PID below. | |||
Picture ID (PID): Picture ID represented in 7 or 15 bits, depending | Picture ID (PID): Picture ID represented in 7 or 15 bits, depending | |||
on the M bit. This is a running index of the pictures, where the | on the M bit. This is a running index of the pictures, where the | |||
sender increments the value by 1 for each picture it sends. (Note | sender increments the value by 1 for each picture it sends. | |||
however that because a middlebox can discard pictures where | (Note, however, that because a middlebox can discard pictures | |||
permitted by the scalability structure, Picture IDs as received by | where permitted by the SS, Picture IDs as received by a receiver | |||
a receiver might not be contiguous.) This field MUST be present | might not be contiguous.) This field MUST be present if the I bit | |||
if the I bit is equal to one. If M is set to zero, 7 bits carry | is equal to one. If M is set to 0, 7 bits carry the PID; else, if | |||
the PID; else if M is set to one, 15 bits carry the PID in network | M is set to 1, 15 bits carry the PID in network byte order. The | |||
byte order. The sender may choose between a 7- or 15-bit index. | sender may choose between a 7- or 15-bit index. The PID SHOULD | |||
The PID SHOULD start on a random number, and MUST wrap after | start on a random number and MUST wrap after reaching the maximum | |||
reaching the maximum ID (0x7f or 0x7fff depending on the index | ID (0x7f or 0x7fff depending on the index size chosen). The | |||
size chosen). The receiver MUST NOT assume that the number of | receiver MUST NOT assume that the number of bits in the PID stays | |||
bits in PID stay the same through the session. If this field | the same through the session. If this field transitions from 7 | |||
transitions from 7-bits to 15-bits, the value is zero-extended | bits to 15 bits, the value is zero-extended (i.e., the value after | |||
(i.e. the value after 0x6e is 0x006f); if the field transitions | 0x6e is 0x006f); if the field transitions from 15 bits to 7 bits, | |||
from 15 bits to 7 bits, it is truncated (i.e. the value after | it is truncated (i.e., the value after 0x1bbe is 0xbf). | |||
0x1bbe is 0xbf). | ||||
In the non-flexible mode (when the F bit is set to 0), this PID is | In the non-flexible mode (when the F bit is set to 0), this PID is | |||
used as an index to the picture group (PG) specified in the SS | used as an index to the PG specified in the SS data below. In | |||
data below. In this mode, the PID of the key frame corresponds to | this mode, the PID of the keyframe corresponds to the first | |||
the first specified frame in the PG. Then subsequent PIDs are | specified frame in the PG. Then subsequent PIDs are mapped to | |||
mapped to subsequently specified frames in the PG (modulo N_G, | subsequently specified frames in the PG (modulo N_G, specified in | |||
specified in the SS data below), respectively. | the SS data below), respectively. | |||
All frames of the same picture MUST have the same PID value. | All frames of the same picture MUST have the same PID value. | |||
Frames (and their corresponding pictures) with the VP9 show_frame | Frames (and their corresponding pictures) with the VP9 show_frame | |||
field equal to 0 MUST have distinct PID values from subsequent | field equal to 0 MUST have distinct PID values from subsequent | |||
pictures with show_frame equal to 1. Thus, a Picture as defined | pictures with show_frame equal to 1. Thus, a picture (as defined | |||
in this specification is different than a VP9 Superframe. | in this specification) is different than a VP9 superframe. | |||
All frames of the same picture MUST have the same value for | All frames of the same picture MUST have the same value for | |||
show_frame. | show_frame. | |||
Layer indices: This information is optional but RECOMMENDED whenever | Layer indices: This information is optional but RECOMMENDED whenever | |||
encoding with layers. For both flexible and non-flexible modes, | encoding with layers. For both flexible and non-flexible modes, | |||
one octet is used to specify a layer frame's temporal layer ID | one octet is used to specify a layer frame's temporal-layer ID | |||
(TID) and spatial layer ID (SID) as shown both in Figure 2 and | (TID) and spatial-layer ID (SID) as shown both in Figure 2 and | |||
Figure 3. Additionally, a bit (U) is used to indicate that the | Figure 3. Additionally, a bit (U) is used to indicate that the | |||
current frame is a "switching up point" frame. Another bit (D) is | current frame is a "switching up point" frame. Another bit (D) is | |||
used to indicate whether inter-layer prediction is used for the | used to indicate whether inter-layer prediction is used for the | |||
current frame. | current frame. | |||
In the non-flexible mode (when the F bit is set to 0), another | In the non-flexible mode (when the F bit is set to 0), another | |||
octet is used to represent temporal layer 0 index (TL0PICIDX), as | octet is used to represent temporal-layer 0 index (TL0PICIDX), as | |||
depicted in Figure 3. The TL0PICIDX is present so that all | depicted in Figure 3. The TL0PICIDX is present so that all | |||
minimally required frames - the base temporal layer frames - can | minimally required frames (the base temporal-layer frames) can be | |||
be tracked. | tracked. | |||
The TID and SID fields indicate the temporal and spatial layers | The TID and SID fields indicate the temporal and spatial layers | |||
and can help middleboxes and endpoints quickly identify which | and can help middleboxes and endpoints quickly identify which | |||
layer a packet belongs to. | layer a packet belongs to. | |||
TID: The temporal layer ID of current frame. In the case of non- | TID: The temporal-layer ID of the current frame. In the case of | |||
flexible mode, if PID is mapped to a picture in a specified PG, | non-flexible mode, if a PID is mapped to a picture in a | |||
then the value of TID MUST match the corresponding TID value of | specified PG, then the value of the TID MUST match the | |||
the mapped picture in the PG. | corresponding TID value of the mapped picture in the PG. | |||
U: Switching up point. If this bit is set to 1 for the current | U: Switching up point. If this bit is set to 1 for the current | |||
picture with temporal layer ID equal to TID, then "switch up" | picture with a temporal-layer ID equal to TID, then "switch up" | |||
to a higher frame rate is possible as subsequent higher | to a higher frame rate is possible as subsequent higher | |||
temporal layer pictures will not depend on any picture before | temporal-layer pictures will not depend on any picture before | |||
the current picture (in coding order) with temporal layer ID | the current picture (in coding order) with temporal-layer ID | |||
greater than TID. | greater than TID. | |||
SID: The spatial layer ID of current frame. Note that frames | SID: The spatial-layer ID of the current frame. Note that frames | |||
with spatial layer SID > 0 may be dependent on decoded spatial | with spatial-layer SID > 0 may be dependent on decoded spatial- | |||
layer SID-1 frame within the same picture. Different frames of | layer SID-1 frame within the same picture. Different frames of | |||
the same picture MUST have distinct spatial layer IDs, and | the same picture MUST have distinct spatial-layer IDs, and | |||
frames' spatial layers MUST appear in increasing order within | frames' spatial layers MUST appear in increasing order within | |||
the frame. | the frame. | |||
D: Inter-layer dependency used. MUST be set to one if and only | D: Inter-layer dependency is used. D MUST be set to 1 if and | |||
if the current spatial layer SID frame depends on spatial layer | only if the current spatial-layer SID frame depends on spatial- | |||
SID-1 frame of the same picture, otherwise MUST be set to zero. | layer SID-1 frame of the same picture; otherwise, it MUST be | |||
For the base layer frame (with SID equal to 0), this D bit MUST | set to 0. For the base-layer frame (with SID equal to 0), the | |||
be set to zero. | D bit MUST be set to 0. | |||
TL0PICIDX: 8 bits temporal layer zero index. TL0PICIDX is only | TL0PICIDX: 8 bits temporal-layer zero index. TL0PICIDX is only | |||
present in the non-flexible mode (F = 0). This is a running | present in the non-flexible mode (F = 0). This is a running | |||
index for the temporal base layer pictures, i.e., the pictures | index for the temporal base-layer pictures, i.e., the pictures | |||
with TID set to 0. If TID is larger than 0, TL0PICIDX | with a TID set to 0. If the TID is larger than 0, TL0PICIDX | |||
indicates which temporal base layer picture the current picture | indicates which temporal base-layer picture the current picture | |||
depends on. TL0PICIDX MUST be incremented by 1 when TID is | depends on. TL0PICIDX MUST be incremented by 1 when the TID is | |||
equal to 0. The index SHOULD start on a random number, and | equal to 0. The index SHOULD start on a random number and MUST | |||
MUST restart at 0 after reaching the maximum number 255. | restart at 0 after reaching the maximum number 255. | |||
Reference indices: When P and F are both set to one, indicating a | Reference indices: When P and F are both set to 1, indicating a non- | |||
non-key frame in flexible mode, then at least one reference index | keyframe in flexible mode, then at least one reference index MUST | |||
MUST be specified as below. Additional reference indices (total | be specified as below. Additional reference indices (a total of | |||
of up to 3 reference indices are allowed) may be specified using | up to three reference indices are allowed) may be specified using | |||
the N bit below. When either P or F is set to zero, then no | the N bit below. When either P or F is set to 0, then no | |||
reference index is specified. | reference index is specified. | |||
P_DIFF: The reference index (in 7 bits) specified as the relative | P_DIFF: The reference index (in 7 bits) specified as the relative | |||
PID from the current picture. For example, when P_DIFF=3 on a | PID from the current picture. For example, when P_DIFF=3 on a | |||
packet containing the picture with PID 112 means that the | packet containing the picture with PID 112 means that the | |||
picture refers back to the picture with PID 109. This | picture refers back to the picture with PID 109. This | |||
calculation is done modulo the size of the PID field, i.e., | calculation is done modulo the size of the PID field, i.e., | |||
either 7 or 15 bits. A P_DIFF value of 0 is invalid. | either 7 or 15 bits. A P_DIFF value of 0 is invalid. | |||
N: 1 if there is additional P_DIFF following the current P_DIFF. | N: 1 if there is additional P_DIFF following the current P_DIFF. | |||
4.2.1. Scalability Structure (SS): | 4.2.1. Scalability Structure (SS) | |||
The scalability structure (SS) data describes the resolution of each | The SS data describes the resolution of each frame within a picture | |||
frame within a picture as well as the inter-picture dependencies for | as well as the inter-picture dependencies for a PG. If the VP9 | |||
a picture group (PG). If the VP9 payload descriptor's "V" bit is | payload descriptor's V bit is set, the SS data is present in the | |||
set, the SS data is present in the position indicated in Figure 2 and | position indicated in Figures 2 and 3. | |||
Figure 3. | ||||
+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+ | |||
V: | N_S |Y|G|-|-|-| | V: | N_S |Y|G|-|-|-| | |||
+-+-+-+-+-+-+-+-+ -\ | +-+-+-+-+-+-+-+-+ -\ | |||
Y: | WIDTH | (OPTIONAL) . | Y: | WIDTH | (OPTIONAL) . | |||
+ + . | + + . | |||
| | (OPTIONAL) . | | | (OPTIONAL) . | |||
+-+-+-+-+-+-+-+-+ . - N_S + 1 times | +-+-+-+-+-+-+-+-+ . - N_S + 1 times | |||
| HEIGHT | (OPTIONAL) . | | HEIGHT | (OPTIONAL) . | |||
+ + . | + + . | |||
| | (OPTIONAL) . | | | (OPTIONAL) . | |||
+-+-+-+-+-+-+-+-+ -/ | +-+-+-+-+-+-+-+-+ -/ | |||
G: | N_G | (OPTIONAL) | G: | N_G | (OPTIONAL) | |||
+-+-+-+-+-+-+-+-+ -\ | +-+-+-+-+-+-+-+-+ -\ | |||
N_G: | TID |U| R |-|-| (OPTIONAL) . | N_G: | TID |U| R |-|-| (OPTIONAL) . | |||
+-+-+-+-+-+-+-+-+ -\ . - N_G times | +-+-+-+-+-+-+-+-+ -\ . - N_G times | |||
| P_DIFF | (OPTIONAL) . - R times . | | P_DIFF | (OPTIONAL) . - R times . | |||
+-+-+-+-+-+-+-+-+ -/ -/ | +-+-+-+-+-+-+-+-+ -/ -/ | |||
Figure 4 | Figure 4: VP9 Scalability Structure | |||
N_S: N_S + 1 indicates the number of spatial layers present in the | N_S: N_S + 1 indicates the number of spatial layers present in the | |||
VP9 stream. | VP9 stream. | |||
Y: Each spatial layer's frame resolution present. When set to one, | Y: Each spatial layer's frame resolution is present. When set to 1, | |||
the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be | the OPTIONAL WIDTH (2 octets) and HEIGHT (2 octets) MUST be | |||
present for each layer frame. Otherwise, the resolution MUST NOT | present for each layer frame. Otherwise, the resolution MUST NOT | |||
be present. | be present. | |||
G: PG description present flag. | G: The PG description present flag. | |||
-: Bit reserved for future use. MUST be set to zero and MUST be | -: A bit reserved for future use. It MUST be set to 0 and MUST be | |||
ignored by the receiver. | ignored by the receiver. | |||
N_G: N_G indicates the number of pictures in a Picture Group (PG). | N_G: N_G indicates the number of pictures in a PG. If N_G is | |||
If N_G is greater than 0, then the SS data allows the inter- | greater than 0, then the SS data allows the inter-picture | |||
picture dependency structure of the VP9 stream to be pre-declared, | dependency structure of the VP9 stream to be pre-declared, rather | |||
rather than indicating it on the fly with every packet. If N_G is | than indicating it on the fly with every packet. If N_G is | |||
greater than 0, then for N_G pictures in the PG, each picture's | greater than 0, then for N_G pictures in the PG, each picture's | |||
temporal layer ID (TID), switch up point (U), and the Reference | temporal-layer ID (TID), switch up point (U), and Reference | |||
indices (P_DIFFs) are specified. | indices (P_DIFFs) are specified. | |||
The first picture specified in the PG MUST have TID set to 0. | The first picture specified in the PG MUST have a TID set to 0. | |||
G set to 0 or N_G set to 0 indicates that either there is only one | G set to 0 or N_G set to 0 indicates that either there is only one | |||
temporal layer (for non-flexible mode) or no fixed inter-picture | temporal layer (for non-flexible mode) or no fixed inter-picture | |||
dependency information is present (for flexible mode) going | dependency information is present (for flexible mode) going | |||
forward in the bitstream. | forward in the bitstream. | |||
Note that for a given picture, all frames follow the same inter- | Note that for a given picture, all frames follow the same inter- | |||
picture dependency structure. However, the frame rate of each | picture dependency structure. However, the frame rate of each | |||
spatial layer can be different from each other and this can be | spatial layer can be different from each other; this can be | |||
described with the use of the D bit described above. The | described with the use of the D bit described above. The | |||
specified dependency structure in the SS data MUST be for the | specified dependency structure in the SS data MUST be for the | |||
highest frame rate layer. | highest frame rate layer. | |||
In a scalable stream sent with a fixed pattern, the SS data SHOULD be | In a scalable stream sent with a fixed pattern, the SS data SHOULD be | |||
included in the first packet of every key frame. This is a packet | included in the first packet of every key frame. This is a packet | |||
with P bit equal to zero, SID or L bit equal to zero, and B bit equal | with the P bit equal to 0, SID or L bit equal to 0, and B bit equal | |||
to 1. The SS data MUST only be changed on the picture that | to 1. The SS data MUST only be changed on the picture that | |||
corresponds to the first picture specified in the previous SS data's | corresponds to the first picture specified in the previous SS data's | |||
PG (if the previous SS data's N_G was greater than 0). | PG (if the previous SS data's N_G was greater than 0). | |||
4.3. Frame Fragmentation | 4.3. Frame Fragmentation | |||
VP9 frames are fragmented into packets, in RTP sequence number order, | VP9 frames are fragmented into packets in RTP sequence number order: | |||
beginning with a packet with the B bit set, and ending with a packet | beginning with a packet with the B bit set and ending with a packet | |||
with the E bit set. There is no mechanism for finer-grained access | with the E bit set. There is no mechanism for finer-grained access | |||
to parts of a VP9 frame. | to parts of a VP9 frame. | |||
4.4. Scalable encoding considerations | 4.4. Scalable Encoding Considerations | |||
In addition to the use of reference frames, VP9 has several | In addition to the use of reference frames, VP9 has several | |||
additional forms of inter-frame dependencies, largely involving | additional forms of inter-frame dependencies, largely involving | |||
probability tables for the entropy and tree encoders. In VP9 syntax, | probability tables for the entropy and tree encoders. In VP9 syntax, | |||
the syntax element "error_resilient_mode" resets this additional | the syntax element "error_resilient_mode" resets this additional | |||
inter-frame data, allowing a frame's syntax to be decoded | inter-frame data, allowing a frame's syntax to be decoded | |||
independently. | independently. | |||
Due to the requirements of scalable streams, a VP9 encoder producing | Due to the requirements of scalable streams, a VP9 encoder producing | |||
a scalable stream needs to ensure that a frame does not depend on a | a scalable stream needs to ensure that a frame does not depend on a | |||
previous frame (of the same or a previous picture) that can | previous frame (of the same or a previous picture) that can | |||
legitimately be removed from the stream. Thus, a frame that follows | legitimately be removed from the stream. Thus, a frame that follows | |||
a frame that might be removed (in full decode order) MUST be encoded | a frame that might be removed (in full decode order) MUST be encoded | |||
with "error_resilient_mode" set to true. | with "error_resilient_mode" set to true. | |||
For spatially-scalable streams, this means that | For spatially scalable streams, this means that | |||
"error_resilient_mode" needs to be turned on for the base spatial | "error_resilient_mode" needs to be turned on for the base spatial | |||
layer; it can however be turned off for higher spatial layers, | layer; however, it can be turned off for higher spatial layers, | |||
assuming they are sent with inter-layer dependency (i.e. with the "D" | assuming they are sent with inter-layer dependency (i.e., with the D | |||
bit set). For streams that are only temporally-scalable without | bit set). For streams that are only temporally scalable without | |||
spatial scalability, "error_resilient_mode" can additionally be | spatial scalability, "error_resilient_mode" can additionally be | |||
turned off for any picture that immediately follows a temporal layer | turned off for any picture that immediately follows a temporal-layer | |||
0 frame. | 0 frame. | |||
4.5. Examples of VP9 RTP Stream | 4.5. Examples of VP9 RTP Stream | |||
4.5.1. Reference picture use for scalable structure | ||||
4.5.1. Reference Picture Use for Scalable Structure | ||||
As discussed in Section 3, the VP9 codec can maintain up to eight | As discussed in Section 3, the VP9 codec can maintain up to eight | |||
reference frames, of which up to three can be referenced or updated | reference frames, of which up to three can be referenced or updated | |||
by any new frame. This section illustrates one way that a scalable | by any new frame. This section illustrates one way that a scalable | |||
structure (with three spatial layers and three temporal layers) can | structure (with three spatial layers and three temporal layers) can | |||
be constructed using these reference frames. | be constructed using these reference frames. | |||
+==========+=========+============+=========+ | +==========+=========+============+=========+ | |||
| Temporal | Spatial | References | Updates | | | Temporal | Spatial | References | Updates | | |||
+==========+=========+============+=========+ | +==========+=========+============+=========+ | |||
skipping to change at page 14, line 40 ¶ | skipping to change at line 633 ¶ | |||
+----------+---------+------------+---------+ | +----------+---------+------------+---------+ | |||
| 1 | 2 | 2,4 | 5 | | | 1 | 2 | 2,4 | 5 | | |||
+----------+---------+------------+---------+ | +----------+---------+------------+---------+ | |||
| 2 | 0 | 3 | 6 | | | 2 | 0 | 3 | 6 | | |||
+----------+---------+------------+---------+ | +----------+---------+------------+---------+ | |||
| 2 | 1 | 4,6 | 7 | | | 2 | 1 | 4,6 | 7 | | |||
+----------+---------+------------+---------+ | +----------+---------+------------+---------+ | |||
| 2 | 2 | 5,7 | - | | | 2 | 2 | 5,7 | - | | |||
+----------+---------+------------+---------+ | +----------+---------+------------+---------+ | |||
Table 1: Example scalability structure | Table 1: Example Scalability Structure | |||
This structure is constructed such that the "U" bit can always be | This structure is constructed such that the U bit can always be set. | |||
set. | ||||
5. Feedback Messages and Header Extensions | 5. Feedback Messages and Header Extensions | |||
5.1. Reference Picture Selection Indication (RPSI) | 5.1. Reference Picture Selection Indication (RPSI) | |||
The reference picture selection index is a payload-specific feedback | The reference picture selection index is a payload-specific feedback | |||
message defined within the RTCP-based feedback format. The RPSI | message defined within the RTCP-based feedback format. The RPSI | |||
message is generated by a receiver and can be used in two ways. | message is generated by a receiver and can be used in two ways: | |||
Either it can signal a preferred reference picture when a loss has | either it can signal a preferred reference picture when a loss has | |||
been detected by the decoder -- preferably then a reference that the | been detected by the decoder (preferably a reference that the decoder | |||
decoder knows is perfect -- or, it can be used as positive feedback | knows is perfect) or it can be used as positive feedback information | |||
information to acknowledge correct decoding of certain reference | to acknowledge correct decoding of certain reference pictures. The | |||
pictures. The positive feedback method is useful for VP9 used for | positive feedback method is useful for VP9 used for point-to-point | |||
point to point (unicast) communication. The use of RPSI for VP9 is | (unicast) communication. The use of RPSI for VP9 is preferably | |||
preferably combined with a special update pattern of the codec's two | combined with a special update pattern of the codec's two special | |||
special reference frames -- the golden frame and the altref frame -- | reference frames -- the golden frame and the altref frame -- in which | |||
in which they are updated in an alternating leapfrog fashion. When a | they are updated in an alternating leapfrog fashion. When a receiver | |||
receiver has received and correctly decoded a golden or altref frame, | has received and correctly decoded a golden or altref frame, and that | |||
and that frame had a Picture ID in the payload descriptor, the | frame had a Picture ID in the payload descriptor, the receiver can | |||
receiver can acknowledge this simply by sending an RPSI message back | acknowledge this simply by sending an RPSI message back to the | |||
to the sender. The message body (i.e., the "native RPSI bit string" | sender. The message body (i.e., the "native RPSI bit string" in | |||
in [RFC4585]) is simply the (7 or 15 bit) Picture ID of the received | [RFC4585]) is simply the (7- or 15-bit) Picture ID of the received | |||
frame. | frame. | |||
Note: because all frames of the same picture must have the same | Note: because all frames of the same picture must have the same | |||
inter-picture reference structure, there is no need for a message to | inter-picture reference structure, there is no need for a message to | |||
specify which frame is being selected. | specify which frame is being selected. | |||
5.2. Full Intra Request (FIR) | 5.2. Full Intra Request (FIR) | |||
The Full Intra Request (FIR) [RFC5104] RTCP feedback message allows a | The Full Intra Request (FIR) [RFC5104] RTCP feedback message allows a | |||
receiver to request a full state refresh of an encoded stream. | receiver to request a full state refresh of an encoded stream. | |||
Upon receipt of an FIR request, a VP9 sender MUST send a picture with | Upon receipt of a FIR request, a VP9 sender MUST send a picture with | |||
a keyframe for its spatial layer 0 layer frame, and then send frames | a keyframe for its spatial-layer 0 layer frame and then send frames | |||
without inter-picture prediction (P=0) for any higher layer frames. | without inter-picture prediction (P=0) for any higher-layer frames. | |||
5.3. Layer Refresh Request (LRR) | 5.3. Layer Refresh Request (LRR) | |||
The Layer Refresh Request (LRR) [I-D.ietf-avtext-lrr] allows a | The Layer Refresh Request (LRR) [RFC9627] allows a receiver to | |||
receiver to request a single layer of a spatially or temporally | request a single layer of a spatially or temporally encoded stream to | |||
encoded stream to be refreshed, without necessarily affecting the | be refreshed without necessarily affecting the stream's other layers. | |||
stream's other layers. | ||||
+---------------+---------------+ | +---------------+---------------+ | |||
|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| | |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| | |||
+---------------+---------+-----+ | +---------------+---------+-----+ | |||
| RES | TID | RES | SID | | | RES | TID | RES | SID | | |||
+---------------+---------+-----+ | +---------------+---------+-----+ | |||
Figure 5 | Figure 5: LRR Index Format | |||
Figure 5 shows the format of LRR's layer index fields for VP9 | Figure 5 shows the format of an LRR's layer index fields for VP9 | |||
streams. The two "RES" fields MUST be set to 0 on transmission and | streams. The two "RES" fields MUST be set to 0 on transmission and | |||
ingnored on reception. See Section 4.2 for details on the TID and | ignored on reception. See Section 4.2 for details on the TID and SID | |||
SID fields. | fields. | |||
Identification of a layer refresh frame can be derived from the | Identification of a layer refresh frame can be derived from the | |||
reference IDs of each frame by backtracking the dependency chain | reference IDs of each frame by backtracking the dependency chain | |||
until reaching a point where only decodable frames are being | until reaching a point where only decodable frames are being | |||
referenced. Therefore it's recommended for both the flexible and the | referenced. Therefore, it's recommended for both the flexible and | |||
non-flexible mode that, when switching up points are being encoded in | the non-flexible mode that, when switching up points are being | |||
response to a LRR, those packets should contain layer indices and the | encoded in response to an LRR, those packets contain layer indices | |||
reference field(s) so that the decoder or a selective forwarding | and the reference field or fields so that the decoder or selective | |||
middleboxes [RFC7667] can make this derivation. | forwarding middleboxes [RFC7667] can make this derivation. | |||
Example: | Example: | |||
LRR {1,0}, {2,1} is sent by an MCU when it is currently relaying | LRR {1,0}, {2,1} is sent by a Multipoint Control Unit (MCU) when it | |||
{1,0} to a receiver and which wants to upgrade to {2,1}. In response | is currently relaying {1,0} to a receiver and which wants to upgrade | |||
the encoder should encode the next frames in layers {1,1} and {2,1} | to {2,1}. In response, the encoder should encode the next frames in | |||
by only referring to frames in {1,0}, or {0,0}. | layers {1,1} and {2,1} by only referring to frames in {1,0}, or | |||
{0,0}. | ||||
In the non-flexible mode, periodic upgrade frames can be defined by | In the non-flexible mode, periodic upgrade frames can be defined by | |||
the layer structure of the SS, thus periodic upgrade frames can be | the layer structure of the SS; thus, periodic upgrade frames can be | |||
automatically identified by the picture ID. | automatically identified by the Picture ID. | |||
6. Payload Format Parameters | 6. Payload Format Parameters | |||
This payload format has three optional parameters, "max-fr", "max- | This payload format has three optional parameters: max-fr, max-fs, | |||
fs", and "profile-id". | and profile-id. | |||
The max-fr and max-fs parameters are used to signal the capabilities | The max-fr and max-fs parameters are used to signal the capabilities | |||
of a receiver implementation. If the implementation is willing to | of a receiver implementation. If the implementation is willing to | |||
receive media, both parameters MUST be provided. These parameters | receive media, both parameters MUST be provided. These parameters | |||
MUST NOT be used for any other purpose. A media sender SHOULD NOT | MUST NOT be used for any other purpose. A media sender SHOULD NOT | |||
send media with a frame rate or frame size exceeding the max-fr and | send media with a frame rate or frame size exceeding the max-fr and | |||
max-fs values signaled. (There may be scenarios, such as pre-encoded | max-fs values signaled. (There may be scenarios, such as pre-encoded | |||
media or selective forwarding middleboxes [RFC7667], where a media | media or selective forwarding middleboxes [RFC7667], where a media | |||
sender does not have media available that fits within a receivers | sender does not have media available that fits within a receiver's | |||
max-fs and max-fr value; in such scenarios, a sender MAY exceed the | max-fs and max-fr value; in such scenarios, a sender MAY exceed the | |||
signaled values.) | signaled values.) | |||
max-fr: The value of max-fr is an integer indicating the maximum | max-fr: The value of max-fr is an integer indicating the maximum | |||
frame rate in units of frames per second that the decoder is | frame rate in units of frames per second that the decoder is | |||
capable of decoding. | capable of decoding. | |||
max-fs: The value of max-fs is an integer indicating the maximum | max-fs: The value of max-fs is an integer indicating the maximum | |||
frame size in units of macroblocks that the decoder is capable of | frame size in units of macroblocks that the decoder is capable of | |||
decoding. | decoding. | |||
The decoder is capable of decoding this frame size as long as the | The decoder is capable of decoding this frame size as long as the | |||
width and height of the frame in macroblocks are less than | width and height of the frame in macroblocks are less than | |||
int(sqrt(max-fs * 8)) - for instance, a max-fs of 1200 (capable of | int(sqrt(max-fs * 8)); for instance, a max-fs of 1200 (capable of | |||
supporting 640x480 resolution) will support widths and heights up | supporting 640x480 resolution) will support widths and heights up | |||
to 1552 pixels (97 macroblocks). | to 1552 pixels (97 macroblocks). | |||
profile-id: The value of profile-id is an integer indicating the | profile-id: The value of profile-id is an integer indicating the | |||
default coding profile, the subset of coding tools that may have | default coding profile (the subset of coding tools that may have | |||
been used to generate the stream or that the receiver supports). | been used to generate the stream or that the receiver supports). | |||
Table 2 lists all of the profiles defined in section 7.2 of | Table 2 lists all of the profiles defined in Section 7.2 of | |||
[VP9-BITSTREAM] and the corresponding integer values to be used. | [VP9-BITSTREAM] and the corresponding integer values to be used. | |||
If no profile-id is present, Profile 0 MUST be inferred. (The | If no profile-id is present, Profile 0 MUST be inferred. (The | |||
profile-id parameter was added relatively late in the development | profile-id parameter was added relatively late in the development | |||
of this specification, so some existing implementations may not | of this specification, so some existing implementations may not | |||
send it.) | send it.) | |||
Informative note: See Table 3 for capabilities of coding profiles | Informative note: See Table 3 for capabilities of coding profiles | |||
defined in section 7.2 of [VP9-BITSTREAM]. | defined in Section 7.2 of [VP9-BITSTREAM]. | |||
A receiver MUST ignore any parameter unspecified in this | A receiver MUST ignore any parameter unspecified in this | |||
specification. | specification. | |||
+=========+============+ | +=========+============+ | |||
| Profile | profile-id | | | Profile | profile-id | | |||
+=========+============+ | +=========+============+ | |||
| 0 | 0 | | | 0 | 0 | | |||
+---------+------------+ | +---------+------------+ | |||
| 1 | 1 | | | 1 | 1 | | |||
+---------+------------+ | +---------+------------+ | |||
| 2 | 2 | | | 2 | 2 | | |||
+---------+------------+ | +---------+------------+ | |||
| 3 | 3 | | | 3 | 3 | | |||
+---------+------------+ | +---------+------------+ | |||
Table 2: Table of | Table 2: Comparison of | |||
profile-id integer | profile-id to VP9 | |||
values representing | Profile Integer | |||
the VP9 profile | ||||
corresponding to the | ||||
set of coding tools | ||||
supported. | ||||
+=========+===========+=================+==========================+ | +=========+===========+=================+==========================+ | |||
| Profile | Bit Depth | SRGB Colorspace | Chroma Subsampling | | | Profile | Bit Depth | SRGB Colorspace | Chroma Subsampling | | |||
+=========+===========+=================+==========================+ | +=========+===========+=================+==========================+ | |||
| 0 | 8 | No | YUV 4:2:0 | | | 0 | 8 | No | YUV 4:2:0 | | |||
+---------+-----------+-----------------+--------------------------+ | +---------+-----------+-----------------+--------------------------+ | |||
| 1 | 8 | Yes | YUV 4:2:2,4:4:0 or 4:4:4 | | | 1 | 8 | Yes | YUV 4:2:2,4:4:0 or 4:4:4 | | |||
+---------+-----------+-----------------+--------------------------+ | +---------+-----------+-----------------+--------------------------+ | |||
| 2 | 10 or 12 | No | YUV 4:2:0 | | | 2 | 10 or 12 | No | YUV 4:2:0 | | |||
+---------+-----------+-----------------+--------------------------+ | +---------+-----------+-----------------+--------------------------+ | |||
| 3 | 10 or 12 | Yes | YUV 4:2:2,4:4:0 or 4:4:4 | | | 3 | 10 or 12 | Yes | YUV 4:2:2,4:4:0 or 4:4:4 | | |||
+---------+-----------+-----------------+--------------------------+ | +---------+-----------+-----------------+--------------------------+ | |||
Table 3: Table of profile capabilities. | Table 3: Profile Capabilities | |||
6.1. SDP Parameters | 6.1. SDP Parameters | |||
6.1.1. Mapping of Media Subtype Parameters to SDP | 6.1.1. Mapping of Media Subtype Parameters to SDP | |||
The media type video/VP9 string is mapped to fields in the Session | The media type video/vp9 string is mapped to fields in the Session | |||
Description Protocol (SDP) [RFC8866] as follows: | Description Protocol (SDP) [RFC8866] as follows: | |||
* The media name in the "m=" line of SDP MUST be video. | * The media name in the "m=" line of SDP MUST be video. | |||
* The encoding name in the "a=rtpmap" line of SDP MUST be VP9 (the | * The encoding name in the "a=rtpmap" line of SDP MUST be VP9 (the | |||
media subtype). | media subtype). | |||
* The clock rate in the "a=rtpmap" line MUST be 90000. | * The clock rate in the "a=rtpmap" line MUST be 90000. | |||
* The parameters "max-fr" and "max-fs" MUST be included in the | * The parameters max-fr and max-fs MUST be included in the "a=fmtp" | |||
"a=fmtp" line of SDP if the receiver wishes to declare its | line of SDP if the receiver wishes to declare its receiver | |||
receiver capabilities. These parameters are expressed as a media | capabilities. These parameters are expressed as a media subtype | |||
subtype string, in the form of a semicolon separated list of | string in the form of a semicolon-separated list of | |||
parameter=value pairs. | parameter=value pairs. | |||
* The OPTIONAL parameter profile-id, when present, SHOULD be | * The OPTIONAL parameter profile-id, when present, SHOULD be | |||
included in the "a=fmtp" line of SDP. This parameter is expressed | included in the "a=fmtp" line of SDP. This parameter is expressed | |||
as a media subtype string, in the form of a parameter=value pair. | as a media subtype string in the form of a parameter=value pair. | |||
When the parameter is not present, a value of 0 MUST be inferred | When the parameter is not present, a value of 0 MUST be inferred | |||
for profile-id. | for profile-id. | |||
6.1.1.1. Example | 6.1.1.1. Example | |||
An example of media representation in SDP is as follows: | An example of media representation in SDP is as follows: | |||
m=video 49170 RTP/AVPF 98 | m=video 49170 RTP/AVPF 98 | |||
a=rtpmap:98 VP9/90000 | a=rtpmap:98 VP9/90000 | |||
a=fmtp:98 max-fr=30;max-fs=3600;profile-id=0 | a=fmtp:98 max-fr=30;max-fs=3600;profile-id=0 | |||
skipping to change at page 19, line 20 ¶ | skipping to change at line 839 ¶ | |||
* The parameter identifying a media format configuration for VP9 is | * The parameter identifying a media format configuration for VP9 is | |||
profile-id. This media format configuration parameter MUST be | profile-id. This media format configuration parameter MUST be | |||
used symmetrically; that is, the answerer MUST either maintain | used symmetrically; that is, the answerer MUST either maintain | |||
this configuration parameter or remove the media format (payload | this configuration parameter or remove the media format (payload | |||
type) completely if it is not supported. | type) completely if it is not supported. | |||
* The max-fr and max-fs parameters are used declaratively to | * The max-fr and max-fs parameters are used declaratively to | |||
describe receiver capabilities, even in the Offer/Answer model. | describe receiver capabilities, even in the Offer/Answer model. | |||
The values in an answer are used to describe the answerer's | The values in an answer are used to describe the answerer's | |||
capabilities, and thus their values are set independently of the | capabilities; thus, their values are set independently of the | |||
values in the offer. | values in the offer. | |||
* To simplify the handling and matching of these configurations, the | * To simplify the handling and matching of these configurations, the | |||
same RTP payload type number used in the offer SHOULD also be used | same RTP payload type number used in the offer SHOULD also be used | |||
in the answer and in a subsequent offer, as specified in | in the answer and in a subsequent offer, as specified in | |||
[RFC3264]. An answer or subsequent offer MUST NOT contain the | [RFC3264]. An answer or subsequent offer MUST NOT contain the | |||
payload type number used in the offer unless the profile-id value | payload type number used in the offer unless the profile-id value | |||
is exactly the same as in the original offer. However, max-fr and | is exactly the same as in the original offer. However, max-fr and | |||
max-fs parameters MAY be changed in subsequent offers and answers, | max-fs parameters MAY be changed in subsequent offers and answers, | |||
with the same payload type number, if an endpoint wishes to change | with the same payload type number, if an endpoint wishes to change | |||
its declared receiver capabilities. | its declared receiver capabilities. | |||
7. Media Type Definition | 7. Media Type Definition | |||
This registration is done using the template defined in [RFC6838] and | This registration uses the template defined in [RFC6838] and | |||
following [RFC4855]. | following [RFC4855]. | |||
Type name: | Type name: video | |||
video | ||||
Subtype name: | ||||
VP9 | ||||
Required parameters: | ||||
N/A. | ||||
Optional parameters: | Subtype name: VP9 | |||
There are three optional parameters, "max-fr", "max-fs", and | ||||
"profile-id". See Section 6 for their definition. | ||||
Encoding considerations: | Required parameters: N/A | |||
This media type is framed in RTP and contains binary data; see | ||||
Section 4.8 of [RFC6838]. | ||||
Security considerations: | Optional parameters: There are three optional parameters: max-fr, | |||
See Section 8 of RFC xxxx. | max-fs, and profile-id. See Section 6 for their definition. | |||
[RFC Editor: Upon publication as an RFC, please replace "XXXX" | Encoding considerations: This media type is framed in RTP and | |||
with the number assigned to this document and remove this note.] | contains binary data; see Section 4.8 of [RFC6838]. | |||
Interoperability considerations: | Security considerations: See Section 8 of RFC 9628. | |||
None. | ||||
Published specification: | Interoperability considerations: None | |||
VP9 bitstream format [VP9-BITSTREAM] and RFC XXXX. | ||||
[RFC Editor: Upon publication as an RFC, please replace "XXXX" | Published specification: VP9 bitstream format [VP9-BITSTREAM] and | |||
with the number assigned to this document and remove this note.] | RFC 9628. | |||
Applications which use this media type: | Applications that use this media type: For example, video over IP, | |||
For example: Video over IP, video conferencing. | video conferencing. | |||
Fragment identifier considerations: | Fragment identifier considerations: N/A | |||
N/A. | ||||
Additional information: | Additional information: None | |||
None. | ||||
Person & email address to contact for further information: | Person & email address to contact for further information: Jonathan | |||
Jonathan Lennox <jonathan.lennox@8x8.com> | Lennox <jonathan.lennox@8x8.com> | |||
Intended usage: | Intended usage: COMMON | |||
COMMON | ||||
Restrictions on usage: | Restrictions on usage: This media type depends on RTP framing; | |||
This media type depends on RTP framing, and hence is only defined | hence, it is only defined for transfer via RTP [RFC3550]. | |||
for transfer via RTP [RFC3550]. | ||||
Author: | Author: Jonathan Lennox <jonathan.lennox@8x8.com> | |||
Jonathan Lennox <jonathan.lennox@8x8.com> | ||||
Change controller: | Change controller: IETF AVTCore Working Group delegated from the | |||
IETF AVTCore Working Group delegated from the IESG. | IESG. | |||
8. Security Considerations | 8. Security Considerations | |||
RTP packets using the payload format defined in this specification | RTP packets using the payload format defined in this specification | |||
are subject to the security considerations discussed in the RTP | are subject to the security considerations discussed in the RTP | |||
specification [RFC3550], and in any applicable RTP profile such as | specification [RFC3550], and in any applicable RTP profile such as | |||
RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ | RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ | |||
SAVPF [RFC5124]. However, as "Securing the RTP Protocol Framework: | SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP | |||
Why RTP Does Not Mandate a Single Media Security Solution" [RFC7202] | Does Not Mandate a Single Media Security Solution" [RFC7202] | |||
discusses, it is not an RTP payload format's responsibility to | discusses, it is not an RTP payload format's responsibility to | |||
discuss or mandate what solutions are used to meet the basic security | discuss or mandate what solutions are used to meet the basic security | |||
goals like confidentiality, integrity and source authenticity for RTP | goals like confidentiality, integrity, and source authenticity for | |||
in general. This responsibility lays on anyone using RTP in an | RTP in general. This responsibility lies with anyone using RTP in an | |||
application. They can find guidance on available security mechanisms | application. They can find guidance on available security mechanisms | |||
in Options for Securing RTP Sessions [RFC7201]. Applications SHOULD | in "Options for Securing RTP Sessions [RFC7201]. Applications SHOULD | |||
use one or more appropriate strong security mechanisms. The rest of | use one or more appropriate strong security mechanisms. | |||
this security consideration section discusses the security impacting | ||||
properties of the payload format itself. | ||||
Implementations of this RTP payload format need to take appropriate | Implementations of this RTP payload format need to take appropriate | |||
security considerations into account. It is extremely important for | security considerations into account. It is extremely important for | |||
the decoder to be robust against malicious or malformed payloads and | the decoder to be robust against malicious or malformed payloads and | |||
ensure that they do not cause the decoder to overrun its allocated | ensure that they do not cause the decoder to overrun its allocated | |||
memory or otherwise mis-behave. An overrun in allocated memory could | memory or otherwise misbehave. An overrun in allocated memory could | |||
lead to arbitrary code execution by an attacker. The same applies to | lead to arbitrary code execution by an attacker. The same applies to | |||
the encoder, even though problems in encoders are typically rarer. | the encoder, even though problems in encoders are (typically) rarer. | |||
This RTP payload format and its media decoder do not exhibit any | This RTP payload format and its media decoder do not exhibit any | |||
significant non-uniformity in the receiver-side computational | significant non-uniformity in the receiver-side computational | |||
complexity for packet processing, and thus are unlikely to pose a | complexity for packet processing; thus, they are unlikely to pose a | |||
denial-of-service threat due to the receipt of pathological data. | denial-of-service threat due to the receipt of pathological data. | |||
Nor does the RTP payload format contain any active content. | Nor does the RTP payload format contain any active content. | |||
9. Congestion Control | 9. Congestion Control | |||
Congestion control for RTP SHALL be used in accordance with RFC 3550 | Congestion control for RTP SHALL be used in accordance with | |||
[RFC3550], and with any applicable RTP profile; e.g., RFC 3551 | [RFC3550], and with any applicable RTP profile, e.g., [RFC3551]. The | |||
[RFC3551]. The congestion control mechanism can, in a real-time | congestion control mechanism can, in a real-time encoding scenario, | |||
encoding scenario, adapt the transmission rate by instructing the | adapt the transmission rate by instructing the encoder to encode at a | |||
encoder to encode at a certain target rate. Media aware network | certain target rate. Media-aware network elements MAY use the | |||
elements MAY use the information in the VP9 payload descriptor in | information in the VP9 payload descriptor in Section 4.2 to identify | |||
Section 4.2 to identify non-reference frames and discard them in | non-reference frames and discard them in order to reduce network | |||
order to reduce network congestion. Note that discarding of non- | congestion. Note that discarding of non-reference frames cannot be | |||
reference frames cannot be done if the stream is encrypted (because | done if the stream is encrypted (because the non-reference marker is | |||
the non-reference marker is encrypted). | encrypted). | |||
10. IANA Considerations | 10. IANA Considerations | |||
The IANA is requested to register the media type registration "video/ | IANA has registered the media type registration "video/vp9" as | |||
vp9" as specified in Section 7. The media type is also requested to | specified in Section 7. The media type has also been added to the | |||
be added to the IANA registry for "RTP Payload Format MIME types" | "RTP Payload Format Media Types" <https://www.iana.org/assignments/ | |||
<http://www.iana.org/assignments/rtp-parameters>. | rtp-parameters> subregistry of the "Real-Time Transport Protocol | |||
(RTP) Paramaeters" registry. | ||||
11. Acknowledgments | ||||
Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo, | ||||
Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert | ||||
commented on the development of this document and provided helpful | ||||
comments and feedback. | ||||
12. References | ||||
12.1. Normative References | 11. References | |||
[I-D.ietf-avtext-lrr] | 11.1. Normative References | |||
Lennox, J., Hong, D., Uberti, J., Holmer, S., and M. | ||||
Flodman, "The Layer Refresh Request (LRR) RTCP Feedback | ||||
Message", Work in Progress, Internet-Draft, draft-ietf- | ||||
avtext-lrr-07, 2 July 2017, | ||||
<https://www.ietf.org/archive/id/draft-ietf-avtext-lrr- | ||||
07.txt>. | ||||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, | Requirement Levels", BCP 14, RFC 2119, | |||
DOI 10.17487/RFC2119, March 1997, | DOI 10.17487/RFC2119, March 1997, | |||
<https://www.rfc-editor.org/info/rfc2119>. | <https://www.rfc-editor.org/info/rfc2119>. | |||
[RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model | [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model | |||
with Session Description Protocol (SDP)", RFC 3264, | with Session Description Protocol (SDP)", RFC 3264, | |||
DOI 10.17487/RFC3264, June 2002, | DOI 10.17487/RFC3264, June 2002, | |||
<https://www.rfc-editor.org/info/rfc3264>. | <https://www.rfc-editor.org/info/rfc3264>. | |||
skipping to change at page 23, line 28 ¶ | skipping to change at line 995 ¶ | |||
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC | |||
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, | |||
May 2017, <https://www.rfc-editor.org/info/rfc8174>. | May 2017, <https://www.rfc-editor.org/info/rfc8174>. | |||
[RFC8866] Begen, A., Kyzivat, P., Perkins, C., and M. Handley, "SDP: | [RFC8866] Begen, A., Kyzivat, P., Perkins, C., and M. Handley, "SDP: | |||
Session Description Protocol", RFC 8866, | Session Description Protocol", RFC 8866, | |||
DOI 10.17487/RFC8866, January 2021, | DOI 10.17487/RFC8866, January 2021, | |||
<https://www.rfc-editor.org/info/rfc8866>. | <https://www.rfc-editor.org/info/rfc8866>. | |||
[RFC9627] Lennox, J., Hong, D., Uberti, J., Holmer, S., and M. | ||||
Flodman, "The Layer Refresh Request (LRR) RTCP Feedback | ||||
Message", RFC 9627, DOI 10.17487/RFC9627, August 2024, | ||||
<https://www.rfc-editor.org/info/rfc9627>. | ||||
[VP9-BITSTREAM] | [VP9-BITSTREAM] | |||
Grange, A., de Rivaz, P., and J. Hunt, "VP9 Bitstream & | Grange, A., de Rivaz, P., and J. Hunt, "VP9 Bitstream & | |||
Decoding Process Specification", Version 0.6, 31 March | Decoding Process Specification", Version 0.6, 31 March | |||
2016, | 2016, | |||
<https://storage.googleapis.com/downloads.webmproject.org/ | <https://storage.googleapis.com/downloads.webmproject.org/ | |||
docs/vp9/vp9-bitstream-specification- | docs/vp9/vp9-bitstream-specification- | |||
v0.6-20160331-draft.pdf>. | v0.6-20160331-draft.pdf>. | |||
12.2. Informative References | 11.2. Informative References | |||
[RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and | [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and | |||
Video Conferences with Minimal Control", STD 65, RFC 3551, | Video Conferences with Minimal Control", STD 65, RFC 3551, | |||
DOI 10.17487/RFC3551, July 2003, | DOI 10.17487/RFC3551, July 2003, | |||
<https://www.rfc-editor.org/info/rfc3551>. | <https://www.rfc-editor.org/info/rfc3551>. | |||
[RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. | [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. | |||
Norrman, "The Secure Real-time Transport Protocol (SRTP)", | Norrman, "The Secure Real-time Transport Protocol (SRTP)", | |||
RFC 3711, DOI 10.17487/RFC3711, March 2004, | RFC 3711, DOI 10.17487/RFC3711, March 2004, | |||
<https://www.rfc-editor.org/info/rfc3711>. | <https://www.rfc-editor.org/info/rfc3711>. | |||
skipping to change at page 24, line 23 ¶ | skipping to change at line 1043 ¶ | |||
[RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP | [RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP | |||
Framework: Why RTP Does Not Mandate a Single Media | Framework: Why RTP Does Not Mandate a Single Media | |||
Security Solution", RFC 7202, DOI 10.17487/RFC7202, April | Security Solution", RFC 7202, DOI 10.17487/RFC7202, April | |||
2014, <https://www.rfc-editor.org/info/rfc7202>. | 2014, <https://www.rfc-editor.org/info/rfc7202>. | |||
[RFC7667] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 7667, | [RFC7667] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 7667, | |||
DOI 10.17487/RFC7667, November 2015, | DOI 10.17487/RFC7667, November 2015, | |||
<https://www.rfc-editor.org/info/rfc7667>. | <https://www.rfc-editor.org/info/rfc7667>. | |||
Acknowledgments | ||||
Alex Eleftheriadis, Yuki Ito, Won Kap Jang, Sergio Garcia Murillo, | ||||
Roi Sasson, Timothy Terriberry, Emircan Uysaler, and Thomas Volkert | ||||
commented on the development of this document and provided helpful | ||||
feedback. | ||||
Authors' Addresses | Authors' Addresses | |||
Justin Uberti | Justin Uberti | |||
Google, Inc. | Google, Inc. | |||
747 6th Street South | 747 6th Street South | |||
Kirkland, WA 98033 | Kirkland, WA 98033 | |||
United States of America | United States of America | |||
Email: justin@uberti.name | Email: justin@uberti.name | |||
Stefan Holmer | Stefan Holmer | |||
Google, Inc. | Google, Inc. | |||
Kungsbron 2 | Kungsbron 2 | |||
SE-111 22 Stockholm | SE-111 22 Stockholm | |||
Sweden | Sweden | |||
Email: holmer@google.com | Email: holmer@google.com | |||
Magnus Flodman | Magnus Flodman | |||
Google, Inc. | Google, Inc. | |||
Kungsbron 2 | Kungsbron 2 | |||
SE-111 22 Stockholm | SE-111 22 Stockholm | |||
Sweden | Sweden | |||
Email: mflodman@google.com | Email: mflodman@google.com | |||
Danny Hong | Danny Hong | |||
Google, Inc. | Google, Inc. | |||
1585 Charleston Road | 1585 Charleston Road | |||
Mountain View, CA 94043 | Mountain View, CA 94043 | |||
United States of America | United States of America | |||
Email: dannyhong@google.com | Email: dannyhong@google.com | |||
Jonathan Lennox | Jonathan Lennox | |||
8x8, Inc. / Jitsi | 8x8, Inc. / Jitsi | |||
Jersey City, NJ 07302 | Jersey City, NJ 07302 | |||
United States of America | United States of America | |||
Email: jonathan.lennox@8x8.com | Email: jonathan.lennox@8x8.com | |||
End of changes. 144 change blocks. | ||||
386 lines changed or deleted | 353 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |