libjxl

FORK: libjxl patches used on blog
git clone https://git.neptards.moe/blog/libjxl.git
Log | Files | Refs | Submodules | README | LICENSE

format_overview.md (13625B)


      1 # JPEG XL Format Overview
      2 
      3 This document gives an overview of the JPEG XL file format and codestream,
      4 its features, and the underlying design rationale.
      5 The aim of this document is to provide general insight into the
      6 format capabilities and design, thus helping developers
      7 better understand how to use the `libjxl` API.
      8 
      9 ## Codestream and File Format
     10 
     11 The JPEG XL format is defined in ISO/IEC 18181. This standard consists of
     12 four parts:
     13 
     14 *   18181-1: Core codestream
     15 *   18181-2: File format
     16 *   18181-3: Conformance testing
     17 *   18181-4: Reference implementation
     18 
     19 ### Core codestream
     20 
     21 The core codestream contains all the data necessary to decode and display
     22 still image or animation data. This includes basic metadata like image dimensions,
     23 the pixel data itself, colorspace information, orientation, upsampling, etc.
     24 
     25 ### File format
     26 
     27 The JPEG XL file format can take two forms:
     28 
     29 *   A 'naked' codestream. In this case, only the image/animation data itself is
     30 stored, and no additional metadata can be included. Such a file starts with the
     31 bytes `0xFF0A` (the JPEG marker for "start of JPEG XL codestream").
     32 *   An ISOBMFF-based container. This is a box-based container that includes a
     33 JPEG XL codestream box (`jxlc`), and can optionally include other boxes with
     34 additional information, such as Exif metadata. In this case, the file starts with
     35 the bytes `0x0000000C 4A584C20 0D0A870A`.
     36 
     37 ### Conformance testing
     38 
     39 This part of the standard defines precision bounds and test cases for conforming
     40 decoders, to verify that they implement all coding tools correctly and accurately.
     41 
     42 ### Reference implementation
     43 
     44 The `libjxl` software is the reference implementation of JPEG XL.
     45 
     46 
     47 ## Metadata versus Image Data
     48 
     49 JPEG XL makes a clear separation between metadata and image data.
     50 Everything that is needed to correctly display an image is
     51 considered to be image data, and is part of the core codestream. This includes
     52 elements that have traditionally been considered 'metadata', such as ICC profiles
     53 and Exif orientation. The goal is to reduce the ambiguity and potential for
     54 incorrect implementations that can be caused by having a 'black box' codestream
     55 that only contains numerical pixel data, requiring applications to figure out how
     56 to correctly interpret the data (i.e. apply color transforms, upsampling,
     57 orientation, blending, cropping, etc.). By including this functionality in the
     58 codestream itself, the decoder can provide output in a normalized way
     59 (e.g. in RGBA, orientation already applied, frames blended and coalesced),
     60 simplifying things and making it less error-prone for applications.
     61 
     62 The remaining metadata, e.g. Exif or XMP, can be stored in the container format,
     63 but it does not influence image rendering. In the case of Exif orientation,
     64 this field has to be ignored by applications, since the orientation in the
     65 codestream always takes precedence (and will already have been applied
     66 transparently by the decoder). This means that stripping metadata can be done
     67 without affecting the displayed image.
     68 
     69 
     70 ## Codestream Features
     71 
     72 ### Color Management
     73 
     74 In JPEG XL, images always have a fully defined colorspace, i.e. it is always
     75 unambiguous how to interpret the pixel values. There are two options:
     76 
     77 *   Pixel data is in a specified (non-XYB) colorspace, and the decoder will produce
     78 a pixel buffer in this colorspace plus an ICC profile that describes that
     79 colorspace. Mathematically lossless encoding can only use this option.
     80 *   Pixel data is in the XYB colorspace, which is an absolute colorspace.
     81 In this case, the decoder can produce a pixel buffer directly in a desired
     82 display space like sRGB, Display-P3 or Rec.2100 PQ.
     83 
     84 The image header always contains a colorspace; however, its meaning depends on
     85 which of the above two options were used:
     86 
     87 *   In the first case (non-XYB), the signaled colorspace defines the
     88 interpretation of the pixel data.
     89 *   In the second case (XYB), the signaled colorspace is merely a _suggestion_
     90 of a target colorspace to represent the image in, i.e. it is the colorspace
     91 the original image was in, that has a sufficiently wide gamut and a
     92 suitable transfer curve to represent the image data with high fidelity
     93 using a limited bit depth representation.
     94 
     95 Colorspaces can be signaled in two ways in JPEG XL:
     96 
     97 *    CICP-style Enum values: This is a very compact representation that
     98 covers most or all of the common colorspaces. The decoder can convert
     99 XYB to any of these colorspaces without requiring an external color management
    100 library.
    101 *    ICC profiles: Arbitrary ICC profiles can also be used, including
    102 CMYK ones. The ICC profile data gets compressed. In this case, external
    103 color management software (e.g. lcms2 or skcms) has to be used for color
    104 conversions.
    105 
    106 ### Frames
    107 
    108 A JPEG XL codestream contains one or more frames. In the case of animation,
    109 these frames have a duration and can be looped (infinitely or a number of times).
    110 Zero-duration frames are possible and represent different layers of the image.
    111 
    112 Frames can have a blendmode (Replace, Add, Alpha-blend, Multiply, etc.) and
    113 they can use any previous frame as a base.
    114 They can be smaller than the image canvas, in which case the pixels outside the
    115 crop are copied from the base frame. They can be positioned at an arbitrary
    116 offset from the image canvas; this offset can also be negative and frames can
    117 also be larger than the image canvas, in which case parts of the frame will
    118 be invisible and only the intersection with the image canvas will be shown.
    119 
    120 By default, the decoder will blend and coalesce frames, producing only a single
    121 output frame when there are subsequent zero-duration frames, and all output frames
    122 are of the same size (the size of the image canvas) and have either no duration
    123 (in case of a still image) or a non-zero duration (in case of animation).
    124 
    125 ### Pixel Data
    126 
    127 Every frame contains pixel data encoded in one of two modes:
    128 
    129 *   VarDCT mode: In this mode, variable-sized DCT transforms are applied
    130 and the image data is encoded in the form of DCT coefficients. This mode is
    131 always lossy, but it can also be used to losslessly represent an existing
    132 (already lossy) JPEG image, in which case only the DCT8x8 is used.
    133 *   Modular mode: In this mode, only integer arithmetic is used, which
    134 enables lossless compression. However, this mode can also be used for lossy
    135 compression. Multiple transformations can be used to improve compression or to
    136 obtain other desirable effects: reversible color transforms (RCTs),
    137 (delta) palette transforms, and a modified non-linear Haar transform
    138 called Squeeze, which facilitates (but does not require) lossy compression
    139 and enables progressive decoding.
    140 
    141 Internally, the VarDCT mode uses Modular sub-bitstreams to encode
    142 various auxiliary images, such as the "LF image" (a 1:8 downscaled version
    143 of the image that contains the DC coefficients of DCT8x8 and low-frequency
    144 coefficients of the larger DCT transforms), extra channels besides the
    145 three color channels (e.g. alpha), and weights for adaptive quantization.
    146 
    147 In addition, both modes can separately encode additional 'image features' that
    148 are rendered on top of the decoded image:
    149 
    150 *   Patches: rectangles from a previously decoded frame (which can be a
    151 'hidden' frame that is not displayed but only stored to be referenced later)
    152 can be blended using one of the blendmodes on top of the current frame.
    153 This allows the encoder to identify repeating patterns (such as letters of
    154 text) and encode them only once, using patches to insert the pattern in
    155 multiple spots. These patterns are encoded in a previous frame, making
    156 it possible to add Modular-encoded pixels to a VarDCT-encoded frame or
    157 vice versa.
    158 *   Splines: centripetal Catmull-Rom splines can be encoded, with a color
    159 and a thickness that can vary along the arclength of the curve.
    160 Although the current encoder does not use this bitstream feature yet, we
    161 anticipate that it can be useful to complement DCT-encoded data, since
    162 thin lines are hard to represent faithfully using the DCT.
    163 *   Noise: luma-modulated synthetic noise can be added to an image, e.g.
    164 to emulate photon noise, in a way that avoids poor compression due to
    165 high frequency DCT coefficients.
    166 
    167 Finally, both modes can also optionally apply two filtering methods to
    168 the decoded image, which both have the goal of reducing block artifacts
    169 and ringing:
    170 
    171 *   Gabor-like transform ('Gaborish'): a small (3x3) blur that gets
    172 applied across block and group boundaries, reducing blockiness. The
    173 encoder applies the inverse sharpening transform before encoding,
    174 effectively getting the benefits of lapped transforms without the
    175 disadvantages.
    176 *   Edge-preserving filter ('EPF'): similar to a bilateral filter,
    177 this smoothing filter avoids blurring edges while reducing ringing.
    178 The strength of this filter is signaled and can locally be adapted.
    179 
    180 ### Groups
    181 
    182 In both modes (Modular and VarDCT), the frame data is signaled as
    183 a sequence of groups. These groups can be decoded independently,
    184 and the frame header contains a table of contents (TOC) with bitstream
    185 offsets for the start of each group. This enables parallel decoding,
    186 and also partial decoding of a region of interest or a progressive preview.
    187 
    188 In VarDCT mode, all groups have dimensions 256x256 (or smaller at the
    189 right and bottom borders). First the LF image is encoded, also in
    190 256x256 groups (corresponding to 2048x2048 pixels, since this data
    191 corresponds to the 1:8 image). This means there is always a basic
    192 progressive preview available in VarDCT mode.
    193 Optionally, the LF image can be encoded separately in a (hidden)
    194 LF frame, which can itself recursively be encoded in VarDCT mode
    195 and have its own LF frame. This makes it possible to represent huge
    196 images while still having an overall preview that can be efficiently
    197 decoded.
    198 Then the HF groups are encoded, corresponding to the remaining AC
    199 coefficients. The HF groups can be encoded in multiple passes for
    200 more progressive refinement steps; the coefficients of all passes
    201 are added. Unlike JPEG progressive scan scripts, JPEG XL allows
    202 signaling any amount of detail in any part of the image in any pass.
    203 
    204 In Modular mode, groups can have dimensions 128x128, 256x256, 512x512
    205 or 1024x1024. If the Squeeze transform was used, the data will
    206 be split in three parts: the Global groups (the top of the Laplacian
    207 pyramid that fits in a single group), the LF groups (the middle part
    208 of the Laplacian pyramid that corresponds to the data needed to
    209 reconstruct the 1:8 image) and the HF groups (the base of the Laplacian
    210 pyramid), where the HF groups are again possibly encoded in multiple
    211 passes (up to three: one for the 1:4 image, one for the 1:2 image,
    212 and one for the 1:1 image).
    213 
    214 In case of a VarDCT image with extra channels (e.g. alpha), the
    215 VarDCT groups and the Modular groups are interleaved in order to
    216 allow progressive previews of all the channels.
    217 
    218 The default group order is to encode the LF and HF groups in
    219 scanline order (top to bottom, left to right), but this order
    220 can be permuted arbitrarily. This allows, for example, a center-first
    221 ordering or a saliency-based ordering, causing the bitstream
    222 to prioritize progressive refinements in a different way.
    223 
    224 
    225 ## File Format Features
    226 
    227 Besides the image data itself (stored in the `jxlc` codestream box),
    228 the optional container format allows storing additional information.
    229 
    230 ## Metadata
    231 
    232 Three types of metadata can be included in a JPEG XL container:
    233 
    234 *   Exif (`Exif`)
    235 *   XMP (`xml `)
    236 *   JUMBF (`jumb`)
    237 
    238 This metadata can contain information about the image, such as copyright
    239 notices, GPS coordinates, camera settings, etc.
    240 If it contains rendering-impacting information (such as Exif orientation),
    241 the information in the codestream takes precedence.
    242 
    243 ## Compressed Metadata
    244 
    245 The container allows the above metadata to be stored either uncompressed
    246 (e.g. plaintext XML in the case of XMP) or by Brotli-compression.
    247 In the latter case, the box type is `brob` (Brotli-compressed Box) and
    248 the first four bytes of the box contents define the actual box type
    249 (e.g. `xml `) it represents.
    250 
    251 ## JPEG Bitstream Reconstruction Data
    252 
    253 JPEG XL can losslessly recompress existing JPEG files.
    254 The general design philosophy still applies in this case:
    255 all the image data is stored in the codestream box, including the DCT
    256 coefficients of the original JPEG image and possibly an ICC profile or
    257 Exif orientation.
    258 
    259 In order to allow bit-identical reconstruction of the original JPEG file
    260 (not just the image but the actual file), additional information is needed,
    261 since the same image data can be encoded in multiple ways as a JPEG file.
    262 The `jbrd` box (JPEG Bitstream Reconstruction Data) contains this information.
    263 Typically it is relatively small. Using the image data from the codestream,
    264 the JPEG bitstream reconstruction data, and possibly other metadata boxes
    265 that were present in the JPEG file (Exif/XMP/JUMBF), the exact original
    266 JPEG file can be reconstructed.
    267 
    268 This box is not needed to display a recompressed JPEG image; it is only
    269 needed to reconstruct the original JPEG file.
    270 
    271 ## Frame Index
    272 
    273 The container can optionally store a `jxli` box, which contains an index
    274 of offsets to keyframes of a JPEG XL animation. It is not needed to display
    275 the animation, but it does facilitate efficient seeking.
    276 
    277 ## Partial Codestream
    278 
    279 The codestream can optionally be split into multiple `jxlp` boxes;
    280 conceptually, this is equivalent to a single `jxlc` box that contains the
    281 concatenation of all partial codestream boxes.
    282 This makes it possible to create a file that starts with
    283 the data needed for a progressive preview of the image, followed by
    284 metadata, followed by the remaining image data.