You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
dr_libs/wip/dr_opus_log.txt

318 lines
22 KiB
Plaintext

This is a log of my development of dr_opus - a single file public domain Opus decoder in C. The purpose of this log is
to track the development of the project and log how I solved various problems.
References:
RFC 6716 - https://tools.ietf.org/html/rfc6716 - Definition of the Opus Audio Codec
RFC 7845 - https://tools.ietf.org/html/rfc7845 - Ogg Encapsulation for the Opus Audio Codec
The main reference for development is RFC 6716. An important detail with this specification is in Section 1
The primary normative part of this specification is provided by the
source code in Appendix A. Only the decoder portion of this software
is normative, though a significant amount of code is shared by both
the encoder and decoder. Section 6 provides a decoder conformance
test. The decoder contains a great deal of integer and fixed-point
arithmetic that needs to be performed exactly, including all rounding
considerations, so any useful specification requires domain-specific
symbolic language to adequately define these operations.
Additionally, any conflict between the symbolic representation and
the included reference implementation must be resolved. For the
practical reasons of compatibility and testability, it would be
advantageous to give the reference implementation priority in any
disagreement. The C language is also one of the most widely
understood, human-readable symbolic representations for machine
behavior. For these reasons, this RFC uses the reference
implementation as the sole symbolic representation of the codec.
What this is basically saying is that the source code should be considered the main point of reference for the format of
the Opus bit stream. We cannot, however, just willy nilly take code from the reference implementation because that would
cause licensing problems. However, I'm going to make a prediction before I even begin: I think RFC 6716 with a bit
intuition, problem solving and help from others it can be figured out. Indeed, I think RFC 6716 may even be enough, and
that perhaps the authors of said specification refer to the source code as a self-defence mechanism to guard themselves
from possible subtle errors in the specification. If I'm wrong, the log below will track it.
Lets begin...
2018/11/30 6:55 PM
==================
- Inserted skeleton code with licensing information. Dual licensed as Public Domain _or_ MIT, whichever you prefer. API
is undecided as of yet.
- Decided on supporting C89. Not yet decided on how the standard library will be used.
- Deciding for now to use the init() style of API from dr_wav and dr_mp3 which takes a pre-allocated decoder object. May
need to change this to dr_flac style open() (or perhaps new()/delete()) if the size of the structure is too big.
- dr_wav, etc. use booleans for results status so I'm copying that for consistency.
- Added boilerplate for standard sized types (copied from mini_al).
- Added a comment at the top of the file to show that this isn't complete (even though it's in the "wip" folder!).
- Added boilerplate for read and seek callbacks (copied from dr_flac).
- Added shell APIs:
- dropus_init()
- Not requiring onSeek for the moment because I'm hoping that we can make that optional somehow, with restriction.
- dropus_uninit()
- Does nothing at the moment, but in future will need to close a file when the decoder is opened against a file.
- Added some common overrideable macros (taken from dr_flac):
- DROPUS_ASSERT
- DROPUS_COPY_MEMORY
- DROPUS_ZERO_MEMORY
- DROPUS_ZERO_OBJECT
- Decided to use the C standard library for now. This simplifies things like memset() and assert(). Currently using the
following headers:
- stddef.h
- stdint.h
- stdlib.h
- string.h
- assert.h
- Added some infrastructure for future APIs: dropus_init_file() and dropus_init_memory()
- In dr_flac, dr_wav and dr_mp3, a mistake was made with how the internal file handle is managed by the decoder. In
these libraries the pUserData member of the decoder structure is recycled to hold the FILE pointer. dr_opus is
changing this to be it's own member of the main decoder structure.
- The internal members for use with dropus_init_memory() use the same system as dr_mp3. See drmp3.memory.
- Added dropus_init_file()
- All init() APIs call dropus_init_internal() which is where the main initialization work is done.
- This is using the stdio FILE API.
- dropus_uninit() has been updated to close the file.
- Added dropus_fopen() for compiler compatibility. Taken from dr_flac, but added the "mode" parameter for consistency
with fopen().
- Added dropus_fclose() for namespace consistency with dropus_fopen().
- Added dropus_init_memory()
- onRead and onSeek callbacks taken from dr_mp3.
- INITIAL COMMIT
2018/12/01 12:46 PM
===================
- Downloading all official test vectors in preparation for future testing.
2018/12/01 03:58 PM
===================
- Decided to figure out the different encapsulations for Opus so I can figure out a design for data flow.
- Ogg: https://tools.ietf.org/html/rfc7845
- Matroska: https://wiki.xiph.org/MatroskaOpus
- WebM: https://www.webmproject.org/docs/container/
- MPEG-TS: https://wiki.xiph.org/OpusTS
- Mysterious encapsulation used by the official test vectors. Looking at this it's hard to tell what they are using for
this, but I'm assuming it's not documented. Will need to figure that one out for myself later.
- Since WebM is a subset of Matroska, can we combine those together to avoid code duplication?
- Upon reading into Ogg and MPEG-TS encapsulation (not looked into Matroska in detail yet) it looks like handling large
channel counts is going to be a pain. It can support up to 255 channels, which are made up of multiple streams. My
instict is telling me that we won't be able to use a fixed sized data structure and instead might require dynamic
allocations. Darn...
- Leaning towards a two-tier architecture - a low-level decoder for raw Opus streams, and a high-level API for encapsulated
streams. Opus packets support only 1 or 2 channels which means the size of the structure _can_ be fixed.
- Currently I'm thinking something like the following:
- struct dropus_stream - low-level Opus stream decoding
- struct dropus - high-level encapsulated Opus stream decoding
- struct dropus_ogg (Ogg Opus)
- struct dropus_mk (Matroska)
- struct dropus_ts (MPEG-TS)
- struct dropus_tv (That annoying non-standard and undocumented encapsulation used by the test vectors)
- The maximum number of samples per frame (per-channel) is 960 because:
- The maximum frame size for wideband (16kHz) is 60: 16*60 = 960
- The maximum frame size for fullband (48kHz) is 20: 48*20 = 960
- To account for stereo frames we would need to multiply 960 by 2: 960*2 = 1920
- This is small enough that it can be stored statically inside the dropus_stream structure.
- It looks like low level Opus streams won't be able to do efficient seeking without an encapsulation.
- An Opus stream is made up of packets. Each packet has 1 or more frames. Each frame can be made up of 1 or 2 channels.
- The size of the packet in bytes is unknown at the Opus stream level - it depends on the encapsulation to know the
size of the packet. This could be a bit of a problem because we may need to know the size of a frame in order to know
how much data we need from the client. I'm wondering if the low-level API should look something like the following:
- dropus_result dropus_stream_decode_packet(dropus_stream* pOpusStream, const void* pCompressedData, size_t compressedDataSize);
- This will require the caller to have information on the size of the packet.
- Added struct dropus_stream_packet, dropus_stream_frame.
- Each packet contains a byte that defines the structure of the packet. It's called the TOC byte
(Section 3.1. The TOC Byte) in the RFC 6716.
- Added skeleton implementations for dropus_stream_init() and dropus_stream_decode_packet().
- Added helpers for decoding the different parts of TOC byte.
- dropus_toc_config(), dropus_toc_s() and dropus_toc_c()
- Added enum for the three different modes as outlined in Section 3.1 of RFC 6716. Also added an API to extract this
mode from the config component of the TOC: dropus_toc_config_mode() and dropus_toc_mode().
- Added API to extract sample rate from the TOC: dropus_toc_config_sample_rate() and dropus_toc_sample_rate()
- Added API to extract the size of an Opus frame in PCM frames from the TOC:
- dropus_toc_config_frame_size_in_pcm_frames() and dropus_toc_frame_size_in_pcm_frames().
- COMMIT
2018/12/02 09:45 AM
===================
- Compiling for the first time. Targetting C89 so need to get it all working property right from the start.
- "test" folder added to my local repository. Not version controlled yet as I'm not sure I want to include all of
the test stuff in the repository. Will definitely not be version controlling the test vectors. Perhaps use a submodule?
- Have decided to not support the mysterious encapsulation format used by the test vectors in dr_opus itself. Instead
I'm just doing it manually in the test program.
- After attempting to build the C++ version, it appears Clang will attempt to #include stdint.h from Visual Studio's
standard library which fails. May need to come up with a way to avoid the dependency on stdint.h. Same thing also
happens with stdlib.h.
- Have cleaned up the sized types stuff for Clang to avoid the dependency on stdint, however errors are still happening
due to the use of stdlib.h. Leaving the Clang C++ issue for now. It's technically a Clang issue rather than dr_opus.
- COMMIT
- Have decided to version control test programs, but not test vectors. Instead I'm putting a note in a readme about
where to download and place the test vectors.
- In my experience I have found that having a testing infrastructure set up earlier saves a lot of time in the long run
so the next thing I have decided to work on is a test program. It makes sense to test against the official test vectors,
so the first thing is to figure out this undocumented encapsulation format...
Test Vector Encapsulation Format
--------------------------------
- All test vectors seem to start with two bytes of 0x00.
- This does not feel like a magic number or identifier.
- Noticed that first 4 bytes add up to a relatively small number. A counter of some kind in little endian format?
- A pattern can be seen in testvector02.bit:
- The bytes 00 00 00 XX can been seen repeating. This indicates some kind of blocking.
- From the first 4 bytes you see the hex value of 0x0000001E which equals 30 in decimal
- Counting forward 30 bytes brings us 4 bytes before the next set of 00 00 00 XX bytes. Definitely starting to feel like
the first 4 bytes are a byte counter.
- Skipping past those four bytes and looking at the next 00 00 00 XX bytes we get 0x0000001D which is 29 in decimal.
- Again, counting forward by that number of bytes brings us to 4 bytes prior to the next 00 00 00 XX block. First 4
bytes are definitely a byte counter.
- It makes sense that each of these blocks are an Opus packet.
- Still don't know what the second set of 4 bytes could represent... Does it matter for our purposes? All we care about
is the raw Opus bit stream.
- A Google search for "opus test vector format" brings up the following result (a PDF document):
- https://www.ietf.org/proceedings/82/slides/codec-4.pdf
- From the "Force Multipliers" page:
- "The range coder has 32 bits of state which must match between the encoder and decoder"
- "Provides a checksum of all encoding and decoding decisions"
- "opus_demo bitstreams include the range value with every packet and test for mismatches"
- Willing to bet that the extra 4 bytes are that "32 bits of state". This is actually good for dr_opus.
- I think the format is just a list of Opus packets with 8 bytes of header data for each one:
4 bytes: Opus packet size in bytes
4 bytes: 32 bit of range coder state
[Opus Packet]
--------------------------------
- The test program can either hard code a list of vector files or we can enumerate over files in known directories. I'm
preferring the latter to make it easier to add new vectors. It's also consistent with the way dr_flac does it which has
been pretty good in the past.
- Started work on a "common" lib for test programs. First thing is the new file iterator.
- File iterator now done, but only supporting Windows currently.
- Have implemented the logic for loading test vectors and the byte counter (and I'm assuming the range state) is actually
in big-endian, not little-endian. I'm an idiot!
- Copying endian management APIs from dr_flac over to dr_opus.
- Basic infrastructure for testing against the standard test vectors is now done.
- COMMIT
2018/12/03 6:15 PM
==================
- The packet enumeration in the test program seems to work so far, but I'm still not 100% sure that my assumptions are
correct. Adding more validation to dropus_stream_decode_packet() to increase my certainty. The next stage is to
implement all of the main packet validation.
- Question for later. How are "Discontinuous Transmission (DTX) or lost packet" handled? Should
dropus_stream_decode_packet() return an error in this case?
- Code for validating packet codes 0, 1 and 2 are now in, however untested so far. Code 3 is a bit more complicated and
right now I need to eat dinner...
- COMMIT
2018/12/03 6:40 PM
==================
- Code 3 CBR is done (no VBR yet). Still untested.
- Code 3 VBR is done. Still untested.
- Testing and debugging against the standard test vectors coming next.
- Fixed a few compiler warnings.
- COMMIT
2019/03/08 6:28 PM
==================
- Starting on range decoding. From figure 4 in RFC 6716 the diagram shows that both the SILK and CELT decoders draw
their data from the Range Decoder. Therefore the Range Decoder needs to be figured out before SILK and CELT decoding.
I have no prior knowledge on range coding, so hopefully RFC 6716 has enough information for a complete enough
implementation for purpose of Opus decoding.
- Section 4.1.1 of RGC 6716 is how to initialize the decoder. It references a variable called "rng", but doesn't
specificy it's possible ranges. Setting to a 32-bit unsigned integer for now.
- In this section, it says "or containing zero if there are no bytes in this Opus frame". I'm assuming we do this
by looping over each frame in order...
- Just noticed that the end of section 4.1.1 says "rng > 2**23". So "rng" needs to be 32-bits.
- And now I've just noticed that section 4.1 explicitly says the following: "Both val and rng are 32-bit unsigned integer
values." Problem solved.
- Having read over the range coding stuff in RFC 6716 I had some difficulty figuring out where symbols frequencies are
being stored. They're not stored, but rather specified in the specification in a per-context basis. In the spec there
are references to "PDFs" which is like this "{f[0], f[1]}/ft". Example: "{1, 1}/2". Simple, now that I've finally
figured it out...
- I can therefore not do the range decoder independant of some higher level context. I will start with the SILK decoder
since that's the next section...
2019/03/09 6:52 AM
==================
- I think I'm getting a feel for the requirements for the range decoding API. I've decided for now to use a structure
for this called dropus_range_decoder.
- dropus_range_decoder_init() implemented based on section 4.1.1.
- Added stub for dropus_stream_decode_frame(). I think the idea is that we iterate over each frame in the packet and
decode them in order? Not fully sure yet how frame transitions will work, so may need to rethink this.
- I'm still not sure if the range decoder is supposed to be initialized once per frame or once per packet...
- After more testing it turns out my test program wasn't reporting errors properly. *Sigh*. These bugs were originating
from the packet decoding section.
- COMMIT
- Section 4.1.2 mentions that there's two steps to decoding a symbol. The first step seems easy enough. Implemented in
dropus_range_decoder_fs().
- The second step requires updating "val" and "rng" of the range decoder based on (fl[k], fh[k] and ft). Implemented in
dropus_range_decoder_update(pRangeDecoder, fl, fh, ft).
- Normalization of the range decoder is easy (RFC 6716 - 4.1.2.1). Implemented in dropus_range_decoder_normalize().
- I think at this point we have the foundations done for the range decoder. Still more to do for CELT, but I think I
now have enough to hack away a bit more at the SILK-specific data.
- COMMIT
- I'm still not 100% sure how range decoding works - I know it depends on the context of the specific thing being
decoded, but not sure how the index "k" is handled. The first part of the SILK frame is the VAD flags PDF={1,1}/2.
Do I use the value of k (which I think is 0 or 1) to know whether or not the bit is set? I am going to assume this
for now.
2019/03/10 7:40 AM
==================
- At the moment I am using what feels like a very cumbersome way of decoding "k" from the range decoder. I'm calling
dropus_range_decoder_fs(), followed by a function called dropus_range_decoder_k() to retrieve "k", and then
dropus_range_decoder_update() at the end of it. I can simplify this into a single function, I think, which I'm
calling dropus_range_decode() which will perform the "fs" decode, perform the update, then return "k".
- As I'm writing dropus_range_decoder_decode() I've realized that I can simplify my dropus_range_decoder_update()
function to do both the "k" lookup and the the update from section 4.1.2. It just needs to be changed to take
the "context" probabilities and "fs" and to return "k".
- This _seems_ to work. Or at least, the results are consistent with my first attempt. Committing.
- Added dropus_Q13() for doing the stereo weight lookup (RFC 7616 - Table 7).
- Section 4.2.7.1 mentions that the weights are linearly interpolated with the weights from the previous frame. What
factor do I use for the interpolation?
- Finished initial work on RFC 7616 - Section 4.2.7.1. Still don't know how the interpolation part works. I'm hoping
this is explained later in the spec.
- COMMIT
- Completed initial implementation of RFC 7616 - Section 4.2.7.2.
2020/03/29 6:33 AM
==================
- After a whole year, time to take another look at this. Only problem is, I've forgotten everything! Let's do a
quick pass on the high level stuff to get it consistent with dr_flac. Two mistakes I made with dr_flac: Not having
a low-level API that works on the FLAC frame level; and returning booleans instead of result codes. The first one
is already being handled, but lets update the high level APIs to use result codes.
- High level APIs now updated to return result codes.
- People are going to want to log errors: added dropus_result_description() to retrieve a human readable description
of a result code for logging and error reporting purposes.
- I've had reports of people having compilation errors with inlining before. Need to bring over the INLINE macro from
dr_flac to ensure we don't get more reports about that.
- I've had people request the ability to customize whether or not public APIs are marked as static or extern in other
libraries, so adding the DROPUS_API decoration. It's better to do this sooner than later, because this happened with
miniaudio recently which was a ~40K line library at the time, and it was nightmare to update!
- There will be a requirement to allocate memory for high level APIs. Support for allocation macros and allocation
callbacks need to be added now so we can avoid breaking APIs later.
- There have been requests in the past to support loading files from a wchar_t encoded path. Adding support by pulling
in custom fopen() and wfopen() implementations from dr_flac.
- Have not yet compiled on VC6 so lets do a pass on that to make it easier for us later.
- VC6 is now compiling clean. Time to check GCC and Clang in strict C89 mode.
- Getting compiler errors about some Linux specific code for handling byte swaps. Removed and replaced with a cross-
platform solution.
- From what I can see it looks like our boilerplate is now up to date with dr_flac, dr_wav and dr_mp3. Time to get onto
some actual Opus decoding.
- Looks like a typo in dropus_range_decoder_update() for handling the `fl[k] > 0` case. Changed this, but now getting
a division by zero. Debugging...
- And fixed. The last paragraph in RFC 6716 - Section 4.1.2 is the part I was missing:
After the updates, implemented by ec_dec_update() (entdec.c), the
decoder normalizes the range using the procedure in the next section,
and returns the index k.
- Reading through dropus_stream_decode_packet() to refresh my memory and some stuff needs cleaning up. Important from
now on to distinguish between PCM frames and Opus frames. All variables referring to one of the other needs to be
prefixed with "pcm" or "opus", depending on it's use. Example: opusFrameCount / pcmFrameCount.
- Added infrastructure for handling CELT and Hybrid frames when the time comes. For now still focused on getting SILK
frames working.
- Reviewing SILK decoding logic to refresh my memory and looks like I've misinterpreted the encoding for per-frame
LBRR flags. When there's more than one SILK frame, the first LBRR flag is just used to determine whether or not
the per-frame LBRR flags is present. I was _always_ reading the LBRR flags regardless of the value of the primary
LBRR flag.
2020/03/30 7:30 AM
==================
- The decoding of SILK frames needs to be put in their own function in order to handle LBRR and regular SILK frames.
Looking at the spec, there's some complicated branching logic for determine what needs to be decoded. I've put in
a rough draft as a start, but needs some work. Will continue on this later.