Fork me on GitHub


Generic Audio Concepts

Audio sources, instances, voices, mix busses.

Audio Source and Instance

SoLoud uses two kinds of classes for the sounds. Audio sources contain all the information related to the sound in question, such as wave sample data, while audio instances contain information about an "instance" of the sound.

As an analogue, if you think of an old vinyl record, the audio source is the record, and you can put as many playheads - the instances - on the record. All of the playheads can also move at different speeds, output to a different pan position and volume, as well as different filter settings.

Audio sources and back-ends.

Back end

SoLoud itself "only" performs audio mixing, resource handling and bookkeeping. For it to be useful, it needs one or more sound source and a back end. Some other audio systems use the term 'sink' for the back-ends (as in audio source and audio sink).

Examples of back-ends would be winmm, oss, portaudio, wasapi and SDL audio. SoLoud comes with several back-ends, and is designed to make back-ends relatively easy to implement.

Different back-ends have different characteristics, such as how much latency they introduce.


One audio stream can contain one or more channels. Typical audio sources are either mono (containing one channel) or stereo (containing two channels), but surround sound audio sources may practically have any number of channels.

In module music (such as mod, s3m, xm, it), "channel" means one of the concurrent sounds played, regardless of speaker configuration. Confusing? Yes.


SoLoud can play audio from several sound sources at once (or, in fact, several times from the same sound source at the same time). Each of these sound instances is a "voice". The number of concurrent voices is limited, as having unlimited voices would cause performance issues, as well as lead to unnecessary clipping.

The default number of concurrent voices - maximum number of "streams" - is 16, but this can be adjusted at runtime. The hard maximum number is 4095, but if more are required, SoLoud can be modified to support more. But seriously, if you need more than 4095 sounds at once, you're probably going to make some serious changes in any case.

If all channels are already playing and the application requests another sound to play, SoLoud finds the oldest voice and kills it. Since this may be your background music, you can protect channels from being killed by using the soloud.setProtect() call.

SoLoud also supports virtual voices, so things are a bit more complicated - basically you can have thousands of voices playing, but only the most audible ones are actually played.

Virtual Voices

SoLoud lets you play way more voices than will actually be played to the user. This can be useful if you, for instance, populate a 3d world with hundreds of audio sources.

SoLoud will sort the voices based on their current volume level and only mixes the most audible sounds. The number of active voices can be set at runtime. Protected voices are always played.

The maximum number of virtual voices is currently 1024, but can be increased up to a hard limit of 4095 by editing soloud.h and recompiling. Higher virtual voice counts than that will require some refactoring.

Voice Group

Sometimes it is important to be able to command several voices at the same time so that they are synchronized; for instance, when cross-fading between two versions of the same song.

Problem unpausing two voices. Delay may vary, 40ms used as an example.

Even if you try to unpause the two voices at the same time, it's possible, due to the multithreaded nature of audio, that the audio engine interrupts you between these two calls and your sounds get unpaused to different audio buffers.

SoLoud's solution to this are voice groups. Voice groups can be commanded the same way as single voices, but instead of affecting just one voice, SoLoud performs the command on all of the voices in the group in one atomic operation.


Audio hardware always has a limited dynamic range. If you think of a signed 16-bit variable, for instance, you can only store values from -32768 to 23767 in it; if you try to put values outside this range in, things tend to break. Same goes for audio.

Results of different clippers.

SoLoud handles all audio as floats, but performs clipping before passing the samples out, so all values are in the -1..1 range. There's two ways SoLoud can perform the clipping; the most straightforward is simply to set all values outside this range to the border value, or alternatively a roundoff calculation can be performed, which "compresses" the loud sounds. The more quiet sounds are largely unchanged, while the loud end gets less precision. The roundoff clipper is used by default.

The roundoff clipper does, however, alter the signal and thus "damages" the sound. A more proper way of doing things would be to use the basic clipper and adjust the global volume to avoid clipping. The roundoff clipper is, however, easier to use.


The real world has continuous signals, which would require infinite amount of storage to store (unless you can figure out some kind of complicated mathematical formula that represents the signal). So, we store discrete samples of signals instead. These samples have traditionally been 8, 16 or 24 bit, but high-end audio is tending towards floating point samples.

SoLoud also uses floating point samples internally. First and foremost, it makes everything much simpler, and second, modern computing devices (even mobile!) have become fast enough that this is not really a performance issue anymore.

Floating point samples also take more space than, for instance, 16 bit samples, but memory and storage sizes have also grown enough to make this a feasible approach. Nothing stops the audio sources from keeping data in a more "compressed" format and performing on-the-fly conversion to float, if memory requirements are a concern.

Sample Rate

The sample rate represents the number of samples used, per second. Typical sample rates are 8000Hz, 22050Hz, 44100Hz and 48000Hz. Higher the sample rates mean clearer sound, but also bigger files, more memory and higher processing power requirements.

Due to limitations in human hearing, 44100Hz is generally considered sufficient. Some audiophiles disagree, but then again, some audiophiles buy gold-plated USB cables.


Hertz, SI unit of frequency. 0.1Hz means "once per 10 seconds", 1Hz means "once per second", 10Hz means "10 times per second", and 192kHz means "192000 times per second".

Play Speed

In addition to a base sample rate, which represents the "normal" playing speed, SoLoud includes a "relative play speed" option. This simply changes the sample rate. However, if you replace your sounds with something that has a different "base" sample rate, using the relative play speed will retain the effect of playing the sound slower (and lower) or faster (and higher).

Relative Play Speed

SoLoud lets you change the relative play speed of samples. Please note that asking for a higher relative play speed is always more expensive than a lower one.

Playing samples at higher or lower sample rate than they're intended can also cause resampling issues.


SoLoud has to perform resampling when mixing. In an ideal case, all of the sources and the destination sample rate are the same, and no resampling is needed, but this is often not true.

Different resamplers (Point / linear / catmull-rom). Red is the ideal signal.

Currently, SoLoud supports "linear interpolation", which calculates linear interpolation of samples, as well as "point sample" resampling, which means it simply skips or repeats samples as needed.

Picking the resampler is done by editing the soloud.h file.

The linear interpolation resampler is used by default.

Higher quality resamplers are planned.


Where the sound is coming from in the stereo sound, ranging from left speaker only to right speaker only. SoLoud uses an algorithm to calculate the left/right channel volume so that the overall volume is retained across the field. You can also set the left/right volumes directly, if needed.


SoLoud uses throwaway handles to control sounds. The handle is an integer, and internally tracks the voice and sound id, as well as an "uniqueness" value.

If you try to use a handle after the sound it represents has stopped, the operation is quietly discarded (or if you're requesting information, some kind of generic value is returned, usually 0). You can also query the validity of a handle.


Audio latency generally means the time it takes from triggering a sound to the sound actually coming out of the speakers. The smaller the latency, the better.

Unfortunately, there's always some latency. The primary source of latency (that a programmer can have any control over) is the size of audio buffer. Generally speaking, the smaller the buffer, the lower the latency, but at the same time, the smaller the buffer, the more likely the system hits buffer underruns (ie, the play head marches on but there's no data ready to be played) and the sound breaks down horribly.

Assuming there's no other sources of latency (and there quite likely is), with 2048 sample buffer and 44100Hz playback, the latency is around 46 milliseconds, which is tolerable in most computer game use cases. A 100ms latency is already easily noticeable. In music terms, when playing drums, 40ms is too much, but when playing slowly evolving atmospheric pads, 200ms is fine.


Audio streams can also be modified on the fly for various effects. Typical uses are different environmental effects such as echoes or reverb, or low pass (bassy sound) / high pass (tinny sound) filters, but basically any kind of modification can be done; the primary limitations are processor power, imagination, and the developer's skill in digital signal processing.

SoLoud lets you hook several filters to a single audio stream, as well as to the global audio output. By default, you can use up to four filters, but this can be easily changed by editing SoLoud.h file and rebuilding the library.

SoLoud also support STFT filters, where the samples are converted to frequency domain for modification and back for playback.

Mixing Bus

In addition to mixing audio streams together at the "global" level, SoLoud includes mixing busses which let you mix together groups of audio streams. These serve several purposes.

The most typical use would be to let the user change the volume of different kinds of audio sources - music, sound effects, speech. In this case, you would have one mixing bus for each of these audio source groups, and simply change the volume on the mixing bus, instead of hunting down every sound separately.

When using environmental effects filters, you most likely won't want the background music to get filtered; the easiest way to handle this is to apply the filters to the mixing bus that plays the sound effects. This will also save on processing power, as you don't need to apply the environmental audio filters on every sound effect separately.

It's also possible that you have some very complex audio sources, such as racing cars. In this case it makes sense to place all the audio streams that play from one car into a mixing bus, and then adjust the panning (or, eventually, 3d position) of the mixing bus.

Additional feature of the mixing busses in SoLoud is that you can request visualization data from a bus, instead of just from the global scope.


SoLoud also contains a special kind of audio source called Queue. This can be used to queue other audio sources. Once one stops, the next one starts playing.

This can be useful when chaining sounds, like having a never ending song by queuing random patterns.

Copyright©2013-2020 Jari Komppa