Rooms!

About Impulse Responses and Convolution

What's this all about?

When your are given an audio signal S and would like it to sound as if it were rendered in a specific room, then there are two options:

Create a model of that room by measuring the geometries of all its interiors and the reflection characteristics of all surfaces, and let the computer run a simulation of the propagation of the audio waves of S.
Use an impulse response of the room to transform S like it would have been transformed when rendered in the room. The impulse response can be recorded by playing a test sound, or it can be calculated by a simulation based on a model as in 1.

Here we are concerned with topic 2.

Why and how does this work?

Under usual circumstances (i.e. not extremely high volume levels), the propagation of sound is a linear process. When a sender plays with twice the volume, then the receiver receives twice the volume. When two senders at the same location play two signals, then the receiver receives the sum of these signals.

Consequently, to calculate how a signal S is transformed by a room, resulting in a signal T(S), it is sufficient to know how some 'basic' signal is transformed by the room. If you can decompose S into a linear combination of (appropriately time-translated copies of) the basic signal, then you get T(S) by calculating the corresponding linear combination of the (correspondingly time-translated) transformations of the basic signal.

So what is an impulse response, anyway? And convolution?

When we work with a digital audio signal in the computer, then we deal with a discretised approximation of an analog function, represented by values sampled with some frequency (like 44.100 Hz). In this context, the natural basic signal is one with a single peek value; a function that has value 1 at time 0 and is zero elsewhere. Any sampled function can obviously and trivially be represented as the sum of suitable time-translations of this basic signal.

This test signal with one single peek is kind of a loud click or an explosion sound; this is the impulse. An impulse response (IR) of a given room is how this test signal sounds in this room. To get an impulse response of the room, you record this expolosion-like sound in the room. Which, by the way, is why the icon of Rooms! shows a little bomb, propagating acoustical waves, and a microphone.

The mathematical operation that needs to be performed to calculate T(S) from S and the IR is called convolution. Principally, this is an easy operation, but doing it sufficiently fast and doing it continuously on streaming sound data is not so easy. Happily, you don't have to care for this.

The base case: One sender, one receiver – mono in, mono IR, mono out

Following important fact was not emphasised in the previous explanations: An IR is not a universal property of the room; it depends on the sender location (where the impulse was emitted) and on the receiver location (where the impulse response was recorded):

A single-channel impulse reponse is valid only for one position of the sound source and one position of the listener. The convolution result captures the reflection characteristics of the room, but it does not contain stereo information.

This means, with a single-channel IR you can faithfully reproduce how sound (musical instruments, human voices, whatever) emitted at one location in the corresponding room is perceived by one ear or a mono microphone at one location in the room.

The easy case: One sender, two receivers – mono in, stereo IR, stereo out

To get stereo audio, you need two IRs, or an IR with two channels, which must be recorded with a stereo microphone or two mono microphones placed at a suitable distance. Both IRs / both channels have to be recorded in one go, as even small delay and phase differences are essential for a correct stereo experience. Most IRs that you can find in the web are stereo sound files.

Still this is for only one location of the sound source, and it is not for two listeners at independent locations in the room, but for the two ears of one listener at their slightly different positions in space:

A stereo impulse reponse is valid only for one position of the sound source and one position of the listener. The convolution result captures the reflection characteristics of the room, and it contains stereo information that was coded in the IR.

If you would like to simulate the experience of a listener at a different hearing position, then you need a different pair of mono IRs / a different stereo IR.

Practically, the mono input signal is convolved with the left IR to give the left part of the stereo output, and it is convolved with the right IR to give the right part of the stereo output. In other words, these are two instances of the "base" case, one for each ear.

The ideal case: n senders, two receivers – mono in, n stereo IRs, stereo out

Now assume that you have several senders, for example the musicians in an orchestra. This gives n mono input signals, and for a faithful reproduction you need n IRs. The IR for musician x must be created by emitting the impulse at the location of x and recording the IR at the location of the (still one) listener.

A set of n stereo impulse reponses, each recorded with a different sender location, but for the same receiver location, is valid for n sound sources at the respective positions and one position of the listener. The sum of the convolution results captures the reflection characteristics of the room, and it contains stereo information that was coded in the IRs.

If you would like to simulate the experience of a listener at a different hearing position, then you need a different pair of n mono IRs / different n stereo IRs.

Practically, each mono input signal is convolved with the corresponding stereo IR, and the stereo results are added up to give the stereo output sum signal. This corresponds to n instances of the "easy" case, one for each sound source.

The real case: n senders, two receivers – mono in, not n stereo IRs, stereo out

In real life, you do not have n IRs. As stated, you would need an IR for each musician of an orchestra, and for a pipe organ you would have to record an IR for each of the pipes. And this is just for one listener position. Hence, it is not feasible to faithfully simulate the sound in these cases, at least not by the IR convolution method. Often you don't have n input signals anyway, but only a sum signal.

So what can be done? As explained above, in the "ideal" case you convolve a mono audio signal with a stereo IR to get a stereo signal. But if you handle the sound of an orchestra or a pipe organ that way, this would be as if the orchestra or organ sound would be rendered by a single loudspeaker in the room that belongs to the IR. There would be no spatial information corresponding to the various positions of the musicians or the organ pipes.

Luckily, most input signals are present in stereo. So, as a compromise, it is common to convolve the left input channel with the left IR and the right input channel with the right IR. There is no strict physical justification for this, and it does not give a faithful reproduction of the acoustical situation. However, it is a way to retain spatial information of the source signal while at the same time adding the characteristics of the room. The result is pleasing to the ear as it does not sound flat but spatial. Sometimes this is called "Parallel Stereo", see https://www.liquidsonics.com.

"Parallel Stereo" convolution reverb is a way to retain information about the sound sources' locations if you don't have separate sound signals or don't have multiple IRs. It is not a faithful acoustical simulation, but it gives enjoyable results.

Some systems also support an approach called "True Stereo", see above text in the LiquidSonics knowledge base. The name does not make much sense, and there is no physical justification for this either. This method makes use of two IRs to supply a stereo input signal with room information. Like "Parallel Stereo", it does not give a faithful reproduction of the acoustical situation.

"True Stereo" convolution reverb is another way to retain information about the sound sources' locations if you don't have separate sound signals or don't have multiple IRs. It is not a faithful acoustical simulation, but it gives enjoyable results. A pair of stereo IRs is needed.

The convolution modes of Rooms!

In Rooms! up to version 6.7, there are the following convolution modes, where input and output as well as the IR always comprise two channels (which can have equal data in case of a mono signal):

Stereo: Convolves the left input channel with the left IR channel and the right input channel with the right IR channel, and outputs the results to the left and the right channel, respectively. If both the input and the IR are mono, then so is the output, and we have the "base" case. If the input is mono, and the IR is stereo, then the output is stereo, and we have the "easy" case. If both the input and the IR are stereo, then the output is stereo, and we have the "Parallel Stereo" case. It is possible to realise the "ideal" case of n senders by starting several instances of the Rooms Audio Unit in a host DAW app, selecting a suitable stereo IR in each of these and supplying each instance with a mono input signal.
Mono L: Convolves the left input channel with the left IR channel and outputs the result to both channels. May be useful in a DAW setup.
Mono R: Ditto for the right channel.
Mono LR: This is new in Rooms! 6.7. It creates a mono signal by calculating the average of the left and the right input channel before it proceeds like the Stereo mode. May be useful when the input is stereo but you want to drop its spatial information in favour of the spatial information of the IR.

Concerning the required amount of processing: Mono L and Mono R each require one convolution, while Stereo and Mono LR need two convolutions. As the mixing step of Mono LR does not matter much, Stereo and Mono LR are about twice as needy as Mono L and Mono R.

The Stereo as well as the Mono LR modes can also be used to realise the "True Stereo" case by using two instances of the Audio Unit in a DAW, selecting a suitable stereo IR in each of them, and feeding one with the left and the other with the right channel of the input. Ideally, this allows some kind of parallel computation.

Rooms! 6.8 shall have a more flexible and powerful approach and will also allow to realise "True Stereo" within one instance of the app or Audio Unit, which does not involve the ability to parallelise but increases the ease of use.