A spatial localization actor

Background

Spatial localization of a sound source in 3D is extremely useful for VR and interactive applications in general. Besides simply adding realism to virtual worlds, spatialized sound can generally provide sonic cues to convey spatially-interpreted information in situaltions where visual cues are lacking, insufficient, or just simply unavailable. For instance, in a VR application, a user may be navigating through a dataset, and may need to determine the next navigational motion relative to the position of a reference point in space. However, the user's full visual acuity and attention may be already fully occupied in the continuous interpretation of the data. Here, a localized sound beacon can function much like a lighthouse, guiding the user sonically toward the new navigational point.

Also, spatial cues can be used to dissociate, or disambiguate, groups of sound sources with similar sonic character, such as when one is in the midst of a group of people talking. By spreading out each sound source into a unique position in sonic space, the sound sources may then be perceived individually, even if they are almost identical in character. This application is highly useful for telecollaborative sessions involving three or more people. Here, the conversational flow may continue freely, even if more than one person is speaking.

In either case, by exploiting the natural ability of the auditory system to perceive and separate individual sound sources by their spatial position, completely new functionality is enabled. And, at the very least, virtual objects can be made to emit sounds that actually seem like they are coming from their location in virtual space.

Technical: Algorithm

The LocalizeActor's processing algorithm is implemented to optimize localization performance against computational complexity. Primary consideration was given to the premise that vss algorithm plug-ins should generally be able to run in real time, and not require specialized hardware to do so. Closely following is the preference that at least two or three instances of the algorithm be able to run in real time. This completely ruled out the HRTF types of processing, at least for current state of development in available hardware platforms. These types of algorithms allow for localization of sounds in full 3-D space, including from overhead and behind, using just two loudspeakers or headphones, but at the expense of heavy amounts of processing (specifically, long convolutions), which currently require the use of dedicated DSP cards. Therefore, a simpler type of processing was employed using a modified cross-term cancellation method. This method makes use of the sensitivity of the auditory system to inter-aural ampltude and delay differences, to "steer" the sound's perceived direction and distance of emanation. It provides the ability to spread the perceived planar soundfield emanating from a pair of speakers far beyond the section otherwise confined between the speakers, to a full plane in front of the listener. Adding two more speakers behind the listener, the soundfield plane extends in every azimuth direction around the listener. With eight speakers positioned, for example, at the inside vertices of a cubic space centered around the listener, the soundfield becomes a volume so that sounds may be differentiated overhead as well as underneath. The algorithm scales in complexity with the number of output channels, and at all configuration points provides a dramatic improvement over simple amplitude panning while coming in at a relatively small computational expense compared to full-blown HRTFs.

The algorithm is implemented according to the following system block diagram:

The input signal arrives from the source generator or processor actor output, to be processed by the localization algorithm, in this case into four output loudspeaker channels. (The four-channel case represented here is easily reducible to the two-channel case by deleting the lower two loudspeaker channels. It may also be extended to eight channels by repeating the portion to the right of the panner for the lower four loudspeaker channels.) The localization is controlled by two parameters, Distance and Direction. The Distance parameter varies from 0 to 1, where 0 corresponds to no distance (i.e. the object is in your head, between your ears), and 1 corresponds to the object being at infinity, or at least at the sonic equivalent of the "clipping plane". Direction varies between -1 and 1, with 0 corresponding to "straight ahead", -0.5 to "straight left", +0.5 to "straight right", and +/- 1 directly behind.

Direction figure

The input first passes through a distance modeler Hd set up to increasingly attenuate the amplitude and dampen the high frequencies of the input with increasing amounts of Distance. This promotes the illusion that a sound source is receding as the Distance is increasing. The signal then distributed through a panner, in response to Direction, into 4 channels (or 2 or 8, depending on how vss is initialized). The panner is a constant-power type so that the perceived loudness of the signal is independent of Direction. The front and rear left-right signal pairs are then passed through recursive cross-term cancelers to perform the stereo-to-planar spreading. The amount of spreading is controlled by Distance in the 4 and 8 channel case, and by both Distance and Direction in the 2-channel case. The spreading is generally controlled so as to reduce it with increasing distance, until it disappears at a distance of 0.5. This produces a strong near-field distance cue which varies continuously from a between-the-ears experience at zero Distance, to, at Distance = 0.5, a definite positioning of the source along a circle circimscribing the speaker locations in space. (With eight channels, the source appears along the surface of a circumscribing sphere.) For two channels, the circle is "squashed" and all localization behind the listener is made to pass through the head. Processing through Hd then becomes the dominant distance cue at Distances greater than 0.5.

Usage

To use the localization actor, load localize.so and create an actor of type LocalizeActor.

LocalizeActor messages

In addition to the messages understood by all processor actors, the LocalizeActor understands the following messages:

LocalizeActor handler messages

In addition to the messages understood by all handlers, the handler for the LocalizeActor algorithm understands the following messages:
SetInput hSound hHandler
Set the input to come from the handler hHandler.
SetInput hSound
Set the input to zero (i.e., silence the input).
SetDirection hSound x time
Set the perceived direction of the sound source, relative to the listener, to x. Range is normalized to [-1,1], with 0 mapping to straight ahead, -0.5 mapping to directly left, +0.5 mapping to directly right, and +/-1 mapping to directly behind. If time is specified, modulate to the new direction over the specified duration.
SetDistance hSound x time
Set the perceived distance of the sound source, relative to the listener, to x. Range is normalized to [0,1], with 0 mapping to "no distance", i.e., the sound source is "in the head, between the ears", and 1 mapping to infinite distance, i.e., the object is at the clipping plane. If time is specified, modulate to the new distance over the specified duration.
For tuning purposes only
SetA hSound x time
Tune the optimal cross-cancellation gain, setting it to x. Range is between [-1,1], with roughly -0.6 being default. If time is specified, modulate to the new distance over the specified duration.
SetT hSound x time
Tune the optimal cross-cancellation delay, setting it to x milliseconds. Range is between [0,1] millisecond, with roughly 0.2 being default. If time is specified, modulate to the new distance over the specified duration.