#398: Audio Objects for Narrative 360 VR with Dolby Atmos

joel-susalSound is probably the second most important part of creating a compelling immersive VR experience, but it’s also usually put off to the end as an afterthought. The visual system is so dominant that this is not a huge surprise, but I thought that it’d be worth focusing on trends of immersive sound on the next four episodes of the Voices of VR podcast starting with Dolby Atmos’ solution for creating spatialized sound with audio objects.

Just as Unity and Unreal are able to create 3D sound environments for interactive games, Dolby Atmos has an ProTools plug-in with a 3D interface that allows you to mix 3D audio objects for narrative, 360-video content. I had a chance to catch up with Dolby Lab’s Director of Virtual and Augmented Reality Joel Susal to talk about their audio spatialization solution, how it’s being used in both Hollywood Blockbusters and cutting-edge narrative VR experiences, and why it’s important to have granular control over individual audio object files rather than just relying upon 4-channel, ambisonic sound fields for 360 video productions.

LISTEN TO THE VOICES OF VR PODCAST

I’ve explored audio in VR in previous episodes including in episode #222 with Jaunt’s Arthur van Hoff, episode #80 with AJ Campbell, and in episode #232 with Two Big Ears & their Spatial Workstation, which has since been bought by Facebook.

Here’s a trailer with a demo of a Dolby Atmos sound mix

Here’s the highlights from the Sundance panel on Immersive Sound for Virtual Reality

Subscribe on iTunes

Donate to the Voices of VR Podcast Patreon

Music: Fatality & Summer Trip

Rough Transcript

[00:00:05.452] Kent Bye: The Voices of VR Podcast. My name is Kent Bye, and welcome to The Voices of VR Podcast. So on today's episode, I'm going to kick off a brief little mini series for the next four episodes, really diving deep into audio within virtual reality. So I'm going to start off with Dolby Atmos and what Dolby Atmos is and what it can do. So I have Joel Susol, who is the director of virtual and augmented reality technologies at Dolby Labs. So I've talked on the podcast before about different solutions with Ambisonics and talked about different plugins that are available for game engines. But if you're producing audio for a movie, then there's a whole entire different process of being able to mix and change that audio later. And that's where Dolby Atmos really comes in. And so we'll be talking about their solution and how you're able to work with it. and to just kind of wrap your head around the differences between doing audio for virtual and augmented reality versus just a 2D experience. So that's what we'll be covering on today's episode of the Voices of VR podcast. But first, a quick word from our sponsor. Today's episode is brought to you by The Virtual Reality Company. VRC is at the intersection of technology and entertainment, creating interactive storytelling experiences. The thing that's unique about VRC is that they have strategic partnerships with companies like Dbox, which is a haptic chair that takes immersion and presence to the next level. So they're making these digital out-of-home experiences for movies, studios, and original content. For more information, check out thevrcompany.com. So this interview with Joel Sousel happened on the Expo floor of GDC. That happened in mid-March in San Francisco. So with that, let's go ahead and dive right in.

[00:01:54.672] Joel Susal: Hi, my name is Joel Susel. I'm the director of virtual and augmented reality at Dolby Labs, which means I run our virtual and augmented reality business. And what we've done is we've taken our Dolby Atmos technology that's been used to create nine of the top ten grossing films in the last two years, and we've adapted those same tools and workflows for use in linear virtual reality content. So what I mean by linear virtual reality content, I mean that it is a piece that is timeline based. So it's predictable what's going to be happening say 30 seconds in. What's not predictable necessarily is where the viewer is looking. And so these types of experiences can be captured with immersive cameras or they can be generated with game engines. But what our solution brings is a maturity that's been used by the best content creators arguably in the world by virtue of the success in cinema. And the technology at its core is based on what are called audio objects. And audio objects are kind of what you'd think they are, which is rather than defining sounds how people used to in terms of channels, like a 5.1 where you have a left channel, a right channel, a center channel, etc. Dolby Atmos instead allows you to define sounds independent of where those fixed locations are and rather you can place the sounds arbitrarily at any location wherever those sounds belong. And more importantly than just the precision of the audio object is the notion of being able to also ascribe characteristics or behaviors to those audio objects. So you can define things like trajectories or size but also you can define in VR whether a given object should head track or not. So, to give an example, most things within a VR movie are going to be scene-based, meaning they're going to be emitted from something that's happening within that movie around the viewer. But oftentimes, content creators also want to include things like a musical score, or the voice of a narrator, or the voice of the good angel on your right shoulder and the evil demon on your left shoulder. And those voices or musical scores shouldn't change as you move throughout the environment. So that's an example of you might have some objects that are scene-based and some that are head-based. And with an object-based solution like Dolby Atmos, it's very simple for someone to specify whether a given object is head-based or scene-based. And that becomes important also when you're trying to deliver this experience because we also have a delivery technology called Dolby Digital Plus that takes the output of the Atmos mixing session and transmits it to the playback device in a very low bitrate, low complexity, but very high quality fashion. For example, if you have a piece of content where you want to mix head-relative objects with scene-relative objects, right? You're, say, at a trade show, and there are people walking by, but you want to add the voice of a narrator that, again, shouldn't change as you turn your head. Using a non-object-based solution, you would need to send two audio streams. You'd need to send the stream that rotates with your head and the stream that doesn't rotate with your head. And so now all of a sudden, well, you're sending two streams, so that's doubling the bitrate that you're sending. It's also you have to decode two streams, so now you're doubling the computational load on the playback device. And also, if that given piece of content is going to be played back on multiple different services, well, each service might implement the mix of those two decodes differently. So now you have a potential inconsistency. So as a content creator, you want to be sure that when you create this piece of art that incorporates both head-relative and scene-relative objects, that when it's played back, it's going to be played back consistently in the way that you've curated. And so the Dolby Atmos solution allows you to specify those characteristics on a per-object basis. That's literally an additional bit of metadata that gets attached to each object so it doesn't bloat the bit rate. It does not impact computational complexity on the decode, and it's always decoded and rendered by a certified Dolby implementation, so you know exactly what you're getting, which is the level of quality that has been embraced. It's been part of why Hollywood has embraced our technology so vehemently over the years. So our tools now, they pull from the lineage of cinematic Dolby Atmos, but they've been adapted for virtual reality. You can audition a binaural virtualized version of your mix on the fly as you're mixing. You can move the objects on the fly. We have multiple different panner UIs, so you can do a top-down view or an equal rectangular view. And more importantly, what we have is an end-to-end solution that allows us to implement new features for content creators and deliver those features, again, without bloating the bitrate, without bloating the computational complexity, and with promoting consistency essentially immediately. So with this end-to-end solution, we enable new and important features to the content creators and simultaneously enable the proper playback of those features instantaneously with our playback partners.

[00:07:09.122] Kent Bye: And so I understand how ambisonic microphone works. I kind of get how Unity will, in a scene, be able to put sound sources in a 3D space and those become those objects where, you know, the sound is coming from those. And so with an ambisonic microphone, it's like four channels and it's capturing a sound field and you can then translate that into the head transfer function. So as you turn your head, you're kind of mimicking what that sounds like. But object-based, within a game engine editor, you're actually kind of taking individual sound files and putting them in a specific spatial location that then, when you turn your head, then it's coming in. So I guess I'm still a little confused as to, like, how are you capturing the audio and then what are you doing to spatialize it?

[00:07:50.825] Joel Susal: Sure. Good question. So, yeah, an ambisonic microphone is very nice because you can capture, quite simply, a 3D sound field, as you point out. One of the challenges of ambisonic is that once you've captured that sound field, for the most part, it's very opaque. What that means is that it's hard to dissect. It's essentially like a cake that's been baked. You can't dissect the ingredients. You can't get the flour out of the cake once it's been baked. And so our solution allows ingestion of ambisonics. Typically, we use that to populate, say, the ambience or the background of a given scene. But what we found and what content creators are telling us again and again is that they're generally not satisfied with solely the ambisonic capture because oftentimes there's going to be someone who sneezes in the background or the motorcycle or the siren that goes by. And all of a sudden, while you may think that the viewers want to feel like they were there, the end product that you want to create, especially if your goal is to create a premium experience, is to make it better than you were there. So, you know, at a basketball game, for example, sitting at half court, Well, you'd think maybe you want to feel like you were actually sitting at half court. But in reality, what you want is you want to hear the swoosh of the ball in the net every time, which you don't always hear from half court. And you don't want to hear necessarily the people talking behind you about something that was unrelated to the game. And so, what you want from a premium experience is actually a fantastical or a super real, hyper real representation. And that's where the art and technology come into play. And so, a simple ambisonic microphone capture solution is insufficient. because you want to put the tools in the hands of someone who's going to craft that hyper real experience. You want to put them in front of the right tools and that's where Dolby and Dolby Atmos really shine. We build those tools that are used by the best content creators to enable those superlative experiences. With regard to game engines, which is kind of the opposite, in a way, of ambisonics, because, as you point out, game engines are also based on audio objects, but oftentimes the way a game engine works is that you drop a sound into some kind of rules engine, a game engine or an audio engine, and those engines implement oftentimes physics engines or things that try to mimic reality. And that's great for a game, but again, oftentimes, if you want to create a premium experience or a highly crafted experience, some content creators are not satisfied or not comfortable leaving it up to the game engine and the physics rules that are implemented in the game engine to render the final product. In many cases, they want more control. And that's, again, where the Dolby Atmos technologies shine, because those technologies have given them the same control they've required when creating movies and episodic television. These are the same types of people, by the way, who've won Oscars for sound design. And they've won those because of the highly crafted, highly artistic output of what they create. And the sophistication of the tools required is where we've uniquely added value.

[00:11:01.437] Kent Bye: And so, you know, how do you actually capture the audio of the scene then? Is there someone with a boom mic that's kind of going around and recording each of the actors? Do they have lapel mics? You know, do they need to have the position of where they're at? Or, you know, talk about some of the process of actually capturing the audio to ingest into the Dolby Atmos system.

[00:11:18.069] Joel Susal: Sure. So, of course, yes, you can use an ambisonic microphone. We recommend, in addition, and of course this depends on the type of content that you're capturing, but as pristine a capture as you can. So, yes, lapel mics are great. If it's a musical scene, you know, tapping into the soundboard is, I'd say, ideal. And so, what Atmos provides is, essentially, once you've captured all of those pristine feeds, it provides the tool, the canvas, if you will, for the sound designer to then place those sounds where they belong, accentuate or diminish them, and define other metadata characteristics about how they should perform when playing back. And it does so in a way that allows real-time listening and auditioning. It provides essentially a guaranteed level of playback, so you know exactly what it's going to sound like on any device that bears the Dolby Atmos playback solution. And it does that in a way that we, over time, we will again introduce new features that will immediately be available on all of our playback partners.

[00:12:20.580] Kent Bye: Now, I imagine that because these are objects, I kind of imagine them as being on a degree from 0 to 360 degrees from you, but there's also a distance for how far away it is, and then there's also a distance for how high or low it is. So are you able to control the XYZ coordinates of each of these sound objects?

[00:12:38.404] Joel Susal: Absolutely. So that's fundamental and at the core of Dolby Atmos. What Dolby Atmos introduced when it first launched in the theater for the first time it brought the upper hemisphere with the inclusion of what we call elevation speakers, speakers overhead. And so since its launch, Dolby Atmos has always supported what we call positive Z or height in terms of objects. With regard to virtual reality, we've now completed that sphere. So in addition to elevation, you also have the potential for underfoot sounds. And yes, to answer your question, distance is also another critical characteristic as you want to build these immersive soundscapes. And so our solution provides full flexibility in terms of positioning. Beyond just the positioning though, There are other characteristics that you can define on a per object basis, such as the reverb model that you'd like to use. In other words, in some cases, you may want to bypass reverb altogether. In some cases, you may want to give a drier sound that was akin to a sound being closer to you. And so our solution, again, the flexibility that it provides is for us to be able to define on a per object basis all of these parameters that content creators are demanding.

[00:13:53.292] Kent Bye: Now in VR, you have the capability of kind of modeling a full 3D model of a room space. Is Dolby Atmos also doing like reflections off of walls? And are you actually modeling to the extent of the size of the room that you're in?

[00:14:06.643] Joel Susal: Yeah, of course. So we do have a room model. It's part of our rendering solution. And there are, again, parameters that help to define that. You know, there's an interesting, I'll point out another interesting use case of Dolby Atmos that's uniquely enabled by our solution, which is the fact that because it is built on top of the wider Dolby Atmos technology, it enjoys a lot of the momentum that may not necessarily have been VR related, but is very compelling nonetheless. So for example, all of our tools for virtual reality are built and optimized for headphone playback. But the same mix that is optimized for headphones is also compatible and you can play it back on any Atmos-equipped AVR that you can buy today at your favorite retailer. What that means is that while you have this mix that's optimized for headphones, if you wanted to enjoy that same piece of content either in a high-end home theater system or a high-end gaming room, or a VR cave or a retail installation, either like at the lobby of the movie theater or in the mall, you can essentially, for free and without any additional mixing, upgrade your experience to an over-speaker experience. And I know it sounds a little funny in VR, but there are a few critical benefits to playing back over speakers. One, frankly, is that it's just much louder than headphones. You can move a lot more air with a subwoofer than you can with any pair of headphones. And so that visceral feeling and the presence that, frankly, that you get when, you know, if you're at a concert and you can feel the bass, it's actually much more presence-inducing than a headphone bass experience. Number one is it's loud, which is important. Number two is because you're no longer blocking your ears, you are able to, in fact, socialize with others who are in the same room as you. And in fact, at CES, we partnered with Jaunt and showed simultaneous playback of, so there were five subjects in a room, all surrounded by the same set of speakers. and all playing the same synchronized movie, yet each subject, each person could choose where to look. So they were choosing visually where they wanted to look, yet they were all surrounded by the same set of speakers. And if you think about it, no matter which way someone is looking, the sound is going to be consistent for them and accurate and correct for them. In other words, if someone's looking at the northeast corner of the room, well, the speaker that's in that corner of the room is going to be front and center for them. But if there's someone else that chooses to look in the exact opposite direction, well, that same speaker that was front and center for the first subject is now directly behind the other subject. So what it allows is simultaneous viewing of a single piece of content without your ears being blocked. So now all of a sudden you have the beginnings of what could be a physically proximate but social experience, a shared experience in VR. And then finally I'd say that the other benefit of this solution was just the safety and comfort that people felt knowing if someone was talking about them. For example, being able to hear if the baby would be crying or the front door opens or someone calls your name. So this notion of being present enough in your physical world to feel safe safe enough so that you feel comfortable actually getting more immersed in the virtual world. And that's just an example of another unique benefit of the Dolby Atmos solution. It just automatically works in that environment.

[00:17:34.354] Kent Bye: So what does the interface for this look like? Is it similar to something like Unity where it's a full 3D model where you're moving objects around in a 3D space?

[00:17:42.677] Joel Susal: Yeah, so our solution, there are two aspects to our solution. There's the content creation tools and the playback solution. The UI is associated with the content creation piece, and it's built on top of Pro Tools. It is a Pro Tools plugin. It's a suite, so there's a number of different UI components to it, but the way that someone pans an object is they select the track in Pro Tools that they want to move, and that helps define the object. and they can instantiate a panner UI for each track. And within that panner plugin is where you can essentially drag the object in either an XYZ projection or we also support equirectangular panning. So that if you have a spherical video that's been flattened into an equirectangular projection, you can now pan your objects in that projection as well. is something that I think makes a lot of sense. It took a little bit of thinking at first because it takes some getting used to understanding what an equirectangular projection looks like and behaves like, but it's actually much more of a WYSIWYG, a what-you-see-is-what-you-get type of mixing paradigm.

[00:18:53.968] Kent Bye: And so what is the licensing model for this technology?

[00:18:57.888] Joel Susal: Yeah, so our content creation tools are in beta right now, so we're working with a select group of partners. And I can't comment on the pricing, but generally our goal is not to charge content creators quite a bit of money. We want the best content creators to use our tools to create the best content that they can. And the other aspect of our solution is a playback solution, where our decoder and renderer gets integrated into one of our customer's applications. And we have a model where we make money when our customers make money, which is what our customers have deeply appreciated. In this world where no one's quite making much money to begin with, at this point anyway.

[00:19:42.371] Kent Bye: Yeah, taking a step back, I'm just wondering at a higher level if this is something where you would buy the suite and have the Pro Tools integration and use it as much as you want, or if it would be something more along the lines of, you know, for each specific piece of content that you create, it has its own unique licensing and based upon the views or, you know, like,

[00:20:01.606] Joel Susal: So our model currently is designed for multiple pieces of content. We're thinking about a per title pricing scheme. But what I will say is that our tools, while they always output the Dolby Atmos version of the content that's created using our tools, we also output in other formats like B-format. And so when you're done mixing using the Dolby Atmos production suite, you have both a Dolby output as well as a B-format output. And that allows you to distribute your piece to, I'd say, a large number of distributors. Of course, the value that we provide is the delta between those two experiences, the B-format versus the bonafide Atmos rendition. And that delta is the value that we provide in the market with our playback solution.

[00:20:51.407] Kent Bye: Now is this something that's like mixed down to like a single WAV file or I'd imagine that it's probably more along the lines of like a project with the original source files and then have some sort of way to dynamically track your head and then mix it live.

[00:21:04.802] Joel Susal: Well, that's where it gets really interesting because any channel-based solution, including B-format or ambisonics, is less flexible than an audio object-based solution. The reason being that with an audio object-based solution, you can define behaviors of objects with a very small amount of metadata. But if you have a channel-based configuration, including B-format, you can't mix two groups of things together if you intend to then separate them downstream. So again, an example, if you have head-tracking objects and non-head-tracking objects, you simply can't mix those ahead of time. And so now all of a sudden you do have to send two different streams if you want to implement that feature. And so we feel strongly that the robustness of having an audio object-based solution end-to-end is far superior to any other alternative including ambisonics, B-format or any channel-based configuration. And that flexibility, frankly, is where we see VR, the direction of VR. We're just at the very tip of the iceberg in terms of what the types of experiences people want to define. And having a robust solution with an end-to-end content creation as well as a playback environment is really where we see tremendous value and where Dolby can provide unique and meaningful value for the industry.

[00:22:34.175] Kent Bye: Is this something that's integrated into the Samsung Milk VR video player? Because I think that at Sundance, there was the perspective, which I think was just a normal MOV file. And so I'm just curious how the audio data, if it's bundled with that file, or how it actually is working in the Gear VR.

[00:22:52.726] Joel Susal: Yeah, I can't comment. What I can say is Milk VR is not released with Dolby Atmos support as of today.

[00:23:00.233] Kent Bye: OK, but it sounds like at Sundance, you were showing perspective and talking about it. So this was an experience that was shooting from the first-person perspective. And so I'm just curious if you could talk a bit about how they were using the technology.

[00:23:14.899] Joel Susal: Yeah, so as I mentioned, our tools do output into other formats, like B-format. And so when someone creates a piece of content using our tools, you may get an Atmos version as well as a B-format version out of those tools. We feel strongly and in fact the success of our business is based on the noticeable difference between those two renditions. And so we feel strongly that the Atmos version is significantly better. So at Sundance we did have versions of perspectives that were enabled with the Atmos playback solution. And that solution is, I'd say, noticeably different in a number of ways, but one critical way is the ability to very precisely localize sound, which is critical in a frame that's happening all around you when you need to understand your environment in order to know where to look and with what to engage.

[00:24:09.707] Kent Bye: So if content creators are out there and they want to get their hands on this technology and start using it in different productions, then what should be their next steps?

[00:24:18.587] Joel Susal: So as I mentioned, we just launched our beta program this week, and we have a really strong group of initial users, but we're always looking for more. And I would suggest that they contact us. They can email vrcontent at dolby.com. And that's one alias where we can ingest your interest and make sure that we reach out and get you in the pipeline for our production tools.

[00:24:45.073] Kent Bye: And finally, what do you see as kind of the ultimate potential of virtual reality and what it might be able to enable?

[00:24:50.397] Joel Susal: Oh, that's a fun one. Well, I think, you know, we're at the beginning. This is hacking your senses, primarily vision and audio at the moment. But you can imagine if you can hack one's senses to the point where it's indistinguishable from reality, then it's a little dystopian. But it's certainly a very valuable technology you would have on your hands. You would not need to travel to go and experience some of the great things in life. So, you know, that's the potential of where it could go. Frankly, I think VR is going to lead significantly into augmented reality, which also shows great promise, tremendous promise, because not only is it useful for deeply immersive experiences, but it also has the potential to just make us better performing people. we would not forget names and have relevant facts and figures and information at our fingertips or in front of our eyes when we need it, which is, I'd say that helps us perform better, which is something that's a very exciting future as well. Awesome, yeah, thank you. No, thank you very much, I've appreciated the time.

[00:26:00.272] Kent Bye: So that was Joel Sousel, he's the Director of Augmented and Virtual Reality Technologies at Dolby Labs, and he was talking about Dolby Atmos. So I have a number of different takeaways about this interview is that, first of all, I think it's great that there is some proprietary solution that's out there that's really trying to address this issue. And for high-end premium content, people with big budgets, people with big Hollywood productions, I think that this technology is going to make sense. For taking the audio for 360 videos, I think, is the primary use case for Dolby Atmos. It's going to take it to the next level. I'm really impressed with the ability to be able to dynamically change the material properties after the fact. So I think the big thing that I learned from this is that when you do field recordings with ambisonics, you're essentially stuck with whatever you have. So while that's good for kind of establishing a room tone to just kind of get a baseline for what the room actually sounds like, for the actual main content, it sounds like you really need to be micing those individual sources and then positioning them within the space. It sounds like Hollywood has already adopted a lot of this Dolby Atmos technology. It's already in theaters and there's going to be super expensive home theater. I expect that at some point that most people will probably be using headphones when they're doing VR in their home. Maybe high-end systems at home, but probably mostly for digital out-of-home entertainment experiences might use something like this Dolby Atmos. have a whole array of speakers. And so I think the real value would be probably like shared social spaces where you can go through a VR experience with other people at the same time and having this overall sound field that just gives you a deeper level of immersion and presence. So I think the thing that the community doesn't have at this point is an open standard alternative to Dolby Atmos. Dolby Atmos is really an early mover in starting to define the format of this technology. But in order to decode it and use it in other places, you're going to have to have a licensing fee. And producing something like this in a standard format and exporting it from an Unreal Engine or Unity, there is no open standard for that yet. And so we're kind of left with It's no standard way to be able to deliver both the original waveforms and a way to kind of mux them all together and to be responsive to as people are moving their head around. So it's a lot of complicated, sophisticated technology. And it really reminds me back of this sort of open versus closed debate that has been ongoing throughout listening to many different episodes of the podcast. And Neil Trivett of the Kronos Group making the point that any good open standard has a proprietary competitor. And I think that Dolby Atmos is the proprietary solution that's out there, but it's the best that's out there. And we're yet to see an emerging open standard competitor to that. And I think that will probably emerge here shortly at some point. It might be something that Unity and Unreal are thinking about. If not, they should be, because I think that we kind of need a standard way to have alternatives to be able to do some of these things that Dolby Atmos is making possible. And it sounds like Dolby is probably in talks with Samsung and being able to integrate this within the player of the MilkVR. It just makes sense. I think there was a number of different experiences that were at Sundance this year that were using this very special build of MilkVR that already had it integrated. Because I just experienced a number of 360 videos that had that object-oriented sound within it. I wasn't aware of it at the time, so I didn't know to be listening for it, but coming this year at Sundance, I'll certainly be trying to discern if I can really hear the difference between a Dolby Atmos experience and something that doesn't have one. You kind of have to experience the same experience, I think, to really tell the difference. Sometimes it's really difficult to discern the differences between these different audio plugins and spatialization approaches without hearing the exact same content and be able to listen to it back to back to be able to subtly notice some of the nuances. So this week, I'll be continuing on to the audio theme. I'll have an interview with the founder of AUSIC Headphones, as well as Pete Moss, who is the VR dude at Unity, who has an audio background. We talk about audio a little bit as well. As well as the University of North Carolina professor, Ming Lin, who has been looking into actually simulating the audio. So actually, just like the physics engine simulates the physics, having some sort of audio engine that's actually generating the audio based upon the material properties and everything else. That, I think, is going to be the future of audio, especially when it comes to game engines, at least. So, that's it for today's episode, and I just wanted to thank you for joining me. And if you'd like to support the podcast, then please do spread the word, tell your friends. Check me out on Twitter, at Kent Bye, if you want to send me a note or get in touch. And you can also send me some tips over at Patreon, at patreon.com slash Voices of VR.

More from this show