AJ Campbell is the founder of VRSFX, and he got inspired to getting 3D audio for 360-degree virtual reality experiences after seeing Beck’s Hello Again concert experience. He noticed the microphone rig in that experience and decided that he could write software to create a fully 360-degree, binaural experience with the right microphone hardware.
AJ contacted Jeff Anderson of 3DIO Sound and learned that he was working on a Free Space Omni-Binaural Microphone, which he was one of the first people to buy. AJ has been working on a Unity plug-in that would allow you to use the head tracking of the Oculus Rift to be able to cross fade between the four different binaural audio recordings that would be perfect for immersive, binaural audio for 360-degree video productions.
He talks about his low-resolution 4-node, 90-degree resolution approach of fading between four different analog binaural audio feeds. This is much less processor intensive to calculating object-oriented, 3D spherical audio and gives satisfactory results in a lot of different use cases. He’s still playing with the best crossfade to provide a seamless and smooth listening experience when you’re turning your head, but expects to be finished with the Unity plug-in soon. For more information, then be sure to sign up for VRSFX’s e-mail list.
Theme music: “Fatality” by Tigoolio
Rough Transcript
[00:00:05.452] Kent Bye: The Voices of VR Podcast.
[00:00:11.977] AJ Campbell: My name is AJ Campbell, and I founded VRSFX.com in order to make 360 3D audio available to game developers. I've got an Omni binaural microphone here, and it's pretty wacky looking. It's got eight ears on it, four pairs, one facing in each direction, north, south, east, west. And so it just looks like a big mallet full of ears.
[00:00:35.090] Kent Bye: Great. And so what inspired you to start to get into this field of audio within virtual reality?
[00:00:42.040] AJ Campbell: Yeah, so back in February, I think it was, I saw the Beck 360 experience. It's called Hello Again. I think it's hello-again.com. It's a really cool 360 video where Beck is on a circular stage and there are 360 cameras and microphones hovering around him, hung from pulleys the whole time. And it's a really amazing experience. And I saw the microphones in particular. I've got a background in software and audio. And it occurred to me from seeing their microphone rig that we could use the same kind of thing in game development but it would require some extra software to be written. So I got a hold of a similar microphone rig and I wrote the software in order to make it drag-and-drop easy for game developers on the Unity platform and hopefully on the Unreal platform pretty soon too.
[00:01:26.995] Kent Bye: And so in video, if you have a binaural audio microphone, you just have two ears. And that's good enough because you can always have it oriented to whatever position you're showing the video. But in this case, you have perhaps a 360 degree video. and you're using these eight pairs of ears to then extrapolate depending on what the head tracking is at. Is that correct? Maybe talk a bit about that process in terms of that translation from mapping the head tracking to panning between all the different audio feeds.
[00:01:59.502] AJ Campbell: Sure, yeah, so with this many ears on one microphone, I can record the entire soundscape in 3D in any direction. Because, say we have a pair of ears that's facing north, we have a pair of ears that's facing west. If the user is facing north and then they turn to their left, then we do a crossfade, we switch the volume from totally north, and then we bring in the west channel, and we fade out the north channel at the same time. So based on the volume you're hearing, it actually affects your perception of what direction you hear the sound coming from.
[00:02:35.039] Kent Bye: I see. And so there's an approach of doing it completely in a game engine, a 3D game engine, and doing 3D object-oriented modeling. And I guess the question I have is the difference between when you start to tilt the head and fully positional tracking, What are the implications because this is something that looks like it's like a DK-1 type of just tracking when you're turning your head left and right but not necessarily leaning over to the left or right or leaning forward. So how do you handle that?
[00:03:05.271] AJ Campbell: Sure, that's exactly right. So the first part of that is the 3D rendered sound that most of the time is used in video games. is extremely accurate for the direction, because they recalculate the positional qualities of the sound signal in real time based on your facing. So you might turn only like a quarter of a degree to the left, and it'll recalculate the very next frame where the sound is supposed to sound like it's coming from, which is cool, but it's also very processor intensive, which is a big problem, especially in VR, because we have a lot more processor power requirements, to keep the frame rate at hopefully eventually 120 frames per second. We don't necessarily have the processor power available to spend on real-time audio rendering at the same time. And so my method, it actually creates similar results to the real-time rendering stuff with a lot higher efficiency because all of the audio tracks are pre-rendered and we're not recalculating everything about the signal that makes it positional, we're just changing the value of the volume. so it requires almost no processor power in comparison to the other stuff. And surprisingly, even though there are much fewer nodes between, say, the north face and the west face, a real-time 3D renderer for sound will capture accurate positional quality at every point in between those, but there's always a distance from one to the next that they slide between. And so this is kind of the same idea, we just have fewer nodes that we're sliding from one to the next. So we're sliding, instead of a tiny percentage of a degree, we're sliding a complete 90 degrees. It works the same way, it's just what's the resolution that you need in order for it to sound realistic? What we found out is that you need very low resolution in order for it to sound realistic. You can have two nodes that are 90 degrees apart and it'll still sound real. So the other part is the pitch and the roll. So this thing will yaw. If you turn your head to the left, turn your head to the right, it works great. If you tilt your head to the left or the right, or if you tilt your head forward and back, this rig by itself will not capture accurate audio. But the funny thing about that is there are certain head orientations that don't affect the sound at all. For example, if it sounds right in front of you, and you tilt your head to the left or right, it's not going to change the sound signal whatsoever. And so, based on what we know about where we need those tilts, we are able to use a couple extra microphones in addition to this one here, in order to capture not just all around you, but up above you and below you as well. And so we're developing the software for that right now. It's not quite done yet, but it's going to be done very soon.
[00:05:39.431] Kent Bye: And so because you have like a 90 degree resolution, if you say turn to 45 degrees, does that mean that you can take a crossfade of half of one and the half of the other and it kind of blends together and still makes sense? Or do you kind of do a step function shift into, you know, you flip over?
[00:05:58.491] AJ Campbell: It's a great question. So we've experimented with a few types of crossfades. The first one was just your standard run-of-the-mill linear crossfade, where halfway between one note and the next, you fade the first one out 50%, and then you fade the second one in 50%, so they're exactly equal in the middle. And then, you know, if they're both fading at the same rate, you get a pretty good effect, but you also have a slight fluctuation in the total volume of the sound. which is a problem. So we're experimenting with a variety of different crossfade scales to figure out which one sounds the most realistic.
[00:06:37.814] Kent Bye: As you're talking about this, I'm just thinking about the use cases. Most use cases, if you're listening to a concert, you're not going to be doing a lot of crazy leaning and tilting of your head in odd places. When you're in a concert, you don't typically do that in real life. Why would you do that in virtual reality? And so I'd imagine that this is covering 80% to 90% of the major use cases for where you would want to just set up a microphone like this to capture the live audio of a live event.
[00:07:03.783] AJ Campbell: That's exactly right. So, for example, I did a music video in collaboration with JumpVR last month, and they are using a different kind of 3D audio method. Theirs works great too, but they were curious about how mine works. So we did a music video where we A-B'd theirs versus mine. And most people, when they sit in that music video experience, they mostly don't really think to tilt or roll their head so much. It's an experience where the natural thing to do is just turn to the left, turn to the right. And since this microphone by itself captures all of that, most people don't notice that there's anything wrong up above or below. So for example, if you, with my microphone, if you take your head and point the top of your head right at a sound source and wiggle it around a little bit, you can kind of hear some blemishes in the audio. And that's the part that I'm working on fixing with our software right now. But most people didn't even know that that was there because they don't think to point the top of their head at the artist that they're listening to, right?
[00:08:02.058] Kent Bye: Yeah, it's like hacking the sound of like, you know, finding the thing that's going to break the sort of the illusion there. We don't sort of think to do that, but it's interesting to think about using these microphones, what those edges are. And so like, what has been your process to kind of QA or figure out what the limits are?
[00:08:19.153] AJ Campbell: Well, so I've brought this mic into a couple of different studios, and we've done diagnostics where we watch the EQ and run low frequency oscillators, various frequencies, to see what the output is from multiple channels as we pan from one channel to the next, in order to discover what's the optimal crossfade, you know, things like that. There's a lot of diagnostic work that's gone into getting the software right. And we're not 100% there yet, but we're getting close.
[00:08:52.274] Kent Bye: And so what's the next steps for this? Is there going to be a prototype or a Kickstarter? Or is this going straight to market? Or how are these made available for people?
[00:09:00.912] AJ Campbell: Well, so the microphone is actually already available for sale. I didn't build this mic, it was actually... When I saw that BEC 360 video, I was completely passionate about getting involved, and I was doing research on the part so I could build one for myself. But then I found out that the guy who did the audio for the BEC 360 experience has a microphone company, and he had an Omni model in the works. And that's where this came from. So I'm more of a software guy. He's obviously more of a hardware guy. I decided, you know, let him be in charge of that part of it. So I called him up and I think I was his first customer for this microphone. It's a really great rig. His name is Jeff Anderson and he runs a company called 3Dio Sound.
[00:09:40.797] Kent Bye: I see, and so are you building a Unity plugin that you're going to be selling then?
[00:09:44.518] AJ Campbell: Correct, yeah. I actually had my Unity plugin ready about a month ago. It's a funny story, I had a system crash, I had it backed up, but the backup was corrupted somehow, and then I called up Unity because I had sent a copy to them, and somehow it was corrupted on their end too. We're not sure how all of that happened. but I am very close to being done recreating the plugin from scratch. I've been a few days away from finishing the code for almost a couple weeks now, but you know, every time I come to an event and people are asking me to jump in on a video project or a game project, and so it's hard to juggle all this stuff. I'm actually trying to build a team around this right now so that, you know, it's not just me getting all this done. I have quite a few people in the forums asking me, when's your plugin going to be available? I'm going to have it available as soon as possible.
[00:10:29.957] Kent Bye: So one of the acronyms that I see a lot in this space is HRTF, and so maybe you could describe what that is and what that means.
[00:10:37.363] AJ Campbell: Yeah, so that's the kind of technical term for the method we were talking about a little bit before. It's used pretty often in AAA games, where you have a regular mono sound signal, and it wasn't recorded with a binaural mic, it has no spatial cues attached to it whatsoever. You need to recreate those spatial cues in real time. And so with HRTFs, they take actual samples from binaural dummy head microphones, and they calculate what's different about that signal versus a regular mono signal. And based on that information, and because they take samples from literally every position around the head in a 360 experience, and based on how many nodes they have, that's how spatially accurate the HRTF is. So a lot of them have hundreds of nodes all around the head. There are some more advanced ones now that have thousands of nodes all around the head. So they're very high resolution. But that's kind of the direction everybody's going with that, with the real-time stuff. They're trying to add more nodes. And when you do that, it sounds more and more accurate. But if you record with a binaural mic, it also sounds very accurate. And I don't know that a lot of people realized, if you go the opposite direction and you have fewer nodes instead of more, it actually still sounds just as good.
[00:11:50.796] Kent Bye: Yeah, because you're kind of recording at a 48 kHz frequency and you're able to, you know, if the update rate is only 120 Hz, then I guess there's a lot of samples. I don't know if that is translating in terms of like, if that's related to latency at all, that the frequency recording rate of live audio in that way, do you know what I mean?
[00:12:10.540] AJ Campbell: Yeah, it's not so much the... I mean, there are latency issues with the sample rate of an audio signal itself, but we're talking about, with an HRTF, that's real-time post-processing for the audio signal. So it takes the regular audio signal and it actually uses either CPU power or, if you have an off-board sound card, it can use the... processor that's on the sound card instead, to calculate in real time how that signal needs to be tweaked. Not just every frame, it actually happens multiple times a frame because the physics engine can update more than once per frame, typically. But every time the physics engine updates, you can recalculate based on the position that you know an object has moved. If it's moved a quarter of an inch to the left, you know to recalculate the audio signal in order to compensate for that change.
[00:12:58.130] Kent Bye: And what are some of the other big open problems when it comes to audio and virtual reality that you want to help address?
[00:13:04.840] AJ Campbell: Well, that's the big one. So when you do all those calculations every frame or multiple times per frame, it potentially eats up a lot of CPU power. And because we really, really need all of that CPU power handy just for the physics engine so that the frame rate stays high, I wanted to be able to use lots of 3D sounds, which are totally necessary in VR. You want all your sound to sound 3D because that's what gives you a convincing immersive experience of presence. If you use dozens of 3D sounds at the same time, and potentially there are, I can picture situations where that's going to happen in VR, that by itself, the audio and nothing else, can bring a CPU to its knees. And so, you know, it's kind of a, you can have one or the other thing right now, where you can have really fast frame rates, or you can have 3D sound. And I wanted to have both, and that's kind of why I started working on this stuff.
[00:13:53.531] Kent Bye: And so do you see like using this to do samples of audio that you're then layering on top of each other? You know, like instead of recording all live in a live environment, they're actually going into a sound booth and doing recordings of sounds and then layering on top of each other.
[00:14:09.745] AJ Campbell: Yeah, absolutely, and that's kind of, in a real-time gaming environment, that's usually how it works. And so that's the beauty of my plugin, you can have one sample, or you can have 12 samples all playing back simultaneously, and unlike the real-time rendering stuff, my stuff doesn't affect the processor output.
[00:14:27.430] Kent Bye: Great. And finally, when it comes to virtual reality in general, what do you see as the ultimate potential for what it can provide?
[00:14:34.316] AJ Campbell: Oh, it's so huge. I mean, everybody's focused on the gaming environments right now, I think, because that's the market that's blossomed as a result of all of the hardware becoming accessible price-wise. but the larger market I think is social media and where that overlaps with gaming because there are millions of people playing social games right now but they're accustomed to creating digital presences for themselves but a lot of them haven't even experienced what it's like to do that in VR for the first time so when those people start to realize that it's so much better when you actually feel like you're there I think that we're going to see eventually billions of people with VR headsets sharing an experience with their friends in real time.
[00:15:18.062] Kent Bye: Great. Well, thank you. Sure.