Arthur van Hoff is the CTO of Jaunt, and I had a chance to catch up with him at Oculus Connect 2 to go over some of the most important specifications for their Jaunt ONE 360-degree camera,formerly known as “Neo.”
Arthur talks about how they’re using cloud-computing to interpolate between 24 different cameras using digital lightfield technology in order to eventually render out 8K worth of pixels for each eye at 60 frames per second. In this interview he explains exactly how they’re using lightfield technology, and how they eventually want to explore lightfield rendering.
Each of the cameras also have an accelerometer that could eventually be used for more advanced image stabilization, but right now they have their own set of algorithms to do image stabilization.
Arthur also talks about their ambisonic sound solution to render out sound fields in combination with HRTFs, as well as how they’re also using Dolby Atmos Audio Technology, which is more object-oriented.
He also talks about how the ultimate end-product for Jaunt is the content in collaboration with their partners, and he talks about what types of doors their latest round of $65 million dollars is going to open when it comes to securing talent and content producers to create immersive, 360-degree video content.
Become a Patron! Support The Voices of VR Podcast Patreon
Theme music: “Fatality” by Tigoolio
[00:00:05.412] Kent Bye: The Voices of VR Podcast.
[00:00:12.045] Arthur van Hoff: My name is Arthur Van Hoff, I'm the CTO and one of the founders of Jaunt. We make cinematic VR. We've built an entire end-to-end pipeline all the way from capture to delivery of how to create high-quality cinematic VR experiences. And here at the show we're actually showing the Jaunt One camera, formerly known as NEO, the One. It's exciting, it's the first time out in the open, and I'm on the panel tomorrow speaking about how to make movies in VR.
[00:00:41.918] Kent Bye: Great. So, yeah, when you're looking at cameras in VR, what are some of the, I guess, the most important specifications or specs that are featured in this camera?
[00:00:50.920] Arthur van Hoff: Well, you need to first decide whether you want to do stereoscopic or monoscopic. With six GoPros, you can do monoscopic, but not stereoscopic. You need more angles if you want to do stereoscopic. So, our camera contains 24 modules, but most importantly, they're much better quality sensors than GoPros. So they are fully synchronized, much bigger sensors, a much bigger pixel, much better low-light performance. And also global shutter. So if you have motion, like if you have rolling shutter, motion between cameras is really hard because the shutter may be at different positions as the object moves by. With global shutter, the picture is taken all at once between all cameras and that makes a huge difference. And then finally, one of the defining factors for the end resolution is the angular resolution of your sensor. So our sensor is in the order of 20 pixels per degree, so you can do about 8K per eye in terms of native resolution. And that's not counting doing super resolution, stacking images over time, for example.
[00:01:56.524] Kent Bye: So it is stereoscopic, I'm assuming. And is there any specific processing that you have to do in order to make that happen?
[00:02:03.455] Arthur van Hoff: Yes, so the camera comes with a whole pipeline. There's tools for setting it up and then there's tools for media management and then there's tools for uploading the data into a cloud where the data then gets stitched and processed. The way we do that is we treat the camera as a light field camera. So we capture as many angles as possible and then later we synthesize a left and a right eye view so that you get sort of the highest possible quality stereoscopic accuracy.
[00:02:29.452] Kent Bye: So yeah, maybe talk a bit about that digital light field. That may be a new concept or confusing for some people. What does that actually mean?
[00:02:36.394] Arthur van Hoff: So a light field is defined as all the light passing through a volume of space, which by definition is an infinite amount of information. So you can never capture a light field. You can only capture part of it. But any sensor system that has more than one lens will capture a light field. It's a sampling of a light field. You have two lenses, you have two points of view. We have 24 points of view all around, so we capture a fairly dense light field at 60 to 120 frames per second. And then with that data you can actually sort of interpolate between cameras as well. So although there might be 16 cameras on the equator, we can actually double that number or quadruple that number by interpolation. And as a result you get very good stereo acuity because you can kind of imagine the light falling into your eye if your eye was in any position inside the camera. And then from that we generate a video that you can then later watch on your phone. So we don't actually do light field display, light field playback, because that's very difficult still, it's very challenging. But we do light field capture in order to do high quality stereo playback.
[00:03:44.816] Kent Bye: So if you have like 24 samples of discrete videos, I guess I have trouble understanding how you're taking like this light field information, which I kind of, in my mind, think of as kind of like a point cloud. I don't know if that's correct or not, but what's actually happening to that source video in order to make it sort of feel more stereoscopic?
[00:04:02.214] Arthur van Hoff: Well, what you do is you take all the calibration data of the lenses and the camera and then you compute, you know, a pixel in a camera basically represents a ray of light, you know, that's falling into the camera lens. And that's one of the angles of light that you can use. Right now you don't get all the angles because you're only sampling from 24 lenses. But you get a lot. I mean, we get about three gigapixels per second. So there's a lot of data to process. But from that data and with interpolation, you can synthesize almost any angle inside the camera. And then what you do is you basically say, well, what would my left eye see if I was looking in this direction? And what would my right eye see if I was looking in that direction? given the interocular distance, and given the convergence distance, and then you can compute a pixel value. So that you have to do 8K per eye, 60 frames per second, that's a lot of work. So that's kind of why you do it in the cloud, because you need a server farm to do that.
[00:05:03.907] Kent Bye: Right, and then what is the output? Is it a 360 degree video that when people actually wear it, then there's the left eye and the right eye,
[00:05:11.143] Arthur van Hoff: It's essentially like a map of the world. It's a spherical projected image for your left eye and a spherical projected image for your right eye. And the resolution is really determined by how much you can decode. On a high-end PC we can do 4K per eye. That's already pushing it. And it's not so much the fact that you can't decode 2 4K video streams, but it's more like the number of pixels that you're pushing to the GPU. So the memory bandwidth is limited at 60 frames per second. It's a limiting factor. Now for a phone, for example, we probably do 4K for both eyes, right, like on a Gear VR. And that you can achieve if you use 80265 and a high-end phone. But that resolution is actually pretty close to the angular resolution of the screen. So you're actually doing a pretty good job at that point. The one problem is, though, that you're also decoding the stuff that's behind you that you're not seeing. So you're kind of doing a bunch of extra work. So the next stage is to create, like, differential decoders that decode only pieces that you want to look at.
[00:06:11.092] Kent Bye: Right, because, you know, people probably aren't going to be whipping their head around, and even if they did, there's a limit for how fast you could do a 180, I'd imagine, so you could kind of make some probabilistic curves in terms of how much processing you would need to do, I'd imagine.
[00:06:25.767] Arthur van Hoff: Yeah, so yeah, that's true. And you know, you can kind of look at the physiology of a human being, how quickly you can look around, and it's actually pretty quick. So you kind of like do camera switching, you know, where you kind of go, oh, let me switch to a different stream, because that would be way too slow. So you need to be a little bit more clever than that. But it's possible to make some optimizations there, yeah.
[00:06:44.028] Kent Bye: And in terms of like traditional 360 video, there's usually like you can look and see a stitching line. So what have you been able to do to kind of eliminate having to have these lines where it's pretty clear that it's from different cameras?
[00:06:56.533] Arthur van Hoff: So we have an algorithm now that works with the NEO cameras to such an extent that you don't have any stitching effects unless you're really close to the camera, like within three feet. Then it becomes a little more difficult. We can still do that, but it requires some manual work. But sort of automatic stitching is now completely automatic in the cloud. You just press the button to record, upload the data, and off you go.
[00:07:22.149] Kent Bye: And in terms of audio, I know that audio is a huge component as well, and so are you recording, like, omnidirectional binaural audio, or what's your audio solution like?
[00:07:30.374] Arthur van Hoff: So we typically use a tetrahedral microphone to record a sound field, and then use ambisonics to render that on the device. But we also support Dolby Atmos. So we have a partnership with Dolby, and we can actually playback Dolby Atmos on your Gear VR, for example. And that actually is far superior than ambisonics. It creates a really immersive sound experience, and it really adds value. Like, we did the Black Mass, which is kind of a horror piece, and when we added ambisonic sound, it got a lot scarier.
[00:08:03.797] Kent Bye: Yeah, and so, you know, for somebody who's not familiar with ambisonics, like, how many different microphones are recording, and what are they recording? And then, I guess it's like the challenge of capturing a sound field. And so, how are you doing that?
[00:08:16.988] Arthur van Hoff: So, ambisonics is an old technology that's been around for dozens of years, but it basically means that you take sound from multiple directions, a sound field, just like a light field that we just discussed for the camera, and then you reduce that down to sort of its essence, and the way it's represented typically is in B-format, which is one center channel, which records the ambient sound, and then an X, Y, and a Z channel, which record the differential of the center channel. And then the reason that's interesting is that it's actually very easy to rotate. You can just apply a very simple function that when you rotate your head, it can counter-rotate the sound field. And then render it back out. Now, at that point you have directional audio, but you still don't have binaural audio. So what you then do is you apply a head-related transfer function. which kind of mimics the shape of your pinea, your ear, in order to create the externalization of the sound. So you get sound that sounds like it's outside your head, rather than inside your head. And that's kind of the next step that you need to do. And we do that all in real time on every device.
[00:09:24.915] Kent Bye: And if there are sound sources of people that are speaking, do you also have to do separate recordings of those sources to be able to position them, and then measure the distance of that to be able to do additional processing?
[00:09:37.538] Arthur van Hoff: So you can, there's plugins for Pro Tools to do ambisonic sound mixing, there's also the Dolby Atmos editing suite, which you can use, So a lot of concerts, for example, like the Paul McCartney concert that we did, that's produced in Dolby Atmos. And then you use object sounds, so you basically have Paul McCartney, the drummer, the guitarist, all these different objects that you place around the user, and that then get rendered for that particular device, it's in our case the headset.
[00:10:06.464] Kent Bye: So, yeah, what is the difference for the Dolby Atmos sound, then? Like, we described the ambisonics as essentially the same process, or what's different that they're doing?
[00:10:14.612] Arthur van Hoff: Atmos is object sound, so you define, you know, an object that makes a sound, like a plane flying overhead, and then you're given an XYZ location over time, so you can see that object moving. And then you have to figure out, well, if I have a 5.1 speaker setup, how do I best represent that object on that speaker setup? And that's kind of what Atmos is trying to do, especially for theater sound, right? Because modern movie theaters have lots of different speaker arrangements. but we basically use a subset of that which is used only for binaural rendering. So we take the object sounds and then render them out for a headphone and it creates this really good externalization and that's kind of a proprietary Dolby codec feature.
[00:11:00.700] Kent Bye: And do you imagine that most of this audio is going to continue to be in headphones or do you foresee a time when there's going to be like an array of speakers that are around in a room where you can move your head around and the sound is coming from all over?
[00:11:12.588] Arthur van Hoff: Well, if you come to our office, you can see that and it's pretty awesome. It's actually better to hear it on speakers than on a headphone. But, you know, that's quite a setup and I don't think people can afford the setup that we have at the office.
[00:11:26.191] Kent Bye: But it would be potentially kind of like if people wanted to set up like a whole advanced, you know, theater experience and then somehow sync the multiple headsets, you could get a room of people together and all watch the same experience and have like the sound field there.
[00:11:40.462] Arthur van Hoff: That's possible, that's a good idea. I mean, we've thought about doing stuff like that, and we can, we have the technology to do it, but, you know, those are one-off events, and we do that for a special occasion, but not, in general, what we're targeting is people with headphones, typically.
[00:11:54.192] Kent Bye: Yeah, and there was a recent announcement of another round of funding, a pretty significant investment from Disney, and I'm sure other investors as well, and so, as you're moving forward, what type of things can we expect to see from Jaunt with this camera?
[00:12:08.417] Arthur van Hoff: Well, our end product is content, so we're hoping to produce a ton of really high quality VR experiences that are going to blow your socks off. And that's kind of our goal at the moment. And the money that we raise is really sort of more fuel in the tank for us to deliver that. Plus, this is a strategic round, so we've got money from Disney, from EMP, which is Creative Artists Agency in Hollywood, and from China Media Capital. And all those are people that have really good relations with the creative community and will open doors for us to get to artists that we otherwise wouldn't be able to reach.
[00:12:44.785] Kent Bye: And in terms of moving a camera, I know that locomotion with 360 video can be challenging because if there's any shake, it can be very nauseating for the viewer. And so do you have image stabilization solutions that can be used with this camera? Are you kind of recommending that when people shoot 360 video, they should just keep the camera in one spot?
[00:13:04.208] Arthur van Hoff: Well, keeping the camera still is very helpful. The camera that we've built is a professional camera, so it's 20 pounds, it's pretty heavy, and that helps in stabilization. We also have various rigs that we've used for drones, for example, where you use a horizon lock stabilizer or Steadicam equipment that works with the camera. But we can also do it in post, although we haven't done that yet. Every camera module in our camera has an accelerometer, so we can actually get 24 accelerometer data points every frame. So we can use that. We haven't had the need to do that yet, but that's definitely something we will end up doing at some point.
[00:13:45.496] Kent Bye: Oculus Rift CV1 has said that the frame rate is going to be around 90 Hz, and you'd mentioned 60 frames per second, but also 120, and can it record at both speeds, and what's the trade-off that you're having in order to do that?
[00:13:58.687] Arthur van Hoff: Well, you can record at any speed up to 120 frames per second, so 90 would work too, but the challenge is a little bit that lots of devices have different frame rates, so what frame rate do you target? It turns out that the majority of our users are phone users, and they have a 60 Hz frame rate. So, typically we tend to record at 60 frames per second. You barely notice the frame rate difference on 90 frames per second. I'm sure that if you're a movie buff you might be able to tell, but you still get rotational frames as you move your head at 90 frames per second. It's just that every third frame is duplicated. I'm sure you can notice that, but I'm not able to notice that.
[00:14:43.672] Kent Bye: And how much data is coming through here? Like how much per second or how do you measure what the capacity is and then how much you could record with the sort of internal storage?
[00:14:56.373] Arthur van Hoff: So that depends on the size of your external storage. So we can record up to 20 hours with the current storage technology that we have. The end result is a video that is usually targeted about 15 to 20 megabits per second. So you can stream it over Wi-Fi. And that's kind of our target, is to get it so that you can click and start right away.
[00:15:16.405] Kent Bye: Right. And I know for recording raw, if you were to record a minute of footage, how big would that be?
[00:15:24.639] Arthur van Hoff: Well, we don't record RAW because that's impractical and there's also unnecessary because we do a lot of post-processing. So what you're going to see is always a synthesized image. So recording RAW would add maybe some quality, but it's total overkill. I mean, this is a computational camera, right? It's not a physical, it's not a traditional camera where every pixel is displayed, you know. We use every pixel, but not to display it to the end user. So we display mostly computed pixels.
[00:15:56.573] Kent Bye: Great. And finally, what do you see as the ultimate potential of virtual reality, and what you hope that what you're doing at Janta is going to help bring that about?
[00:16:06.985] Arthur van Hoff: I think the next sort of steps for what we're doing in terms of cinematic VR is to do things where you can move around, you know, sort of do volumetric video or light fields rendering. Those are challenging, you know, also from a bandwidth perspective, but I think that's sort of the next steps. I think there's going to be a coming together of gaming and entertainment, you know, where entertainment is almost like a game. You're immersed in the content and you can have some interaction with the content. I'm not sure how much, because I also want to have a laid-back experience. But that's something that we'll learn over time. OK, great. Well, thank you. Thanks.
[00:16:46.460] Kent Bye: And thank you for listening. If you'd like to support the Voices of VR podcast, then please consider becoming a patron at patreon.com slash voicesofvr.