#1464: Snap's Head of Spectacles Software Engineering Shares Technical Break Down of Design Tradeoffs

I interviewed Snap’s Head of Spectacles Software Daniel Wagner at the Snap Lens Fest about the Snap Spectacles. See more context in the rough transcript below.

Here’s the presentation Wagner made at Lens Fest that I based my interview off of.

This is a listener-supported podcast through the Voices of VR Patreon.

Music: Fatality

Podcast: Play in new window | Download

Rough Transcript

[00:00:05.458] Kent Bye: The Voices of VR Podcast. Hello, my name is Kent Bye, and welcome to the Voices of VR Podcast. It's a podcast that looks at the future of spatial computing. You can support the podcast at patreon.com slash voicesofvr. So continuing my series of looking at different announcements around Snap Spectacles, as well as the Snap ecosystem. Today's episode, I have a chance to speak to Daniel Wagner, who heads up Snap Spectacles software engineering on glasses, mobile, phone, and cloud. So Daniel actually gave a really great technical talk that was streamed on YouTube. I'll link the talk down into the show notes if you want to see it raw. But I had a chance to kind of dive into some of the more interesting parts about that talk, both covering some of the different aspects of the displays that they're using, the industrial design in order to. deal with the limits of power consumption, and then how to reduce the latency. So he gave a whole breakdown of the dual processor architecture of how there's like the same Snapdragon chip that's used on both sides. But they're able to do this real magical architecture to be able to reduce the motion photon latency down to 13 milliseconds, when normally it would be like 80 to 100 milliseconds if you were to not do this kind of like wizardry and prediction. So we kind of dig into some of the nuances of that. i wanted to have this kind of super wonky technical breakdown because you know a lot of times i see within the xr industry folks will just look at the top line specs will be like okay 46 degree field of view 37 pixels per degree there's like 45 minutes worth of battery and then for a lot of people that's kind of like the end of the story okay the field of view isn't wide enough and the battery life is not viable for most of the experiences that we're used to so i think for one this is going to be a completely new types of use cases and maybe they're like more in short bursts Maybe it's something that's still intentional wear that you have to put it on, which is quite annoying. You want to get to the point where you're wearing something all day and you don't have to think about it. That's kind of like what the Ray-Ban meta smart glasses are at that point. But they're design considerations that have like standalone. They don't want to have like external processing. They don't want to have a puck. And so they are trying to like have some constraints that is leaning towards this form factor that you would want to wear all day, even though this is like super bulky and It's kind of awkward and it's not like the great aesthetic choice, but there's different trade-offs because they wanted to have like an extra wide field of view. They have their own wave guide company that they acquired to be able to have these specific innovations when it comes to like the one step 2D pupil expansion that they have with the wave guides that he dives into much more detail into the talk. But essentially they need to have that extra width in order to have like the field of view of 46 degrees. So it's like this trade-off between like, okay, how good do you want it to have it look from the outside versus what's your experience from the inside, even though you may look a little dorky or geeky wearing like these super bulky glasses. But also they're trying to miniaturize everything, use metamaterials, trying to do all these other kind of wizardry innovations to kind of push the state of art forward so they get down to this kind of glasses form factor. But we dive deep into all the different technical wonky details in this conversation. And if you want even more details, then I also highly recommend checking out his talk that he gave at the LensFest at Snap. I'll also include the link down below that you can check that out as well. So that's what we're coming on today's episode of the Voices of VR podcast. So this interview with Daniel happened on Thursday, September 19th, 2024. So with that, let's go ahead and dive right in.

[00:03:24.996] Daniel Wagner: So I'm Daniel. I'm a senior director here at Snap. I'm heading the software development for our Spectacles program. Yeah, before that, I was at Duckery, also doing software. And it's a lot of fun.

[00:03:39.404] Kent Bye: Yeah, maybe could give a bit more context as to your background and your journey into this space.

[00:03:43.473] Daniel Wagner: Yeah. So I did my PhD thesis on augmented reality on mobile phones. Back in that day, there was no 3D graphics or anything like that on phones. So it was a lot of hacking and programming, implementing, rendering engines and so on, which was a lot of fun. And then after my PhD thesis, we got funding for a bigger lab, which developed this together with a commercial partner. And then at some point, Qualcomm got very interested in this and acquired part of that company and the tech. And this is how I joined Qualcomm and worked there on augmented reality, specifically for Qualcomm powered phones for, I think, six years. And then I joined Ducree, which opened an office there for us in Vienna, where we started working on SIGSTOF tracking and latency and over time also more on enterprise and industrial use cases. And at some point then Snap acquired some of that IP and this is how I joined Snap and started focusing on computer vision and these days on the whole software development for Spectacles.

[00:04:51.724] Kent Bye: OK, well, yesterday at the LensFest, you gave a really great technical breakdown of a lot of the trade-offs of the path towards the fifth generation of Spectacles. And so maybe you just do a little bit of a brief recap of the different generations and iterations and what you learned from each of those different prototypes as Snap has been developing the Spectacles as a form factor.

[00:05:10.973] Daniel Wagner: Yeah, sure. So I need to start. I wasn't there for the first two generations of Spectacles, which were monocular capture cameras. So one camera and a button and it just captures a video or a photo. I joined right when the third generation started, which had stereo cameras. And I think this was the first one where we also did significantly more computer vision. That wasn't my team, but I was able to get some of these learnings. And one thing, for example, that we and also the team back then weren't aware of is how flexible these glasses are. So even when you just press a button, the glasses deform a little bit and that violates the calibration that you have done in the factory. So that team started doing a lot of runtime calibration to get that stereo setup working properly. Also using the IMU to get more robust tracking. I mean all those were very valuable learnings into a fourth generation which was the first one with a display. This was a huge step up for Snap because This was the first one where you could run lenses directly on the glasses, meaning you suddenly needed a full operating system where you could run lenses that third party folks wrote and installed on these glasses. You needed to deal with thermals because suddenly these lenses were rendering you so much more power. We had on-device real-time SIGSTOF tracking and scene understanding and later on also added a bit of hand tracking. But it soon became clear that these glasses were too limited. They were very lightweight and small, that's what people liked, but the field of view was too small, the compute capacity was too small. So that's with the fifth generation, which we just launched. We made a huge step up here. It's hard to say numbers, but we have now two processors rather than one, and each one of them is multiple times more powerful than the ones in the V4. And with that, we added many new capabilities. Like now we have real time scene understanding. We have a really good hand tracking solution with gestures. We invested a lot of time into our UI, also into the mobile phone application. Yeah.

[00:07:15.064] Kent Bye: OK, great. Yeah, so that kind of brings us up to today for what was just announced a couple of days ago at the Snap Partner Summit of the fifth generation of Spectacles. So I've been in the XR industry for 10 years now, covering different aspects of VR and AR with the HoloLens as their approach that they had with optical see-through. Magic Leap that has kind of come and gone now at this point. You were working at Daiquiri, which was for a long time working on more enterprise use cases. And so most of the industry at this point is doing mixed reality pass-through with like a VR headset screen, but with camera-based input and then overlaying the digital bits on top of that. But it seems like Snap with the spectacles is going back to the optical see-through and also the standalone form factor, which is distinct from anything else that's out there right now. It's kind of like a unique product in its own right. And so maybe you could just give a little bit more context for why optimize for that specific use case of standalone, no external processing, no external puck, no external battery, but all on the head with very limited power and compute power, but trying to fit everything in and do the best you could for creating this kind of standalone form factor.

[00:08:22.711] Daniel Wagner: Yes, you brought a very good point there. I mean, all these devices have their own strengths and weaknesses on their own. Like if you look at video pass-through, it is awesome to mix real and virtual in complex ways because you can manipulate every single pixel and you have very wide field of views. But at the same time, it really seals you off from the environment. And this is what people also see with the Apple Vision Pro which was launched as a pass-through device dominantly rather than the Quest 3 where it's pass-through still like an add-on that you use every now and then but dominantly it is a VR glasses. The Apple Vision Pro is meant to be AR and many people say that they feel like sealed off the real environment. I think if that is your goal to really work in the real environment then optical see-through is the way you want to go. I'm not saying that waveguide is maybe the last wisdom here but optical see-through really keeps you in the real environment because then you see the real world with your eyes directly rather than a reproduction of it which will always be limited in resolution or dynamic range or latency. As I also mentioned in my talk yesterday Ideally, you would have a latency of zero because that's what you used when you don't wear glasses. And that is really very, very hard to achieve or an unperceivable latency in any way. And these new VRMR headsets do a great job there, but you can still notice it. Of course, now the optical see-through have their difficulties on their own. Either they are very large, if you go for a birdbath, which is like what many others, like I think X-REAL uses, they are very nice colors, but they need relatively large space for their optical engines. The promise of the waveguides is that they can be very compact, like you literally have the optical combiner is just a fraction of a millimeter thick. But on the other side, it has to be flat, at least with today's technology. There is research in making them curved, but it's very hard. And then you have all these difficulties with these nanostructures that because most of them are diffractive, which means they are wavelength dependent. Now that means that different colors behave slightly different within the waveguide. And because of that, you often see these rainbow effects and these color uniformity issues that are getting better and better, but it's still very hard to do that. But we believe that if you want AR all day, then something see-through is the way to go. And from today's perspective, we believe that waveguides are the most promising path. What we also believe is that you want in the long run something like a glasses form factor, not a helmet, not a headset, not an external compute unit. Because if you had the choice with or without a compute unit, like purely from a form factor, you would say, why would I want a compute unit if I can't have it without? So we believe that is the long-term goal. And Snap long ago decided that we are not interested in these stopgaps. We see no value in saying let's make an excellent compute unit now and then later you won't need it anymore. We always said we want something that has the form factor we want to aim at and I think people will be surprised how soon these will become available. Obviously our current one is not there yet that's why it's also a developer device predominantly, I mean people can use it for other things as well, but in terms of how we put it out and how we support it, it is mostly a developer device. We also think that right now there isn't a big enough market for pure AR glasses yet, the value versus the cost is not there yet and the technology is also not there yet. But these things will change over the next few years and it might be faster than many think.

[00:11:57.552] Kent Bye: And so one of the things I got from your talk was just a series of different trade-offs that you have to balance all these different things, from power consumption to latency to this dual compute architecture, and also with the wave guides and field of view. So you were talking around the fact that you're using a one-step pupil expansion. rather than if you were to use two-step, which would be like 15% of the lens size, you're getting up to like 35% of the lens size, which means that you're able to have a wider field of view increasing from 26 degrees from the fourth generation up to 46 degrees of this generation. So that's a great expansion. And then one of the trade-offs is that in order to have that bigger field of view, you have to have a little bit wider glasses that are a little bit more bulkier and get away from the normal form factor that people have. So there's kind of a little bit of an aesthetic trade-off where you have to have a little bit wider, but in order to have the experiential benefit of having a wider field of view. So maybe you could just talk about some of those different innovations that you had to do with some of your own custom waveguide technology that you developed in-house in order to achieve that.

[00:12:59.540] Daniel Wagner: Yeah, so first I should say I'm not a waveguide expert, but I think what people sometimes forget is that the display is not the only thing which is in there. So if you look at our glasses and it is right with this 2D expansion, which I think we are one of the very few companies doing that, we have an advantage. And we also think we are pretty good at making really small projectors, but there's so much more in there. If you think about it, we also have two cameras on each side, and then we have one processor on each side, and a battery on each side, and then a lot of sensors like IMUs. But then these also need to be connected. So that's why they maybe appear a little bit more clunky than people would like to have them. But there's a lot of data going from one side to the other. That is the one thing. And the other is, despite accepting that they will never be really rigid as a computer vision person I would like them to be, we still want to make them such that they are not too flexible. That's why they need to have a certain mass not to deform that easily. On the waveguide side, I think we'll continue seeing improvements there. If you, for example, look at Magic Leap 2, which I think is 70 degrees field of view, I think they are one of the very few others who also did 2D expansion. I mean, they have the advantage that their optical engines are generally larger because they have more real estate and they don't have to also squeeze in the batteries and the computing there. But as all these things, the optical components get smaller and the way how our optical engineers can work with these metamaterials and all other things, which I know very little about, we will see that also the field of view becomes larger, which many people focus on. But there are other aspects we also need to improve. For example, we put a lot of effort into improving the efficiency with our waveguides. I'm sure we could have made a larger field of view but that then would have costed more energy just to light up the larger field of view. Then we would have to sacrifice on the efficiency and for us it was really important to have a product that can work outdoors and with the small batteries efficiency was super super important. So that was one of the driving factors why for example the field of view is not larger.

[00:15:07.165] Kent Bye: OK. And it seems like this is an unfolding process where you're continuing to iterate many different iterations now with the fifth generation, and that now you're moving more into a dual processing architecture. It was mentioned that it was Qualcomm Snapdragon. There wasn't any specific name of what it was. I'm assuming that it's either unannounced or you can't talk about that yet. Or are there any details? We can say, what are the processors within the spectacles?

[00:15:32.219] Daniel Wagner: Yes, we are not saying which specific model it is. It is one that, as far as I know, only we use with a special firmware. And that is everything I can say about this one.

[00:15:42.563] Kent Bye: OK, yeah, it's a pattern that Qualcomm will sometimes make announcements later of whatever is in this hardware. So I expect to see more information on that here at some point. So OK, so it's a dual processor architecture. So you're kind of splitting the processing between the two sides. And so in order to distribute the power, you said there was like 1.8 watts. per processor, and then 5 watts total that you have as a budget for the processing. So maybe you could just kind of describe both the dual processing architecture, but also the SnapOS that has to kind of tie everything together.

[00:16:14.844] Daniel Wagner: Yes, sure. So these processors, they are identical, so it's the same on each side, which was for us great because it gave us the flexibility to really make decisions on our own, like what functionality do we put where. There's always the dream that high-level software architects have that anything can run anywhere and you just move stuff around, but in practice you're very much bound by the input and the output that the processor has to process. And that's why we decided that the one side would be mostly processing the camera data. So all our four cameras connect to one processor and all the computer vision runs on that one processor. So the six-dof tracking, the SLAM mapping, the scene understanding, the hand tracking, all that runs on one processor. Because as I also mentioned yesterday, sending data over a bus to another processor is expensive. So in almost no time I think we ever send any image from one side to the other, because that would be very costly, image data is costly. So then only the output, the sixth of pose, which is a very compact update, but at 800 Hz, goes from one side to the other. And then on the other side, we do all the display-related functions. That means this is where we execute the lenses, the TypeScript code, where we do all the rendering, where we do the late stage warping so when we get an image out of the renderer it doesn't go to the screen directly as it is maybe on a PC or so but instead it goes to this late stage warping engine which we also call the compositor. It can combine multiple outputs for example the corrections that you want to do for latency reduction on the hands will usually be very different than the one from the head. because they move independently or almost independently. So if you do a late stage correction for the head then you don't want to apply that to your hand because it might have moved differently. So that means there are all these outputs which might come from different renderers or at least different render layers all go into this compositor which merges together with the very latest IMU data. and then converts it into a format that the screen prefers, because we have a color sequential display, which means it shows red, green and blue one after the other with an offset of a few milliseconds. We need to make corrections also separately for each color plane, because as you move your head you will see red at a different time than green, so your head might have moved in between. So each color is corrected separately and they also send one after the other to the display so that each one is possibly be used by the display as soon as possible. So after we have done a correction for red it immediately goes out and while it is going out and will be used by the display we start doing the correction for green. And when the red one is fully sent we start sending the green one and while that happens we do the blue and then the next frame repeats.

[00:19:02.830] Kent Bye: Yeah, in terms of the operating system, were you starting with any Linux-based core, or is this all homegrown from scratch?

[00:19:10.011] Daniel Wagner: MARTIN SPLITTMANN- No, it's not completely homegrown. That would be awkward. There are good base operating systems, and I won't comment which one it is here. But we took a base operating system like Linux or Android or one of these. But we soon noticed that all these standard operating systems, they are predominantly for 2D usage. So they have their own compositors, window managers and stuff like that. So we had to basically remove all these things so that we could go down to the core, that we could fully control the GPU and the DPU, the display processing unit, which is a Qualcomm term, which is the unit that basically takes the image from the GPU and is then responsible for sending it to the display. So we can control all this directly. to get the latency and the stability that we need. You need to consider that these timings, as I talked about yesterday, 360 Hz, that means you have about 3 ms per color plane. That is for a standard operating system a very short time. So it takes quite some effort and a very tight control to make these things really stable. Because if you miss this time bounce, then suddenly you see a glitch on the display. So you need to do a lot of very low level programming to do this. And because of that, we really ripped away everything that we didn't need and then brought these things ourselves.

[00:20:26.394] Kent Bye: I'm going to pull up a graphic that you showed yesterday because there was quite involved architecture for this dual processing architecture that you were showing. And I think the end of the day, it's a 13 millisecond motion to photon latency, which I think is like all of these higher numbers that you have in order to get down to that 13 milliseconds. But essentially, you're pulling in the camera data around 30 hertz and IMU around 800 hertz. You have a 6DOF tracker that then is sending those 800 hertz poses to the other computer vision models. Then you're sending 800 hertz of poses. And so maybe we'll start at that. What is that pose data? Because 6DOF, is that the person's body? Is that other bodies in the scene? How many points? What does that mean, these six DoF poses?

[00:21:12.091] Daniel Wagner: Yes. So this example talks about what is usually called the motion to photon latency, which means that it is driven by the IMU, which means it's basically the head motion. We don't have units where you have IMUs on your hands. That would be very helpful for getting the hand latency better under control, which all devices, also the Apple Vision Pro and the Quest and everybody is challenged with, because you don't want to tell people you have to wear a ring or a wrist or whatever. But on the head, we have the IMU, and that gives us these updates, which on the one hand are of very high frequency, like 800 Hz, but they're also very low latency. So when an IMU sample comes out of the IMU, it is usually only about like one or two milliseconds old. while when you get an image from a camera just considering exposure time and the readout time because you need to shuffle that image from the camera sensor into the processor then it has to go through the isp the image signal processor until you actually finally have it to process it is usually already like like whatever 10 milliseconds or 20 milliseconds old so you're really dealing with old information here The IMU gives you much lower latency from when the sample was taken to when you can use it. And these 30 milliseconds, they're really how quickly can we take that IMU data, which I said, when we get it, it's already like one or two milliseconds old, shuffle it through our algorithms that do this late-stage reprojection, and then get it into the DPU and then get it over the MIPI bus from the processor to the display and then make the display show it. So this is where those 13 milliseconds are. We know we can get it down a little bit more. We are working hard on this. But as I said, this is like where it becomes really difficult and our engineers are working on chipping off every fraction of a millisecond there.

[00:22:53.976] Kent Bye: Yeah, so it sounds like that you're kind of splitting up the compute for each of these different sides. And I guess one of the things that you said was that you want to minimize the amount of information that is going from one side to the next. But sometimes there's like computer vision modules that are on the left side, and then there's other rendering aspects that are happening on the right side. it sounds like you still have to send over information from one side or the other. Because in order to really have the output from one, I would imagine that it has to still send over. But it sounds like it's taking some of that image data and reducing it down to some more compact information. And then is it sending that as an input over to the right side to then do more rendering processes in order to actually have the output?

[00:23:34.505] Daniel Wagner: Yes, as I said, our architecture, how we split the processing between the processors allows us that at least we don't have to send any large data around. So no rendering goes from what we call the apps processor over to the CV processor. And usually no camera data goes over from the CV processor over to the apps processor. So these big blocks of data, which can easily be like multiple megabytes each, we don't have to send those. But we have to send these post updates. And that is another cost, because these are at 800 hertz. Even just waking up the bus that often and sending data over costs us a lot of power. But there are certain limits you can do here.

[00:24:12.396] Kent Bye: Right, right. So the left camera path you're talking about, there's the camera, the IMU, the 6DOF tracking, other computer vision modules, as well as hand tracking and scene understanding that are happening on the left side. So maybe you can just elaborate on some of these other things that you're doing on the left side.

[00:24:28.755] Daniel Wagner: Sure. So, as I said, the one thing we talked already about is like the poses. And poses are very compact. In their minimal form, it would just be six floating point numbers, which could be just a few dozens of bytes. In practice, we send a little bit more because we also send some of the motion data over to the other side, which helps us then with these predictions that the late stage reprojector has to do. But poses are very compact. Another thing that is sent over Basically all the time is for example the hand tracking data, but there we send over the 3D hand skeleton, which is again very compact. We have I think about two dozen joints per hand by and large, and we send really only over all the coordinates, the rotations of the joints and all that. So this is also much more compact than sending over like a whole image. What is larger, for example, is if we do scene understanding and we build a depth map, but those we send over not that often. So we don't do depth estimation at 60 or 30 hertz. That is usually not necessary because this is mostly for scene understanding and the scene doesn't change that much. What changes is the position of the user and we can separate those. So we know when we send over a depth map and the person has moved in between, we can basically compensate for that motion without losing the validity of that depth map.

[00:25:46.305] Kent Bye: OK, and then on the right-hand side, you were talking about how it's like receiving the poses at a rate of 800 hertz. There's a 60 hertz pose predictor. There's a renderer that's rendering out at 60 frames per second. And then there's being reprojected at 360 hertz. And then there's the rendering and delayed stage reprojection for latency reduction, the 360 hertz color planes, the SnapML audio processing, lens execution inside the power sandbox. And then I'd love to hear any other comments on all of these other things that are happening on the right-hand side.

[00:26:17.266] Daniel Wagner: Sure. So on the right hand side what mostly drives the clock is the rendering, which in Snapwise can vary. So per default it runs at 60 Hz, but sometimes you don't need it. So for example if If you look at Lens Explorer, which is this menu with all the lenses inside, if you don't interact with it, it doesn't really change what it shows. It just changes from which perspective you look at it. But for that, we don't have to re-render it. Our compositor can actually do these updates without having to re-render Lens Explorer. But once we interact with it, we update Lens Explorer at 60 Hz. But the lens can decide whether it wants to run at a higher or lower rate. So a lens can also say, I want to run at 120 Hz. Unless it is very simple content, you will quickly use too much power, so that's what you usually don't do. But a lens can also decide to run at a lower rate. So for example, we have this chess lens, which if you don't interact with it, it also lowers its frame rate because each one of these pieces are rendered with very high quality and it just uses a lot of power. And there's really no need to render all the time with a fixed frame rate. That might also be something which is different from the classic computing, that in our system the main goal is to conserve power. So when we can save power by doing less, we'll just do a little bit less. But always to the extent that the user doesn't perceive it. So as I said, the main driver is probably the render engine, which is basically the counterpart of Lens Studio, which you have on a PC to design lenses. We have our render engine, which we call Lens Core, which is the render engine that executes these lenses. And that, by default, runs at 60 Hz. And in order to do its job, when it starts rendering a new frame, it clears these poses from the pose predictor, which gets these updates from the other side. And then it says, I want to render a frame which will be visible because it can talk to the composite and say, if I start working now, when will that be visible on the screen? Then the compositor says, well, if you start now, your next chance to show something is in, let's say, whatever, 27 milliseconds. So then the renderer goes to the predictor and says, hey, I need a pose for where will we be in 27 milliseconds. And then uses that pose to render. Then the rendering, if it runs at 60 hertz, it means it has 16 milliseconds until it is done. And then the frame goes into the compositor to prepare to go to the screen. And if that prediction would have been perfect, that the predictor made, then nothing would have to be done. But predicting something like, let's say, 27 milliseconds is a time where human motion can do something that is not predictable. Like you might tap your glasses and suddenly have a strong motion that was not foreseeable. So because of that, a compositor also goes to the post-predictor and says, hey, I'm going to show something on the screen in, let's say, 10 milliseconds. or 15 or whatever, give me a prediction for that. And in between, because these IMU samples come at 800 hertz, so every 1.25 milliseconds we get a new update, the post predictor now has better knowledge of what happened in between and can make better predictions. The shorter you have to predict, the better your prediction will be. So the composer says, well, give me a prediction for what happens in 15 milliseconds. And then, let's say for red. And then it corrects red. And then says, OK, now give me another one for 15, because after that, they will be green, and so on. So in the practice, that's how we then reach these 13 milliseconds, which are less time than it actually takes to render a frame, because we only have to do this correction and not the full 3D rendering anymore.

[00:29:47.032] Kent Bye: Yeah, and you said it would usually take 80 to 100 milliseconds if you weren't doing all this kind of magic around all that as well.

[00:29:53.021] Daniel Wagner: Yes, if we wouldn't do that, then this pipeline becomes much longer because then the starting point is the camera frame itself. And as already mentioned, by the time the frame even arrives in the CPU to be processed or in the DSP or wherever you process it, it is already quite old because let's assume you have 15 milliseconds of exposure time. Then we put the reference time of the frame always in the middle of exposure, which means by that time the image is done exposing. It's already seven and a half milliseconds old. then it needs to be copied from the sensor into the RAM of the processor has. Then the ISP will process it. When it comes out of the ISP, then the CV algorithms will process it. Then those results need to be shipped over to the other processor. Then that processor will do the rendering. And then the compositor will do it. And then we send it over MIPI to the display and then it will be shown. If you do nothing, like in order to shorten that, then you have this about 80 to 100 milliseconds. Which, as mentioned, totally unacceptable. You would never want to use that. It would look really awkward and give you headaches.

[00:30:55.367] Kent Bye: MARK MANDELBACHER- Great. Well, I know you're going to be running off to go be a judge and a juror on the Lensathon that's been happening here over the last couple of days. But I did want to ask one final question, which is, what do you think the ultimate potential of spatial computing and this type of AR form factor might be and what it might be able to enable?

[00:31:13.246] Daniel Wagner: So I think today we are definitely too limited by the existing technology for mass adoption. That's why we place the new spectacles mostly as a device for developers and enthusiasts. because the technology isn't there yet to really wear them all time without taking them off. At the same time, we also think that the market isn't there yet because there isn't enough value yet. There are not enough services yet that are meaningful to people because somebody needs to pay for those. Somebody needs to pay for the NRE, for the research, somebody needs to pay for the production, somebody needs to pay for the services running them. I think it will take a few more years until these things are there. But I do feel there is a time where Most people will wear AR glasses, smart glasses, like most people have a smartphone today. I'm not sure one will replace the other, same as the smart watches didn't replace the smartphones. I think we will not see the glasses do things maybe better than the phone, but it will enable doing other things. And maybe eventually the phones will fade out, but I think there will be a very long time where people will just have both.

[00:32:22.972] Kent Bye: Awesome. Well, Danny, thanks so much for joining me today to give a little bit more context for the architecture, for what you've been able to achieve with the Snap Spectacles. I think a lot of people in the XR industry may look at just the top line specs and make a lot of judgments around this or that. But I feel like there's a lot of reasons for the kind of trade-offs that you've chosen. And it's creating something that's very unique. So very excited to see where it continues to go in the future. And of course, over time, increase all the different baseline specs that we have right now. But yeah, it seems like there's Certainly a lot of technological innovation that I learned about through your talk. And I appreciated just taking a little bit more time to break it down a little bit more here on the podcast. So thank you so much. Thank you, too. Thanks again for listening to this episode of the Voices of VR podcast. That's a part of my larger series of doing a deep dive into both the announcements around Snap Spectacles as well as the AR ecosystem at Snap. What I do here at the Voices of VR podcast is fairly unique. I really like to lean into oral history, so to capture the stories of people who are on the front lines, but also to have my own experiences and to try to give a holistic picture of what's happening not only with the company, but also the ecosystem of developers that they've been able to cultivate. And so for me, I find the most valuable information that comes from the independent artists and creators and developers who are at the front lines of pushing the edges of what this technology can do and listening to what their dreams and aspirations are for where this technology is going to go in the future. So I feel like that's a little bit different approach than what anybody else is doing. But it also takes a lot of time and energy to go to these places and to do these interviews and put it together in this type of production. So if you find value in that, then please do consider becoming a member of the Patreon. Just $5 a month will go a long way of helping me to sustain this type of coverage. And if you could give more $10 or $20 or $50 a month, that has also been a huge help for allowing me to continue to bring this coverage. So you can become a member and donate today at patreon.com slash voices of VR. Thanks for listening.

Play episode

#1464: Snap’s Head of Spectacles Software Engineering Shares Technical Break Down of Design Tradeoffs

This is a listener-supported podcast through the Voices of VR Patreon.

Rough Transcript

More from this show

#1749: Socratic Debate on the Future of AI & XR (Round 2) from AWE 2026

#1748: Caitlin Krause on the Experience of XR, Process Philosophy Musings, & the Ethics of AI

#1747: XR Software Engineer Trond Nilsen on Reluctantly Using AI Hyperscalers & the Aspiration for Open Source Alternatives

Menu

Play episode

#1464: Snap’s Head of Spectacles Software Engineering Shares Technical Break Down of Design Tradeoffs

This is a listener-supported podcast through the Voices of VR Patreon.

Share this

Rough Transcript

More from this show

#1749: Socratic Debate on the Future of AI & XR (Round 2) from AWE 2026

#1748: Caitlin Krause on the Experience of XR, Process Philosophy Musings, & the Ethics of AI

#1747: XR Software Engineer Trond Nilsen on Reluctantly Using AI Hyperscalers & the Aspiration for Open Source Alternatives

Menu

Share this