#136: Martin Breidt on the Uncanny Valley & Facial Tracking within a VR Head-Mounted Display by Oculus Research

Dr. Martin Breidt is a research technician at the Max Plank Institute for Biological Cybernetics. His bio page says that he’s part of the Cognitive Engineering group where they “develop and use systems from Computer Vision, Computer Graphics, Machine Learning with methods from psychophysics in order to investigate fundamental cognitive processes.”

Martin only had time for a very quick 5-minute chat, but this was enough time for him to give me some pointers to his research about the uncanny valley effect as well as to some work is being done in order to capture facial animations while wearing a VR HMD. This led me to learn a lot more about the research that Oculus is doing in order to capture human expressions while wearing a VR HMD.

Martin named Hao Li as doing some very important work in being able to predict facial expressions with partial information based upon statistical models. Hao is an assistant professor of Computer Science at the University of Southern California, and he has a paper titled “Unconstrained Realtime Facial Performance Capture” at an upcoming Conference on Computer Vision and Pattern Recognition. Here’s the abstract.

We introduce a realtime facial tracking system specifically designed for performance capture in unconstrained settings using a consumer-level RGB-D sensor. Our framework provides uninterrupted 3D facial tracking, even in the presence of extreme occlusions such as those caused by hair, hand-to-face gestures, and wearable accessories. Anyone’s face can be instantly tracked and the users can be switched without an extra calibration step. During tracking, we explicitly segment face regions from any occluding parts by detecting outliers in the shape and appearance input using an exponentially smoothed and user-adaptive tracking model as prior. Our face segmentation combines depth and RGB input data and is also robust against illumination changes. To enable continuous and reliable facial feature tracking in the color channels, we synthesize plausible face textures in the occluded regions. Our tracking model is personalized on-the-fly by progressively refining the user’s identity, expressions, and texture with reliable samples and temporal filtering. We demonstrate robust and high-fidelity facial tracking on a wide range of subjects with highly incomplete and largely occluded data. Our system works in everyday environments and is fully unobtrusive to the user, impacting consumer AR applications and surveillance.

Here’s a video that goes along with the Unconstrained Realtime Facial Performance Capture paper for CVPR 2015

Hao Li is also the lead author on an upcoming paper at SIGGRAPH 2015 titled that is able to capture human expression even while wearing a VR HMD.

Facial Performance Sensing Head-Mounted Display
Hao Li, Laura Trutoiu, Pei-Lun Hsieh, Tristan Trutna, Lingyu Wei, Kyle Olszewski, Chongyang Ma, Aaron Nicholls
ACM Transactions on Graphics, Proceedings of the 42nd ACM SIGGRAPH Conference and Exhibition 2015, 08/2015

Three of the co-authors of the paper work at Oculus Research including Laura Trutoiu, Tristan Trutna & Aaron Nicholls. Laura was supposed to present at the IEEE VR panel on “Social Interactions in Virtual Reality: Challenges and Potential,” but she was unable to make the trip to southern France. She was going to talk about faces in VR, and had the following description about her talk:

Faces provide a rich source of information and compelling social interactions will require avatar faces to be expressive and emotive. Tracking the face within the constraints of the HMD and accurately animating facial expressions and speech raise hardware and software challenges. Real-time animation further imposes an extra constraint. We will discuss early research in making facial animation within the HMD constraints a reality. Facial analysis suitable for VR systems could not only provide important non-verbal cues about the human intent to the system, but could also be the basis for sophisticated facial animation in VR. While believable facial synthesis is already very demanding, we believe that facial motion analysis under the constraints of an immersive real-time VR system is the main challenge that needs to be solved.

The implications for being able to capture human expressions within VR are going to be huge for social and telepresence experiences in VR. It’s pretty clear that Facebook and Oculus have a lot of interest in being able to solve this difficult problem, and it looks like we’ll start to see some of the breakthroughs that have been made at SIGGRAPH in August 2015 if not sooner.

As a sneak peak, one of student Hao Li’s students, Chongyang Ma, had the following photo on his website that shows an Oculus Rift HMD that has a rig with a camera in order to do facial capture.

2015_fp_thumbnail

Okay. Back to this very brief interivew that I did with Martin at IEEE VR. Here’s the description of Martin’s presentation at the IEEE VR panel on Social interactions in VR

Self-Avatars: Body Scans to Stylized Characters
In VR, avatars are arguably the most natural paradigm for social interaction between humans. Immediately, the question of what such avatars really should look like arises. Although 3D scanning system have become more widespread, such a semi-realistic reproduction of the physical appearan ce of a human might not be the most effective choice; we argue that a certain amount of carefully controlled stylization of an avatar’s appearance might not only help coping with the inherent limitations of immersive real-time VR systems, but also be more effective at achieving task-specific goals with such avatars.

Martin mentions a paper titled Face Reality: Investigating the Uncanny Valley for Virtual Faces that he wrote with Rachel McDonnell for SIGGRAPH 2010.

Here’s the introduction to that paper:

The Uncanny Valley (UV) has become a standard term for the theory that near-photorealistic virtual humans often appear unintentionally erie or creepy. This UV theory was first hypothesized by robotics professor Masahiro Mori in the 1970’s [Mori 1970] but is still taken seriously today by movie and game developers as it can stop audiences feeling emotionally engaged in their stories or games. It has been speculated that this is due to audiences feeling a lack of empathy towards the characters. With the increase in popularity of interactive drama video games (such as L.A. Noire or Heavy Rain), delivering realistic conversing virtual characters has now become very important in the real-time domain. Video game rendering techniques have advanced to a very high quality; however, most games still use linear blend skinning due to the speed of computation. This causes a mismatch between the realism of the appearance and animation, which can result in an uncanny character. Many game developers opt for a stylised rendering (such as celshading) to avoid the uncanny effect [Thompson 2004]. In this preliminary work, we begin to study the complex interaction between rendering style and perceived trust, in order to provide guidelines for developers for creating plausible virtual characters.

It has been shown that certain psychological responses, including emotional arousal, are commonly generated by deceptive situations
[DePaulo et al. 2003]. Therefore, we used deception as a basis for our experiments to investigate the UV theory. We hypothesised that deception ratings would correspond to empathy, and that highly realistic characters would be rated as more deceptive than stylised ones.

He mentions the famous graph by Masahiro Mori, who was a robotics researcher who first proposed the concept back in 1970 in Energy. That article was originally in Japanese, but I found this translation of it.

I have noticed that, as robots appear more humanlike, our sense of their familiarity increases until we come to a valley. I call this relation the “uncanny valley.”

Martin isn’t completely convinced that the conceptualization of the uncanny valley that Mori envisioned back in 1970 is necessarily the correct one. He’s interested in continuing to research and empirically measure the uncanny valley effect through experiments, and hopes to eventually come up with a data-driven model of what works in stylizing virtual humans within VR environments so that they’re the most comfortable with our expectations. At the moment, this job is being through the artistic intuitions from directors and artists within game development studios, but Martin says that this isn’t scalable for everyone. So he intends on continuing to research and better understand this uncanny valley effect.

Rough Transcript

[00:00:05.452] Kent Bye: The Voices of VR Podcast.

[00:00:11.935] Martin Breidt: My name is Martin Breit. I work at the Max Planck Institute for Biological Cybernetics in Tübingen in the department of professibility. My main focus is on data-driven facial animation and more recently we've looked into different representations of avatars. So what I showed today was a rough overview of what we did in the past, building statistical models for markerless facial motion capture, as well as looking into different aspects of representing an avatar in VR and CG in general. So part of that involved looking at different rendering styles and most recently we looked into different styles for avatars. So do you really want to look like yourself in VR or would you like an enhanced body maybe? And if you want an enhanced body, how are we going to achieve that?

[00:01:07.814] Kent Bye: And so it looked like you were starting to put numbers to kind of the uncanny valley effect and sort of having a range of different shaders that give a spectrum from non-real to realistic. And so I'm curious about, by the process of trying to actually scientifically study the uncanny valley effect, what type of things and conclusions you're able to come to from that?

[00:01:27.222] Martin Breidt: Okay. Yeah, you're referring to a work I did together with Rich McDonnell from Trinity College, which was published at SIGGRAPH a few years ago. And there we tried to investigate the rather vague phenomenon of the uncanny valley. Everybody knows about it, but it's kind of hard to measure. And we set out to measure this by using the same, the identical facial animation and applying different CG shaders to the face. And what we found were indeed As predicted by Professor Mori in the 70s, even with the intensified version for a moving stimulus, that if we're in the middle ground of a range between very unrealistic abstract to a very realistic rendering, we do get a drop in subjective ratings from our participants on very subjective impressions like how appealing is the avatar or how trustworthy does the avatar appear. So there's other studies that also have started to investigate the uncanny valley and we believe there is some truth to it. We don't know whether it's really the sketch that Mori drew but there is something to consider when representing humans in computer graphics and not fully succeeding in very high photorealism that people have wrong expectations about what they're seeing. They're not sure is this a human, is this a puppet and this mismatch of expectations and category boundaries that will cause people to dislike what they're seeing.

[00:02:59.665] Kent Bye: And in terms of doing facial tracking, I know that, you know, when someone's wearing an Oculus Rift headband on display, you're occluded, and so I'm just curious about, you know, how are you going about trying to actually track the full face when it may not be viewable by the tracking sensor, or if you have to do some sort of sensor fusion then?

[00:03:17.179] Martin Breidt: Yeah, this is indeed a tricky problem, because basically half the face is occluded, I haven't seen a good proposal. There is a very recent work by Hao Li published at CVPR where he's using, if I understand it correctly, statistical face models. So basically you're trying to infer from the little bit that you can see, you're trying to infer the rest of the face. I could also imagine that people take advantage of using the proximity of the HMD to actually measure what's going on underneath. I mean there will be some problems because parts of the face won't be as mobile without an HMD. We've seen already solutions for eye tracking and I can imagine that at least some part of the facial motion can be recovered despite the fact that it's occluded by the HMD. But we will have to rethink the approaches.

[00:04:08.360] Kent Bye: And you're just coming out of a social interaction panel here at IEEE VR, and I'm just curious some of the big topics or discussions that were happening here.

[00:04:17.394] Martin Breidt: Well, social interaction is a big topic and to me a lot revolves around the type of task that you're trying to solve in the social interaction question. A common theme I've experienced here at IEEE VR is remote collaboration in terms of assembly or disassembly of objects, for example, in the aerospace industry. That seems to be a frequently asked application scenario. From where I come from, we're more interested in basic human perception. So what we're looking at is how people are able to solve very fundamental tasks together. So we have colleagues that look into two people collaborating on simple transportation tasks. And there's this big problem of asymmetry in the terms of interaction, meaning that not necessarily all participants of a social interaction have the same access to hardware, tracking devices or display devices. And that's something that needs to be taken into account when designing these systems.

[00:05:23.019] Kent Bye: Great. And finally, what's next for you in terms of your research and what you hope to help solve in terms of some of the big open problems that are left in this realm of social interaction and facial tracking?

[00:05:33.352] Martin Breidt: Right now, I'm really excited about the project we're working on right now, which is about stylization of avatars. Traditionally, the film and game industry has relied on artistic intuition and director input when creating these avatars. But as VR becomes more and more widespread, that solution doesn't really scale. Not everybody can hire a creative person to do a character design and probably not everybody should be designing his or her own avatar anyway. So right now we're looking into data-driven ways of making that stylization happen in order to improve the effectiveness of an avatar for a given task. Okay, great. Well, thank you. Thank you very much.

More from this show