Sec16 paper xu

This paper is included in the Proceedings of the
25th USENIX Security Symposium
August 10–12, 2016 • Austin, TX
ISBN 978-1-931971-32-4
Open access to the Proceedings of the
25th USENIX Security Symposium
is sponsored by USENIX
Virtual U: Defeating Face Liveness Detection by
Building Virtual Models from Your Public Photos
Yi Xu, True Price, Jan-Michael Frahm, and Fabian Monrose,
The University of North Carolina at Chapel Hill
https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/xu

USENIX Association 25th USENIX Security Symposium 497
Virtual U: Defeating Face Liveness Detection by Building Virtual Models
From Your Public Photos
Yi Xu, True Price, Jan-Michael Frahm, Fabian Monrose
Department of Computer Science, University of North Carolina at Chapel Hill
{yix, jtprice, jmf, fabian}@cs.unc.edu
Abstract
In this paper, we introduce a novel approach to bypass
modern face authentication systems. More specifically,
by leveraging a handful of pictures of the target user
taken from social media, we show how to create realistic,
textured, 3D facial models that undermine the security
of widely used face authentication solutions. Our frame-
work makes use of virtual reality (VR) systems, incor-
porating along the way the ability to perform animations
(e.g., raising an eyebrow or smiling) of the facial model,
in order to trick liveness detectors into believing that the
3D model is a real human face. The synthetic face of the
user is displayed on the screen of the VR device, and as
the device rotates and translates in the real world, the 3D
face moves accordingly. To an observing face authenti-
cation system, the depth and motion cues of the display
match what would be expected for a human face.
We argue that such VR-based spoofing attacks con-
stitute a fundamentally new class of attacks that point
to a serious weaknesses in camera-based authentication
systems: Unless they incorporate other sources of verifi-
able data, systems relying on color image data and cam-
era motion are prone to attacks via virtual realism. To
demonstrate the practical nature of this threat, we con-
duct thorough experiments using an end-to-end imple-
mentation of our approach and show how it undermines
the security of several face authentication solutions that
include both motion-based and liveness detectors.
1 Introduction
Over the past few years, face authentication systems have
become increasingly popular as an enhanced security
feature in both mobile devices and desktop computers.
As the underlying computer vision algorithms have ma-
tured, many application designers and nascent specialist
vendors have jumped in and started to offer solutions for
mobile devices with varying degrees of security and us-
ability. Other more well-known players, like Apple and
Google, are posed to enter the market with their own
solutions, having already acquired several facial recog-
nition software companies1. While the market is seg-
mented based on the type of technology offered (e.g.,
2D facial recognition, 3D recognition, and facial analyt-
ics/face biometric authentication), Gartner research esti-
mates that the overall market will grow to over $6.5 bil-
lion in 2018 (compared to roughly $2 billion today) [13].
With this push to market, improving the accuracy of
face recognition technologies remains an active area of
research in academia and industry. Google’s FaceNet
system, which achieved near-perfect accuracy on the La-
beled Faces in the Wild dataset [47], exemplifies one
such effort. Additionally, recent advances with deep
learning algorithms [38, 53] show much promise in
strengthening the robustness of the face identification
and authentication techniques used today. Indeed, state-
of-the-art face identification systems can now outper-
form their human counterparts [36], and this high accu-
racy is one of the driving factors behind the increased use
of face recognition systems.
However, even given the high accuracy of modern face
recognition technologies, their application in face au-
thentication systems has left much to be desired. For
instance, at the Black Hat security conference in 2009,
Duc and Minh [10] demonstrated the weaknesses of pop-
ular face authentication systems from commodity ven-
dors like Lenovo, Asus, and Toshiba. Amusingly, Duc
and Minh [10] were able to reliably bypass face-locked
computers simply by presenting the software with pho-
tographs and fake pictures of faces. Essentially, the secu-
rity of these systems rested solely on the problem of face
detection, rather than face authentication. This widely
publicized event led to subsequent integration of more
robust face authentication protocols. One prominent ex-
ample is Android OS, which augmented its face authen-
1See, for example, “Apple Acquires Face Recognition, Expression
Analysis firm, Emotient”, TechTimes, Jan, 2016; “Google Acquires
Facial Recognition Software Company PittPar,” WSJ, 2011.

498 25th USENIX Security Symposium USENIX Association
tication approach in 2012 to require users to blink while
authenticating (i.e., as a countermeasure to still-image
spoofing attacks). Unfortunately, this approach was also
shown to provide little protection, and can be easily by-
passed by presenting the system with two alternating im-
ages — one with the user’s eyes open, and one with her
eyes closed.2 These attacks underscore the fact that face
authentication systems require robust security features
beyond mere recognition in order to foil spoofing attacks.
Loosely speaking, three types of such spoofing attacks
have been used in the past, to varying degrees of success:
(i) still-image-based spoofing, (ii) video-based spoofing,
and (iii) 3D-mask-based spoofing. As the name suggests,
still-image-based spoofing attacks present one or more
still images of the user to the authentication camera; each
image is either printed on paper or shown with a digi-
tized display. Video-based spoofing, on the other hand,
presents a pre-recorded video of the victim’s moving face
in an attempt to trick the system into falsely recognizing
motion as an indication of liveness. The 3D-mask-based
approach, wherein 3D-printed facial masks are used, was
recently explored by Erdogmus and Marcel [11].
As is the typical case in the field of computer se-
curity, the cleverness of skilled, motivated adversaries
drove system designers to incorporate defensive tech-
niques in the biometric solutions they develop. This
cat-and-mouse game continues to play out in the realm
of face authentication systems, and the current recom-
mendation calls for the use of well-designed face live-
ness detection schemes (that attempt to distinguish a real
user from a spoofed one). Indeed, most modern systems
now require more active participation compared to sim-
ple blink detection, often asking the user to rotate her
head or raise an eyebrow during login. Motion-based
techniques that check, for example, that the input cap-
tured during login exhibits sufficient 3D behavior, are
also an active area of research in face authentication.
One such example is the recent work of Li et al. [34]
that appeared in CCS’2015. In that work, the use of
liveness detection was proposed as a solution to thwart-
ing video-based attacks by checking the consistency of
the recorded data with inertial sensors. Such a detection
scheme relies on the fact that as a camera moves relative
to a user’s stationary head, the facial features it detects
will also move in a predictable way. Thus, a 2D video
of the victim would have to be captured under the exact
same camera motion in order to fool the system.
As mentioned in [34], 3D-printed facial reconstruc-
tions offer one option for defeating motion-based live-
ness detection schemes. In our view, a more realizable
approach is to present the system with a 3D facial mesh
in a virtual reality (VR) environment. Here, the motion
2https://www.youtube.com/watch?v=zYxphDK6s3I
of the authenticating camera is tracked, and the VR sys-
tem internally rotates and translates the mesh to match.
In this fashion, the camera observes exactly the same
movement of facial features as it would for a real face,
fulfilling the requirements for liveness detection. Such
an attack defeats color-image- and motion-based face au-
thentication on a fundamental level because, with suffi-
cient effort, a VR system can display an environment that
is essentially indistinguishable from real-world input.
In this paper, we show that it is possible to undermine
modern face authentication systems using one such at-
tack. Moreover, we show that an accurate facial model
can be built using only a handful of publicly accessible
photos — collected, for example, from social network
websites — of the victim. From a pragmatic point of
view, we are confronted with two main challenges: i) the
number of photos of the target may be limited, and ii) for
each available photo, the illumination setting is unknown
and the user’s pose and expression are not constrained.
To overcome these challenges, we leverage robust, pub-
licly available 3D face reconstruction methods from the
field of computer vision, and adapt these techniques to fit
our needs. Once a credible synthetic model of a user is
obtained, we then employ entry-level virtual reality dis-
plays to defeat the state of the art in liveness detection.
The rest of the paper is laid out as follows: §2 provides
background and related work related to face authentica-
tion, exploitation of users’ online photos, and 3D facial
reconstruction. §3 outlines the steps we take to perform
our VR-based attack. In §4, we evaluate the performance
of our method on 5 commercial face authentication sys-
tems and, additionally, on a proposed state-of-the-art sys-
tem for liveness detection. We suggest steps that could
be taken to mitigate our attack in §5, and we address the
implications of our successful attack strategy in §6.
2 Background and Related Work
Before delving into the details of our approach, we first
present pertinent background information needed to un-
derstanding the remainder of this paper.
First, we note that given the three prominent classes of
spoofing attacks mentioned earlier, it should be clear that
while still-image-based attacks are the easiest to perform,
they can be easily countered by detecting the 3D struc-
ture of the face. Video-based spoofing is more difficult to
accomplish because facial videos of the target user may
be harder to come by; moreover, such attacks can also
be successfully defeated, for example, using the recently
suggested techniques of Li et al. [34] (which we discuss
in more detail later). 3D-mask-based approaches, on the
other hand, are harder to counter. That said, building
a 3D mask is arguably more time-consuming and also
requires specialized equipment. Nevertheless, because

of the threat this attack vector poses, much research has
gone into detecting the textures of 3D masks [11].
2.1 Modern Defenses Against Spoofing
Just as new types of spoofing attacks have been intro-
duced to fool face authentication systems, so too have
more advanced methods for countering these attacks
been developed. Nowadays, the most popular liveness
detection techniques can be categorized as either texture-
based approaches, motion-based approaches, or liveness
assessment approaches. We discuss each in turn.
Texture-based approaches [11, 25, 37, 40, 54, 60] at-
tempt to identify spoofing attacks based on the assump-
tion that a spoofed face will have a distinctly different
texture from a real face. Specifically, they assume that
due to properties of its generation, a spoofed face (irre-
spective of whether it is printed on paper, shown on a
display, or made as a 3D mask) will be different from
a real face in terms of shape, detail, micro-textures, res-
olution, blurring, gamma correction, and shading. That
is, these techniques rely on perceived limitations of im-
age displays and printing techniques. However, with the
advent of high-resolution displays (e.g., 5K), the differ-
ence in visual quality between a spoofed image and a
living face is hard to notice. Another limitation is that
these techniques often require training on every possible
spoofing material, which is not practical for real systems.
Motion-based approaches [3, 27, 29, 32, 57] detect
spoofing attacks by using motion of the user’s head to
infer 3D shape. Techniques such as optical flow and
focal-length analysis are typically used. The basic as-
sumption is that structures recovered from genuine faces
usually contain sufficient 3D information, whereas struc-
tures from fake faces (photos) are usually planar in depth.
For instance, the approach of Li et al. [34] checks the
consistency of movement between the mobile device’s
internal motion sensors and the observed change in head
pose computed from the recorded video taken while the
claimant attempts to authenticate herself to the device.
Such 3D reasoning provides a formidable defense against
both still-image and video-based attacks.
Lastly, liveness assessment techniques [19, 30, 31, 49]
require the user to perform certain tasks during the au-
thentication stage. For the systems we evaluated, the
user is typically asked to follow certain guidelines dur-
ing registration, and to perform a random series of ac-
tions (e.g., eye movement, lip movement, and blinking)
at login. The requested gestures help to defeat contem-
porary spoofing attacks.
Take-away: For real-world systems, liveness detec-
tion schemes are often combined with motion-based ap-
proaches to provide better security protection than either
can provide on their own. With these ensemble tech-
niques, traditional spoofing attacks can be reliably de-
tected. For that reason, the combination of motion-based
systems and liveness detectors has gained traction and
is now widely adopted in many commercial systems, in-
cluding popular face authentication systems offered by
companies like KeyLemon, Rohos, and Biomids. For the
remainder of this paper, we consider this combination as
the state of the art in defenses against spoofing attacks
for face authentication systems.
2.2 Online Photos and Face Authentication
It should come as no surprise that personal photos from
online social networks can compromise privacy. Major
social network sites advise users to set privacy settings
for the images they upload, but the vast majority of these
photos are often accessible to the public or set to ‘friend-
only’ viewing’ [14, 26, 35]. Users also do not have di-
rect control over the accessibility of photos of themselves
posted by other users, although they can remove (‘un-
tag’) the association of such photos with their account.
A notable use of social network photos for online se-
curity is Facebook’s social authentication (SA) system
[15], an extension of CAPTCHAs that seeks to bolster
identity verification by requiring the user to identify pho-
tos of their friends. While this method does require more
specific knowledge than general CAPTCHAs, Polakis
et al. [42] demonstrated that facial recognition could be
applied to a user’s public photos to discover their social
relationships and solve 22% of SA tests automatically.
Given that one’s online photo presence is not entirely
controlled by the user alone — but by their collective
social circles — many avenues exist for an attacker to
uncover the facial appearance of a user, even when the
user makes private their own personal photos. In an ef-
fort to curb such easy access, work by Ilia et al. [17] has
explored the automatic privatization of user data across
a social network. This method uses face detection and
photo tags to selectively blur the face of a user when the
viewing party does not have permission to see the photo.
In the future, such an approach may help decrease the
public accessibility of users’ personal photos, but it is
unlikely that an individual’s appearance can ever be com-
pletely obfuscated from attackers across all social media
sites and image stores on the Internet.
Clearly, the availability of online user photos is a boon
for an adversary tasked with the challenge of undermin-
ing face authentication systems. The most germane on
this front is the work of Li et al. [33]. There, the au-
thors proposed an attack that defeated commonly used
face authentication systems by using photos of the target
user gathered from online social networks. Li et al. [33]
reported that 77% of the users in their test set were vul-

nerable to their proposed attack. However, their work
is targeted at face recognition systems that do not in-
corporate face liveness detection. As noted in §2, in
modern face authentication software, sophisticated live-
ness detection approaches are already in use, and these
techniques thwart still-image spoofing attacks of the kind
performed by Li et al. [33].
2.3 3D Facial Reconstruction
Constructing a 3D facial model from a small number
of personal photos involves the application of powerful
techniques from the field of computer vision. Fortu-
nately, there exists a variety of reconstruction approaches
that make this task less daunting than it may seem on first
blush, and many techniques have been introduced for fa-
cial reconstruction from single images [4, 23, 24, 43],
videos [20, 48, 51], and combinations of both [52]. For
pedagogical reasons, we briefly review concepts that
help the reader better understand our approach.
The most popular facial model reconstruction ap-
proaches can be categorized into three classes: shape
from shading (SFS), structure from motion (SFM) com-
bined with dense stereoscopic depth estimation, and sta-
tistical facial models. The SFS approach [24] uses a
model of scene illumination and reflectance to recover
face structure. Using this technique, a 3D facial model
can be reconstructed from only a single input photo. SFS
relies on the assumption that the brightness level and gra-
dient of the face image reveals the 3D structure of the
face. However, the constraints of the illumination model
used in SFS require a relatively simple illumination set-
ting and, therefore, cannot typically be applied to real-
world photo samples, where the configuration of the light
sources is unknown and often complicated.
As an alternative, the structure from motion approach
[12] makes use of multiple photos to triangulate spatial
positions of 3D points. It then leverages stereoscopic
techniques across the different viewpoints to recover the
complete 3D surface of the face. With this method, the
reconstruction of a dense and accurate model often re-
quires many consistent views of the surface from differ-
ent angles; moreover, non-rigid variations (e.g., facial ex-
pressions) in the images can easily cause SFM methods
to fail. In our scenario, these requirements make such an
approach less usable: for many individuals, only a lim-
ited number of images might be publicly available on-
line, and the dynamic nature of the face makes it difficult
to find multiple images having a consistent appearance
(i.e., the exact same facial expression).
Unlike SFS and SFM, statistical facial models [4, 43]
seek to perform facial reconstruction on an image using
a training set of existing facial models. The basis for this
type of facial reconstruction is the 3D morphable model
(3DMM) of Blanz and Vetter [6, 7], which learns the
principal variations of face shape and appearance that
occur within a population, then fits these properties to
images of a specific face. Training the morphable mod-
els can be performed either on a controlled set of im-
ages [8, 39] or from internet photo-collections [23]. The
underlying variations fall on a continuum and capture
both expression (e.g., a frowning-to-smiling spectrum)
and identity (e.g., a skinny-to-heavy or a male-to-female
spectrum). In 3DMM and its derivatives, both 3D shape
and texture information are cast into a high-dimensional
linear space, which can be analyzed with principal com-
ponent analysis (PCA) [22]. By optimizing over the
weights of different eigenvectors in PCA, any particu-
lar human face model can be approximated. Statistical
facial models have shown to be very robust and only re-
quire a few photos for high-precision reconstruction. For
instance, the approach of Baumberger et al. [4] achieves
good reconstruction quality using only two images.
To make the process fully automatic, recent 3D fa-
cial reconstruction approaches have relied on a few fa-
cial landmark points instead of operating on the whole
model. These landmarks can be accurately detected us-
ing the supervised descent method (SDM) [59] or deep
convolutional networks [50]. By first identifying these
2D features in an image and then mapping them to points
in 3D space, the entire 3D facial surface can be effi-
ciently reconstructed with high accuracy. In this process,
the main challenge is the localization of facial landmarks
within the images, especially contour landmarks (along
the cheekbones), which are half-occluded in non-frontal
views; we introduce a new method for solving this prob-
lem when multiple input images are available.
The end result of 3D reconstruction is a untextured
(i.e., lacking skin color, eye color, etc.) facial surface.
Texturing is then applied using source image(s), creating
a realistic final face model. We next detail our process for
building such a facial model from a user’s publicly avail-
able internet photos, and we outline how this model can
be leveraged for a VR-based face authentication attack.
3 Our Approach
A high-level overview of our approach for creating a syn-
thetic face model is shown in Figure 1. Given one or
more photos of the target user, we first automatically ex-
tract the landmarks of the user’s face (stage –). These
landmarks capture the pose, shape, and expression of the
user. Next, we estimate a 3D facial model for the user,
optimizing the geometry to match the observed 2D land-
marks (stage —). Once we have recovered the shape of
the user’s face, we use a single image to transfer texture
information to the 3D mesh. Transferring the texture is
non-trivial since parts of the face might be self-occluded

Figure 1: Overview of our proposed approach.
(e.g., when the photo is taken from the side). The tex-
ture of these occluded parts must be estimated in a man-
ner that does not introduce too many artifacts (stage ˜).
Once the texture is filled, we have a realistic 3D model
of the user’s face based on a single image.
However, despite its realism, the output of stage ˜ is
still not able to fool modern face authentication systems.
The primary reason for this is that modern face authenti-
cation systems use the subject’s gaze direction as a strong
feature, requiring the user to look at the camera in order
to pass the system. Therefore, we must also automati-
cally correct the direction of the user’s gaze on the tex-
tured mesh (stage ™). The adjusted model can then be de-
formed to produce animation for different facial expres-
sions, such as smiling, blinking, and raising the eyebrows
(stage š). These expressions are often used as liveness
clues in face authentication systems, and as such, we
need to be able to automatically reproduce them on our
3D model. Finally, we output the textured 3D model into
a virtual reality system (stage ›).
Using this framework, an adversary can bypass both
the face recognition and liveness detection components
of modern face authentication systems. In what follows,
we discuss the approach we take to solve each of the var-
ious challenges that arise in our six-staged process.
3.1 Facial Landmark Extraction
Starting from multiple input photos of the user, our first
task is to perform facial landmark extraction. Follow-
ing the approach of Zhu et al. [63], we extract 68 2D
facial landmarks in each image using the supervised de-
scent method (SDM) [59]. SDM successfully identifies
facial landmarks under relatively large pose differences
(±45deg yaw, ±90deg roll, ±30deg pitch). We chose
the technique of Zhu et al. [63] because it achieves a me-
dian alignment error of 2.7 pixels on well-known datasets
[1] and outperforms other commonly used techniques
(e.g., [5]) for landmark extraction.
Figure 2: Examples of facial landmark extraction
For our needs, SDM works well on most online im-
ages, even those where the face is captured at a low res-
olution (e.g., 40 × 50 pixels). It does, however, fail on
a handful of the online photos we collected (less than
5%) where the pose is beyond the tolerance level of the
algorithm. If this occurs, we simply discard the image.
In our experiments, the landmark extraction results are
manually checked for correctness, although an automatic
scoring system could potentially be devised for this task.
Example landmark extractions are shown in Figure 2.
3.2 3D Model Reconstruction
The 68 extracted 3D point landmarks from each of the
N input images provide us with a set of coordinates
si,j ∈ R2, with 1 ≤ i ≤ 68,1 ≤ j ≤ N. The projection of
the 3D points Si,j ∈ R3 on the face onto the image coor-
dinates si,j follows what is called the “weak perspective
projection” (WPP) model [16], computed as follows:
si,j = fjPRj (Si,j +tj), (1)
where fj is a uniform scaling factor; P is the projection
matrix
1 0 0
0 1 0
; Rj is a 3×3 rotation matrix defined by

the pitch, yaw, and roll, respectively, of the face relative
to the camera; and tj ∈ R3 is the translation of the face
with respect to the camera. Among these parameters,
only si,j and P are known, and so we must estimate the
others.
Fortunately, a large body of work exists on the shape
statistics of human faces. Following Zhu et al. [63],
we capture face characteristics using the 3D Morphable
Model (3DMM) [39] with an expression extension pro-
posed by Chu et al. [9]. This method characterizes varia-
tions in face shape for a population using principal com-
ponent analysis (PCA), with each individual’s 68 3D
point landmarks being concatenated into a single feature
vector for the analysis. These variations can be split into
two categories: constant factors related to an individual’s
distinct appearance (identity), and non-constant factors
related to expression. The identity axes capture charac-
teristics such as face width, brow placement, or lip size,
while the expression axes capture variations like smiling
versus frowning. Example axes for variations in expres-
sion are shown in Figure 6.
More formally, for any given individual, the 3D coor-
dinates Si,j on the face can be modeled as
Si, j = ¯Si +Aid
i αid
+Aexp
i αexp
j , (2)
where ¯Si is the statistical average of Si, j among the in-
dividuals in the population, Aid
i is the set of principal
axes of variation related to identity, and Aexp
i is the set
of principal axes related to expression. αid and αexp
j are
the identity and expression weight vectors, respectively,
that determine person-speciﬁc facial characteristics and
expression-specific facial appearance. We obtain ¯Si and
Aid
i using the 3D Morphable Model [39] and Aexp
i from
Face Warehouse [8].
Figure 3: Illustration of identity axes (heavy-set to thin) and
expression axes (pursed lips to open smile).
When combining Eqs. (1) and (2), we inevitably run
into the so-called “correspondence problem.” That is,
given each identified facial landmark si,j in the input im-
age, we need to find the corresponding 3D point Si ,j
on the underlying face model. For landmarks such as
the corners of the eyes and mouth, this correspondence
is self-evident and consistent across images. However,
for contour landmarks marking the edge of the face in
an image, the associated 3D point on the user’s facial
model is pose-dependent: When the user is directly fac-
ing the camera, their jawline and cheekbones are fully in
view, and the observed 2D landmarks lie on the fiducial
boundary on the user’s 3D facial model. When the user
rotates their face left (or right), however, the previously
observed 2D contour landmarks on the left (resp. right)
side of the face shift out of view. As a result, the observed
2D landmarks on the edge of the face correspond to 3D
points closer to the center of the face. This 3D point dis-
placement must be taken into account when recovering
the underlying facial model.
Qu et al. [44] deal with contour landmarks using con-
straints on surface normal direction, based on the obser-
vation that points on the edge of the face in the image
will have surface normals perpendicular to the viewing
direction. However, this approach is less robust because
the normal direction cannot always be accurately esti-
mated and, as such, requires careful parameter tuning.
Zhu et al. [63] proposed a “landmark marching” scheme
that iteratively estimates 3D head pose and 2D contour
landmark position. While their approach is efficient and
robust against different face angles and surface shapes,
it can only handle a single image and cannot refine the
reconstruction result using additional images.
Our solution to the correspondence problem is to
model 3D point variance for each facial landmark using a
pre-trained Gaussian distribution (see Appendix A). Un-
like the approach of Zhu et al. [63] which is based on
single image input, we solve for pose, perspective, ex-
pression, and neutral-expression parameters over all im-
ages jointly. From this, we obtain a neutral-expression
model Si of the user’s face. A typical reconstruction, Si,
is presented in Figure 4.
Figure 4: 3D facial model (right) built from facial landmarks
extracted from 4 images (left).

3.3 Facial Texture Patching
Given the 3D facial model, the next step is to patch the
model with realistic textures that can be recognized by
the face authentication systems. Due to the appearance
variation across social media photos, we have to achieve
this by mapping the pixels in a single captured photo
onto the 3D facial model, which avoids the challenges
of mixing different illuminations of the face. However,
this still leaves many of the regions without texture, and
those untextured spots will be noticeable to modern face
authentication systems. To fill these missing regions, the
naïve approach is to utilize the vertical symmetry of the
face and fill the missing texture regions with their sym-
metrical complements. However, doing so would lead
to strong artifacts at the boundary of missing regions. A
realistic textured model should be free of these artifacts.
To lessen the presence of these artifacts, one approach
is to iteratively average the color of neighboring vertices
as a color trend and then mix this trend with texture de-
tails [45]. However, such an approach over-simplifies
the problem and fails to realistically model the illumina-
tion of facial surfaces. Instead, we follow the suggestion
of Zhu et al. [63] and estimate facial illumination using
spherical harmonics [61], then fill in texture details with
Poisson editing [41]. In this way, the output model will
appears to have a more natural illumination. Sadly, we
cannot use their approach directly as it reconstructs a pla-
nar normalized face, instead of a 3D facial model, and so
we must extend their technique to the 3D surface mesh.
The idea we implemented for improving our initial
textured 3D model was as follows: Starting from the
single photo chosen as the main texture source, we first
estimate and subsequently remove the illumination con-
ditions present in the photo. Next, we map the textured
facial model onto a plane via a conformal mapping, then
impute the unknown texture using 2D Poisson editing.
We further extend their approach to three dimensions and
perform Poisson editing directly on the surface of the fa-
cial model. Intuitively, the idea behind Poisson editing
is to keep the detailed texture in the editing region while
enforcing the texture’s smoothness across the boundary.
This process is defined mathematically as
∆f = ∆g,s.t f|∂Ω = f0
|∂Ω, (3)
where Ω is the editing region, f is the editing result, f0
is the known original texture value, and g is the texture
value in the editing region that is unknown and needs to
be patched with its reflection complement. On a 3D sur-
face mesh, every vertex is connected with 2 to 8 neigh-
bors. Transforming Eq. 3 into discrete form, we have
|Np|fp − ∑
q∈Np∩Ω
fq = ∑
q∈Np∩Ω
f0
q +(∆g)p, (4)
where Np is the neighborhood of point p on the mesh.
Our enhancement is a natural extension of the Poisson
editing method suggested in the seminal work of Pérez
et al. [41], although no formulation was given for 3D.
By solving Eq. 4 instead of projecting the texture onto a
plane and solving Eq. 3, we obtain more realistic texture
on the facial model, as shown in Figure 5.
Figure 5: Naïve symmetrical patching (left); Planar Poisson
editing (middle); 3D Poisson editing (right).
3.4 Gaze Correction
We now have a realistic 3D facial model of the user.
Yet, we found that models at stage ˜ were unable to
bypass most well-known face recognition systems. Dig-
ging deeper into the reasons why, we observed that most
recognition systems rely heavily on gaze direction during
authentication, i.e., they fail-close if the user is not look-
ing at the device. To address this, we introduce a simple,
but effective, approach to correct the gaze direction of
our synthetic model (Figure 1, Stage ™).
The idea is as follows. Since we have already re-
constructed the texture of the facial model, we can syn-
thesize the texture data in the eye region. These data
contain the color information from the sclera, cornea,
and pupil and form a three-dimensional distribution in
the RGB color space. We estimate this color distri-
bution with a 3D Gaussian function whose three prin-
ciple components can be computed as (b1,b2,b3) with
weight (σ1,σ2,σ3),σ1 ≥ σ2 ≥ σ3 > 0. We perform the
same analysis for the eye region of the average face
model obtained from 3DMM [39], whose eye is look-
ing straight towards the camera, and we similarly obtain
principle color components (bstd
1 ,bstd
2 ,bstd
3 ) with weight
(σstd
1 ,σstd
2 ,σstd
3 ),σstd
1 ≥ σstd
2 ≥ σstd
3 > 0. Then, we con-
vert the eye texture from the average model into the eye
texture of the user. For a texture pixel c in the eye region
of average texture, we convert it to
cconvert =
3
∑
i=1
σi
σstd
i
(c bstd
i )bi. (5)
In effect, we align the color distribution of the average
eye texture with the color distribution of the user’s eye
texture. By patching the eye region of the facial model
with this converted average texture, we realistically cap-
ture the user’s eye appearance with forward gaze.

3.5 Adding Facial Animations
Some of the liveness detection methods that we test re-
quire that the user performs specific actions in order to
unlock the system. To mimic these actions, we can sim-
ply animate our facial model using a pre-defined set of
facial expressions (e.g., from FaceWarehouse [8]). Re-
call that in deriving in Eq. 2, we have already computed
the weight for the identity axis αid, which captures the
user-specific face structure in a neutral expression. We
can adjust the expression of the model by substituting a
specific, known expression weight vector αexp
std into Eq. 2.
By interpolating the model’s expression weight from 0 to
αexp
std , we are able to animate the 3D facial model to smile,
laugh, blink, and raise the eyebrows (see Figure 6).
Figure 6: Animated expressions. From left to right: smiling,
laughing, closing the eyes, and raising the eyebrows.
3.6 Leveraging Virtual Reality
While the previous steps were necessary to recover a re-
alistic, animated model of a targeted user’s face, our driv-
ing insight is that virtual reality systems can be lever-
aged to display this model as if it were a real, three-
dimensional face. This VR-based spoofing constitutes
a fundamentally new class of attacks that exploit weak-
nesses in camera-based authentication systems.
In the VR system, the synthetic 3D face of the user
is displayed on the screen of the VR device, and as the
device rotates and translates in the real world, the 3D
face moves accordingly. To an observing face authen-
tication system, the depth and motion cues of the dis-
play exactly match what would be expected for a hu-
man face. Our experimental VR setup consists of custom
3D-rendering software displayed on a Nexus 5X smart
phone. Given the ubiquity of smart phones in modern
society, our implementation is practical and comes at
no additional hardware cost to an attacker. In practice,
any device with similar rendering capabilities and iner-
tial sensors could be used.
On smart phones, accelerometers and gyroscopes
work in tandem to provide the device with a sense of
self-motion. An example use case is detecting when the
device is rotated from a portrait view to a landscape view,
and rotating the display, in response. However, these sen-
sors are not able to recover absolute translation — that
is, the device is unable to determine how its position has
changed in 3D space. This presents a challenge because
without knowledge of how the device has moved in 3D
space, we cannot move our 3D facial model in a realistic
fashion. As a result, the observed 3D facial motion will
not agree with the device’s inertial sensors, causing our
method to fail on methods like that of Li et al. [34] that
use such data for liveness detection.
Fortunately, it is possible to track the 3D position of
a moving smart phone using its outward-facing camera
with structure from motion (see §2.3). Using the cam-
era’s video stream as input, the method works by tracking
points in the surrounding environment (e.g., the corners
of tables) and then estimating their position in 3D space.
At the same time, the 3D position of the camera is re-
covered relative to the tracked points, thus inferring the
camera’s change in 3D position. Several computer vision
approaches have been recently introduced to solve this
problem accurately and in real time on mobile devices
[28, 46, 55, 56]. In our experiments, we make use of a
printed marker3 placed on a wall in front of the camera,
rather than tracking arbitrary objects in the surrounding
scene; however, the end result is the same. By incorpo-
rating this module into our proof of concept, the perspec-
tive of the viewed model due to camera translation can be
simulated with high consistency and low latency.4
An example setup for our attack is shown in Figure
7. The VR system consists of a Nexus 5X unit using
its outward-facing camera to track a printed marker in
the environment. On the Nexus 5X screen, the system
displays a 3D facial model whose perspective is always
consistent with the spatial position and orientation of the
authentication device. The authenticating camera views
the facial model on the VR display, and it is successfully
duped into believing it is viewing the real face of the user.
Figure 7: Example setup using virtual reality to mimic 3D
structure from motion. The authentication system observes a
virtual display of a user’s 3D facial model that rotates and trans-
lates and the device moves. To recover the 3D translation of the
VR device, an outward-facing camera is used to track a marker
in the surrounding environment.
3See Goggle Paper at http://gogglepaper.com/
4Specialized VR systems such as the Oculus Rift could be used to
further improve the precision and latency of camera tracking. Such ad-
vanced, yet easily obtainable, hardware has the potential to deliver even
more sophisticated VR attacks compared to what is presented here.

4 Evaluation
We now demonstrate that our proposed spoofing method
constitutes a significant security threat to modern face
authentication systems. Using real social media photos
from consenting users, we successfully broke five com-
mercial authentication systems with a practical, end-to-
end implementation of our approach. To better under-
stand the threat, we further systematically run lab exper-
iments to test the capabilities and limitations of our pro-
posed method. Moreover, we successfully test our pro-
posed approach with the latest motion-based liveness de-
tection approach by Li et al. [34], which is not yet avail-
able in commercial systems.
Participants
We recruited 20 volunteers for our tests of commercial
face authentication systems. The volunteers were re-
cruited by word of mouth and span graduate students and
faculty in two separate research labs. Consultation with
our IRB departmental liaison revealed that no applica-
tion was needed. There was no compensation for par-
ticipating in the lab study. The ages of the participants
range between 24 and 44 years, and the sample consists
of 6 females and 14 males. The participants come from
a variety of ethnic backgrounds (as stated by the volun-
teers): 6 are of Asian descent, 4 are Indian, 1 is African-
American, 1 is Hispanic, and 8 are Caucasian. With their
consent, we collected public photos from the users’ Face-
book and Google+ social media pages; we also collected
any photos we could find of the users on personal or com-
munity web pages, as well as via image search on the
web. The smallest number of photos we collected for an
individual was 3, and the largest number was 27. The
average number of photos was 15, with a standard de-
viation of approximately 6 photos. No private informa-
tion about the subjects was recorded beside storage of the
photographs they consented too. Any images of subjects
displayed in this paper was done with the consent of that
particular volunteer.
For our experiments, we manually extracted the region
around user’s face in each image. An adversary could
also perform this action automatically using tag infor-
mation on social media sites, when available. One in-
teresting aspect of social media photos is they may cap-
ture significant physical changes of users over time. For
instance, one of our participants lost 20 pounds in the
last 6 months, and our reconstruction had to utilize im-
ages from before and after this change. Two other users
had frequent changes in facial hair styles – beards, mous-
taches, and clean-shaven – all of which we used for our
reconstruction. Another user had only uploaded 2 pho-
tos to social media in the past 3 years. These varieties all
present challenges for our framework, both for initially
reconstructing the user’s face and for creating a likeness
that matches their current appearance.
Industry-leading Solutions
We tested our approach on five advanced commercial
face authentication systems: KeyLemon5, Mobius6, True
Key [18], BioID [21], and 1U App7. Table 1 summarizes
the training data required by each system when learning a
user’s facial appearance, as well as the approximate num-
ber of users for each system, when available. All systems
incorporate some degree of liveness detection into their
authentication protocol. KeyLemon and the 1U App re-
quire users to perform an action such as blinking, smil-
ing, rotating the head, and raising the eyebrows. In ad-
dition, the 1U App requests these actions in a random
fashion, making it more resilient to video-based attacks.
BioID, Mobius and True Key are motion-based systems
and detect 3D facial structure as the user turns their head.
It is also possible that these five systems employ other
advanced liveness detection approaches, such as texture-
based detection schemes, but such information has not
been made available to the public.
Methodology
System Training Method # Installs
KeyLemon3 Single video ∼100,000
Mobius2 10 still images 18 reviews
True Key1 Single video 50,000-100,000
BioID2 4 videos unknown
1U App1 1 still image 50-100
Table 1: Summary of the face authentication systems evaluated.
The second column lists how each system acquires training data
for learning a user’s face, and the third column shows the num-
ber approximate number of installations or reviews each sys-
tem has received according to (1) the Google Play Store, (2)
the iTunes store, or (3) softpedia.com. BioID is a relatively
new app and does not yet have customer reviews on iTunes.
All participants were registered with the 5 face authen-
tication systems under indoor illumination. The average
length of time spent by each of the volunteers to register
across all systems was 20 minutes. As a control, we first
verified that all systems were able to correctly identify
the users in the same environment. Next, before testing
our method using textures obtained via social media, we
evaluated whether our system could spoof the recogni-
tion systems using photos taken in this environment. We
5http://www.keylemon.com
6http://www.biomids.com
7http://www.1uapps.com

thus captured one front-view photo for each user under
the same indoor illumination and then created their 3D
facial model with our proposed approach. We found that
these 3D facial models were able to spoof each of the
5 candidate systems with a 100% sucess rate, which is
shown in the second column of Table 2
Following this, we reconstructed each user’s 3D fa-
cial model using the images collected from public online
sources. As a reminder, any source image can be used as
the main image when texturing the model. Since not all
textures will successfully spoof the recognition systems,
we created textured reconstructions from all source im-
ages and iteratively presented them to the system (in or-
der of what we believed to be the best reconstruction, fol-
lowed by the second best, and so on) until either authen-
tication succeeded or all reconstructions had been tested.
Findings
We summarize the spoofing success rate for each system
in Table 2. Except for the 1U system, all facial recogni-
tion systems were successfully spoofed for the majority
of participants when using social media photos, and all
systems were spoofed using indoor, frontal view photos.
Out of our 20 participants, there were only 2 individu-
als for whom none of the systems was spoofed via the
social-media-based attack.
Looking into the social media photos we collected of
our participants, we observe a few trends among our re-
sults. First, we note that moderate- to high-resolution
photos lend substantial realism to the textured models.
In particular, photos taken by professional photographers
(e.g., wedding photos or family portaits) lead to high-
quality facial texturing. Such photos are prime targets
for facial reconstruction because they are often posted by
other users and made publicly available. Second, we note
that group photos provide consistent frontal views of in-
dividuals, albeit with lower resolution. In cases where
high-resolution photos are not available, such frontal
views can be used to accurately recover a user’s 3D fa-
cial structure. These photos are easily accessible via
friends of users, as well. Third, we note that the least
spoof-able users were not those who necessarily had a
low number of personal photos, but rather users who had
few forward-facing photos and/or no photos with suffi-
ciently high resolution. From this observation, it seems
that creating a realistic texture for user recognition is the
primary factor in determining whether a face authentica-
tion method will be fooled by our approach. Only a small
number of photos are necessary in order to defeat facial
recognition systems.
We found that our failure to spoof the 1U App, as well
as our lower performance on BioID, using social me-
dia photos was directly related to the poor usability of
Indoor Social Media
Spoof % Spoof % Avg. # Tries
KeyLemon 100% 85% 1.6
Mobius 100% 80% 1.5
True Key 100% 70% 1.3
BioID 100% 55% 1.7
1U App 100% 0% —
Table 2: Success rate for 5 face authentication systems using a
model built from (second column) an image of the user taken in
an indoor environment and (third and fourth columns) images
obtained on users’ social media accounts. The fourth column
shows the average number of attempts needed before success-
fully spoofing the target user.
those systems. Specifically, we found the systems have
a very high false rejection rate when live users attempt
to authenticate themselves in different illumination con-
ditions. To test this, we had 5 participants register their
faces indoors on the 4 mobile systems.8 We then had
each user attempt to log in to each system 10 times in-
doors and 10 times outdoors on a sunny day, and we
counted the number of accepted logins in each environ-
ment for each system. True Key and Mobius, which we
found were easier to defeat, correctly authenticated the
users 98% and 100% of the time for indoor logins, re-
spectively, and 96% and 100% of the time for outdoor
logins. Meanwhile, the indoor/outdoor login rates of
BioID and the 1U App were 50%/14% and 96%/48%,
respectively. The high false rejection rates under outdoor
illumination show that the two systems have substantial
difficulty with their authentication when the user’s envi-
ronment changes. Our impression is that 1U’s single-
image user registration simply lacks the training data
necessary to accommodate to different illumination set-
tings. BioID is very sensitive to a variety of factors in-
cluding head rotation and illumination, which leads to
many false rejections. (Possibly realizing this, the mak-
ers of BioID therefore grant the user 3 trials per login
attempt.) Even so, as evidenced by the second column
in Table 2, our method still handily defeats the liveness
detection modules of these systems given images of the
user in the original illumination conditions, which sug-
gests that all the systems we tested are vunerable to our
VR-based attack.
Our findings also suggest that our approach is able to
successfully handle significant changes in facial expres-
sion, illumination, and for the most part, physical charac-
teristics such as weight and facial hair. Moreover, the ap-
proach appears to generalize to users regardless of gen-
der or ethnicity. Given that it has shown to work on a var-
ied collection of real-world data, we believe that the at-
8As it is a desktop application, KeyLemon was excluded.

tack presented herein represents a realistic security threat
model that could be exploited in the present day.
Next, to gain a deeper understanding of the realism
of this threat, we take a closer look at what conditions
are necessary for our method to bypass the various face
authentication systems we tested. We also consider what
main factors contribute to the failure cases of our method.
4.1 Evaluating System Robustness
To further understand the limitations of the proposed
spoofing system, we test its robustness against resolu-
tion and viewing angle, which are two important factors
for the social media photos users upload. Specifically,
we answer the question: what is the minimum resolu-
tion and maximum head rotation allowed in an uploaded
photo before it becomes unusable for spoofing attacks
like ours? We further explore how low-resolution frontal
images can be used to improve our success rates when
high-resolution side-view images are not available.
4.1.1 Blurry, Grainy Pictures Still Say A Lot
To assess our ability to spoof face authentication systems
when provided only low-resolution images of a user’s
face, we texture the 3D facial models of our sample users
using an indoor, frontal view photo. This photo is then
downsampled at various resolutions such that the dis-
tance between the user’s chin and forehead ranges be-
tween 20 and 50 pixels. Then, we attempt to spoof
the True Key, BioId, and KeyLemon systems with fa-
cial models textured using the down-sampled photos.9 If
we are successful at a certain resolution, that implies that
that resolution leaks the user’s identity information to our
spoofing system. The spoofing success rate for various
image resolutions is shown in Figure 8.
The result indicates that our approach robustly spoofs
face authentication systems when the height of the face in
the image is at least 50 pixels. If the resolution of an up-
loaded photo is less than 30 pixels, the photo is likely of
too low-resolution to reliably encode useful features for
identifying the user. In our sample set, 88% of users had
more than 6 online photos with a chin-to-forehead dis-
tance greater than 100 pixels, which easily satisfies the
resolution requirement of our proposed spoofing system.
4.1.2 A Little to the Left, a Little to the Right
To identify the robustness of the proposed system against
head rotation, we first evaluate the maximum yaw angle
allowed for our system to spoof baseline systems using a
9We skip analysis of Mobius because its detection method is similar
to True Key, and our method did not perform as well on True Key. We
also do not investigate the robustness of our method in the 1U system
because of our inability to spoof this system using online photos.
Figure 8: Spoofing success rate with texture taken from photos
of different resolution.
single image. For all 20 sample users, we collect multi-
ple indoor photos with yaw angle varying from 5 degrees
(approximately frontal view) to 40 degrees (significantly
rotated view). We then perform 3D reconstruction for
each image, for each user, on the same three face au-
thentication systems. The spoofing success rate for a
single input image as a function of head rotation is il-
lustrated in Figure 9 (left). It can be seen that the pro-
posed method successfully spoofs all the baseline sys-
tems when the input image has a largely frontal view. As
yaw angle increases, it becomes more difficult to infer
the user’s frontal view from the image, leading to a de-
creased spoofing success rate.
4.1.3 For Want of a Selfie
The results of Figure 9 (left) indicate that our success rate
falls dramatically if given only a single image with a yaw
angle larger than 20 degrees. However, we argue that
these high-resolution side-angle views can serve as base
images for facial texturing if additional low-resolution
frontal views of the user are available. We test this hy-
pothesis by taking, for each user, the rotated images from
the previous section along with 1 or 2 low-resolution
frontal view photos (chin-to-forehead distance of 30 pix-
els). We then reconstruct each user’s facial model and
use it to spoof our baseline systems. Alone, the pro-
vided low-resolution images provide insufficient texture
for spoofing, and the higher-resolution side view does
not provide adequate facial structure. As shown in Fig-
ure 9 (right), by using the low-resolution front views to
guide 3D reconstruction and then using the side view for
texturing, the spoofing success rate for large-angle head
rotation increases substantially. From a practical stand-
point, low-resolution frontal views are relatively easy to
obtain, since they can often be found in publicly posted
group photos.

Figure 9: Spoofing success rate with different yaw angles. Left: Using only a single image at the specified angle. Right: Supple-
menting the single image with low-resolution frontal views, which aid in 3D reconstruction.
4.2 Seeing Your Face Is Enough
Our approach not only defeats existing commercial sys-
tems having liveness detection — it fundamentally un-
dermines the process of liveness detection based on color
images, entirely. To illustrate this, we use our method
to attack the recently proposed authentication approach
of Li et al. [34], which obtains a high rate of success
in guarding against video-based spoofing attacks. This
system adds another layer to motion-based liveness de-
tection by requiring that the movement of the face in the
captured video be consistent with the data obtained from
the motion sensor of the device. Fortunately, as discussed
in §3, the data consistency requirement is automatically
satisfied with our virtual reality spoofing system because
the 3D model rotates in tandem with the camera motion.
Central to Li et al. [34]’s approach is to build a classi-
fier that evaluates the consistency of captured video and
motion sensor data. In turn, the learned classifier is used
to distinguish real faces from spoofed ones. Since their
code and training samples have not been made public,
we implemented our own version of Li et al. [34]’s live-
ness detection system and trained a classifier with our
own training data. We refer the reader to [34] for a full
overview of the method.
Following the methodology of [34], we capture video
samples (and inertial sensor data) of ∼4 seconds from
the front-facing camera of a mobile phone. In each sam-
ple, the phone is held at a distance of 40cm from the
subject and moved back-and-forth 20cm to the left and
right. We capture 40 samples of real subjects moving
the phone in front of their face, 40 samples where a pre-
recorded video of a user is presented to the camera, and
30 samples where the camera is presented with a 3D re-
construction of a user in our VR environment. For train-
ing, we use a binary logistic regression classifier trained
on 20 samples from each class, with the other samples
used for testing. Due to the relatively small size of our
training sets, we repeat our classification experiments 4
times, with random train/test splits in each trial, and we
report the average performance over all four trials.
Training Data Real Video VR
Real+Video 19.50 / 20 0.25 / 20 9.75 / 10
Real+Video+VR 14.00 / 20 0.00 / 20 5.00 / 10
Real+VR 14.75 / 20 — 5.00 / 10
Table 3: Number of testing samples classified as real users.
Values in the first column represent true positive rates, and the
second and third columns represent false positives. Each row
shows the classification results after training on the classes in
the first column. The results were averaged over four trials.
The results of our experiments are shown in Table 3.
For each class (real user data, video spoof data, and VR
data), we report the average number (over 4 trials) of test
samples classified as real user data. We experiment with
three different training configurations, which are listed in
the first column of the table. The first row shows the re-
sults when using real user data as positive samples and
video spoof data as negative samples. In this case, it
can easily be seen that the real-versus-video identifica-
tion is almost perfect, matching the results of [34]. How-
ever, our VR-based attack is able to spoof this training
configuration nearly 100% of the time. The second and
third rows of Table 3 show the classification performance
when VR spoof data is included in the training data. In
both cases, our approach defeats the liveness detector in
50% of trials, and the real user data is correctly identified
as such less than 75% of the time.
All three training configurations clearly point to the
fact that our VR system presents motion features that are
close to real user data. Even if the liveness detector of
[34] is specifically trained to look for our VR-based at-
tack, 1 out of every 2 attacks will still succeed, with the
false rejection rate also increasing. Any system using

this detector will need to require multiple log-in attempts
to account for the decreased recall rate; allowing multi-
ple log-in attempts, however, allows our method more
opportunties to succeed. Overall, the results indicate
that the proposed VR-based attack successfully spoofs
Li et al. [34]’s approach, which is to our knowledge the
state of the art in motion-based liveness detection.
5 Defense in Depth
While current facial authentication systems succumb to
our VR-based attack, several features could be added to
these systems to confound our approach. Here, we detail
three such features, namely random projection of light
patterns, detection of minor skin tone fluctuations related
to pulse, and the use of illuminated infrared (IR) sensors.
Of these, the first two could still be bypassed with addi-
tional adversary effort, while the third presents a signif-
icantly different hardware configuration that would re-
quire non-trivial alterations to our method.
Light Projection The principle of using light projec-
tion for liveness detection is simple: Using an outward-
facing light source (e.g., the flashlight commonly in-
cluded on camera-equipped mobile phones), flash light
on the user’s face at random intervals. If the observed
change in illumination does not match the random pat-
tern, then face authentication fails. The simplicity of this
approach makes it appealing and easily implementable;
however, an adversary could modify our proposed ap-
proach to detect the random flashes of light and, with
low latency, subsequently add rendered light to the VR
scene. Random projections of structured light [62], i.e.,
checkerboard patterns and lines, would increase the diffi-
culty of such an attack, as the 3D-rendering system must
be able to quickly and accurately render the projected
illumination patterns on a model. However, structured
light projection requires specialized hardware that typi-
cally is not found on smart phones and similar devices,
which decreases the feasibility of this mitigation.
Pulse Detection Recent computer vision research [2,
58] has explored the prospect of video magnification,
which transforms micro-scale fluctuations over time into
strong visual changes. One such application is the detec-
tion of human pulse from a standard video of a human
face. The method detects small, periodic color changes
related to pulse in the region of the face and then am-
plifies this effect such that the face appears to undergo
strong changes in brightness and hue. This amplification
could be used as an additional method for liveness detec-
tion by requiring that the observed face have a detectable
pulse. Similar ideas have been applied to fingerprint sys-
tems that check for blood flow using light emitted from
beneath a prism. Of course, an attacker using our pro-
posed approach could simply add subtle color variation
to the 3D model to approximate this effect. Nevertheless,
such a method would provide another layer of defense
against spoofed facial models.
Infrared Illumination Microsoft released Windows
Hello as a more personal way to sign into Windows 10
devices with just a look or a touch. The new interface
supports biometric authentication that includes face, iris,
or fingerprint authentication. The platform includes In-
tel’s RealSense IR-based, rather than a color-based, fa-
cial authentication method. In principle, their approach
works in the same way as contemporary face authentica-
tion methods, but instead uses an IR camera to capture
a video of the user’s face. The attack presented in this
paper would fail to bypass this approach because typi-
cal VR displays are not built to project IR light; how-
ever, specialized IR display hardware could potentially
be used to overcome this limitation.
One limiting factor that may make IR-based tech-
niques less common (especially on mobile devices) is
the requirement for additional hardware to support this
enhanced form of face authentication. Indeed, as of this
writing, only a handful of personal computers support
Windows Hello.10 Nevertheless, the use of infrared illu-
mination offers intriguing possibilities for the future.
Takeaway In our opinion, it is highly unlikely that ro-
bust facial authentication systems will be able to op-
erate using solely web/mobile camera input. Given
the widespread nature of high-resolution personal online
photos, today’s adversaries have a goldmine of informa-
tion at their disposal for synthetically creating fake face
data. Moreover, even if a system is able to robustly de-
tect a certain type of attack – be it using a paper printout,
a 3D-printed mask, or our proposed method – generaliz-
ing to all possible attacks will increase the possibility of
false rejections and therefore limit the overall usability of
the system. The strongest facial authentication systems
will need to incorporate non-public imagery of the user
that cannot be easily printed or reconstructed (e.g., a skin
heat map from special IR sensors).
6 Discussion
Our work outlines several important lessons for both the
present state and the future state of security, particularly
as it relates to face authentication systems. First, our ex-
ploitation of social media photos to perform facial re-
construction underscores the notion that online privacy
of one’s appearance is tantamount to online privacy of
other personal information, such as age and location.
10See “PC platforms that support Windows Hello” for more info.

The ability of an adversary to recover an individual’s fa-
cial characteristics through online photos is an immedi-
ate and very serious threat, albeit one that clearly can-
not be completely neutralized in the age of social media.
Therefore, it is prudent that face recognition tools be-
come increasingly robust against such threats in order to
remain a viable security option in the future.
At a minimum, it is imperative that face authentica-
tion systems be able to reject synthetic faces with low-
resolution textures, as we show in our evaluations. Of
more concern, however, is the increasing threat of virtual
reality, as well as computer vision, as an adversarial tool.
It appears to us that the designers of face authentication
systems have assumed a rather weak adversarial model
wherein attackers may have limited technical skills and
be limited to inexpensive materials. This practice is
risky, at best. Unfortunately, VR itself is quickly becom-
ing commonplace, cheap, and easy-to-use. Moreover,
VR visualizations are increasingly convincing, making
it easier and easier to create realistic 3D environments
that can be used to fool visual security systems. As such,
it is our belief that authentication mechanisms of the fu-
ture must aggressively anticipate and adapt to the rapid
developments in the virtual and online realms.
Appendix
A Multi-Image Facial Model Estimation
In §3.2, we outline how to associate 2D facial landmarks
with corresponding 3D points on an underlying facial
model. Contour landmarks pose a substantial difficulty
for this 2D-to-3D correspondence problem because the
associated set of 3D points for these features is pose-
dependent. Zhu et al. [63] compensate for this phe-
nomenon by modeling contour landmarks with parallel
curved line segments and iteratively optimizing head ori-
entation and 2D-to-3D correspondence. For a specific
head orientation Rj, the corresponding landmark points
on the 3D model are found using an explicit function
based on rotation angle:
si, j = fjPRj(Si , j +tj)
Si , j = ¯Si +Aid
i αid
+Aexp
i αexp
j
i = land(i,Rj),
(6)
where land(i,Rj) is the pre-calculated mapping func-
tion that computes the position of landmarks i on the 3D
model when the orientation is Rj. Ideally, the first equa-
tion in Eq. (6) should hold for all the landmark points
in all the images. However, this is not the case due to
the alignment error introduced by landmark extraction.
Generally, contour landmarks introduce more error than
corner landmarks, and this approach actually leads to in-
ferior results when multiple input images are used.
Therefore, different from Zhu et al. [63], we com-
pute the 3D facial model with Maximum a Posteriori
(MAP) estimation. We assume the alignment error of
each 3D landmark independently follows a Gaussian dis-
tribution. Then, the most probable parameters θ :=
({ fj},{Rj},{tj},{αexp
j },αid) can be estimated by mini-
mizing the cost function
θ = argmax
θ
{
68
∑
i=1
N
∑
j=1
1
(σs
i )2
||si,j − fjPRj(Si ,j +tj)||2
+
N
∑
j=1
(αexp
j ) Σ−1
expαexp
j +(αid
) Σ−1
id αid
}.
(7)
Here, Si ,j is computed using Eq. (6). Σid and Σexp
are covariance matrices of αid and αexp
j , which can be
obtained from the pre-existing face model. (σs
i )2 is the
variance of alignment error of the i-th landmark and is
obtained from a separate training set consisting 20 im-
ages with hand-labeled landmarks. Eq. (7) can be com-
puted efficiently, leading to the estimated identity weight
αid, with which we can compute the neutral-expression
model Si(= ¯Si +Aid
i αid).
References
[1] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying
framework. International Journal of Computer Vision (IJCV), 56
(3):221–255, 2004.
[2] G. Balakrishnan, F. Durand, and J. Guttag. Detecting pulse from
head motions in video. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3430–3437,
2013.
[3] W. Bao, H. Li, N. Li, and W. Jiang. A liveness detection method
for face recognition based on optical flow field. In Image Analysis
and Signal Processing, International Conference on, pages 233–
236, 2009.
[4] C. Baumberger, M. Reyes, M. Constantinescu, R. Olariu,
E. De Aguiar, and T. Oliveira Santos. 3d face reconstruction
from video using 3d morphable model and silhouette. In Graph-
ics, Patterns and Images (SIBGRAPI), Conference on, pages 1–8,
2014.
[5] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar.
Localizing parts of faces using a consensus of exemplars. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on,
35(12):2930–2940, 2013.
[6] V. Blanz and T. Vetter. A morphable model for the synthesis
of 3d faces. In Proceedings of the 26th annual conference on
Computer graphics and interactive techniques, pages 187–194.
ACM Press/Addison-Wesley Publishing Co., 1999.
[7] V. Blanz and T. Vetter. Face recognition based on fitting a 3d
morphable model. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 25(9):1063–1074, 2003.

[8] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Faceware-
house: A 3d facial expression database for visual computing. Vi-
sualization and Computer Graphics, IEEE Transactions on, 20
(3):413–425, 2014.
[9] B. Chu, S. Romdhani, and L. Chen. 3d-aided face recognition
robust to expression and pose variations. In Computer Vision and
Pattern Recognition (CVPR), Conference on, pages 1907–1914,
2014.
[10] N. Duc and B. Minh. Your face is not your password. In Black
Hat Conference, volume 1, 2009.
[11] N. Erdogmus and S. Marcel. Spoofing face recognition with 3d
masks. Information Forensics and Security, IEEE Transactions
on, 9(7):1084–1097, 2014.
[12] D. Fidaleo and G. Medioni. Model-assisted 3d face reconstruc-
tion from video. In Analysis and modeling of faces and gestures,
pages 124–138. Springer, 2007.
[13] Gartner. Gartner backs biometrics for enterprise mobile authen-
tication. Biometric Technology Today, Feb. 2014.
[14] S. Golder. Measuring social networks with digital photograph
collections. In Proceedings of the nineteenth ACM conference on
Hypertext and hypermedia, pages 43–48, 2008.
[15] M. Hicks. A continued commitment to security, 2011. URL
https://www.facebook.com/notes/facebook/
a-continued-commitment-to-security/
486790652130/.
[16] R. Horaud, F. Dornaika, and B. Lamiroy. Object pose: The link
between weak perspective, paraperspective, and full perspective.
International Journal of Computer Vision, 22(2):173–189, 1997.
[17] P. Ilia, I. Polakis, E. Athanasopoulos, F. Maggi, and S. Ioanni-
dis. Face/off: Preventing privacy leakage from photos in social
networks. In Proceedings of the 22nd ACM Conference on Com-
puter and Communications Security, pages 781–792, 2015.
[18] Intel Security. True KeyTM by Intel Security: Security white pa-
per 1.0, 2015. URL https://b.tkassets.com/shared/
TrueKey-SecurityWhitePaper-v1.0-EN.pdf.
[19] H.-K. Jee, S.-U. Jung, and J.-H. Yoo. Liveness detection for em-
bedded face recognition system. International Journal of Biolog-
ical and Medical Sciences, 1(4):235–238, 2006.
[20] L. A. Jeni, J. F. Cohn, and T. Kanade. Dense 3d face align-
ment from 2d videos in real-time. In Automatic Face and Gesture
Recognition (FG), 2015 11th IEEE International Conference and
Workshops on, volume 1, pages 1–8. IEEE, 2015.
[21] O. Jesorsky, K. J. Kirchberg, and R. W. Frischholz. Robust face
detection using the hausdorff distance. In Audio-and video-based
biometric person authentication, pages 90–95. Springer, 2001.
[22] I. Jolliffe. Principal component analysis. Wiley Online Library,
2002.
[23] I. Kemelmacher-Shlizerman. Internet based morphable model. In
Proceedings of the IEEE International Conference on Computer
Vision, pages 3256–3263, 2013.
[24] I. Kemelmacher-Shlizerman and R. Basri. 3D face reconstruction
from a single image using a single reference face shape. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 33(2):
394–405, 2011.
[25] G. Kim, S. Eum, J. K. Suhr, D. I. Kim, K. R. Park, and J. Kim.
Face liveness detection based on texture and frequency analy-
ses. In Biometrics (ICB), 5th IAPR International Conference on,
pages 67–72, 2012.
[26] H.-N. Kim, A. El Saddik, and J.-G. Jung. Leveraging personal
photos to inferring friendships in social network services. Expert
Systems with Applications, 39(8):6955–6966, 2012.
[27] S. Kim, S. Yu, K. Kim, Y. Ban, and S. Lee. Face liveness detec-
tion using variable focusing. In Biometrics (ICB), 2013 Interna-
tional Conference on, pages 1–6, 2013.
[28] K. Kolev, P. Tanskanen, P. Speciale, and M. Pollefeys. Turn-
ing mobile phones into 3d scanners. In Computer Vision and
Pattern Recognition (CVPR), IEEE Conference on, pages 3946–
3953, 2014.
[29] K. Kollreider, H. Fronthaler, and J. Bigun. Evaluating liveness by
face images and the structure tensor. In Automatic Identiﬁcation
Advanced Technologies, Fourth IEEE Workshop on, pages 75–80.
IEEE, 2005.
[30] K. Kollreider, H. Fronthaler, M. I. Faraj, and J. Bigun. Real-time
face detection and motion analysis with application in liveness
assessment. Information Forensics and Security, IEEE Transac-
tions on, 2(3):548–558, 2007.
[31] K. Kollreider, H. Fronthaler, and J. Bigun. Verifying liveness by
multiple experts in face biometrics. In Computer Vision and Pat-
tern Recognition Workshops, IEEE Computer Society Conference
on, pages 1–6, 2008.
[32] A. Lagorio, M. Tistarelli, M. Cadoni, C. Fookes, and S. Sridha-
ran. Liveness detection based on 3d face shape analysis. In Bio-
metrics and Forensics (IWBF), International Workshop on, pages
1–4, 2013.
[33] Y. Li, K. Xu, Q. Yan, Y. Li, and R. H. Deng. Understanding
osn-based facial disclosure against face authentication systems.
In Proceedings of the ACM Symposium on Information, Com-
puter and Communications Security (ASIACCS), pages 413–424.
ACM, 2014.
[34] Y. Li, Y. Li, Q. Yan, H. Kong, and R. H. Deng. Seeing your
face is not enough: An inertial sensor-based liveness detection for
face authentication. In Proceedings of the 22nd ACM Conference
on Computer and Communications Security, pages 1558–1569,
2015.
[35] Y. Liu, K. P. Gummadi, B. Krishnamurthy, and A. Mislove. Ana-
lyzing facebook privacy settings: user expectations vs. reality. In
Proceedings of the 2011 ACM SIGCOMM conference on Internet
measurement conference, pages 61–70. ACM, 2011.
[36] C. Lu and X. Tang. Surpassing human-level face verifica-
tion performance on LFW with GaussianFace. arXiv preprint
arXiv:1404.3840, 2014.
[37] J. Määttä, A. Hadid, and M. Pietikainen. Face spoofing detection
from single images using micro-texture analysis. In Biometrics
(IJCB), International Joint Conference on, pages 1–7, 2011.
[38] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recog-
nition. In Proceedings of the British Machine Vision Conference
(BMVC), 2015.
[39] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A
3d face model for pose and illumination invariant face recogni-
tion. In Proceedings of the 6th IEEE International Conference
on Advanced Video and Signal based Surveillance (AVSS) for Se-
curity, Safety and Monitoring in Smart Environments, 2009.

[40] B. Peixoto, C. Michelassi, and A. Rocha. Face liveness detection
under bad illumination conditions. In Image Processing (ICIP),
18th IEEE International Conference on, pages 3557–3560, 2011.
[41] P. Pérez, M. Gangnet, and A. Blake. Poisson image editing. ACM
Transactions on Graphics (TOG), 22(3):313–318, 2003.
[42] I. Polakis, M. Lancini, G. Kontaxis, F. Maggi, S. Ioannidis, A. D.
Keromytis, and S. Zanero. All your face are belong to us: Break-
ing facebook’s social authentication. In Proceedings of the 28th
Annual Computer Security Applications Conference, pages 399–
408, 2012.
[43] C. Qu, E. Monari, T. Schuchert, and J. Beyerer. Fast, robust
and automatic 3d face model reconstruction from videos. In Ad-
vanced Video and Signal Based Surveillance (AVSS), 11th IEEE
International Conference on, pages 113–118, 2014.
[44] C. Qu, E. Monari, T. Schuchert, and J. Beyerer. Adaptive con-
tour fitting for pose-invariant 3d face shape reconstruction. In
Proceedings of the British Machine Vision Conference (BMVC),
pages 1–12, 2015.
[45] C. Qu, E. Monari, T. Schuchert, and J. Beyerer. Realistic tex-
ture extraction for 3d face models robust to self-occlusion. In
IS&T/SPIE Electronic Imaging. International Society for Optics
and Photonics, 2015.
[46] T. Schops, T. Sattler, C. Hane, and M. Pollefeys. 3d modeling
on the go: Interactive 3d reconstruction of large-scale scenes on
mobile devices. In 3D Vision (3DV), International Conference
on, pages 291–299, 2015.
[47] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified
embedding for face recognition and clustering. arXiv preprint
arXiv:1503.03832, 2015.
[48] F. Shi, H.-T. Wu, X. Tong, and J. Chai. Automatic acquisition of
high-fidelity facial performances using monocular videos. ACM
Transactions on Graphics (TOG), 33(6):222, 2014.
[49] L. Sun, G. Pan, Z. Wu, and S. Lao. Blinking-based live face
detection using conditional random fields. In Advances in Bio-
metrics, pages 252–260. Springer, 2007.
[50] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cas-
cade for facial point detection. In Computer Vision and Pattern
Recognition (CVPR), IEEE Conference on, pages 3476–3483,
2013.
[51] S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M. Seitz.
Total moving face reconstruction. In Computer Vision–ECCV
2014, pages 796–812. Springer, 2014.
[52] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman.
What makes tom hanks look like tom hanks. In Proceedings of
the IEEE International Conference on Computer Vision, pages
3952–3960, 2015.
[53] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Clos-
ing the gap to human-level performance in face verification. In
Computer Vision and Pattern Recognition (CVPR), IEEE Confer-
ence on, pages 1701–1708, 2014.
[54] X. Tan, Y. Li, J. Liu, and L. Jiang. Face liveness detection
from a single image with sparse low rank bilinear discriminative
model. In European Conference on Computer Vision (ECCV),
pages 504–517. 2010.
[55] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, and
M. Pollefeys. Live metric 3d reconstruction on mobile phones. In
Proceedings of the IEEE International Conference on Computer
Vision, pages 65–72, 2013.
[56] J. Ventura, C. Arth, G. Reitmayr, and D. Schmalstieg. Global lo-
calization from monocular slam on a mobile phone. Visualization
and Computer Graphics, IEEE Transactions on, 20(4):531–539,
2014.
[57] T. Wang, J. Yang, Z. Lei, S. Liao, and S. Z. Li. Face liveness
detection using 3d structure recovered from a single camera. In
Biometrics (ICB), International Conference on, pages 1–6, 2013.
[58] H.-Y. Wu, M. Rubinstein, E. Shih, J. Guttag, F. Durand, and
W. T. Freeman. Eulerian video magnification for revealing subtle
changes in the world. ACM Transactions on Graphics (TOG), 31
(4), 2012.
[59] X. Xiong and F. De la Torre. Supervised descent method and its
applications to face alignment. In Computer Vision and Pattern
Recognition (CVPR), IEEE Conference on, pages 532–539, 2013.
[60] J. Yang, Z. Lei, S. Liao, and S. Z. Li. Face liveness detection with
component dependent descriptor. In Biometrics (ICB), Interna-
tional Conference on, pages 1–6, 2013.
[61] L. Zhang and D. Samaras. Face recognition from a single training
image under arbitrary unknown lighting using spherical harmon-
ics. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 28(3):351–363, 2006.
[62] L. Zhang, B. Curless, and S. M. Seitz. Rapid shape acquisition us-
ing color structured light and multi-pass dynamic programming.
In 3D Data Processing Visualization and Transmission, First In-
ternational Symposium on, pages 24–36, 2002.
[63] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and
expression normalization for face recognition in the wild. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 787–796, 2015.

Sec16 paper xu

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Sec16 paper xu

Similar to Sec16 paper xu (20)

More from Andrey Apuhtin

More from Andrey Apuhtin (20)

Recently uploaded

Recently uploaded (20)

Sec16 paper xu