Introduction

In the following section, we will delve into the significance of face anti-spoofing, highlighting the shortcomings of traditional approaches and early deep learning methods. Furthermore, we will explore the use of auxiliary supervision, central difference convolution, and semantic information as potential solutions to enhance the effectiveness of face anti-spoofing techniques.

The importance of face anti-spoofing

With the continuous development of science and technology, biological information is increasingly being used in password systems as a replacement for traditional character passwords that require memorization. Due to their unique characteristics, human faces, as one of the most important biological features of human beings, have been widely used in various interactive systems, such as mobile phone unlocking, account control, permission access, and mobile payment, to provide more convenient operations. However, the presence of fake or spoofed faces limits the reliability of face interactive systems. Printing faces and replaying video attacks can easily confuse existing face recognition systems, leading to incorrect judgments. To ensure the effectiveness of face recognition, it is necessary to design a robust and easy-to-deploy face anti-spoofing system.

Limitation of traditional methods and early deep learning methods

In recent years, face anti-spoofing tasks have attracted a significant amount of attention from researchers. While early research in face anti-spoofing relied on hand-crafted features, such as LBP [1], LBP-TOP [2], HOG [3, 4], and SURF [5], these methods lack robustness and generalization capabilities. Hand-crafted features are not specifically designed for face anti-spoofing tasks and may not accurately represent the underlying data. In addition, they perform poorly when faced with high-definition images or the absence of detailed invariant information.

To overcome these limitations, researchers have turned to deep learning methods [6], which have shown effectiveness in extracting discriminative features and improving the generalization capability of face anti-spoofing systems. Convolutional neural networks (CNNs) have been utilized to design face anti-spoofing networks, leveraging their powerful feature extraction capabilities. However, existing methods often rely on fine-tuning image classification networks, which may not capture the essential features distinguishing spoof and real faces accurately. Binary supervision used in these networks can lead to overfitting and difficulty in generalizing to external datasets.

Auxiliary supervision, central difference convolution and semantic information

To address these challenges, researchers have proposed the use of auxiliary supervision, such as depth map supervision [7] and RPPG [8] to guide network learning and enhance the performance of face anti-spoofing. Most high-performance models rely on multi-frame input [7, 9,10,11], for the lack of spatial features extracted from a single-frame under traditional convolution. But multi-frame model is difficult to deploy in actual production environments

To get more spatial features from a single-frame, Central Difference Convolution (CDC) [12,13,14,15] has been introduced. CDC combines traditional convolution with LBP to extract gradient information and improves the characterization ability of traditional convolution. A large number of ablation experiments have shown that central difference convolution is more suitable for face anti-spoofing tasks and performs better in describing invariant information of details. In various environments, central difference convolution is more likely to extract inherent deceptive patterns, such as lattice artifacts, than traditional convolution operations.

Fig. 1
figure 1

Overview of network structure: UCDCN is composed of a backbone and a classifier. ConvBlock and CDCBlock is explained in Fig. 3, Pool and Up refers to the pooling layer and upsampling layer

Furthermore, many high-performance face anti-spoofing networks use domain knowledge from other visual tasks to improve their performance, such as NAS [12] and De-X [16]. Therefore, to consider the importance of semantic information in segmentation tasks, we transfer the domain knowledge from segmentation tasks to face anti-spoofing tasks. We refer to the network of UNet++ structure to gather multi-level features of the image and enhance the ability to obtain semantic information.

Summary

In summary, the motivation of this research is driven by the importance of facial anti-counterfeiting in various interactive systems. Existing methods face challenges in terms of reliability, robustness, and generalization capabilities. The research aims to design a robust and easy-to-deploy face anti-spoofing system by leveraging auxiliary supervision, central difference convolution, and domain knowledge transfer. As a result, this article mainly includes the following contributions:

  1. 1.

    Utilizing central difference convolution instead of traditional convolution to construct a face anti-spoofing network structure improves the characterization ability of invariant details in diverse environments.

  2. 2.

    We leverage domain knowledge from image segmentation and propose a multi-level feature fusion network structure to enhance the model’s ability to capture semantic information. To the best of our knowledge, this is the first time that domain knowledge from image segmentation has been applied to face anti-spoofing. We named the designed network structure UCDCN as shown in Fig. 1.

  3. 3.

    We have redefined the loss function and training strategy to prevent overfitting.

  4. 4.

    The network structure we designed demonstrates excellent performance on both internal and external datasets with minimal training, thus demonstrating the effectiveness of our model.

Related work

In this segment, we will provide an overview of the existing research on face anti-spoofing, encompassing a range of approaches such as texture-based methods, time-based methods, and depth map auxiliary supervision.

Texture-based method

Most previous methods for face anti-spoofing based on hand-crafted features and color texture analysis, such as LBP, [1], LBP-TOP [2], HOG [3, 4] and SURF [5], rely on texture differences between live and spoof faces and are classified by traditional SVM [17] and LDA [18]. LBP, LBP-TOP, HOG, SIFT, and SURF are not specifically designed for face anti-spoofing, and their feature extraction ability is relatively limited. To overcome this, Jianwei Yang et al. [19] utilized the powerful self-extracting feature ability of CNNs and introduced it to face anti-spoofing tasks. The CNN model achieved impressive results through supervised training via binary softmax loss. However, traditional CNNs tend to extract less detailed information, such as moiré stripes, lattice artifacts, and phone borders, rather than valid and generalizable cues, since face anti-spoofing tasks contain a vast amount of detailed information. Furthermore, some studies [7] have indicated that the face anti-spoofing model using binary CNN is prone to overfitting, and its performance is susceptible to environmental changes, such as lighting changes, posture changes, deception medium, etc. Some high-precision texture-based models have abandoned the binary softmax supervision and used auxiliary supervision to guide model training to obtain generalizable cues. For instance, some researchers [20] have used learn-to-learn network to extract meta pattern to get the discriminative information instead of hand-crafted feature extraction. In addition, depth map-based supervision [7] has been explored, showing significantly improved performance compared to previous binary supervision models.

Time-based method

The time-based method is one of the earliest schemes applied in face anti-spoofing. Some studies have used multi-frame input to capture facial motion features such as blink detection [21] and lip motion [22] to achieve face anti-spoofing. However, such methods can be easily confused by some paper-cut attacks (where the eyes and lips part of the print attack are cut off). Since the structures of living faces significantly differ from these spoof faces, it is challenging to design effective time-based face anti-spoofing systems. Some researchers [23] have implemented face anti-spoofing tasks by comparing Fourier domains between consecutive frames in time series, but this approach relies heavily on accurate face localization and performs poorly for high-definition images. The time-based approach [24, 25] requires multiple frames as input, and multi-frame detection models are more challenging to deploy in production environments than single-frame detection models. Therefore, it is crucial to design robust single-frame face anti-spoofing models.

Table 1 Study contributions along with research gaps

Depth map auxiliary supervision

Estimating face depth from a single RGB facial image is a highly challenging computer vision problem that plays a crucial role in facial anti-counterfeiting. Previous research has explored different approaches to tackle this challenge. Atoum et al. [26] first utilized pseudo-depth labels to guide a multi-scale fully convolutional network, while Zitong et al. [27] proposed a pyramid supervision technique to capture both local details and global semantics. Wang et al. [28] introduced a Generative Adversarial Network (GAN) to transfer RGB face images to the depth domain. [29] provides a new method for multi-view data processing. Yahang et al. [30] introduced facial depth as well as the boundary of spoof medium, moiré pattern, reflection artifacts to imitate human decision. Wang Yu et al. [31] proposed a face anti-spoofing method based on client identity information using Siamese network and employed depth map as auxiliary information to improve the performance. Jie jiang et al. [32] employ GCBlock to better mine face depth information for auxiliary supervision. These advancements have been made possible by the widespread use of convolutional neural networks in the field of facial analysis. In recent years, face 3D reconstruction techniques, such as the 3D Morphable Model (3DMM) proposed by Booth et al. [33], have significantly contributed to the development of facial anti-counterfeiting methods. Moreover, Jianzhu et al. introduced 3DDFAv2 [34, 35], which fits dense 3D face models to face images by CNN and enables efficient processing on CPU, reducing the time required for dataset processing. In this study, we leverage 3DDFAv2 [34, 35] to generate depth maps for live faces and use flat zero matrices as ground truth for various spoof faces, including printing attacks and replay attacks . The study contributions along with research gaps is shown in Table 1.

Proposed method

In this section, we provide a detailed description of the network structure, derive the CDC formula, and thoroughly explain the loss function.

Network structure

The overview of the proposed network is as shown in Fig. 1. The entire network can be divided into two parts: the first part is the backbone, which mainly focuses on estimating the depth of the input face, and the second part is the classifier, which utilizes the estimated depth information to obtain the final classification.

Central difference convolution

The convolutional neural network (CNN) is a fundamental operation used for various computer vision tasks, such as feature extraction, dimension transformation, and scale transformation. However, the face anti-spoofing task is different from traditional image classification tasks. In face anti-spoofing, distinguishing between living and spoof faces is challenging due to the subtle differences between them. Many researchers have pointed out that traditional CNNs have difficulty capturing the crucial information that differentiates between living and spoof faces. To address this, we use the Central Difference Convolution (CDC) operation within our network. CDC obtains gradient information that traditional convolution does not have through differentiation, which can combine prior knowledge of the differences in three-dimensional aspects between real and fake faces. CDC consists of two steps, sampling and aggregation, and distinguishes from traditional CNN in the aggregation step. We express the mathematical description of CDC as Eq. 1 in this paper:

$$\begin{aligned} y({p_0}) = \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x(} {p_0} + {p_n})-x(p_0)) \end{aligned}$$
(1)

where \(p_0\) denotes the current position on the input and output feature maps, and \(p_n\) is the position computed in \({\mathcal {R}}\):

$$\begin{aligned} \begin{aligned} y({p_0})&= \underbrace{\theta \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x({p_0} + {p_n}) - x({p_0})} }_{{\text {center difference convolution}}} \\&\quad + \underbrace{(1 - \theta ) \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot x({p_0} + {p_n})} }_{{\text {vanilla convolution}}} \\&= \underbrace{\sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}) \cdot (x({p_0} + {p_n})} }_{{\text {vanilla convolutio}}n} \\&\quad + \underbrace{\theta \cdot ( - x({p_0}) \cdot \sum \limits _{{p_n} \in {\mathcal {R}}} {w({p_n}))} }_{{\text {center difference convolution}}} \\ \end{aligned} \end{aligned}$$
(2)
Table 2 Notations and illustrate

The specific implementation of the central difference convolution is shown in Fig. 2. For the input feature map, it first uses the convolutional kernel to realize the traditional convolution. Then, the convolution kernel is summed according to the w and h dimensions and performs the convolutional operation on the feature map. Finally, the two feature maps are subtracted to obtain the final output feature map.

Fig. 2
figure 2

Central difference convolution, convergence of gradients pointing to the central direction

Following the research of Zitong Yu et al. [13], we used the \(\theta \) parameter to obtain the weighted combination of the central difference convolution with the traditional convolution, resulting in the final central difference convolution. The mathematical formula is shown in Eq. 2. It should be noted that when \(\theta =0\), the CDC operation reduces to the traditional convolution.

UCDCN

Our network is divided into two parts: a backbone responsible for extracting image features and regressing the depth information, and a classifier that uses the extracted depth information to achieve final classification. The backbone of our network draws inspiration from UNet++ and employs a multi-layer feature fusion approach. This approach combines low-level texture features with high-level abstract features and maximizes the utilization of fine-grained details captured by the Central Difference Convolution (CDC), which represents essential information. Figure 3 illustrates the structure of our network where each ConvBlock consists of two consecutive sets of CDCs, BatchNorm [37], and ReLU [38]. The CDC block includes a CDC layer followed by a Sigmoid layer. As mentioned earlier, the Sigmoid layer constrains the output values within the range of [0, 1], which corresponds to the actual pixel values in the image. This constraint aids in computing the loss function using normalized labels, facilitating accurate training of the network.

Fig. 3
figure 3

ConvBlock and CDCBlock implementation details

Loss function

This subsection will elaborate on the loss function \({\mathcal {L}}\) that we have designed, inspired by previous research [9, 13]. Our loss function combines depth loss and classification loss, targeting both depth estimation and classification tasks. The notations and illustrate used in the article are indicated in Table 2.

Depth map loss

Our depth loss \({\mathcal {L}}\) is comprised of two components. The first component is the SmoothL1Loss as in Eqs. 3, 4, and 5, which is commonly used in regression analysis. This loss function combines the advantages of Mean Absolute Error (MAE) and Mean Squared Error (MSE) losses, making it less sensitive to outliers than MSE and able to prevent gradient explosion in certain cases.The loss function can be represented in Eq. 3:

$$\begin{aligned} loss (x,y) = L = {\{ {l_1},...,{l_N}\} ^T} \end{aligned}$$
(3)

where

$$\begin{aligned} {l_n} = \left\{ \begin{aligned}&0.5{({x_n} - {y_n})^2}/\beta ,{\text { }}if{\Vert {x_n} - {y_n}\Vert } < \beta \\&{\Vert {x_n} - {y_n}\Vert } - 0.5\beta ,{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$
(4)

where \(\beta \) is the threshold to change between L1 and L2 losses

$$\begin{aligned} \ell _{absolute} (x,y) = \left\{ \begin{aligned}&mean(L),{\text { }}if\;reduction = mean \\&sum(L),{\text { }\text { }}if\;reduction = sum \\ \end{aligned} \right. \end{aligned}$$
(5)

and the depth map loss can be represented by Eq. 5

Contrast depth loss

SmoothL1Loss, while effective in regression tasks, does not consider the varying weights of pixel points in different regions that contribute to the overall regression. This limitation prevents the model from effectively capturing detailed information present in different facial regions. To overcome this limitation, we introduce a contrast depth loss that builds upon SmoothL1Loss and enhances the model’s capability to capture fine-grained details by incorporating region-specific contributions into the loss function. As depicted in Fig. 1, different facial regions exhibit distinct features, such as a prominent nose bulge or noticeable depressions in the cheeks and eyes. These features provide strong cues for distinguishing between genuine and fake faces. Following the approach of Zezheng Wang et al., we utilize a convolutional kernel, as shown in Fig. 4.

Fig. 4
figure 4

\(K_i^{contrast}\) in contrast depth loss

To incorporate the contrast depth loss into our framework, we utilize a convolutional kernel \(K_i^{contrast}\)to perform convolution operations on both the ground truth depth labels and the predicted depth map. Since the contrast depth loss acts as a fine-tuning method rather than a standalone loss function, we directly compute the loss value using MSE. The entire process can be represented by Eq. 6:

$$\begin{aligned} {\ell _{contrast}}(x,y) = MSE(K_i^{contrast} \odot x,K_i^{contrast} \odot y) \end{aligned}$$
(6)

Therefore, the loss function for our depth estimation is expressed as Eq. 7:

$$\begin{aligned} {{\mathcal {L}}_{depth}}(x,y) = {\ell _{absolute}}(x,y) + {\ell _{contrast}}(x,y) \end{aligned}$$
(7)

Classifier loss

Face anti-spoofing remains fundamentally a binary classification task, and thus, binary cross-entropy loss is frequently utilized to supervise this task. However, a considerable body of research [6, 39,40,41] has demonstrated that binary cross-entropy loss is highly susceptible to causing overfitting in face anti-spoofing tasks, which is a key factor contributing to poor model generalization. Moreover, given the diverse range of spoofing techniques and media, the number of negative samples in face anti-spoofing datasets, i.e., the number of spoof faces, far exceeds the number of positive samples representing genuine faces. The large proportion of negative samples in the dataset accounts for a significant portion of the total loss, and is a key reason why the face anti-spoofing task tends to focus on classifying negative samples, leading the model optimization direction to deviate from our intended objective. Uneven data distribution is a common characteristic of face anti-spoofing datasets, and therefore, for the classifier, we need to address the following two challenges.

  1. 1.

    To mitigate the impact of overfitting associated with softmax on model performance.

  2. 2.

    To address the impact of uneven data distribution on model performance.

In this study, we were motivated by the application of Ref. [42] in object detection and we utilized focal loss as the classification loss function in our approach. We further elucidated the exceptional performance of focal loss in binary classification tasks by analyzing its ability to overcome the limitations of the conventional binary cross-entropy loss, as previously reported in Eq. 8:

$$\begin{aligned} CE(p,y) = \left\{ \begin{aligned}&- \log (p),{\text { }\text { }\text { }\text { }\text { }\text { }}if{\text { }}y = 1 \\&- \log (1 - p),{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$
(8)

where \(y=1\) denotes our living face sample and p is the prediction probability of the classifier, and for the friendliness of the formula representation, we define \(p_t\) as follows:

$$\begin{aligned} {p_t} = \left\{ \begin{aligned}&p,{\text { }\text { }\text { }\text { }\text { }\text { }}if{\text { }}y = 1 \\&1 - p,{\text { }}otherwise \\ \end{aligned} \right. \end{aligned}$$
(9)

Thus, we can obtain \(CE(p,y) = CE(p_t) = -\log (p_t)\). As in most scenarios, we add the weight parameter \({\alpha _t}\) to handle the category imbalance problem, so we get \(\alpha -balanced\) CEloss, \(CE({p_t}) = - {\alpha _t}\log ({p_t})\), where \({\alpha _t} \in [0,1]\). As the number of spoof faces typically constitutes a significantly larger proportion of the dataset, the model evaluation metrics can be artificially inflated if the model tends to predict spoof faces more frequently. This phenomenon can give rise to an illusion of exceptional model performance and is often the primary cause of overfitting. To address this issue and achieve a more balanced loss between living and spoof faces, we introduce a modulation factor \({(1 - {p_t})^\gamma }\), resulting in the following focal loss formulation represented by Eq. 10:

$$\begin{aligned} FL({p_t}) = - {(1 - {p_t})^\gamma }\log ({p_t}) \end{aligned}$$
(10)

Figure 5 displays the focal loss curves for various values of the \(\gamma \) coefficient, which regulates the behavior of the focal loss mechanism as follows.

Table 3 The details of the datasets for face anti-spoofing
Fig. 5
figure 5

The traditional CE Loss, that is, the change relationship with the probability \(p_t\) when \(\gamma =0\), and also draws some curves corresponding to different \(\gamma \) parameters

  1. 1.

    When a sample is misclassified, there are no more than two cases. When the groundtruth is 1, its prediction result is close to 0, i.e., \(p \rightarrow 0\), according to Eq. 9, at this time \(p_t \rightarrow 0\), \(FL(p_t)\) is a large value. When the groundtruth is 0, its prediction is close to 1, \(p \rightarrow 1\), \(p_t \rightarrow 0\), \(FL(p_t)\) is still a large value.

  2. 2.

    The focus parameter \(\gamma \) smoothly adjusts the rate at which easy examples are down weighted. When \(\gamma =0\), FL is equivalent to CE, and as \(\gamma \) is increased the effect of the modulating factor is likewise increased. Based on the experimental results of previous work, in the face anti-spoofing task, we also set \(\gamma =2\).

Adding the \({\alpha _t}\) parameter, we can obtain the final focal loss as our classifier loss in Eq. 11:

$$\begin{aligned} FL({p_t}) = - {\alpha _t}{(1 - {p_t})^\gamma }\log ({p_t}) \end{aligned}$$
(11)

In summary, we use focal loss to enhance the ability of the model to correctly classify difficult samples and to improve the impact of unbalanced data distribution on model performance. Finally the total loss of our model can be expressed as in Eq. 12, where \({{\mathcal {L}}_{classify}}(x,y)=FL({p_t})\):

$$\begin{aligned} {\mathcal {L}}(x,y) = {{\mathcal {L}}_{depth}}(x,y) + {{\mathcal {L}}_{classify}}(x,y) \end{aligned}$$
(12)
Fig. 6
figure 6

Data augmentation visualization, from left to right is random brightness, random erasing and random angles. Depth map labels only change equally with angle changes

Implementation details

Datasets

In our experiments, three databases were used, OULU-NPU [43], SiW [7] and Replay-Attack [44]. All the details of databases are in Table 3. OULU-NPU is a high-resolution database consisting of 4950 live access and spoofed videos. This database contains four protocols to validate the generality of the model. SiW contains more living subjects as well as three testing protocols. Replay-Attack is databases that contain low-resolution videos.

Pre-processing stage

Our proposed method works on cropped face images due to the different resolutions of various devices. Jianwei Yang et al. [19] point out that a certain degree of background features is helpful to the model, so we use RetinaFace [45] to crop the square area in an isotropic way and set the image size to 1.2 times the face area. The input images are normalized and scaled to 128\(\times \)128 in accordance with the IMAGENET [46] standard, after which the corresponding depth maps are generated using 3DDFAv2 [34], as described earlier.

Data augmentation

Unlike conventional image classification tasks, the data augmentation techniques employed for face anti-spoofing tasks require the incorporation of real-world scenarios, such as occlusions, changes in lighting conditions, variations in angle, and so on. In this regard, we utilized various data augmentation methods, including random erasing to simulate partial occlusions of the face, random brightness adjustments to simulate changes in lighting conditions, and random horizontal flips and rotations to simulate alterations in facial angles. Notably, the depth map labels only change proportionally with variations in the facial angle, and Fig. 6 provides a visualization of our augmentation strategy.

Fig. 7
figure 7

Model pruning, reducing model complexity to avoid the risk of overfitting

Training strategies

Our proposed method was implemented using the PyTorch framework. The training section comprises two stages: freezing the classifier during training of the backbone for depth regression, and then freezing the weights of the backbone for training the classifier. This approach offers the advantage that, at the early stage of end-to-end training, the backbone may not have sufficient ability to estimate valid depth information, and backpropagation of the classifier may affect the weight update of the backbone, thereby reducing the robustness of training to some extent. Therefore, we first train the backbone separately until its performance is good enough, then train the classifier separately, and finally perform end-to-end training to improve the model’s performance. We trained the backbone for 100,000 steps, the classifier for 2500 steps, and the end-to-end training for 10,000 steps using the POLY learning rate decay strategy. We set the learning rate to 0.0001 both in the backbone stage and the end-to-end training, and 0.001 in the classifier stage.

Table 4 The number of images on three databases
Table 5 The results of intra-testing on four protocols of OULU-NPU
Table 6 The result on three databases

During different training phases, we will monitor different loss values. In the backbone stage, we will monitor the depth loss \({{\mathcal {L}}_{depth}}\). In the classifier stage, we will monitor the classification loss \({{\mathcal {L}}_{classify}}\) and accuracy. Finally, during end-to-end training, we will monitor the total loss function \({\mathcal {L}}\).

Fig. 8
figure 8

Imperfect depth map estimated by model, live face on the left and spoof face on the right

Table 7 Different classifier performance on datasets. The convolution of each classifier is replaced by CDC
Table 8 The performance of different classifier on protocol 2
Table 9 The four protocols on OULU-NPU

Evaluation metrics

In the OULU-NPU [43] database, we followed the original protocols and metrics for a fair comparison. The following evaluation metrics were used for all databases.

  1. 1.

    The Attack Presentation Classification Error Rate (\(APCER\) [47]) is used to calculate the misclassification error rate of a spoof face;

  2. 2.

    The Bona Fide Presentation Classification Error Rate (BPCER [47]) is used to measure the error rate of a living face being misclassified;

  3. 3.

    The Average Classification Error Rate (ACER [47]) is computed as the average of the APCER and the BPCER:

    $$\begin{aligned} ACER=\frac{APCER+BPCER}{2} \end{aligned}$$
    (13)

Experiments and results

We utilized the entire training set to train our model, and subsequently evaluated the performance on the entire test set to ascertain its effectiveness on the entire database. The number of images in the test and training sets were counted on three different databases, as shown in Table 4. All images were subjected to rigorous data clarity procedures, with each image resized to 128\(\times \)128. The results of our experiments using the full training and test sets are presented in Table 6. In Table 7, we provide supplementary results for different datasets using distinct classification heads. The results were classified with a threshold of 0.5, and most of the models achieved over 98% accuracy, as demonstrated in D. These results are highly encouraging.

In addition, we provide metrics under the OULU protocol in Table 5 to demonstrate the effectiveness of our proposed method. UCDCN-\(\text {L}_3\) shows similar performance to the original CDCN, while UCDCN-\(\text {L}_2\) exhibits higher performance than the original CDCN.

Visualization and analysis

In the proposed architecture, estimated depth maps serve as input to the classifier and are supervised by the focal loss. These maps provide information on the importance of different facial areas to the classifier. During the training process, as shown in Fig. 9, the estimated depth maps and corresponding classification results are displayed. The input image size is 128\(\times \)128, and the estimated depth map size is also 128\(\times \)128. An advantage of this approach is that the backbone can calculate the point-to-point regression loss between the estimated depth map and ground truth.

Through the visualization, it is evident that the depth map estimated by the model is highly accurate, detailed, and adaptable to various angles. This fine-grained depth information effectively reduces the classifier’s burden, leading to a robust architecture that is not affected by variations in luminance or angles.

Although the performance of the backbone affects the classifier, the classifier is able to compensate for the shortcomings of the backbone to some extent. Since the backbone cannot estimate near-perfect depth information for each image, in Fig. 8, we demonstrate the imperfect depth map estimated by the model. Therefore, it is also important to allow the classifier to autonomously determine the depth information threshold to distinguish between spoof and living faces for some imperfect depth maps. In Table 8, we present the performance of different classifiers under OULU-NPU protocol II.

Our model maximizes the utilization of the 3D shape of the face, which involves obtaining gradient information using CDC, a crucial difference from the flat spoofing face. It is important to note that CDC is not only utilized in the depth regression task, but we also replace vanilla convolution with CDC in the downstream classification task, as depth maps also possess 3D information. To gain a better understanding of the features extracted by the model in the depth regression stage, we implement a visualization of the middle layer.

Fig. 9
figure 9

Depth map visualization, “depth:1” means live face, “depth:0” means spoof face

Fig. 10
figure 10

The top row is the living face and part of the output features from the first layer of convolution, and so as to the bottom row

Fig. 11
figure 11

The output feature map of the first layer of convolution in Decoder, the images are arranged in the same order as in Fig. 10

As depicted in Fig. 10, we visualize the output features of the first-layer convolution, where the leftmost column presents the original input images. The top and bottom rows represent the living and spoof faces, respectively, along with their corresponding output features in the same channel of the same convolution features. It is evident that the extracted features of the living face can better differentiate the foreground from the background and provide a clearer edge description. In contrast, the spoof face is less distinguishable from the background. In addition, the model is capable of effectively emphasizes obvious gradient regions, such as eyes and mouth corners, for living faces. This further indicates that CDC can extract the most essential gradient information from living faces and distinguish them from flat spoof faces.

As shown in Fig. 11, we visualize the first layer of convolution features in the Decoder following the same order as the visualized image in Fig. 10. We observed significant differences in the output features of the living face and the spoof face in the Decoder. The second and fourth columns of the figure demonstrate that the model effectively regresses the depth information of the face in the Decoder. For the living face, the depth map is very clear and accurately portrays the nose tip and other facial parts. However, in contrast, the regression depth information for the spoof face is nearly zero.

Furthermore, the model’s regression pattern for the background remains consistent for both the living face and the spoof face. The disparity lies in the incorporation of image segmentation domain knowledge. In the case of the living face, a face-shaped region is present in the center, signifying that the utilization of depth map regression enhances the model’s emphasis on the facial region and captures features that differentiate the face from the background. As a result, the background alone has minimal impact on the model’s performance, whether it is a living face or a spoof face, thereby leading to an improvement in the model’s performance to some extent.

Fig. 12
figure 12

The degree of contribution of different hierarchical features to the model

Model pruning

As shown in Figs. 7 and 1, our model incorporates features from multiple layers, but each layer’s contribution to the model differs during inference. In Fig. 12, we utilize Eq. 14 to calculate the variance of each hierarchical parameter in Fig. 7. This calculation aims to assess the extent of contribution of different hierarchical features to the model. It is evident that \(X^{0,1}\), \(X^{1,1}\) and \(X^{0,2}\) have the lowest variance, which describe the intermediate layer features of the model as shown in Fig. 13, it can be found that output features of \(X_{1,1}\) are very sparse and \(X^{0,1}\),\(X^{0,2}\) output zero features. These findings indicate that the outer model used by UCDCN for depth estimation makes a significant contribution to the model output, while the inner structure plays a minor role. This insight presents the potential for model pruning and parameter reduction:

$$\begin{aligned} {y_i} = \log \left( \frac{{{\theta _i} \cdot e}}{{\min \Theta }}\right) \end{aligned}$$
(14)
Fig. 13
figure 13

The shade of color indicates the level of contribution. The features in the red dashed box have a low level of contribution, and the higher up the middle layer, the lower the contribution of the features

Fig. 14
figure 14

Intermediate layer output feature visualization. The top two groups represent \(X^{0,1}\),\(X^{0,2}\) output features from left to right. The bottom group represents \(X^{1,1}\) output features

Training strategy analysis

Instead of directly conducting end-to-end training, our model is first trained in the regression part to effectively regress the depth information of the face, which is then used to achieve classification, making our model more interpretable. In contrast, end-to-end training would generate abstract input features to the classifier, which may achieve correct classification but are difficult to explain and do not align well with human thinking. The following figure illustrates the difference between end-to-end training and non-end-to-end training mechanisms. It can be observed that end-to-end training can also generate depth information, but the detailed information description is poor. This is because the reduction of classification loss during end-to-end training reduces the contribution of regression loss to the overall loss, thereby reducing the model’s capability to regress the depth map. Therefore, separate training of the regression part is utilized to obtain a more detailed depth map.

Fig. 15
figure 15

The first row displays the depth map generated by separately training the regression part, while the second row displays the depth map generated by directly training the entire network end-to-end

Comparison and performance

Model pruning offers two advantages: the ability to remove neurons that contribute little to the model and the ability to reduce the number of parameters, thus avoiding overfitting when dealing with small databases. The OULU-NPU protocols have different data allocation methods, resulting in a small amount of data for each protocol. To mitigate the risk of overfitting, we adopt pruning to reduce the complexity of the model. Therefore, we remove the last fused layer as shown in Fig. 7, we named it UCDCN-\(\text {L}_3\), subtracting one more from UCDCN-\(\text {L}_3\), we get the UCDCN-\(\text {L}_2\), our purpose is to further compress the complexity and parameters of the model, the difference is that UCDCN-\(\text {L}_2\) will use the parameters of UCDCN-\(\text {L}_3\) as pretrained weights and iteratively train for 10,000 steps. We perform intra-testing on Oulu database. For Oulu, we follow the four protocols in Table 9 and report the APCER, BPCER and ACER in Table 5; it can be found that \(L_2\) using \(L_3\) as a pretrained model has better results. This further suggests that there are redundant intermediate layer neurons in \(L_3\), which contain redundant information that may lead to poor model performance. In addition, \(L_3\) with linear classifier has 11.26 flops and \(L_2\) has 5.79 flops, and its extreme frame rate can reach 653 fps on RTX 3090 graphic card. This greatly reduces the difficulty of deployment in production environments. Exciting results can be found for our model, with improvements in different protocols compared to previous methods. This proves the validity of our proposed method.

Discussion

The datasets we employ are based on media and assume that the depth of the spoofed face is zero. However, in reality, facial changes and scenes exhibit significant complexity. This includes intricate pitch angle variations, diverse luminance changes, and various attack presentation methods such as different angles and distances, photo bending degrees, and more. Consequently, the depth of such a spoofed face is not zero; however, its depth information is significantly lower compared to that of a genuine living face.

In general, we have made improvements in data acquisition to ensure high-quality data. In addition, we have enhanced the network structure to enable the model to effectively utilize features at different levels. Furthermore, we have refined our training strategy and loss function to mitigate the risks of overfitting. We believe that methods from different domains within computer vision tasks can be adapted and applied to face anti-spoofing tasks, thereby enhancing the performance of the model. Overall, we firmly believe that depth-based face anti-spoofing is a promising and valuable area for further research.

Conclusion and future work

In this paper, we propose a novel face anti-spoofing network structure based on central difference convolution. This structure comprises a backbone for depth estimation and a classifier for binary prediction and we leverage various domain knowledge, including UNet++ multi-layer feature fusion from image segmentation, focal loss from object detection to address data imbalance and handle hard sample classification problems, as well as a face depth estimation method for constructing depth maps. Furthermore, we design a multi-tasking model training strategy that facilitates both depth map regression and classification. Our network achieves satisfactory metrics on Replay-Attack, Oulu-NPU and SiW and it is easy to deploy due to the reduction of model parameters. This easily deployable facial anti-spoofing network can provide rapid and effective anti-spoofing protection in scenarios where real-time decision-making is crucial, which is essential for applications such as financial transactions and identity verification, enabling the timely identification and interception of potential fraudulent activities.

However, our method might generalize poorly on unknown attack types and it is crucial to enhancing the generalization capacity of our methods. Recently proposed zero/few-shot learning [48,49,50,51] can quickly adapt the model to new attacks by learning from both the predefined attacks and the collected very few samples of the new attack. Domain Effective Fast Adaptive nEt-worK (DEFAEK) [52] based on the optimization-based meta-learning paradigm can effectively and quickly adapts to new tasks. On the other hand, Adv-APG [53] regard face anti-spoofing as a unified framework with the attack and defense systems and optimize the defense system against unseen attacks via adversarial training with the attack system. To avoid identity-biased and domain-biased features, UDG-FS [54] proposed a novel Split-Rotation-Merge module to build identity-agnostic local representations. These studies give us inspiration for future research, and we will improve our model along these directions.