1 Introduction

The process of deinterlacing involves converting a stream of interlaced frames within a video sequence to progressive frames [1], to ensure their playback on nowadays progressive devices.Such video processing has been widely studied in the recent literature [28], as the interlaced video format is still preferred for the acquisition systems when high-fidelity motion accuracy is needed. Deinterlacing requires the display device to buffer one or more fields and recombine them to a full progressive frame. There are various methods to deinterlace a video and each method produces its own artifacts, due to the temporal lack of information and the dynamics of the video sequence.

Spatial deinterlacers [2, 4, 7, 9] use the information from the current field to interpolate the missing field lines. The most common types of spatial deinterlacing methods are line averaging and directional spatial interpolation. Edge-based line averaging is done by interpolation along the edge direction, by comparing the gradients of various directions. The interpolation accuracy of edge-based line averaging is increased by an efficient estimation of the directional spatial correlations of neighboring pixels. Usually, the spatial deinterlacing methods have low computational power.

However, one disadvantage of spatial deinterlacing is that this class of methods is not optimal due to the fact that motion activity is not considered in interpolation; moreover, these kind of algorithms fail to remove the flickering artifacts.

Motion adaptive methods, such the ones proposed in [5, 6, 8], use consecutive fields to analyze the characteristics of motion in order to choose the appropriate interpolation scheme. In such deinterlacers, dynamic areas are interpolated spatially and the static segments are interpolated temporally. The best class of deinterlacers is given by the motion-compensated ones [3, 10]. In these schemes, the motion trajectory is estimated and the interpolation of the missing fields is done along the motion flow. However, motion-compensated deinterlacers need massive computational resources. To reduce their complexity, block-based motion estimation is used at the expense of blocking artifacts and some unreliable motion information [11], which severely degrades the visual quality of the reconstructed video sequences.

A single-field interpolation algorithm based on block-wise autoregression that considers mutual influence between the missing high-resolution pixels and the given interlaced, low-resolution pixels in a slip window is introduced in [4]. A method to use different interpolation techniques based on classification of each missing pixel into two categories according to different local region gradient features is discussed in [5]. Further, a statistical-based approach which uses Bayes theory to model the residual of the images as Gaussian and Laplacian distribution can be used to estimate the missing pixels in [6]. To improve the accuracy of motion vectors for video deinterlacing by selectively using optical flow results, for assisting the block-based motion estimation is proposed in [12], at a high computational cost. The computational load of block-based compensation can be reduced using predictive area search algorithms, which estimate the motion vectors (MV) of the current block using the MVs of previous blocks [13]. Neural networks and fuzzy logic can also be used as deinterlacing solutions. A way to exploit fuzzy reasoning to reinforce contours for improving an edge-adaptive deinterlacing algorithm without an excessive increase in computational complexity is discussed in [14]. Another approach for fuzzy logic deinterlacing is to use a fuzzy-bilateral filtering method which considers the range and domain filters based on a fuzzy metric [2, 8].

In this paper, in order to reduce the blocking artifacts hence improving the Quality of Experience (QoE) of human viewers, we propose to use the block-based motion estimation on smooth areas, while on highly textured areas optical flow based pixel velocity is used [15] because this method is free of blocking effect. For improving the frame reconstruction quality, visual saliency-guided interpolation of the estimated temporal field is used. The use of visual saliency [16] as trigger for the spatio-temporal interpolator has two advantages: for non-salient regions, no motion estimation is performed, the areas being spatially interpolated, hence highly reducing the proposed deinterlacer complexity.

The second advantage is the corollary of the first one: the computing resources, translated mainly into the motion estimation process, can be used entirely on the region-of-interest area.

In the sequel, the paper is organized as follows: Sect. 2 first introduces the notion of visual saliency and present some existing saliency models. In particular, a focus is made on the graph-based visual saliency (GBVS) model that outperforms the reference model, as well as the spectral residual visual saliency (SRVS) model. Then, Sect. 3 describes the proposed saliency spatio-temporal video deinterlacing method. Some experimental results obtained with the proposed method for different video sequences are presented in Sect. 4. A comparison between deinterlacing processes using different saliency models is also proposed. Simulation results are presented and discussed. Finally, conclusions are drawn in Sect. 5.

2 Visual saliency

Visual saliency is defined in [17] as the distinct subjective perceptual quality which makes some items in the world stand out from their neighbors and immediately grab our attention. The visual saliency process allows a human observer to specifically focus her/his attention on one or more visual stimuli into a scene depending on some semantic features like orientation, motion or color.

Fig. 1
figure 1

Flowchart diagram of the spectral saliency model

It constitutes one of the most important properties of the human visual system (HVS) with numerous applications in digital imaging applications including content-aware video coding, segmentation or image resizing [18, 19]. To model human visual attention, several visual saliency models have been recently proposed in the literature [2022]. Generally, these models allow computing a so-called visual saliency map as a topographically arranged map that represents visually salient parts, also called regions of interest (ROI), of a visual scene. Among the different existing saliency models, the one proposed by Itti et al. [17, 23] is the most popular.

The Itti algorithm exploits three low-level semantic features of an image: color, orientation and intensity. These features are extracted from the image to establish feature maps. Finally, the saliency map is computed from these feature maps after normalization and pooling.

In [16], the authors propose a Graph-Based Visual Saliency (GBVS) model which improves the model developed by Itti et al. The GBVS model relies on a fully connected graph between feature maps at multiple spatial scales. It is shown that the GBVS model outperforms the Itti model in predicting human visual attention while viewing natural images. However, the computational complexity of the GBVS model constitutes a significant drawback for deinterlacing implementation purposes. Hence, other low-complex saliency models have been considered to replace the GBVS one.

Among these, we retain the so-called spectral residual visual saliency (SRVS) model described in [10, 15]. Figure 1 represents the flowchart of the spectral residual saliency model’s computation. The model relies on spectral residual saliency detection. Spectral residual saliency detection is an approach developed in computer vision to simulate the behavior of pre-attentive visual search. Different from traditional image statistical models, it analyzes the log spectrum of each image of the video sequence and estimates the corresponding spectral residual. Then, the spectral residual is transformed to the spatial domain to obtain the saliency map. This method explores the properties of the background areas, rather than the target objects. The procedure can be detailed as follows.

Given the luminance component (Y) of a field, the amplitude spectrum noted A(f) and the phase spectrum noted P(f) are first evaluated as the real and the imaginary part of the two-dimensional Fourier transform of the luminance component, respectively:

$$\begin{aligned} A(f) = \sqrt{{\mathrm{Re}[\textit{F}(Y)]}^2+{\mathrm{Im}[\textit{F}(Y)]}^2} \end{aligned}$$
(1)
$$\begin{aligned} P(f) = \tan ^{-1} \frac{\mathrm{Im}\left[ \textit{F(Y)}\right] }{\mathrm{Re}\left[ \textit{F(Y)}\right] } \end{aligned}$$
(2)

where \(\textit{F}\) represents the Fourier transform. The log spectrum L(f) is then obtained by:

$$\begin{aligned} L(f) = \mathrm{log}[A(f)] \end{aligned}$$
(3)

The average spectrum can be approximated by convoluting the log spectrum with a matrix \(h_n(f)\):

$$\begin{aligned} A_{s}(f) = h_{n}(f)*L(f) \end{aligned}$$
(4)

where \(h_n(f)\) is a nxn unit matrix with all entries equal to \(1/n^2\).

Finally, the spectral residual R(f) consists in the statistical singularities specific to the input image and is obtained, for each frame of a video sequence, as the difference between the log spectrum and the averaged spectrum, respectively:

$$\begin{aligned} R(f) = L(f) - A_{s}(f) \end{aligned}$$
(5)

Spectral residual is then converted to the saliency map S(x) using inverse two-dimensional Fourier transform. The resulting saliency map contains primarily the non-trivial part of the visual scene. The value at each point in a saliency map is squared to indicate the estimation error. For better visual effects, the saliency map is traditionally smoothed with a Gaussian filter g(x) with typical variance of 8:

$$\begin{aligned} S(x) = g(x)*\textit{F}^{-1}\left[ \mathrm{exp}(P(f) + R(f))^2\right] . \end{aligned}$$
(6)

3 Saliency-based deinterlacing

The flowchart of the proposed algorithm is depicted in Fig. 2. As the field interpolation model depends on the saliency map, the first step of our algorithm is given by the computation of the spatial saliency of the current field to the deinterlaced. To compute the so-called saliency map, the two models described in the Sect. 2 have been considered: the GBVS model proposed in [16], and the spectral saliency model. The obtained saliency map denoted in the followings by S (i.e., depicted in Fig. 3 in the case of the GBVS model) and consisting of gray values \(S(i,j) \in \{0\ldots 255\}\) will trigger, along with the texture type, the interpolation used for the current field. Equally, a Canny edge detector is applied on the current field and the edges mask C is obtained.

Fig. 2
figure 2

Flowchart diagram of the proposed deinterlacing algorithm

Further, the current field is partitioned into blocks of fixed size \(B^2\), each block being categorized depending on its belonging to the salient region, as follows: the block \(b_n\) of size \(B^2\) is said to be salient/important, i.e.:

$$\begin{aligned} S_{b_n} = \Sigma _{i=1}^{B}\Sigma _{j=1}^{B} s_n(i,j)/B^2 \end{aligned}$$
(7)

if the mean \(S_{b_n}\) of the entire collocated block \(s_n\) within the saliency map S is higher than a given threshold \(T_s\); otherwise, the block is classified as smooth.

Also, for each block \(b_n\) belonging to the current field \(f_n\), its number of edges is derived as in Eq. (8), by counting the amount of pixels on contours in the collocated block \(c_n\) in the mask field C, obtained with the Canny filter:

Fig. 3
figure 3

Saliency map obtained for (a) 10th frame of “Foreman” sequence, (b) 21st frame of “Salesman” sequence

$$\begin{aligned} CE_{b_n} = \Sigma _{i=1}^{B}\Sigma _{j=1}^{B} c_n(i,j) \end{aligned}$$
(8)

where, \(\textit{CE}_{b_n}\) is the number of identified edges in block \(b_n\). The block \(b_n\) is classified as highly textured if \(\textit{CE}_{b_n}\) is significant with respect to the blocksize \(B^2\), i.e.:

$$\begin{aligned} \textit{CE}_{b_n}> T_b \end{aligned}$$
(9)

(\(T_b\) is a threshold depending on \(B^2\)), or smooth, if Eq. (9) does not hold.

If the block \(b_n\) belongs to a salient region and its number of contours is significant (as in Eq. 9), optical flow-based motion estimation is implemented; otherwise, we use block-based estimation.

If the block \(b_n\) is determined as not belonging to a salient region, simple spatial 5-tap edge-line averaging techniques (Fig. 4) are used to obtain the deinterlaced block \(\hat{b}_n(i,j)\), i.e.:

$$\begin{aligned} \hat{b}_{n}(i,j) = \frac{b_{n}(i-1,j+x_0) + b_{n}(i+1,j-x_0)}{2},\qquad \end{aligned}$$
(10)

where the exact value of \(x_0\) is given by the minimization:

$$\begin{aligned}&\left| b_n(i-1,j+x_0) - b_n(i+1,j-x_0)\right| \nonumber \\&\quad = \min _{x_0 \in \left\{ -2,-1,0,1,2\right\} }{\left| b_n(i-1,j+x_0) - b_n(i+1,j-x_0)\right| }.\nonumber \\ \end{aligned}$$
(11)

For the salient blocks, the motion vectors (MV) are obtained on the backward and forward directions for the current field, and applying either OF-based estimation proposed by Liu in [5], or simple block-based ME.

We assume that the motion trajectory is linear; so, the obtained forward motion vectors (MVs) are split into backward (MVB) and forward (MVF) motion vector fields for the current field \(f_n\). As a block in \(f_n\) could have zero or more than one MVs passing through, the corresponding \(\mathrm{MV}_n\) for the block \(b_n \in f_n\) is obtained by the minimization of the Euclidean distance between \(b_n\)’s center, \((y_{n,0},x_{n,0})\), and the passing vectors MVs. In our minimization, we consider only the MVs obtained for the blocks in the neighborhood of the collocated block \(b_{n-1}\) in the left field \(f_{n-1}\) (thus, a total of nine MVs, obtained for \(b_{n-1}\) and the blocks adjacent to \(b_{n-1} \in f_{n-1}\), as these MVs are supposed to be the most correlated to the one in the current block, e.g., belonging to the same motion object).

If the motion vector MV corresponding to the collocated block \(b_{n-1} \in f_{n-1}\) lies on the line:

$$\begin{aligned} \frac{y-y_{n-1,0}}{\mathrm{MV}_y}=\frac{x-x_{n-1,0}}{\mathrm{MV}_x} \end{aligned}$$
(12)

where \((y_{n-1,0},x_{n-1,0})\) is the center of \(b_{n-1}\); and \(\mathrm{MV}_x\), respectively, \(\mathrm{MV}_y\), measures the displacement along the x, respectively, y axis; the distances from the center \((y_{n,0},x_{n,0})\) of the current block \(b_n\) to the MVs lines are obtained as:

Fig. 4
figure 4

5-Tap Edge Line Average (ELA): the five interpolation directions are represented with different dashed lines. Gray nodes correspond to original pixels from the upper and lower lines (solid lines); the black node is the interpolated pixel

$$\begin{aligned} D_{k \in \{1,\ldots ,9\}}= \frac{\left| \mathrm{MV}_{k,x} y_{n,0} - \mathrm{MV}_{k,y} x_{n,0} + \mathrm{MV}_{k,y} x_{n-1,k} - \mathrm{MV}_{k,x} y_{n-1,k} \right| }{\sqrt{{\mathrm{MV}_{k,x}}^2 + {\mathrm{MV}_{k,y}}^2}}. \end{aligned}$$
(13)

\(\mathrm{MV}^n\) is the closest motion vector to the current block \(b_n\), if its corresponding distance to the center of \(b_n\), \((y_{n,0},x_{n,0})\), is minimal, i.e., \(D_n = min (D_{k \in \{1,\ldots ,9\}})\). Hence, \(\mathrm{MV}^n\) is generated for each block, containing the motion estimation in the x and y directions for every pixel.

The forward and backward MVs for each block are obtained as:

$$\begin{aligned} \mathrm{MV}_{B}^{n} = \frac{-\mathrm{MV}^n}{2}, \quad \mathrm{MV}_{F}^n = \frac{\mathrm{MV}^n}{2}. \end{aligned}$$
(14)

The backward prediction of \(b_n\), denoted by \(\mathcal F^{n}_{\mathrm{MV}_{B}}\), is obtained as:

$$\begin{aligned} \mathcal F^{n}_{\mathrm{MV}_{B}}(i,j) = b_{n-1}\left( i + \mathrm{MV}_{By}^n, j + \mathrm{MV}_{Bx}^n\right) , \end{aligned}$$
(15)

and the forward prediction of the current block, \(\mathcal F^{n}_{\mathrm{MV}_{F}}\), is obtained as:

$$\begin{aligned} \mathcal F^{n}_{\mathrm{MV}_{F}}(i,j) = b_{n+1}\left( i + \mathrm{MV}_{Fy}^n, j + \mathrm{MV}_{Fx}^n\right) , \end{aligned}$$
(16)

The motion-compensated block \(\hat{b}^n\) to be further used for the deinterlacing of \(b_n\) is obtained as average of the backward, \(\mathcal F^{n}_{\mathrm{MV}_{B}}\), and forward, \(\mathcal F^{n}_{\mathrm{MV}_{F}}\), predictions:

$$\begin{aligned} \hat{b}^n(i,j) = \frac{\mathcal F^{n}_{\mathrm{MV}_{B}}(i,j) + \mathcal F^{n}_{\mathrm{MV}_{B}}(i,j)}{2}. \end{aligned}$$
(17)

Finally, the deinterlaced block is found in a saliency-based motion-compensated manner, as:

$$\begin{aligned} \hat{b}_n(i,j) = \frac{b_n(i-1,j+x_0)+ b_n(i+1,j-x_0) + s_n(i,j)\hat{b}^n(i,j)}{s_n(i,j)+2}, \end{aligned}$$
(18)

\(s_n\) being the corresponding saliency value within the saliency map S, acting as a weight for the motion-compensated interpolation, and \(x_0\) is obtained by the edge line minimization in (11).

4 Experimental results

To objectively and comprehensively present the performance of the proposed deinterlacing approach, our method has been tested on several CIF-\(352\times 288\) (“Foreman”, “Hall”, “Mobile”, “Stefan” and “News”) and QCIF-\(176\times 144\) (“Carphone” and “Salesman”) video sequences. These well-known test sequences have been chosen for their different texture content and motion dynamics. Such spatio-temporal characteristics can be explicited by computing the relative spatial information (SI) and temporal information (TI) found in these video contents, as described in [24]. Figure 5 shows the relative amount of spatial and temporal information for the selected test scenes. We can note that they span a large portion of the SI–TI plane, as desired. Moreover, the two (SI,TI) pairs located in the top right part of the diagram correspond to the “Mobile” and “Stefan” CIF sequences which are known to contain very high spatio-temporal activity.

Fig. 5
figure 5

Spatial–temporal information diagram for the test video scene set

Fig. 6
figure 6

Progressive to interlaced format frame conversion, by removing the dashed lines

The selected video sequences were originally in progressive format. To generate interlaced content, the even lines of the even frames and the odd lines of the odd frames were removed, as shown in Fig. 6. This way, objective quality measurements could be done, using the original sequences—progressive frames—as references.

In our experimental framework, the GBVS model is first considered. We have used \(8\times 8\, (B=8)\) pixel blocks for a \(16 \times 16\, (S = 16)\) search motion estimation window, for the salient blocks \(b_n\) having a small \(CE_{b_n} < T_b\) number of contours. The parameter \(T_s\) for saliency detector was set up to 20, and the edge threshold \(T_b\) to 32 (e.g., at least half of the block pixels are situated on contours).

The tests were run on 50 frames for each sequence. The deinterlacing performance of our method is presented in terms of peak signal-to-noise ratio (PSNR) computed on the luminance component. The efficiency of our proposed method—denoted in the followings by SGAD—is compared in Table 1 to Vertical Average (VA), Edge Line Average (ELA), Temporal Field Average (TFA), Adaptive Motion Estimation (AME) and Motion-Compensated Deinterlacing (MCD), which are the most common implementations in deinterlacing systems. Moreover, the proposed algorithm is compared to the work in [27], denoted by EPMC, [28] denoted by SMCD and the methods proposed in [29] (high-fidelity motion estimation based deinterlacer), [30] (adaptive motion-compensated interpolator with overlapped motion estimation) and [31] (hybrid low-complexity motion-compensated-based deinterlacer), which are all motion-compensation-based algorithms with different complexity degrees. (these latter results are reported as in the corresponding references, NC denoting the non-communicated ones). In the present case, the GBVS model is considered.

For visually showing the results of the proposed method, two deinterlaced frames are illustrated in Fig. 7.

Table 1 Y-PSNR results (in dB)
Fig. 7
figure 7

Deinterlacing result for the a 10th frame of “Foreman” sequence, b 21th frame of “Salesman” sequence

As it can be seen in the presented results, our proposed method using the GBVS model has an average PSNR gain of \(\approx 4.5\) dBs with respect to a wild range of deinterlacers. Our framework has been implemented in Matlab (8.0.0.783 (R2012b)) and the tests have been realized on a quad-core Intel-PC@4 GHz. Due to the independent block-based processing, the proposed deinterlacing approach is prone to distributed/parallel implementation, thus highly reducing the computation time obtained with a sequential implementation. Moreover, as the proposed algorithm adapts the motion estimation in function of region’s saliency, due to our used threshold \(T_s\) for motion computation, only \(\approx 1/3\) of field regions is motion processed (as it can be seen in Fig. 3). The parameterization allows, thus, to drastically decrease the complexity attached to motion-compensated schemes, by preserving its advantages where the user attention is focused.

However, it is known that the GBVS model requires a lot of computational resources to be computed. Such computational complexity can be a severe issue particularly for real-time applications. For lower complexity, we propose in what follows to replace the GBVS model by the SRVS one to compute the saliency values. Because optical flow is also complex to implement, only block-based MC is used in the low-complex approach; the implementation of the algorithm is left unchanged. The performances of the low-complex SGAD algorithm using the SRVS model are evaluated by comparing the PSNR and computation time values with the GBVS-based method for the video data set. Table 2 summarizes the results that include the average PSNR value, the average number of blocks on which the ELA algorithm is used (noted \(N_e\)), the average number of blocks on which block-based MC is used (noted \(N_m\)), and the average total computation time noted CT (saliency’s estimation then adaptive block processing).

Table 2 performances of the modified low-complex SGAD algorithm

First, we can note that the average PSNR values are reduced compared to the ones obtained with the GBVS-based method. The quality loss is due to the artifacts introduced by the block-based MC process, as opposed to optical flow mostly, but also due to the saliency model performance: the GBVS model has the best results, but unfortunately at the expense of processing time (it takes about 1 min to extract the saliency map). Nevertheless, we verify that the low-complex algorithm offers performances which are mostly similar to conventional deinterlacing techniques in terms of video quality. Concerning the total processing time, it varies between 4.49 and 20.1 s. This time must be set against that required for the GBVS-based version which varies approximately from 100 s for QCIF video contents to 350–400 s for CIF ones. Such time penalty for the initial version of our algorithm is mainly due to the optical flow computation, especially if highly textured salient regions are represent in the initial scene. Hence, the modified version of the SGAD algorithm can be adapted for real-time deinterlacing while maintaining a satisfactory video quality though slightly reduced. On the contrary, the GBVS version should be better suited for storage for which deinterlacing time is not an issue; so, it is mainly designed for adapting the content from interlaced cameras to progressive display devices. To conclude, it should be noted that further gains should be expected for the proposed SGAD method because the code is still not optimized.

5 Conclusion

In this paper, a spatial saliency-guided motion-compensated method for video deinterlacing is proposed. Our approach is an efficient deinterlacing tool, being able to adapt the interpolation method depending both on region of interest and its texture content. Experiments show that the proposed algorithm generates high-quality results, having more than 4.5 dBs PSNR gain, in average, compared to other deinterlacing approaches. Furthermore, the proposed method acknowledges the possibility of improving image quality and simultaneously reducing execution time, based on the saliency map. Finally, we have presented two models: the first one for storage applications (in this case, deinterlacing time is not a critical issue, so it is mainly designed for high-quality conversion from interlaced cameras to progressive display devices), and the other one with less but still acceptable video quality performances, which can be adapted for real-time deinterlacers.