程序代写代做代考 c++ GPU Bayesian algorithm ()

()

EUROGRAPHICS 2010 / T. Akenine-Möller and M. Zwicker
(Guest Editors)

Volume 29 (2010), Number 2

Shared Sampling for Real-Time Alpha Matting

Eduardo S. L. Gastal1 and Manuel M. Oliveira1,2

1Instituto de Informática, UFRGS 2Camera Culture Group, MIT Media Lab

Abstract
Image matting aims at extracting foreground elements from an image by means of color and opacity (alpha) esti-
mation. While a lot of progress has been made in recent years on improving the accuracy of matting techniques,
one common problem persisted: the low speed of matte computation. We present the first real-time matting tech-
nique for natural images and videos. Our technique is based on the observation that, for small neighborhoods,
pixels tend to share similar attributes. Therefore, independently treating each pixel in the unknown regions of a
trimap results in a lot of redundant work. We show how this computation can be significantly and safely reduced by
means of a careful selection of pairs of background and foreground samples. Our technique achieves speedups of
up to two orders of magnitude compared to previous ones, while producing high-quality alpha mattes. The quality
of our results has been verified through an independent benchmark. The speed of our technique enables, for the
first time, real-time alpha matting of videos, and has the potential to enable a new class of exciting applications.

Categories and Subject Descriptors (according to ACM CCS): I.4.6 [Image Processing and Computer Vision]:
Segmentation—Pixel classification

1. Introduction

Extraction and compositing of foreground objects are funda-
mental image and video editing operations. The process of
digital matting aims at accurately extracting foreground ob-
jects from images and videos. For this, matting techniques
need to estimate foreground (F) and background (B) colors
for all pixels belonging to an image I, along with opacity (α)
values. These values are related by the compositing Equa-
tion 1, where the observed color Cp of pixel p is expressed
as a linear combination of Fp and Bp, with interpolation pa-
rameter αp:

Cp = αp Fp +(1−αp)Bp. (1)

For natural images, F and B are not constrained to a particu-
lar subset of values. Thus, all variables on the right-hand side
of Equation 1 are unknown, making the problem of comput-
ing an alpha matte considerably harder.

Due to the highly ill-posed nature of the matting prob-
lem, most existing approaches require additional constraints
in the form of user input, either as trimaps or scribbles.
This user-supplied information identifies pixels for which
the opacity value αi is known to be 1 or 0, i.e., known fore-
ground and known background pixels, respectively. The re-
maining unconstrained pixels are marked as unknown. The

goal of a digital matting algorithm is then to compute the
values of αp, Fp, and Bp for all pixels labeled as unknown.

Matting techniques can be classified according to the un-
derlying method used for solving the matte [WC07a], which
can be based on sampling, pixel affinities, or a combination
of the two. Most recent matting algorithms [SRR09,LLW08,
RRG08,WC07b] fall in one of the last two categories, where
local affinities are employed in optimization steps for solv-
ing or refining the matte. This usually requires the solution
of large linear systems, whose sizes are directly proportional
to the number of unknown pixels in I and, thus, can be-
come quite big. Furthermore, these optimization procedures
solve for α independently of F and B, thus requiring an ad-
ditional step for reconstructing F and, if necessary, also B.
Consequently, state-of-the-art techniques take from seconds
to minutes to generate alpha mattes for typical images (with
about 1 Megapixels), making the matte creation process a
tedious task. Long offline computations have also prevented
the use of natural scenes in real-time matting applications,
such as live broadcasting.

We present the first real-time matting technique for nat-
ural images and videos. Our approach is based on the key
observation that pixels in a small neighborhood tend to have
highly similar values for (α,F,B) triplets. Thus, a significant

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.
Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and
350 Main Street, Malden, MA 02148, USA.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

Figure 1: Example of alpha matte extraction and compositing using our technique. (Left) Image (800× 563 pixels) from the
training dataset provided by [RRW∗09]. (Center) Alpha matte computed with our technique in 0.043 seconds using a trimap
with 40% of unknown pixels. (Right) Composite of the extracted foreground on a new background.

amount of computation used to obtain the matte for neigh-
boring pixels is in fact redundant and can be safely elimi-
nated. We show how to avoid such unnecessary computation
by carefully distributing the work among neighboring pixels,
which will then share their results. Since the operations per-
formed by the pixels are independent, they can be performed
in parallel on modern GPUs. As a result, our approach can
generate high-quality mattes up to 100 times faster than the
previous techniques. The quality of our results have been
confirmed by the independent image-matting benchmark of
Rhemann et al. [RRW∗09]. According to this benchmark,
our technique ranked second among current state-of-the-art
techniques. Figure 1 shows an example of an alpha matte
extracted with our technique in 0.043 seconds for a chal-
lenging example taken from the training dataset provided by
Rhemann et al. [RRW∗09]. When extended with an addi-
tional optimization step described in Section 4.2, our ap-
proach ranks first in the same benchmark, while still per-
forming at interactive rates.

We also introduce a new objective function for identify-
ing good pairs of background and foreground colors for any
given pixel p (Equation 7). Our new function takes into ac-
count spatial, photometric, and probabilistic information ex-
tracted from the image. Such a function allows our approach
to achieve high-quality results while still operating on a con-
siderably small discrete search space.

Due to its speed, our technique has the potential to enable
new and exciting real-time applications that have not been
previously possible. We illustrate this potential by showing
the first real-time alpha matting demonstration for natural-
scene videos, and by providing real-time feedback to users
during interactive alpha-matting extraction sessions.

2. Related Work

Sampling-based approaches make the assumption that the
true foreground and background colors of an unknown pixel
can be explicitly estimated by analyzing nearby known pix-
els (i.e., pixels in the trimap’s known regions), which are

called background or foreground samples. The first tech-
nique to use sampling for estimating the alpha values of
unknown pixels was proposed by Mishima [Mis94]. Earlier
systems also include the work of Ruzon and Tomasi [RT00],
where alpha values are estimated using simple color distri-
butions. Bayesian matting [CCSS01] models the foreground
and background distributions with spatially-varying sets of
Gaussians and solves the matte using a maximum a posteri-
ori (MAP) estimation.

Affinity-based approaches solve for α independent of the
estimation of foreground and background colors. The Pois-
son matting algorithm [SJTS04] observes that if the fore-
ground and background colors are locally smooth, the gra-
dient of the matte can be estimated from the gradient of the
image. The matte is then found by solving Poisson equa-
tions, with boundary conditions defined by the trimap. The
Random Walks method of Grady et al. [GSAW05] prop-
agates user constraints to the entire image by minimizing
a quadratic cost function. The Closed-form matting tech-
nique [LLW08] solves the matte by minimizing a cost func-
tion derived from careful analysis of the matting problem.

Combined approaches mix an initial sampling step with
α-propagation methods. The recent Robust matting ap-
proach of Wang and Cohen [WC07b] uses an initial sam-
pling step to adapt an energy function that is then minimized
using random walks. The authors discuss a way to calculate
the confidence of the collected samples, and the computa-
tion only uses high confidence samples. The work of Rhe-
mann et al. [RRG08] improves on this idea by proposing
new weights for the confidence metric. Furthermore, they
improve the search for suitable foreground samples by as-
suming that the foreground object is spatially connected.

Some algorithms use additional information to constrain
the matting problem. This extra information usually requires
special conditions, such as capturing multiple images with
different focal depths [MMP∗05] or acquiring depth infor-
mation using a time-of-flight sensor [CTP∗08].

Interactive alpha matting of images is typically performed

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

using a two-step iterative process: first, the user refines the
needed constraints (trimap or scribbles), which will then be
used for matte generation in a subsequent step. This process
is repeated until the user is satisfied with the quality of the
matte. Any delay between successive evaluations of the first
step are enough to make this a time-consuming and tedious
process. The system proposed by Wang et al. [WAC07]
tries to avoid such a delay by noting that the user mod-
ifies the system constraints in a localized fashion. There-
fore, the matte only needs to be (re)computed for a small
portion of the image at a time. However, as noted by Rhe-
mann et al. [RRRAS08], long or complex boundaries are
still monotonous and time-consuming to trace.

For segmentation and matting of videos, Bai and Sapiro
[BS07] use the geodesic distance — based on the short-
est path on a weighted graph — to interactively make soft
segmentation and matting of images and videos. The recent
work by Bai et al. [BWSS09] uses local classifiers to prop-
agate a user defined segmentation across time. The authors
extend the work in [LLW08] by adding a temporal coher-
ence term to the cost function for generating mattes for of-
fline video sequences. None of these techniques, however,
are suitable for real-time alpha-matting of videos.

3. Real-Time Alpha Matting

Our technique for real-time alpha matting is based on the
fundamental observation that pixels in a small neighbor-
hood often have highly similar values for their true (α,F,B)
triplets. From this, it follows that: (i) the initial collection
of samples gathered by nearby pixels differ only by a small
number of elements; and (ii) close-by pixels usually select
the same or very similar pairs for their best foreground and
background colors. This leads to the conclusion that per-
forming the alpha matte computation for each pixel indepen-
dently of its neighbors results in a large amount of redundant
work, which can be safely avoided without compromising
the quality of the matte. In fact, as demonstrated in Section 4,
it is possible to achieve speedups of up to two orders of mag-
nitude while still obtaining high-quality results.

Our technique takes as input an image I (or video se-
quence) and its corresponding trimap(s), and consists of the
following steps:

1. Expansion of Known Regions: extrapolates “known
foreground” and “known background” regions of the
trimap into the “unknown” regions;

2. Sample Selection and Matte Computation: tries to
identify, for each pixel in the unknown regions, the best
pair of foreground and background samples. It also com-
putes an alpha matte from the obtained samples;

3. Local Smoothing: locally smooths the resulting matte
while maintaining its distinct features.

3.1. Expansion of Known Regions

A trimap T segments an input image (or video frame) into
three non-overlapping pixel regions: known foreground (Tf ),
known background (Tb) and unknown (Tu). The idea behind
expanding known regions is to exploit the affinity of neigh-
boring pixels to reduce the size of the unknown regions.
Thus, let Dimage(p,q) and Dcolor(p,q) be, respectively, the
image-space and color-space distances between two pixels p
and q. The expansion process consists of checking for each
pixel p ∈ Tu if there exists a pixel q ∈ Tr (r = { f ,b}) such
that Dimage(p,q) ≤ ki, Dcolor(p,q) ≤ kc, and Dimage(p,q) is
minimal for p. In such a case, pixel p is labeled as belonging
to region Tr based on its affinity to pixel q ∈ Tr. The value of
the parameter ki depends on the unknown region size; thus,
larger images might require larger values of ki. We found
that ki = 10 pixels and kc = 5/256 units (measured as Eu-
clidean distance in the RGB unit cube) produce good results
for typical images. These values were used for all examples
shown in the paper and supplemental materials.

3.2. Sample Selection and Matte Computation

For each remaining pixel p ∈ Tu, our goal is to find an
(α,F,B) triplet that better models p. For this, a sampling
strategy inspired by the work of Wang and Cohen [WC07b]
is used, but it differs from theirs in some fundamental as-
pects. For any given pixel p ∈ Tu, Wang and Cohen’s idea
is to collect a large number of foreground and background
samples in a neighborhood around p. Such samples are con-
sidered candidates for estimating the alpha value of p. Their
assumption is that the true foreground and background col-
ors should be close to the ones of some of the collected
samples. This is a reasonable assumption when the initial
sample set is large. For instance, in their approach, Wang
and Cohen analyze 400 pairs of foreground and background
colors for each pixel p [WC07b]. Unfortunately, the use of
larger sample sets requires a significant amount of compu-
tation. The next sections show that this computational cost
can be significantly reduced by exploiting affinities among
neighboring pixels. Furthermore, a new and improved met-
ric for electing the best samples is presented, which takes
into account image parameters that were not considered in
the sample-selection process described in [WC07b].

We minimize the amount of work involved in finding the
best pairs of foreground and background samples for a set of
neighbor pixels by leveraging their high affinity. For this, we
first divide this task among the various pixels in the neigh-
borhood (sample gathering), which then share and refine
their results (sample refinement):

Sample Gathering: in this stage, each pixel p ∈ Tu selects
the best (F,B) pair from a small set of samples in its neigh-
borhood. Those sets are gathered in a manner that guaran-
tees that sample sets from neighboring pixels are disjoint. p
gathers at most kg background and kg foreground samples

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

Figure 2: The red arrows starting at p define the paths
for searching for background and foreground samples for p.
The selected (closest) samples are marked in orange. Pixel
q explores a different path (in blue). Foreground samples are
shown as squares, and background samples as circles.

(Figure 2), resulting in at most k2g tested pairs, from which
the best candidate is selected.

Sample Refinement: in this stage, each pixel p ∈ Tu ana-
lyzes the choices (without recombining them) of its (at most)
kr spatially closest pixels in Tu. Thus, while in practice p
performs (at most) k2g + kr pair evaluations, due to the affin-
ity among neighbor pixels, this is roughly equivalent to per-
forming (at most) k2g×kr pair comparisons. According to our
experience, values of kg = 4 and kr = 200 produce very good
results. For these values, the actual number of performed pair
comparisons is 216 (i.e., 16 + 200), while its net effect ap-
proximates a total of 3,200 (i.e., 16 × 200) comparisons.
Sections 3.2.1 and 3.2.2 present the details of the sample
gathering and sample refinement sub-steps.

3.2.1. Sample Gathering

In this stage, each pixel p ∈ Tu looks for possible foreground
and background samples along kg line segments starting at
p (Figure 2). These segments divide the plane of the image
into kg disjoint sectors containing equal planar angles. The
slope of the first line segment associated to p is defined by
an initial orientation θ ∈

[

0, π2
]

measured with respect to the
horizontal line. Such an angle takes a different value for each
pixel q ∈ Tu in a 3×3 window (Figure 2). The orientation of
the other segments is given by an angular increment θinc =

kg

. Starting from p and following a particular line segment
yields at most one background and at most one foreground
sample — the ones closer to p along the segment. Thus, p
must find its best pair among, at most, k2g sample pairs.

We introduce a new objective function that combines pho-
tometric, spatial, and probabilistic elements to select the best
sample pair for a pixel from the initial set of k2g pairs. The
proposed approach is the first to comprehensively consider
all these aspects. Thus, let fi and b j be a pair of foreground
and background samples, whose colors are F i and B j, re-
spectively. Next, we will derive an objective function (Equa-
tion 7) for identifying the best sample-pair for each pixel
p ∈ Tu. Before describing this function, its required build-
ing blocks (minimization of chromatic distortion and image
space statistics) will be presented.

Minimization of chromatic distortion: Similar to what has
been described by Wang and Cohen [WC07b], our approach
favors the selection of pairs of foreground and background
colors that can model the color of pixel p as a linear com-
bination of themselves. This is modeled by the chromatic
distortion Mp, whose value should be small for a good pair
of candidate colors:

Mp(F
i
,B j) = ‖Cp − (α̂p F

i
+(1− α̂p)B

j
)‖ (2)

where Cp is the color of p, and α̂p is the estimated al-
pha value for p, obtained using the color space projection
of Cp onto the line defined by F

i and B j . Wang and Co-
hen [WC07b] further divide Equation 2 by ‖F i − B j‖ —
which we do not do — to enforce a wide separation of the
foreground from the background colors. While their obser-
vation regarding linear interpolation is intuitively sound ac-
cording to Equation 1, this second criterion is oversimplify-
ing and does not address the fundamental issues involved
in the computation of a matte. As such, it does not rep-
resent a good measure for comparison of candidate pairs.
Thus, although a small Mp(F

i,B j) is necessary for accu-
rately representing the alpha value of p, it is not a suffi-
cient condition to elect a good sample pair. For this reason,
we propose a new color metric derived from two previously
made observations: (i) Omer and Werman [OW04] showed
that small pixel neighborhoods tend to form locally linear
clusters in color space. This is particularly true for small
windows located over image edges; and (ii) Mitsunaga et
al. [MYT95] showed that if the foreground and background
gradient norms ‖∇F‖ and ‖∇B‖ are relatively small com-
pared to ‖∇α‖, then the image gradient ∇I is directly pro-
portional to ∇α.

Based on these observations, one concludes that in the un-
known region of the trimap — where ‖∇α‖ is potentially
very large — the locally-linear color variations observed by
Omer and Werman [OW04] are primarily caused by varia-
tions in α. Thus, all colors from pixels in a small local win-
dow are situated approximately along the line in color space
spanned by the true foreground and background colors F and
B. This means that a good sample pair should minimize the
chromatic distortion not only for p, but for all pixels in a
small window around p. Thus, a good sample pair should
minimize the least squares residual defined by the neighbor-
hood affinity term:

Np(fi,b j) = ∑
q∈Ωp

Mq(F
i
,B j)2 (3)

where Ωp is the pixel neighborhood of p, consisting of all
pixels in a 3 × 3 window centered at p, and Mq is the oper-
ator defined in Equation 2, evaluated at pixel q.

Image space statistics: In addition to color space informa-
tion, image space statistics should play a key role in iden-
tifying good pair of samples. These image parameters were
not considered in the sample selection process of [WC07b],
where only color space metrics were used. We will now

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

use such information to estimate the probability of pixel p
belonging to the foreground, which will be later used in
Equation 6 to enforce a meaningful alpha value. Thus, let
Dp(s) = Dimage(s, p) = ‖s− p‖ be the image space distance
from a sample s to the current pixel p. We define the energy
Ep(s) to reach a foreground or background sample s from
the current pixel p as the squared path integral of ∇I along
the image space line segment L connecting p and s:

Ep(s) =

L
‖∇I ·dr‖2 =

∫ s

p

∇I ·

(

s− p
‖s− p‖

)

2

. (4)

The energy Ep(s) is directly proportional to the projected
length of ∇I onto the normalized direction of integration
(s− p). Thus, if the linear path from s to p crosses image
regions where ‖∇I‖ is large (e.g., at image edges), greater
energy will be required to reach s.

An estimate of the probability of p belonging to the fore-
ground, according to the energy Ep(s), can be obtained as:

PFp =
min j(Ep(b j))

mini(Ep(fi))+min j(Ep(b j))
. (5)

Thus, if the minimum energy required to reach a foreground
sample is much lower than the minimum energy required to
reach a background sample, PFp will be close to one — i.e.,
pixel p has a high probability of belonging to the foreground.

Intuitively, we want the alpha value α̂p (computed from fi
and b j — Equation 2) to correlate with the probability PFp
of pixel p belonging to the foreground. Thus, a good sample
pair should also minimize the function Ap:

Ap(fi,b j) = PFp +(1−2 PFp) α̂p. (6)

Indeed, for a given pair of samples (fi,b j), when PFp = 0,
Ap(fi,b j) = α̂p. Thus, minimizing Ap(fi,b j) also minimizes
the value of α̂p. Likewise, when PFp = 1, Ap(fi,b j) =
(1− α̂p), so minimizing Ap(fi,b j) maximizes α̂p. Finally,
if PFp = 0.5, Ap(fi,b j) = 0.5, and the value of α̂p has no
effect on the minimization. Ap(fi,b j) will be used as one of
the terms in the final objective function (Equation 7).

The Objective function: The resulting objective function
that combines photometric and spatial affinity, as well as
probabilistic information for selecting good pairs of back-
ground and foreground samples can be expressed as:

gp(fi,b j) = Np(fi,b j)
eN Ap(fi,b j)

eA Dp(fi)
e f Dp(bi)

eb . (7)

Here, Np(fi,b j) minimizes chromatic distortion in the 3×3
neighborhood around p. Ap(fi,b j) enforces that the com-
puted alpha matte values correlate with the probability of
pixel p belonging to the foreground. Dp(fi) and Dp(bi) en-
force the spatial affinity criterion: the background and fore-
ground samples should be as close as possible to p. The con-
stants e{N,A, f ,b} define the penalties for having a high value
in any of these metrics. In practice, we found that values of
eN = 3, eA = 2, e f = 1 and eb = 4 tend to produce good re-
sults. Thus, the best pair of foreground and background sam-
ples (f̂p, b̂p) for pixel p is obtained by evaluating gp(fi,b j)

for all possible sample-pairs:

(f̂p, b̂p) = argmin f,b gp(fi,b j). (8)

Let (F
g
p ,B

g
p) be the corresponding colors of the best pair

(f̂p, b̂p) for p, obtained in the gathering stage. We then com-
pute σ2f and σ

2
b as:

σ2f =
1
N ∑q∈Ω f

∥Cq −F
g
p

2
,

σ2b =
1
N ∑q∈Ωb

∥Cq −B
g
p

2 (9)

where Ω f and Ωb are 5×5 pixel neighborhoods centered at
f̂p and b̂p, respectively, and N = 25. The values σ2f and σ

2
b

measure the local color variations in the neighborhoods Ω f
and Ωb assuming small univariate Gaussian color distribu-
tions around f̂p and b̂p. We will use σ2f and σ

2
b in the sample-

refinement step. Hence, the output of the sample gathering
stage is a tuple τgp = (F

g
p , B

g
p, σ2f , σ

2
b) for each pixel p ∈ Tu.

3.2.2. Sample Refinement

For small values of kg, the number of samples analyzed by
pixel p ∈ Tu during the sample gathering stage is often not
enough to reliably estimate either an alpha value or the true
foreground and background colors. Thus, a more extensive
search is performed by sharing the best results obtained by
all pixels in a neighborhood around p in Tu.

In the sample-refinement stage, each pixel p compares
its own choice of best sample-pair with the choices of its
(at most) kr spatially closest pixels q ∈ Tu. Among those,
the three tuples with the lowest values of Mp(F

g
q ,B

g
q) are

then averaged to create a new tuple τ̃gp = (F̃
g
p , B̃

g
p, σ̃2f , σ̃

2
b)

for p. The purpose of this averaging is to reduce the occur-
rence of noise in the resulting alpha matte. This procedure
is supported by the observation that neighbor pixels tend
to have similar values of alpha, as well as background and
foreground colors (i.e., neighbor pixels tend to present high
affinity). Therefore, by averaging the best few values in a
given neighborhood, the occurrence of noise is reduced.

The output of the sample-refinement stage for pixel p∈ Tu
is another tuple τrp = (F

r
p , B

r
p, α

r
p, f

r
p), where:

Frp =

{

Cp if

∥Cp − F̃
g
p

2
≤ σ̃2f


g
p otherwise

, (10)

Brp =

{

Cp if

∥Cp − B̃
g
p

2
≤ σ̃2b


g
p otherwise

, (11)

αrp =
(Cp −B

r
p) · (F

r
p −B

r
p)

∥Frp −B
r
p

2
, (12)

f rp =

{

exp
{

−λ Mp(F̃
g
p , B̃

g
p)

}

if Frp 6= B
r
p

ε if Frp = B
r
p

. (13)

Here, the superscript r represents quantities computed in the
sample-refinement stage. The intuition behind the computa-
tion of Frp is that if the color Cp of pixel p is sufficiently close

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

to the average color F̃
g
p of the best three foreground sam-

ples computed during the gathering stage, then Frp should be
taken as Cp, thus keeping the original color. The case for
Brp is similar. The alpha value α

r
p is computed as the rela-

tive length of the projection of vector (Cp −B
r
p) onto vector

(Frp −B
r
p), defined by the computed foreground and back-

ground colors. Thus, αrp represents the opacity of p. Finally,
f rp expresses the confidence of p in its candidate foreground
and background colors Frp and B

r
p. This confidence mea-

sure should decrease fast (but not too fast) as the foreground
and background colors F̃

g
p and B̃

g
p fail to properly model the

color Cp of p. According to our experience, a value of λ = 10
produces good results. Additionally, if Cp is close to both F̃

g
p

and B̃
g
p, the α value of p cannot be accurately estimated and

the confidence f rp is set to a small value ε = 10
−8.

For completeness, output tuples for pixels outside the un-
known region are also defined; thus, for pixels v ∈ Tf , we
have τrv = (F

r
v = Cv, B

r
v = Cv, α

r
v = 1, f

r
v = 1); and for pix-

els v∈ Tb, we have τ
r
v = (F

r
v =Cv, B

r
v =Cv, α

r
v = 0, f

r
v = 1).

3.3. Local Smoothing

Although the sample-selection process takes into account
affinities among localized groups of pixels, this is not
enough to prevent discontinuities in the resulting matte.
Thus, an additional step is used to ensure the local smooth-
ness of the final alpha values, while maintaining its dis-
tinct features. This is achieved by computing, for each pixel
p ∈ Tu, a weighted average of the tuples τrq of the closest m
neighbors of p in image space (we use a value of m = 100).
Such neighbors can come from either Tu, Tf , or Tb. Let Ψp
be such a neighborhood for pixel p. The weights are defined
in such way that details in the matte are preserved.

The final foreground and background colors Fp and Bp of
p are computed as:

Wc(p,q) =

{

G
(

Dimage(p,q)
)

f rq

∣αrp −α
r
q

∣ if p 6= q

G
(

Dimage(p,q)
)

f rq if p = q
,

Fp =
∑q∈Ψp

[

Wc(p,q) αrq F
r
q

]

∑q∈Ψp
[

Wc(p,q) αrq
] , (14)

Bp =
∑q∈Ψp

[

Wc(p,q) (1−αrq) B
r
q

]

∑q∈Ψp
[

Wc(p,q) (1−αrq)
] (15)

where G is a normalized Gaussian function with variance
σ2 = m/9π pixels. The set Ψp of the closest m pixels to p
approximates an image space circle with area m = πr2. The
farthest pixels in Ψp have a distance of r to p and should
have weights close to zero in the Gaussian (i.e., r = 3σ).
Thus, m = πr2 = π(3σ)2, which solves to σ2 = m/9π.

The weight Wc(p,q) blends the foreground (background)
colors of pixels p and q taking into account: (i) Spatial affin-
ity – the colors of two pixels that are far apart in image

space should not be averaged. This is modeled by the Gaus-
sian function; (ii) Confidence values – pixels with low con-
fidence in their foreground and background samples should
not propagate their uncertainty. (iii) Difference in alpha val-
ues – by minimizing ‖∇F‖ and ‖∇B‖ where the estimated
‖∇α̂‖ is large, we make the final ∇α ≈ ∇I; Furthermore,
whenever Cq = B

r
q, α

r
q = 0 regardless the value of F

r
q ; con-

versely, whenever Cq = F
r
q , α

r
q = 1 regardless the value of B

r
q

(Equation 12). Thus, we multiply Frq by α
r
q in Equation 14 to

denote that the confidence of pixel q in its foreground color
Frq is directly proportional to α

r
q. Similarly, we multiply B

r
q

by (1−αrq) in Equation 15 to denote that the confidence of
pixel q in its background color Brq is inversely proportional
to αrq.

We can now compute the confidence fp of pixel p in
its final foreground and background colors Fp and Bp. To
do so, we first define in Equation 17 the mean foreground-
background distance DFB (in color space) for the neighbor-
hood Ψp. This mean is weighted by WFB(q) (Equation 16)
which is directly proportional to the confidence f rq of q and
is maximized for values of αrq = 0.5 — where the confidence
of q in both Frq and B

r
q is potentially maximal — while being

zero for αq = {0,1}. DFB will be used next to compute the
final confidence fp.

WFB(q) = f
r
q α

r
q

(

1−αrq
)

, (16)

DFB(Ψp) =
∑q∈Ψp

[

WFB(q)

∥Frq −B
r
q

]

∑q∈Ψp WFB(q)
. (17)

The final confidence fp of pixel p in its final foreground
and background colors Fp and Bp is modeled by Equa-
tion 18. Here, the first term expresses the ratio of the distance
‖Fp −Bp‖ to the mean foreground-background distance in
the neighborhood Ψp (clamped to the range [0,1]). This ra-
tio tries to detect pixels whose final foreground and back-
ground colors deviate from those in the neighborhood Ψp.
The second term is analogous to Equation 13 (λ = 10).

fp = min

(

1,
‖Fp −Bp‖
DFB(Ψp)

)

exp

{

−λ Mp(Fp,Bp)
}

. (18)

Having Fp, Bp and fp, we can now compute the final al-
pha value αp of pixel p. In order to do so, we first define the
low frequency alpha αlp (Equation 20) as the weighted av-
erage of alpha values in the neighborhood Ψp. The weights
Wα(p,q) are proportional to the confidence f

r
q of q and in-

versely proportional to the image space distance of p and q.
Additionally, greater weights are given for pixels lying in Tf
or Tb (i.e., known pixels).

Wα(p,q) = f
r
q G

(

Dimage(p,q)
)

+δ(q /∈ Tu), (19)

αlp =
∑q∈Ψp

[

Wα(p,q) αrq
]

∑q∈Ψp Wα(p,q)
(20)

where δ is a boolean function returning 1 when q /∈ Tu; or 0

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

otherwise. G is the Gaussian function from Wc (Equations 14
and 15).

The final alpha value αp for p is given by Equation 21. This
equation blends the alpha value computed using Fp and Bp
with the low frequency alpha αlp, using as blending factor the
final confidence fp. Thus, pixels with a low final confidence
will accept alpha values from higher-confidence neighbors
(modeled by αlp) to preserve local smoothness.

αp = fp
(Cp −Bp) · (Fp −Bp)

‖Fp −Bp‖
2

+(1− fp) α
l
p. (21)

Finally, the output of the proposed algorithm for the
matting parameters of pixel p ∈ Tu is given by the tuple
(Fp, Bp, αp) with an associated confidence value of fp. For
completeness, the matting parameters for pixels outside the
unknown region are also defined. Thus, for pixels q ∈ Tf
we have the tuple (Fq = Cq,Bq = Cq,αq = 1) with con-
fidence fq = 1, and for pixels w ∈ Tb we have the tuple
(Fw = Cw,Bw = Cw,αw = 0) with confidence fw = 1.

4. Results

We have implemented the technique described in the paper
using C++ and GLSL and used it to process a large num-
ber of images, and videos. Given that the search space as-
sociated with Equation 7 is both small and discrete, its min-
ima is computed by evaluating this equation for its entire
search space and by selecting the sample-pair with the small-
est value. Since these operations can be performed indepen-
dently for each pixel p ∈ Tu, we exploit the inherent paral-
lelism of current GPUs to efficiently perform these searches
in parallel. All the results reported in this paper were ob-
tained on a 2.8 GHz Quad Core PC with 8 GB of memory
and a GeForce GTX 280 with 1 GB of memory.

In order to assess the quality of our results, we used the
benchmark provided by Rhemann et al. [RRW∗09]. It eval-
uates and compares the accuracy of an image-matting tech-
nique against the results produced by the state-of-the-art.
Such a benchmark is composed of eight test images publicly
available at www.al phamatting.com, each accompanied by
three trimaps (small, large and user). These images are de-
signed to be challenging representations of natural scenes,
containing examples of highly textured backgrounds, as well
as images where background and foreground colors cannot
be easily differentiated. The ground-truth alpha mattes for
each of the test images are used to assess the quality of the
results, but are not disclosed to the public. As such, Rhe-
mann et al.’s benchmark provides an independent and reli-
able mechanism for evaluating the quality of digital image-
matting algorithms. Tables 2 (a) and (b) show results from
Rhemann et al.’s benchmark and summarize the accuracy
of our technique (Shared Matting – Real-Time) in compari-
son to others. Our approach ranks second according to the
sum of absolute differences (SAD) and mean squared error

Image # of pixels % Unknown Time (sec)

elephant 536,800 16% 0.029
donkey 455,200 17% 0.028

pineapple 481,600 20% 0.032
doll 451,200 25% 0.034

plasticbag 529,600 28% 0.040
plant 425,600 35% 0.038
troll 512,000 40% 0.045
net 496,000 51% 0.056

Table 1: Time to generate alpha mattes with our technique
for the images in the benchmark (Figure 2), using the most
conservative trimaps (i.e., large).

(MSE) metrics, when compared to previous techniques. It is,
however, up to 100 times faster, allowing, for the first time,
alpha matting in real-time applications. Such results indicate
that our technique produce high-quality alpha mattes even
for challenging images.

Since previous technique are not suitable for real-time ap-
plications, performance comparisons considering the time
required to compute the alpha matte have been overlook
in many previous publications. Our technique, on the other
hand, can compute alpha mattes for typical images in real-
time. Table 1 summarizes the times required to extract the
alpha mattes for the test dataset available from Rhemann
et al.’s benchmark [RRW∗09] using the most conservative
trimaps (i.e., large). For each image, we provide its dimen-
sions, the number of pixels in the unknown region of the
trimap, and the time required to compute the matte.

Figure 3 shows the alpha mattes generated by our tech-
nique for some images from the training dataset provided
by [RRW∗09] (using the small trimaps). For such dataset,
the ground-truth mattes are available and are shown next
to our results for comparison. Versions of these images in
their original resolutions are provided for closer inspections
as supplementary material. For this dataset (average image
size of 0.5Mpix, with 16% unknown pixels), our technique
takes on average 0.024 seconds for the matte computation.
The expansion-of-known-regions step (Section 3.1) reduced
the number of pixels in Tu by 38% on average, but 4% of the
pixels were erroneously considered to be in Tf or Tb.

In the gathering stage (Section 3.2.1), foreground and
background samples are found using a linear search (Fig-
ure 2) with a step size of 6 pixels and a maximum of 300
steps. Fine details in the trimap might be “jumped over” by
some unknown pixels; however, this is not a problem due to
the sample-pair sharing in the sample-refinement stage.

Cost of the Algorithm: Let z be the number of pixels in
Tu. Per pixel computation can be defined as follows: (i) the
“expansion of known regions” step runs in constant time
O(πk2i ); (ii) The “sample gathering” step has a worst cost
of O(z + k2g), and average cost of O(ω + k

2
g), where ω is the

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

(a)

(b)

Table 2: Results of Rhemann et al.’s benchmark. Our technique, Shared Matting (Real-Time), ranks second when compared
against previous techniques. However, it is up to 100 times faster. A version of our technique extended with an optimization step
(Shared Matting) ranks first when compared against the same set of techniques. The overal rankings were computed from the
SAD and MSE error values supplied by Rhemann et al.’s benchmark. Only the top five techniques are shown.

Figure 3: Left: Images from the training dataset provided
by [RRW∗09]. Center: Alpha mattes extracted with our
technique using the trimaps supplied in the dataset. Right:
Ground truth alpha mattes provided for comparison.

average number of pixels inspected before finding candidate
samples; (iii) the “sample refinement” step runs in constant
time O(kr); and (iv) the “local smoothing” step runs in con-
stant time O(4m). All this computation can be done inde-
pendently for each pixel p ∈ Tu. Thus, let t be the number of
pixels which are processed in parallel on a GPU. The aver-
age cost is O(z/t). Worst cost is O(z2/t).

4.1. Applications

Video Matting Our method enables the use of alpha mat-

ting in real-time applications for the first time. One possi-
ble such application is real-time matte generation for videos.
Since our approach uses a trimap as input, it needs to rely
on other techniques for providing trimaps for each frame of
the video in real-time. The supplementary video illustrates
such an application for two video sequences. In these ex-
amples, the trimaps were created by dilating the boundaries
of the binary segmented video frames. Such segmentation
was obtained using a real-time foreground-background bi-
nary segmentation technique which models the background
color distribution from color priors using the kernel den-
sity estimation method (as such, this technique is limited to
videos where the foreground and background color distribu-
tions do not overlap). Thus, given an input video sequence,
the results shown in the supplementary video were entirely
computed in real-time. This means that the whole sequence
of operations comprising binary segmentation, boundary di-
lation, and matte extraction was performed in real time. Fig-
ure 4 shows some frames from two of these video sequences
with their corresponding extracted mattes.

Interactive Alpha Matting Another benefit of the im-
proved performance of our technique is its ability to provide
real-time feedback to users during interactive alpha-matting
sessions. We demonstrate this feature using a simple trimap
creation interface. Initially, all pixels in the image are labeled
as belonging to the unknown region Tu. As one uses small
scribbles over the image, a trimap is automatically computed
and refined, also providing instant feedback on the resulting
matte extraction. The scribbles are propagated using an iter-
ative flood-filling procedure, limited by a simple edge detec-
tor. Figure 5 illustrates the concept using one of the images
of the training dataset. On the left, one sees the scribbles su-
perimposed onto the original image. The blue color is a label
for foreground pixels, while red and yellow represent back-

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

Figure 4: Some frames extracted from video sequences pro-
cessed by our technique for real-time matte extraction. The
images on top show the original frames, while the extracted
mattes are shown at the bottom.

Figure 5: Use of our technique to interactively segment im-
ages by alpha matting. As the user scribbles over the im-
age (left), a trimap is automatically updated. The resulting
trimap and alpha matte computed from the set of scribbles
on the left are shown on the center and right images.

ground and unknown pixels, respectively. The image on the
center shows the computed trimap, for which a gray shade
indicates uncertainty about whether a pixel belongs to the
background (shown in black) or to the foreground (shown in
white). The extracted alpha matte is shown on the right.

The speed of our method makes the matte creation process
easier for the user, as there is no delay between input and
matte refinement. This considerably reduces the time taken
to interactively segment images by alpha matting. The sup-
plementary video shows such an application in action. The
resulting mattes are generated considerably faster for images
with complex edges and topologies. Only in the worst case
one needs to completely trace the border of the foreground
object, which is always needed in the technique described
in [WAC07].

High-resolution Matting Due to its improved speed, our
method can generate alpha mattes for high-resolution images
much faster than existing techniques. We ran our algorithm
on the high-resolution training dataset from [RRW∗09],
where images are, on average, 6.7 Mpix with 16% unknown
pixels. For these images, matte computation takes a mean
time of 0.3 seconds, with a longest time of 1.1s and short-
est time of 0.17s. We also tested our method on extremely
high resolution images. For a 37 Mpix image, for which a
trimap was created manually containing 11% unknown pix-
els, matte computation took 3.65 seconds (with ki = 30).

4.2. Matte Optimization

Some users might want to obtain a matte with the best pos-
sible quality, even at the expense of extra computation time.
This can be achieved by refining the matte obtained in Sec-
tion 3 using an additional optimization step. This step is
analogous to the one in [RRG08], where the final matte
is obtained by minimizing a quadratic cost function in α.
This cost function is comprised of a smoothness term and a
data term. Here, we use the matting Laplacian of [LLW08]
as smoothness term, but for the data term we use the mat-
ting parameters (αp, fp) obtained in Section 3. Thus, let
α̂T = [α1, . . . αp, . . . αn] be a vector with the alpha val-
ues, obtained in Section 3, for all n pixels in the input image.
Let Γ̂ be a diagonal matrix where each diagonal element γp
is defined as γp = fp if p ∈ Tu or γp = 0 if p ∈ Tf ∪Tb. The
final alpha matte is obtained by solving for:

α = argmin αT Lα + λ (α− α̂)T D(α− α̂)
+ γ (α− α̂)T Γ̂(α− α̂)

(22)

where λ is some relatively large number when compared
to to the values in α̂ and Γ̂. γ = 10−1 is a constant which
defines the relative importance of the data and smoothness
terms, L is the matting Laplacian and D is a diagonal matrix
whose diagonal elements are one for pixels in Tf ∪ Tb and
zero for all other pixels. Equation 22 yields a sparse linear
system which we solve using the conjugate gradient method
implemented on the GPU. For an image with 512,000 pixels
and 25% unknown pixels, solving for α takes approximately
100ms. Thus, the system is no longer real-time but still inter-
active. For videos, the matting Laplacian L must be recom-
puted for every frame. However, such computation for frame
k can be offloaded to the CPU and performed in parallel with
the GPU optimization of frame (k−1).

This optimization step induces an α-propagation from
high confidence to low confidence pixels. Thus, while pixels
with a high final confidence try to adhere to the alpha values
computed in Section 3, pixels with a low final confidence
will rely more on propagation for their final alpha values.

Tables 2 (a) and (b) show Rhemann et al.’s benchmark
results for the version of our technique containing the ad-
ditional optimization step (Shared Matting). This extended
version ranks first both in the sum of absolute differences
and mean squared error metrics.

5. Limitations

The proposed technique makes the assumption that the true
foreground and background colors of unknown pixels can
be explicitly estimated by analyzing nearby known pixels.
For images where this assumption does not hold, computed
alpha values will not be accurate (this will be apparent in
the final, low, confidence values). Examples of such prob-
lematic images are those containing completely transparent
foreground objects or those where the foreground and back-
ground color distributions strongly overlap.

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

E. S. L. Gastal & M. M. Oliveira / Shared Sampling for Real-Time Alpha Matting

For videos with highly textured backgrounds or greatly
transparent foreground pixels, the produced matte might suf-
fer from temporal noise, i.e. flickering. One can consider
many ways for improving the temporal coherence of the
matte, such as temporal blending of alpha values based on
confidence estimates, or even selecting candidate samples
along the time axis, in addition to the image space.

6. Conclusions and Future Work

We have presented the first real-time matting technique for
natural images and videos. Our technique is based on the ob-
servation that pixels in a small neighborhood in image space
tend to have highly similar values for (α,F,B) triplets. As
such, independently computing such triplets for each pixel
in the unknown region in a conventional way tends to result
in a lot of redundant work. The amount of required computa-
tion can then be significantly and safely reduced by a careful
selection of pairs of candidate background and foreground
samples. The work required to perform this task can be dis-
tributed among neighbor pixels, leading to considerable sav-
ings in computation cost. Moreover, the required operations
can be performed in parallel on programmable GPUs.

We have also presented a new objective function for iden-
tifying good pairs of background and foreground samples.
Such a function takes into account spatial and photomet-
ric, as well as some probabilistic information extracted from
the image. This improved objective function allows our ap-
proach to achieve high-quality results while still operating
on a considerably small discrete search space. Our approach
can achieve speedups of up to two orders of magnitude
compared to previous approaches, while producing highly-
accurate alpha mattes. We assessed the quality of our results
by performing the independently-developed benchmark by
Rhemann et al. [RRW∗09]. In such a benchmark, our real-
time technique ranked second with respect to both the sum of
absolute differences and to the mean squared error metrics.
A version of our technique that uses an additional optimiza-
tion step to improve the matte quality ranked first according
to both metrics, while still performing at interactive rates.

We have demonstrated that our technique can provide
instant feedback to support interactive extraction of high-
quality alpha mattes. It is also fast enough to, for the first
time, support alpha-matte computation for natural videos
in real-time, given that the corresponding trimaps are pro-
vided. This opens up exciting opportunities for new real-
time applications and for improved real-time trimap gener-
ation for videos. We are currently exploring some of these
directions. Finally, the problem of handling temporal coher-
ence for real-time matte remains to be explored.

7. Acknowledgements

We would like to thank the anonymous reviewers for
their insightful comments. Manuel M. Oliveira acknowl-

edges the following CNPq-Brazil fellowships and grant:
200284/2009-6, 305613/2007-3, and 476954/2008-8. Ed-
uardo S. L. Gastal acknowledges his PIBIC/CNPq/UFRGS
fellowship.

References

[BS07] BAI X., SAPIRO G.: A geodesic framework for fast in-
teractive image and video segmentation and matting. ICCV 1
(2007), 1–8. 3

[BWSS09] BAI X., WANG J., SIMONS D., SAPIRO G.: Video
snapcut: robust video object cutout using localized classifiers.
ACM Trans. Graph. 28, 3 (2009), 1–11. 3

[CCSS01] CHUANG Y.-Y., CURLESS B., SALESIN D. H.,
SZELISKI R.: A bayesian approach to digital matting. In CVPR
(2001), vol. 2, pp. 264–271. 2

[CTP∗08] CRABB R., TRACEY C., PURANIK A., DAVIS J.,
SANTA CRUZ C., CANESTA I., SUNNYVALE C.: Real-time
foreground segmentation via range and color imaging. In CVPR
Workshop on Time-of-flight Computer Visionn (2008), pp. 1–5. 2

[GSAW05] GRADY L., SCHIWIETZ T., AHARON S., WESTER-
MANN R.: Random walks for interactive alpha-matting. In VIIP
(2005), pp. 423–429. 2

[LLW08] LEVIN A., LISCHINSKI D., WEISS Y.: A closed-form
solution to natural image matting. TPAMI 30, 2 (2008), 228–242.
1, 2, 3, 9

[Mis94] MISHIMA Y.: Soft edge chroma-key generation based
upon hexoctahedral color space, Oct. 11 1994. US Patent
5,355,174. 2

[MMP∗05] MCGUIRE M., MATUSIK W., PFISTER H., HUGHES
J. F., DURAND F.: Defocus video matting. ACM Trans. Graph.
24, 3 (2005), 567–576. 2

[MYT95] MITSUNAGA T., YOKOYAMA T., TOTSUKA T.: Au-
tokey: Human assisted key extraction. In SIGGRAPH (1995),
pp. 265–272. 4

[OW04] OMER I., WERMAN M.: Color lines: Image specific
color representation. In CVPR (2004), vol. 2, pp. 946–953. 4

[RRG08] RHEMANN C., ROTHER C., GELAUTZ M.: Improving
color modeling for alpha matting. In BMVC (2008), pp. 1155–
1164. 1, 2, 9

[RRRAS08] RHEMANN C., ROTHER C., RAV-ACHA A., SHARP
T.: High resolution matting via interactive trimap segmentation.
CVPR 1 (2008), 1–8. 3

[RRW∗09] RHEMANN C., ROTHER C., WANG J., GELAUTZ
M., KOHLI P., ROTT P.: A perceptually motivated online bench-
mark for image matting. In CVPR (2009). 2, 7, 8, 9, 10

[RT00] RUZON M., TOMASI C.: Alpha estimation in natural im-
ages. In CVPR (2000), vol. 1, pp. 18–25. 2

[SJTS04] SUN J., JIA J., TANG C.-K., SHUM H.-Y.: Poisson
matting. ACM Trans. Graph. 23, 3 (2004), 315–321. 2

[SRR09] SINGARAJU D., ROTHER C., RHEMANN C.: New ap-
pearance models for natural image matting. In CVPR (2009). 1

[WAC07] WANG J., AGRAWALA M., COHEN M. F.: Soft scis-
sors: an interactive tool for realtime high quality matting. In SIG-
GRAPH (2007), p. 9. 3, 9

[WC07a] WANG J., COHEN M. F.: Image and video matting:
a survey. Foundations and Trends in Computer Graphics and
Vision 3, 2 (2007), 97–175. 1

[WC07b] WANG J., COHEN M. F.: Optimized color sampling for
robust matting. CVPR (2007), 1–8. 1, 2, 3, 4

c© 2010 The Author(s)
Journal compilation c© 2010 The Eurographics Association and Blackwell Publishing Ltd.

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *