TOMPKIN ET AL.: CRITERIA SLIDERS 9
tion on the uncertainty of the prediction on label
f
i
, which is typically low when
X
i
is labeled
and is high otherwise. Active label suggestion presents results which minimize uncertainty.
Note that this Bayesian reformulation is mathematically rigorous: By the construction of
H
[
8
],
C
−1
is positive definite and therefore,
m
and
C
are valid mean vector and covariance
matrix. This shapes the posterior
p
(
f|y
) as a sample from an underlying Gaussian process that
factorizes to a prior on
f
and the corresponding likelihood
p
(
f|y
). This re-interpretation enables
us to exploit well-established Bayesian uncertainty modeling. In particular, the variance of the
predicted distribution
p
(
f|y
) indicates how uncertain we are about each prediction we make
on data points X ∈ X .
One simple and well-established strategy to exploit these modeled uncertainties for active
label selection is to predict at each iteration
t
and choose the point
X
i
which has the largest
uncertainty. However, Figure 8 shows that naively choosing data points with maximum un-
certainty leads to poor performance with a higher error rate than random selection, as isolated
outlier data points—which are not broadly informative—receive high variances and are chosen.
Instead, we construct the candidate data points that minimizes the predictive variance over
the entire set of data points. At each time step
t
, we choose data points with the highest average
information gain I, defined as:
I(X
i
)=
X
j=1,...,u
[C(t−1)−C(t)
i
]
[ j, j]
. (9)
The matrix
C
(
t
)
i
is constructed by adding 1 to the
i
-th diagonal element of
C
(
t −
1)
−1
and
inverting it (i.e., i-th data point is regarded as labeled; see Eq. 2).
As suggested by the form of the matrix C, this score can be calculated without knowing the
corresponding label value
Y
i
. This is due to the Gaussian noise model; in general, different
noise models lead to the predictive variance (C) depending on the labels.
Naïvely estimating the information gain for all data points requires quadratic computational
complexity: One has to estimate the minimizer of
E
(
f
) (Eq. 2), which is
O
(
u
3
) for each data
point. However, in our iterative label suggestion scenario,
I
can be efficiently computed in
linear time: Assuming that
C
(
t−
1) is given from the previous step, calculating
diag
[
C
(
t
)
i
] does
not require inverting the matrix
L
(
t
)+
λH
: Using Sherman–Morrison–Woodbury formula,
C
(
t
)
i
can be efficiently calculated from C(t−1):
diag[C(t)
i
]= diag[C(t−1)]−
squ[C(t−1)
[:,i]
]
1+diag[C(t−1)]
i
, (10)
where
A
[:, j]
denotes a vector formed from the
j
-th column of the matrix
A
,
diag
[
A
] constructs
a vector from the diagonal components of matrix
A
, and
squ
[
B
] is a vector obtained by tak-
ing element-wise squares of the vector
B
. Accordingly, after explicitly calculating
C
(0),
1
subsequent updates in C(t) and C(t)
i
can be performed efficiently for each iteration t.
Amershai et al. comparison.
One of the first algorithms to propose active learning in end-
user interaction application is Amershi et al.’s approach [
1
]. Their algorithm was originally
developed for active labeling in metric learning contexts, and so it is not directly applicable
to our problem. However, the re-interpretation of their label selection criterion enables trans-
ferring their algorithm to our current setting: Their strategy is based on a Gaussian process
(GP) model with Gaussian noise: For a given set of labeled data points
S
and the unlabeled
data points U, and for the i-th unlabeled point, the corresponding score f (i) is defined as:
f =
Var(i|S )
Var(i|U)
, (11)
1
In practice, we limit C(0) to 2,000 labels for faster interactivity.