TOMPKIN ET AL.: CRITERIA SLIDERS 1
Supplemental Material for
Criteria Sliders: Learning Continuous
Database Criteria via Interactive Ranking
James Tompkin*
http://www.jamestompkin.com/
Brown University, USA
Kwang In Kim*
k.kim@bath.ac.uk
University of Bath, UK
Hanspeter Pfister
https://vcg.seas.harvard.edu/
Harvard University, USA
Christian Theobalt
http://gvv.mpi-inf.mpg.de/
MPI for Informatics, Germany
* Equal contribution.
We describe our interactive ranking prototype (Sec. 1), present additional experimental results
(Sec. 2) and discussion (Sec. 3), and provide both an introductory explanation to the graph
Laplacian as a regularizer and more details of our active label suggestion algorithm (Sec. 4).
1 Interactive Prototype
Our prototype interactive interface allows users to describe criteria by ranking examples, re-
ceiving active label suggestions, and viewing and manipulating the recovered sliders (Fig. 1).
The database is shown to the south, while the user rank is shown to the north. Initially, the
database is shown in a random order. The user can drag suggested items to the rank, or drag
any database item from the south up to the rank. The items are not ‘removed’ from the database
on dragging; merely, this is the interaction metaphor for labeling an example. Every time an
item is inserted into the rank or whenever the rank is reordered, we update the criteria and order
all items in the database display to the south. Ecient learning is key to this interactivity.
Criteria may be defined by ranking in 1D. These can be displayed as a linear rank, or in
western page order similar to a Web image search. As our underlying representation is contin-
uous, the criteria can also be displayed as a continuous scale where items are separated by their
estimated distances along the scale. For databases which use underlying generative models
as their high-dimensional representation, this allows new examples to be generated at specific
points along the scale, e.g., at a specific point between two desirable examples (Fig. 2).
Criteria may also be defined in pseudo-2D by ranking with two independent 1D rankings
(Fig. 3). The criteria output is then viewed as a 2D scatter plot, either as a matrix-style rank
ordering or as a continuous 2D space from the underlying continuous representation. For
experts, we can relax the rank interaction convenience and allow criteria to be defined directly
by placing examples on the continuous scale, e.g., Figure 3 in the main paper.
Our databases can be large with many thousands of items, and so we allow the user to reduce
the number of presented items to show the broader trend. When viewing the database organized
by rank, we allow the user to control a subsampling of the database items. When viewing the
database organized along continuous scales, we allow the user to zoom out. Users can also jump
quickly to a relevant point in the output criteria by right-clicking on the rank, and vice versa.
© 2017. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2 TOMPKIN ET AL.: CRITERIA SLIDERS
Figure 1: Our prototype interface to define criteria on visual databases, in this case of paintings.
The user is attempting to describe a criteria which spans ‘abstract to concrete’. Top: The user
ranking of examples. To the far right is the suggested next item to label. Bottom: The database,
ordered by the criteria, displayed in western page order. To see the broad trend across all items,
the user has decided to show only a subset of the data sampled across all items. We can see
that drawings and pictographic representations are at the top left, and more realistic or concrete
depictions are at the bottom right.
ID: 1088 ID: 857
C: -2.12
C: -1.95
Generated Model
C: -2.03
Figure 2: With an underlying
high-dimensional generative
model, we can create new
examples by interpolating the
discovered continuous criteria.
For example, given the CAESAR
database, we can generate a hu-
man form with a shape in between
two examples on our criteria.
2 Additional Evaluation
1D criteria learning—CAESAR human body geometries.
In our supplemental video, we
demonstrate dierent ways of organizing the CAESAR human body database. Here, we
show 2226 female body scans ordered by ‘height with priority, then girth’, which project two
orthogonal but related physical attributes down to a single criteria (Fig. 4).
Rank usefulness—further details
To try and discover the performance required for useful
orderings, we asked 28 users to rate candidate orderings of 50 items against a hypothetical
ideal ordering. Users assigned to each candidate rank a 7-point Likert scale score and a binary
‘useful/not’ classification. Candidate orderings were real results generated by our system, which
spanned the performance space between ‘random selection’ and ‘ground truth’. The ideal
TOMPKIN ET AL.: CRITERIA SLIDERS 3
Figure 3: Our interface is used to define an output pseudo-2D criteria space from two
independent 1D ranks. The horizontal criteria labels are ranked in the bar at the top left (Label
rank 1), with label suggestions to the right. The vertical criteria is on the far left (Label rank
2). In this example, the user has zoomed in to a subregion of the learned criteria space, where
representative data points are visible as images and red dots represent the remaining data points.
2.0373 2.0083 1.9941
0.1008 0.0997
0.0986
-2.5369-2.1637-2.1330
...
...
Figure 4: 2226 CAESAR female body scans are ordered by height with priority, then girth,
expressed with 50 labels. While some local variation remains, we can identify the general trend.
orderings were visible for comparison. First, the experiment was thoroughly explained and
users completed a training example. Then, users ranked 24 candidate rankings in a random order,
taken in equal portion from the three ‘Objects’ geometry databases described in the main paper.
Figure 5 compares the Kendall’s tau rank correlation coecient of the dierent candidate
rank orderings with the average Likert score attributed by the participants. We fit a line to
compute
τ
thresholds for 70%, 80%, and 90% acceptance rates. For 70% of participants to be
satisfied,
τ
= 0
.
61; at this level, 4
.
3
×
more pairwise orderings must agree than disagree. For
80% of participants to be satisfied,
τ
= 0
.
70, and 5
.
9
×
more pairs must agree than disagree. For
90% of participants to be satisfied, τ = 0.80, and 9× more pairs must agree than disagree.
Human variation—further details
In the main paper, we organize paintings along the cri-
teria “abstract to concrete”. This task has no well-defined answer: while there is a general
trend that can be agreed by most people, the specific rank positions of the paintings are open
to interpretation. As such, one measure for the level of performance that our system should
aim to achieve to be useful is the level of agreement between humans.
To investigate the human variance of ranking paintings, we asked seven users to rank 201
images along the criteria “abstract to concrete”. The users were aged 20–40, 2 female and 5 male,
with varying artistic experience from regular painters to casual knowledge. During this ‘ground
truth’ generation task, many visual comparisons must be made across the rank. Completing this
4 TOMPKIN ET AL.: CRITERIA SLIDERS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Kendall's tau rank correlation coefficient [-1,1]
1
2
3
4
5
6
7
Participant score
Rank usefulness: Kendall's tau rank correlation vs. useful score
y = 4.8x + 1.3
`Useful' by 70%
`Useful' by 80%
`Useful' by 90%
Figure 5: Ordering agreement
plotted against participant score.
At 90% acceptance, 9
×
more
pair orderings must agree than
disagree; at 80%, 5
.
9
×
more; at
70%, 4.3× more.
Figure 6: Ranking 201 paintings on paper over a large wall (and floor).
task on a computer is currently dicult because displays are too small to show all images at once
with sucient size for comparison. Thus, we printed each image on paper and asked participants
to tape them to a very large wall. The wall acted as the ‘rank space’, and participants were free to
use it as they wished. Most used some combination of bubble, insertion, and merge sorts (Fig. 6).
We compute Kendall’s tau rank correlation coecient between each pair of user rankings,
producing 21 coecients (Tab. 1). The mean coecient is 0.619, with standard deviation of
0.043, and range 0.530–0.684. There is only moderate agreement between participants. This
shows the diculty of the task and the need for an interactive ranking system which can more
easily create personal criteria.
On average, participants spent 158 minutes to produce their rank—ordering a complex
database can take time! We initially developed a mouse-based list drag and drop computer
interface for this task, with quick buttons to move items to dierent sections of the 201-item
list as a quick approximate pre-sort (e.g., ‘Move to start/10%/20%. . . /end’). However, this was
slower than using paper and a large physical rank space (Fig. 6).
TOMPKIN ET AL.: CRITERIA SLIDERS 5
Par. 1 2 3 4 5 6 7
1 - 0.6296 0.6067 0.5269 0.6094 0.6028 0.6032
2 0.6296 - 0.6806 0.6443 0.6839 0.6667 0.6130
3 0.6067 0.6806 - 0.5760 0.6449 0.6237 0.5941
4 0.5269 0.6443 0.5760 - 0.6444 0.6401 0.5342
5 0.6094 0.6839 0.6449 0.6444 - 0.6586 0.5661
6 0.6028 0.6667 0.6237 0.6401 0.6586 - 0.6407
7 0.6032 0.6130 0.5941 0.5342 0.5661 0.6407 -
Table 1: Kendall’s tau rank correlation coecients computer pairwise between seven
participants, each asked to rank 201 paintings by the subjective criteria ‘abstract to concrete’.
There is mild agreement across all participants, showing both the diculty of the task and
a baseline level of performance required to satisfy the group on average.
Data retrieval precision/recall
We perform an experiment to provide an impression of the
performance of our approach in the data retrieval setting, rather than in our interactive ranking
setting. We use three databases from computer vision: the easier ETH-80 database [
12
], and the
harder C-Pascal [
12
] and CIFAR-10 [
9
] databases. We compare both our approach and linear
SVM in a one-vs.-all binary classification, then average performance across classes (Fig. 7).
We compute precision and recall curves by varying the decision boundary threshold. Further,
we compare each method across varying numbers of labels. To build the LG regularizer [
8
],
we set the
k
nearest neighbors to 20 and the dimensionality of the underlying manifold
m
to 10.
For ETH-80 with 3,280 data points, we withhold 530 random points as test data (
1
/
6). We
follow Ebert et al. [
5
] and extract HOG descriptors for each image. For C-Pascal, we randomly
withhold 700 of the 4,450 data point for testing. Similarly, we extract HOG descriptors. CIFAR-
10 has 60,000 data points, 10,000 of which are for testing. For this database, we normalize the
input pixel values to the range -1 to 1, then classify directly without a feature transform, and so
we should not expect performance to be high. Current state of the art techniques jointly learn the
feature representation and classifier, for instance, by employing a convolutional neural network.
Across all three datasets, our approach which makes use of the unlabeled data points is
competitive with linear SVM.
3 Discussion
Ranking vs. metric distances.
Interactive ranking makes the intuitive assumption that peo-
ple find ranking easier than specifying explicit or relative distances between many objects on
scales, where the diculty of this task is proportional to the complexity of the criteria. This
ranking choice assumes that scale metric distances depend linearly on the ordering. In principle,
this does not incur any loss of generality: Once a criteria scale is generated, we can always
non-linearly re-metrize it by performing an order-preserving regression on the final result.
Feature power.
We focused on designing an interactive semi-supervised learning system
suitable for high-dimensional spaces. However, ultimately, the performance of our algorithm
also depends on the data features. We use existing feature representations from corresponding
related problems (e.g., distance fields for geometric objects); however, in general, the best
features depend on the specifics of the database. Here, more sophisticated methods for feature
extraction and selection are applicable, e.g., using automatic relevance determination [16].
6 TOMPKIN ET AL.: CRITERIA SLIDERS
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision
ETH-80 (HOG) test set performance (avg. across classes)
Ours: 55 Labels
Ours: 275 Labels
Ours: 550 Labels
Ours: 1375 Labels
Ours: 2750 Labels
SVM: 55 Labels
SVM: 275 Labels
SVM: 550 Labels
SVM: 1375 Labels
SVM: 2750 Labels
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Precision
C-Pascal (HOG) test set performance (avg. across classes)
Ours: 75 Labels
Ours: 375 Labels
Ours: 750 Labels
Ours: 1875 Labels
Ours: 3750 Labels
SVM: 75 Labels
SVM: 375 Labels
SVM: 750 Labels
SVM: 1875 Labels
SVM: 3750 Labels
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Precision
CIFAR-10 test set performance (avg. across classes)
Ours: 1000 Labels
Ours: 5000 Labels
Ours: 10000 Labels
Ours: 25000 Labels
Ours: 50000 Labels
SVM: 1000 Labels
SVM: 5000 Labels
SVM: 10000 Labels
SVM: 25000 Labels
SVM: 50000 Labels
Figure 7: In the data retrieval setting, which compares classification performance instead of
rank performance, our approach (full line) is competitive with linear SVM (dotted) across
dierent numbers of labels (colors), so long as the threshold is set appropriately. The reader
is recommended to compare like-colored lines to assess performance. All precision/recall
curves are mean curve of retrieval over all classes. Top left: ETH-80 with HOG features. Top
right: C-Pascal with HOG features. CIFAR-10 database.
Non-zero-knowledge vs. exploiting existing labels.
We intentionally designed our system
for the general ‘zero-knowledge’ case where there are no existing labels, e.g., using criteria
definition as queries to assess what is in a Web-scraped database. However, in the wild, some
labels may exist. Taking inspiration from Chaudhuri et al. [
4
], who extend a supervised learning
framework [
15
], it may be possible to create an interactive criteria exploration system where
crowd-sourced labels ‘guide’ our regularizer. This may pull the solution towards the ‘average’
case, which removes absolute subjectivity, but it may also be possible for this guiding to
be parameterized and under user control. Further, we only address single session labeling
scenarios, but over multiple sessions it may be possible to apply structured labeling techniques
to reinforce previous decisions or allow criteria evolution, as per Kulesza et al. [10].
TOMPKIN ET AL.: CRITERIA SLIDERS 7
4 Interactive ranking and
active label suggestion—further details
This section contains 1) an introduction to the graph Laplacian regularizer, 2) motivation for
our use of the local Gaussian (LG) regularizer [
8
] over the standard graph Laplacian regularizer
[
13
,
18
,
20
] (par. 2, ‘semi-supervised learning problem’), 3) a statement of the rigorousness of
the Bayesian reformulation in Eq. 8 (par. 2, ‘active learning formulation’), include explanations
of our comparisons to Amershai et al. [
1
] and Zhu et al. [
20
], and 4) more detailed derivations
of our large-scale database eciency extension. Parts of this section are reproduced from the
main paper to ease understanding.
Semi-supervised learning problem.
For a set of data points
X
=
{X
1
,...,X
u
} R
n
plus the
corresponding labels
Y
=
{Y
1
,...,Y
l
} R
for the first
l
points in
X
where
l u
, our goal is to
infer the labels of the remaining
ul
data points in
X
. Our approach is based on regularized
empirical risk minimization [3, 18, 20]:
argmin
f :R
n
R
l
X
i=1
(Y
i
f (X
i
))
2
+λR
H
( f ), (1)
where
R
H
(
·
) is the regularization functional which quantifies how to smooth the propagated
labels within their local context, and where
λ
balances between the smoothness of
f
and the
deviation from the labeled examples. Here, we use the standard squared loss function for
simplicity. This can be represented in matrix form as an energy functional to be minimized:
E(f) = (f y)
>
L(f y)+λf
>
Hf, (2)
where
L
is a diagonal label indicator matrix and
y
is a vector of continuous label values:
L
[i,i]
= 1
and y
i
= Y
i
if i-th data point is labeled, and L
[i,i]
= 0 and y
i
= 0, otherwise.
This problem can be solved in two ways: 1) by reconstructing the underlying function
f
which maps from
X
to
Y
, which is often called inductive learning as it allows
f
to be applied
to new data points not in
X
; or 2) by identifying the values of
Y
for the unlabeled points in
X
only, which is often called transductive learning.
Datasets as graphs and the Laplacian matrix.
One way to solve the transductive learning
problem is to represented data points as a graph and use the Laplacian matrix as a regularizer.
For the unfamiliar reader, the Laplacian matrix
S
is a representation of a simple graph
G
=
{V,E}
,
where
V
=
{v
0
,...,v
n
}
is the set of nodes and
E V × V
is the set of edges. It can be built by
subtracting the adjacency matrix
A
of a graph from the degree matrix
D
of a graph:
S
=
DA
.
For an unweighted graph, the elements of S are:
S
[i, j]
:=
deg(v
i
), if i = j
1, if i , j and if (v
i
,v
j
) E
0, otherwise
(3)
In our semi-supervised scenario, instead of an unweighted graph, we declare edge weights
W
based on the similarity of the input data points. Thus, the graph Laplacian matrix
S
is one
of the best regularizers
H
for transductive learning because it allows label propagation to occur
between data points based on the pair-wise deviations of
{f
i
}
weighted by the similarity of the
corresponding inputs [
13
,
18
,
20
]. The graph Laplacian matrix also finds applications in spectral
clustering and embedding [13], ranking [19], and inducing diusion processes on graphs [7].
8 TOMPKIN ET AL.: CRITERIA SLIDERS
Graph Laplacian as a regularizer.
Given Equation 1, we can formulate the graph Laplacian
S [13] as a regularizer R
S
( f ):
R
H
( f ) := f
>
S f =
u
X
i, j=1
W
[i, j]
( f
i
f
j
)
2
, (4)
where
f
i
=
f
(
X
i
),
f
:=
f |
X
= [
f
1
,..., f
u
]
>
, and
W
is a non-negative input similarity matrix which
is typically defined based on a Gaussian:
W
[i, j]
= exp
kX
i
X
j
k
2
b
. (5)
with b being the width of the Gaussian kernel, and
S = DW (6)
where D is a diagonal matrix of column sums of W (D
[i,i]
=
P
j
W
[i, j]
).
One way of justifying the use of the graph Laplacian comes from its limit case behavior as
u
and
b
0: When the data
X
is generated from an underlying manifold
M
with dimension
m n
, i.e., the corresponding probability distribution
P
has support in
M
, the graph Laplacian
converges to the Laplace-Beltrami operator on
M
[
2
,
6
]. The Laplace-Beltrami operator can
be used to measure the first-order variations of a continuously dierentiable function
f
on
M
:
k f k
2
:=
Z
M
f (X)[ f |
X
]dV(x) =
Z
M
k∇ f |
X
k
2
g
dV(X), (7)
where
g
is the Riemannian metric, and
dV
is the corresponding natural volume element [
11
] of
M
. The second equality is the result of Stokes’ theorem. Accordingly, a graph Laplacian-based
regularizer
R
L
can be regarded as an empirical estimate of the first-order variation of
f
on
M
based on X .
With predominant use of the graph Laplacian regularizer, semi-supervised learning has be-
come established in classification and clustering tasks where the output is a discrete set of labels.
However, estimating continuous outputs is still an area of active research due to degenerate
behavior of the graph Laplacian regularizer: in high-dimensional spaces, minimizing
E
with
H
=
S
tends to produce an output
f
where most elements are close to zero [
14
]. This is fine
for classification as output magnitudes are irrelevant, e.g., to classify data points into positive
and negative classes, 1 and 1
10
are both positive; however, having close to zero outputs is
meaningless for continuous outputs, as very little regularization occurs.
This problem has only recently been addressed with the local Gaussian (LG) regularizer
[
8
], which explicitly addresses this degeneracy and reduces the computational complexity
required to remove it. However, since the LG regularizer was developed for general batch semi-
supervised learning, i.e., all labeled data points are provided before the learning process, it can-
not be directly applied to interactive ranking. Hence, we adapt the LG regularizer to our setting.
Active learning formulation.
Following the analogy between regularized empirical risk
minimization and the maximum a posteriori (MAP) estimation [
16
] and adopting the Bayesian
optimization [
17
], we reformulate
1
× E
(Eq. 2) as (the logarithm of) a product of the prior
p
(
f
) and a Gaussian noise model
p
(
y|f
). This leads us to assess the minimizer of
E
as the mean
of the predictive distribution (the posterior):
logp(f|y) = (mf)
>
C
1
(mf)+Z, (8)
with mean
m
=
CLy
, covariance matrix
C
= (
L
+
λH
)
1
, the normalization constant
Z
, and the
local Gaussian regularization matrix
H
. This perspective informs a predictive uncertainty for
each data point: The
i
-th diagonal component
C
ii
of the covariance matrix
C
contains informa-
TOMPKIN ET AL.: CRITERIA SLIDERS 9
tion on the uncertainty of the prediction on label
f
i
, which is typically low when
X
i
is labeled
and is high otherwise. Active label suggestion presents results which minimize uncertainty.
Note that this Bayesian reformulation is mathematically rigorous: By the construction of
H
[
8
],
C
1
is positive definite and therefore,
m
and
C
are valid mean vector and covariance
matrix. This shapes the posterior
p
(
f|y
) as a sample from an underlying Gaussian process that
factorizes to a prior on
f
and the corresponding likelihood
p
(
f|y
). This re-interpretation enables
us to exploit well-established Bayesian uncertainty modeling. In particular, the variance of the
predicted distribution
p
(
f|y
) indicates how uncertain we are about each prediction we make
on data points X X .
One simple and well-established strategy to exploit these modeled uncertainties for active
label selection is to predict at each iteration
t
and choose the point
X
i
which has the largest
uncertainty. However, Figure 8 shows that naively choosing data points with maximum un-
certainty leads to poor performance with a higher error rate than random selection, as isolated
outlier data points—which are not broadly informative—receive high variances and are chosen.
Instead, we construct the candidate data points that minimizes the predictive variance over
the entire set of data points. At each time step
t
, we choose data points with the highest average
information gain I, defined as:
I(X
i
)=
X
j=1,...,u
[C(t1)C(t)
i
]
[ j, j]
. (9)
The matrix
C
(
t
)
i
is constructed by adding 1 to the
i
-th diagonal element of
C
(
t
1)
1
and
inverting it (i.e., i-th data point is regarded as labeled; see Eq. 2).
As suggested by the form of the matrix C, this score can be calculated without knowing the
corresponding label value
Y
i
. This is due to the Gaussian noise model; in general, dierent
noise models lead to the predictive variance (C) depending on the labels.
Naïvely estimating the information gain for all data points requires quadratic computational
complexity: One has to estimate the minimizer of
E
(
f
) (Eq. 2), which is
O
(
u
3
) for each data
point. However, in our iterative label suggestion scenario,
I
can be eciently computed in
linear time: Assuming that
C
(
t
1) is given from the previous step, calculating
diag
[
C
(
t
)
i
] does
not require inverting the matrix
L
(
t
)+
λH
: Using Sherman–Morrison–Woodbury formula,
C
(
t
)
i
can be eciently calculated from C(t1):
diag[C(t)
i
]= diag[C(t1)]
squ[C(t1)
[:,i]
]
1+diag[C(t1)]
i
, (10)
where
A
[:, j]
denotes a vector formed from the
j
-th column of the matrix
A
,
diag
[
A
] constructs
a vector from the diagonal components of matrix
A
, and
squ
[
B
] is a vector obtained by tak-
ing element-wise squares of the vector
B
. Accordingly, after explicitly calculating
C
(0),
1
subsequent updates in C(t) and C(t)
i
can be performed eciently for each iteration t.
Amershai et al. comparison.
One of the first algorithms to propose active learning in end-
user interaction application is Amershi et al.’s approach [
1
]. Their algorithm was originally
developed for active labeling in metric learning contexts, and so it is not directly applicable
to our problem. However, the re-interpretation of their label selection criterion enables trans-
ferring their algorithm to our current setting: Their strategy is based on a Gaussian process
(GP) model with Gaussian noise: For a given set of labeled data points
S
and the unlabeled
data points U, and for the i-th unlabeled point, the corresponding score f (i) is defined as:
f =
Var(i|S )
Var(i|U)
, (11)
1
In practice, we limit C(0) to 2,000 labels for faster interactivity.
10 TOMPKIN ET AL.: CRITERIA SLIDERS
where
Var
(
i|S
) is the variance predicted for
i
-th data point based on the GP model trained based
on
S
, and
Var
(
i|U
) is the variance predicted based on the GP model trained on
U
, assuming that
U
are all labeled. Details of GP prediction can be found in [
16
]. This interpretation enables us to
apply the same principle to our setting: for the
i
-th data point, the corresponding score
I
0
(
X
i
) is:
I
0
(X
i
)=
diag[C(t1)]
i
diag[U(t1)]
i
, (12)
where
U
(
t
1) is the variance predicted based on all unlabeled data points. Maximizing only
the predictive uncertainty
diag
[
C
(
t
1)]
i
corresponds to active label selection based on the
predictive uncertainty, which is likely to choose an outlier. The (reciprocal of the) denominator
represents how well the point
X
i
is connected to the unlabeled data points [
1
]. Accordingly,
normalizing the uncertainty by this values rules out the possibility of choosing the outlier.
However,
I
0
(
X
i
) is based on how much information the unlabeled data points contains about
X
i
,
which does not directly represent our final goal: our goal is to choose a data point minimizes
the overall uncertainty on every data points which is directly addressed by I.
Zhu et al. comparison.
Zhu et al. [
20
] proposed an active learning framework that selects
the labels by minimizing an estimate of the expected risk. This framework relies on evaluating
the prediction accuracy of the updated system for each hypothesized output values. Therefore,
the application of this approach is limited to the case in which the output space is finite, e.g.,
{−1,1} in the classification problem. Our approach does not have this limit.
Large-scale extension.
The rank-one update model (Eqs. 9 and 10) relies on explicitly cal-
culating the covariance matrix
C
= (
L
+
λH
)
1
. Even though the original regularization matrix
L
+
λH
is sparse, its inverse
C
is in general dense. Therefore, this strategy is not directly
applicable for large-scale problems where calculating and storing the inverse of
L
+
λH
are
infeasible. For this case, we adopt sparse eigen-decomposition-based approximation of
L
+
λH
.
Assume that L+λH is eigen-decomposed:
C
1
= L+λH = E
F
Λ
F
E
F
>
C = E
F
Λ
F
1
E
F
>
=
u
X
i=1
1
λ
i
e
i
e
>
i
,
where
u×u
-sized matrix
E
F
stores eigenvectors (
{e
i
}
u
i=1
) column-wise, and Λ
F
is a diagonal
matrix of the corresponding eigenvalues (
{λ
i
}
u
i=1
). We assume that the eigenvalues and the
corresponding eigenvectors are arranged in increasing eigenvalues. Given the prescribed rank
parameter
r
, our approximate covariance matrix
C
is given as the best possible
L
2
-approximation
of rank r to C:
C C =
r
X
i=1
1
λ
i
e
i
e
>
i
= EΛ
1
E
>
, (13)
where
E
is a
u×r
submatrix of
E
F
containing the first
r
eigenvectors and Λ is the corresponding
r×r submatrix of Λ
F
.
Now suppose that we add a label of the
i
-th data point at time 1. Applying the Sherman–
TOMPKIN ET AL.: CRITERIA SLIDERS 11
Figure 8: Reproduced from
the main paper for reading
ease. Mean absolute error for
estimating the ground-truth
weight parameter in the CAE-
SAR database. Error bar
lengths correspond to twice
the standard deviations (std.).
Note the smaller stds. of
the proposed full and sparse
methods. The error rates for
two labeled data points cases
are the same by construction.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
2
4
6
8
10
12
14
16
18
20
# labels
Error rates
Random label selection
Active label selection (predictive variance)
Amershi et al.
Active label selection (proposed)
Active label selection sparse (proposed)
Morrison–Woodbury formula to C, we can represent the update equation:
C(1)
i
= EΛ
1
E
>
EΛ
1
E
>
[i,:]
E
[i,:]
Λ
1
(t1)E
>
1+H
1
ii
= E
Λ
1
Λ
1
E
>
[i,:]
E
[i,:]
Λ
1
1+H
1
ii
E
>
.
The corresponding information gain is given as
diag[CC
i
]
k
= E
[k,:]
Λ
1
E
>
[i,:]
E
[i,:]
Λ
1
1+H
1
ii
E
>
[k,:]
=
1
1+H
1
ii
E
[k,:]
Λ
1
E
>
[i,:]
E
[i,:]
Λ
1
E
>
[k,:]
. (14)
At step t, adding a label to the i(t)-th data point leads the a new covariance estimate:
[EF
i
E
>
]
k,k
= E
[k,:]
Λ
1
E
>
[k,:]
+
X
i=1,...,t
E
[k,:]
r
i
r
>
i
E
>
[k,:]
,
where
r
i
=
1
1+C
1
ii
E
[i,:]
Λ
1
. (15)
Active label selection performance.
For CAESAR, we randomly selected a set
A
of 2000
data points and a subset
B A
of 2 initial data points to be labeled (our user scenario). First, we
train on
B
. Then, we calculate the information gain
I
(Eq. 9) for each data point in
A\B
. The best
2 data points were assigned labels and were added to
B
. We iterated until
|B|
= 46. Figure 8 com-
pares error rates of our sparse eigen-decomposition-based approximation (Active label selection
sparse (proposed)) to the alternative algorithms (averaged over 10 trials): 1) random label
selection, 2) predictive uncertainty as discussed previously, 3) Amershi et al. [
1
] adapted to our
semi-supervised setting, 4) the full information-gain-based method (Eqs. 9 and 10; Active label
selection (proposed)). We used only the first 100 (
r
= 100) eigenvectors. Predictive variance only
resulted in suggesting outliers which led to worse results than random selection. The adapted
algorithm of Amershi et al. [
1
] improves upon random selection, while our two information
gain-based approaches show further improvement, especially when the number of labels is low.
12 TOMPKIN ET AL.: CRITERIA SLIDERS
References
[1]
S. Amershi, J. Fogarty, A. Kapoor, and D. S. Tan. Eective end-user interaction with
machine learning. In Proc. AAAI, 2011.
[2]
Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based
manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2005.
[3] O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.
[4]
S. Chaudhuri, E. Kalogerakis, S. Giguere, and T. Funkhouser. AttribIt: content creation
with semantic attributes. In Proc. UIST, pages 193–202, 2013.
[5]
Sandra Ebert, Diane Larlus, and Bernt Schiele. Extracting structures in image collections
for object recognition. In Proc. ECCV, pages 720–733, 2010.
[6]
M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds - weak and
strong pointwise consistency of graph Laplacians. In Proc. COLT, pages 470–485, 2005.
[7]
K. I. Kim, J. Tompkin, H. Pfister, and C. Theobalt. Context-guided diusion for label
propagation on graphs. In Proc. ICCV, pages 2776–2784, 2015.
[8]
K. I. Kim, J. Tompkin, H. Pfister, and C. Theobalt. Local high-order regularization on
data manifolds. In Proc. IEEE CVPR, pages 5473–5481, 2015.
[9]
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical
report, University of Toronto, 2009.
[10]
T. Kulesza, S. Amershi, R. Caruana, D. Fisher, and D. Charles. Structured labeling for
facilitating concept evolution in machine learning. In Proc. CHI, pages 3075–3084, 2014.
[11]
John M. Lee. Riemannian Manifolds- An Introduction to Curvature. Springer, New York,
1997.
[12]
B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object
categorization. In 2003 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, 2003. Proceedings., volume 2, pages II–409–15 vol.2, June 2003.
doi: 10.1109/CVPR.2003.1211497.
[13]
U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416,
2007.
[14]
B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: the
limit of infinite unlabelled data. In NIPS, pages 1330–1338, 2009.
[15] D. Parikh and K. Grauman. Relative attributes. In Proc. ICCV, pages 503–510, 2011.
[16]
C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT
Press, 2006.
[17]
J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine
learning algorithms. In NIPS, pages 2951–2959, 2012.
[18]
D. Zhou, O Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and
global consistency. In NIPS, pages 1330–328, 2004.
TOMPKIN ET AL.: CRITERIA SLIDERS 13
[19]
D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data
manifolds. In NIPS, pages 169–176, 2004.
[20]
X. Zhu, J. Laerty, and Z. Ghahramani. Combining active learning and semi-supervised
learning using Gaussian fields and harmonic functions. In Proc. ICML workshop on
The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining,
pages 58–65, 2003.