TOMPKIN ET AL.: CRITERIA SLIDERS 1

Supplemental Material for

Criteria Sliders: Learning Continuous

Database Criteria via Interactive Ranking

James Tompkin*

http://www.jamestompkin.com/

Brown University, USA

Kwang In Kim*

k.kim@bath.ac.uk

University of Bath, UK

Hanspeter Pﬁster

https://vcg.seas.harvard.edu/

Harvard University, USA

Christian Theobalt

http://gvv.mpi-inf.mpg.de/

MPI for Informatics, Germany

* Equal contribution.

We describe our interactive ranking prototype (Sec. 1), present additional experimental results

(Sec. 2) and discussion (Sec. 3), and provide both an introductory explanation to the graph

Laplacian as a regularizer and more details of our active label suggestion algorithm (Sec. 4).

1 Interactive Prototype

Our prototype interactive interface allows users to describe criteria by ranking examples, re-

ceiving active label suggestions, and viewing and manipulating the recovered sliders (Fig. 1).

The database is shown to the south, while the user rank is shown to the north. Initially, the

database is shown in a random order. The user can drag suggested items to the rank, or drag

any database item from the south up to the rank. The items are not ‘removed’ from the database

on dragging; merely, this is the interaction metaphor for labeling an example. Every time an

item is inserted into the rank or whenever the rank is reordered, we update the criteria and order

all items in the database display to the south. Eﬃcient learning is key to this interactivity.

Criteria may be deﬁned by ranking in 1D. These can be displayed as a linear rank, or in

western page order similar to a Web image search. As our underlying representation is contin-

uous, the criteria can also be displayed as a continuous scale where items are separated by their

estimated distances along the scale. For databases which use underlying generative models

as their high-dimensional representation, this allows new examples to be generated at speciﬁc

points along the scale, e.g., at a speciﬁc point between two desirable examples (Fig. 2).

Criteria may also be deﬁned in pseudo-2D by ranking with two independent 1D rankings

(Fig. 3). The criteria output is then viewed as a 2D scatter plot, either as a matrix-style rank

ordering or as a continuous 2D space from the underlying continuous representation. For

experts, we can relax the rank interaction convenience and allow criteria to be deﬁned directly

by placing examples on the continuous scale, e.g., Figure 3 in the main paper.

Our databases can be large with many thousands of items, and so we allow the user to reduce

the number of presented items to show the broader trend. When viewing the database organized

by rank, we allow the user to control a subsampling of the database items. When viewing the

database organized along continuous scales, we allow the user to zoom out. Users can also jump

quickly to a relevant point in the output criteria by right-clicking on the rank, and vice versa.

It may be distributed unchanged freely in print or electronic forms.

2 TOMPKIN ET AL.: CRITERIA SLIDERS

Figure 1: Our prototype interface to deﬁne criteria on visual databases, in this case of paintings.

The user is attempting to describe a criteria which spans ‘abstract to concrete’. Top: The user

ranking of examples. To the far right is the suggested next item to label. Bottom: The database,

ordered by the criteria, displayed in western page order. To see the broad trend across all items,

the user has decided to show only a subset of the data sampled across all items. We can see

that drawings and pictographic representations are at the top left, and more realistic or concrete

depictions are at the bottom right.

ID: 1088 ID: 857

C: -2.12

C: -1.95

Generated Model

C: -2.03

Figure 2: With an underlying

high-dimensional generative

model, we can create new

examples by interpolating the

discovered continuous criteria.

For example, given the CAESAR

database, we can generate a hu-

man form with a shape in between

two examples on our criteria.

2 Additional Evaluation

1D criteria learning—CAESAR human body geometries.

In our supplemental video, we

demonstrate diﬀerent ways of organizing the CAESAR human body database. Here, we

show 2226 female body scans ordered by ‘height with priority, then girth’, which project two

orthogonal but related physical attributes down to a single criteria (Fig. 4).

Rank usefulness—further details

To try and discover the performance required for useful

orderings, we asked 28 users to rate candidate orderings of 50 items against a hypothetical

ideal ordering. Users assigned to each candidate rank a 7-point Likert scale score and a binary

‘useful/not’ classiﬁcation. Candidate orderings were real results generated by our system, which

spanned the performance space between ‘random selection’ and ‘ground truth’. The ideal

TOMPKIN ET AL.: CRITERIA SLIDERS 3

Figure 3: Our interface is used to deﬁne an output pseudo-2D criteria space from two

independent 1D ranks. The horizontal criteria labels are ranked in the bar at the top left (Label

rank 1), with label suggestions to the right. The vertical criteria is on the far left (Label rank

2). In this example, the user has zoomed in to a subregion of the learned criteria space, where

representative data points are visible as images and red dots represent the remaining data points.

2.0373 2.0083 1.9941

0.1008 0.0997

0.0986

-2.5369-2.1637-2.1330

...

Figure 4: 2226 CAESAR female body scans are ordered by height with priority, then girth,

expressed with 50 labels. While some local variation remains, we can identify the general trend.

orderings were visible for comparison. First, the experiment was thoroughly explained and

users completed a training example. Then, users ranked 24 candidate rankings in a random order,

taken in equal portion from the three ‘Objects’ geometry databases described in the main paper.

Figure 5 compares the Kendall’s tau rank correlation coeﬃcient of the diﬀerent candidate

rank orderings with the average Likert score attributed by the participants. We ﬁt a line to

compute

thresholds for 70%, 80%, and 90% acceptance rates. For 70% of participants to be

satisﬁed,

= 0

61; at this level, 4

more pairwise orderings must agree than disagree. For

80% of participants to be satisﬁed,

= 0

70, and 5

more pairs must agree than disagree. For

90% of participants to be satisﬁed, τ = 0.80, and 9× more pairs must agree than disagree.

Human variation—further details

In the main paper, we organize paintings along the cri-

teria “abstract to concrete”. This task has no well-deﬁned answer: while there is a general

trend that can be agreed by most people, the speciﬁc rank positions of the paintings are open

to interpretation. As such, one measure for the level of performance that our system should

aim to achieve to be useful is the level of agreement between humans.

To investigate the human variance of ranking paintings, we asked seven users to rank 201

images along the criteria “abstract to concrete”. The users were aged 20–40, 2 female and 5 male,

with varying artistic experience from regular painters to casual knowledge. During this ‘ground

truth’ generation task, many visual comparisons must be made across the rank. Completing this

4 TOMPKIN ET AL.: CRITERIA SLIDERS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Kendall's tau rank correlation coefficient [-1,1]

Participant score

Rank usefulness: Kendall's tau rank correlation vs. useful score

y = 4.8x + 1.3

`Useful' by 70%

`Useful' by 80%

`Useful' by 90%

Figure 5: Ordering agreement

plotted against participant score.

At 90% acceptance, 9

pair orderings must agree than

disagree; at 80%, 5

more; at

70%, 4.3× more.

Figure 6: Ranking 201 paintings on paper over a large wall (and ﬂoor).

task on a computer is currently diﬃcult because displays are too small to show all images at once

with suﬃcient size for comparison. Thus, we printed each image on paper and asked participants

to tape them to a very large wall. The wall acted as the ‘rank space’, and participants were free to

use it as they wished. Most used some combination of bubble, insertion, and merge sorts (Fig. 6).

We compute Kendall’s tau rank correlation coeﬃcient between each pair of user rankings,

producing 21 coeﬃcients (Tab. 1). The mean coeﬃcient is 0.619, with standard deviation of

0.043, and range 0.530–0.684. There is only moderate agreement between participants. This

shows the diﬃculty of the task and the need for an interactive ranking system which can more

easily create personal criteria.

On average, participants spent 158 minutes to produce their rank—ordering a complex

database can take time! We initially developed a mouse-based list drag and drop computer

interface for this task, with quick buttons to move items to diﬀerent sections of the 201-item

list as a quick approximate pre-sort (e.g., ‘Move to start/10%/20%. . . /end’). However, this was

slower than using paper and a large physical rank space (Fig. 6).

TOMPKIN ET AL.: CRITERIA SLIDERS 5

Par. 1 2 3 4 5 6 7

1 - 0.6296 0.6067 0.5269 0.6094 0.6028 0.6032

2 0.6296 - 0.6806 0.6443 0.6839 0.6667 0.6130

3 0.6067 0.6806 - 0.5760 0.6449 0.6237 0.5941

4 0.5269 0.6443 0.5760 - 0.6444 0.6401 0.5342

5 0.6094 0.6839 0.6449 0.6444 - 0.6586 0.5661

6 0.6028 0.6667 0.6237 0.6401 0.6586 - 0.6407

7 0.6032 0.6130 0.5941 0.5342 0.5661 0.6407 -

Table 1: Kendall’s tau rank correlation coeﬃcients computer pairwise between seven

participants, each asked to rank 201 paintings by the subjective criteria ‘abstract to concrete’.

There is mild agreement across all participants, showing both the diﬃculty of the task and

a baseline level of performance required to satisfy the group on average.

Data retrieval precision/recall

We perform an experiment to provide an impression of the

performance of our approach in the data retrieval setting, rather than in our interactive ranking

setting. We use three databases from computer vision: the easier ETH-80 database [

], and the

harder C-Pascal [

] and CIFAR-10 [

] databases. We compare both our approach and linear

SVM in a one-vs.-all binary classiﬁcation, then average performance across classes (Fig. 7).

We compute precision and recall curves by varying the decision boundary threshold. Further,

we compare each method across varying numbers of labels. To build the LG regularizer [

we set the

nearest neighbors to 20 and the dimensionality of the underlying manifold

to 10.

For ETH-80 with 3,280 data points, we withhold 530 random points as test data (

≈

6). We

follow Ebert et al. [

] and extract HOG descriptors for each image. For C-Pascal, we randomly

withhold 700 of the 4,450 data point for testing. Similarly, we extract HOG descriptors. CIFAR-

10 has 60,000 data points, 10,000 of which are for testing. For this database, we normalize the

input pixel values to the range -1 to 1, then classify directly without a feature transform, and so

we should not expect performance to be high. Current state of the art techniques jointly learn the

feature representation and classiﬁer, for instance, by employing a convolutional neural network.

Across all three datasets, our approach which makes use of the unlabeled data points is

competitive with linear SVM.

3 Discussion

Ranking vs. metric distances.

Interactive ranking makes the intuitive assumption that peo-

ple ﬁnd ranking easier than specifying explicit or relative distances between many objects on

scales, where the diﬃculty of this task is proportional to the complexity of the criteria. This

ranking choice assumes that scale metric distances depend linearly on the ordering. In principle,

this does not incur any loss of generality: Once a criteria scale is generated, we can always

non-linearly re-metrize it by performing an order-preserving regression on the ﬁnal result.

Feature power.

We focused on designing an interactive semi-supervised learning system

suitable for high-dimensional spaces. However, ultimately, the performance of our algorithm

also depends on the data features. We use existing feature representations from corresponding

related problems (e.g., distance ﬁelds for geometric objects); however, in general, the best

features depend on the speciﬁcs of the database. Here, more sophisticated methods for feature

extraction and selection are applicable, e.g., using automatic relevance determination [16].

6 TOMPKIN ET AL.: CRITERIA SLIDERS

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

ETH-80 (HOG) test set performance (avg. across classes)

Ours: 55 Labels

Ours: 275 Labels

Ours: 550 Labels

Ours: 1375 Labels

Ours: 2750 Labels

SVM: 55 Labels

SVM: 275 Labels

SVM: 550 Labels

SVM: 1375 Labels

SVM: 2750 Labels

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Precision

C-Pascal (HOG) test set performance (avg. across classes)

Ours: 75 Labels

Ours: 375 Labels

Ours: 750 Labels

Ours: 1875 Labels

Ours: 3750 Labels

SVM: 75 Labels

SVM: 375 Labels

SVM: 750 Labels

SVM: 1875 Labels

SVM: 3750 Labels

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Precision

CIFAR-10 test set performance (avg. across classes)

Ours: 1000 Labels

Ours: 5000 Labels

Ours: 10000 Labels

Ours: 25000 Labels

Ours: 50000 Labels

SVM: 1000 Labels

SVM: 5000 Labels

SVM: 10000 Labels

SVM: 25000 Labels

SVM: 50000 Labels

Figure 7: In the data retrieval setting, which compares classiﬁcation performance instead of

rank performance, our approach (full line) is competitive with linear SVM (dotted) across

diﬀerent numbers of labels (colors), so long as the threshold is set appropriately. The reader

is recommended to compare like-colored lines to assess performance. All precision/recall

curves are mean curve of retrieval over all classes. Top left: ETH-80 with HOG features. Top

right: C-Pascal with HOG features. CIFAR-10 database.

Non-zero-knowledge vs. exploiting existing labels.

We intentionally designed our system

for the general ‘zero-knowledge’ case where there are no existing labels, e.g., using criteria

deﬁnition as queries to assess what is in a Web-scraped database. However, in the wild, some

labels may exist. Taking inspiration from Chaudhuri et al. [

], who extend a supervised learning

framework [

], it may be possible to create an interactive criteria exploration system where

crowd-sourced labels ‘guide’ our regularizer. This may pull the solution towards the ‘average’

case, which removes absolute subjectivity, but it may also be possible for this guiding to

be parameterized and under user control. Further, we only address single session labeling

scenarios, but over multiple sessions it may be possible to apply structured labeling techniques

to reinforce previous decisions or allow criteria evolution, as per Kulesza et al. [10].

TOMPKIN ET AL.: CRITERIA SLIDERS 7

4 Interactive ranking and

active label suggestion—further details

This section contains 1) an introduction to the graph Laplacian regularizer, 2) motivation for

our use of the local Gaussian (LG) regularizer [

] over the standard graph Laplacian regularizer

[

] (par. 2, ‘semi-supervised learning problem’), 3) a statement of the rigorousness of

the Bayesian reformulation in Eq. 8 (par. 2, ‘active learning formulation’), include explanations

of our comparisons to Amershai et al. [

] and Zhu et al. [

], and 4) more detailed derivations

of our large-scale database eﬃciency extension. Parts of this section are reproduced from the

main paper to ease understanding.

Semi-supervised learning problem.

For a set of data points

,...,X

} ⊂ R

plus the

corresponding labels

,...,Y

} ⊂ R

for the ﬁrst

points in

where

l  u

, our goal is to

infer the labels of the remaining

u−l

data points in

. Our approach is based on regularized

empirical risk minimization [3, 18, 20]:

argmin

f :R

→R

i=1

− f (X

))

+λR

( f ), (1)

where

(

) is the regularization functional which quantiﬁes how to smooth the propagated

labels within their local context, and where

balances between the smoothness of

and the

deviation from the labeled examples. Here, we use the standard squared loss function for

simplicity. This can be represented in matrix form as an energy functional to be minimized:

E(f) = (f −y)

L(f −y)+λf

Hf, (2)

where

is a diagonal label indicator matrix and

is a vector of continuous label values:

[i,i]

= 1

and y

= Y

if i-th data point is labeled, and L

[i,i]

= 0 and y

= 0, otherwise.

This problem can be solved in two ways: 1) by reconstructing the underlying function

which maps from

, which is often called inductive learning as it allows

to be applied

to new data points not in

; or 2) by identifying the values of

for the unlabeled points in

only, which is often called transductive learning.

Datasets as graphs and the Laplacian matrix.

One way to solve the transductive learning

problem is to represented data points as a graph and use the Laplacian matrix as a regularizer.

For the unfamiliar reader, the Laplacian matrix

is a representation of a simple graph

{V,E}

where

,...,v

}

is the set of nodes and

E ⊂ V × V

is the set of edges. It can be built by

subtracting the adjacency matrix

of a graph from the degree matrix

of a graph:

D−A

For an unweighted graph, the elements of S are:

[i, j]











deg(v

), if i = j

−1, if i , j and if (v

)∈ E

0, otherwise

(3)

In our semi-supervised scenario, instead of an unweighted graph, we declare edge weights

based on the similarity of the input data points. Thus, the graph Laplacian matrix

is one

of the best regularizers

for transductive learning because it allows label propagation to occur

between data points based on the pair-wise deviations of

}

weighted by the similarity of the

corresponding inputs [

]. The graph Laplacian matrix also ﬁnds applications in spectral

clustering and embedding [13], ranking [19], and inducing diﬀusion processes on graphs [7].

8 TOMPKIN ET AL.: CRITERIA SLIDERS

Graph Laplacian as a regularizer.

Given Equation 1, we can formulate the graph Laplacian

S [13] as a regularizer R

( f ):

( f ) := f

S f =

i, j=1

[i, j]

( f

− f

)

, (4)

where

(

f |

= [

,..., f

]

, and

is a non-negative input similarity matrix which

is typically deﬁned based on a Gaussian:

[i, j]

= exp







−

−X







. (5)

with b being the width of the Gaussian kernel, and

S = D−W (6)

where D is a diagonal matrix of column sums of W (D

[i,i]

[i, j]

One way of justifying the use of the graph Laplacian comes from its limit case behavior as

u→ ∞

and

b→

0: When the data

is generated from an underlying manifold

with dimension

m ≤ n

, i.e., the corresponding probability distribution

has support in

, the graph Laplacian

converges to the Laplace-Beltrami operator ∆ on

[

]. The Laplace-Beltrami operator can

be used to measure the ﬁrst-order variations of a continuously diﬀerentiable function

k f k

∆

f (X)[∆ f |

]dV(x) =

k∇ f |

dV(X), (7)

where

is the Riemannian metric, and

is the corresponding natural volume element [

] of

. The second equality is the result of Stokes’ theorem. Accordingly, a graph Laplacian-based

regularizer

can be regarded as an empirical estimate of the ﬁrst-order variation of

based on X .

With predominant use of the graph Laplacian regularizer, semi-supervised learning has be-

come established in classiﬁcation and clustering tasks where the output is a discrete set of labels.

However, estimating continuous outputs is still an area of active research due to degenerate

behavior of the graph Laplacian regularizer: in high-dimensional spaces, minimizing

with

tends to produce an output

∗

where most elements are close to zero [

]. This is ﬁne

for classiﬁcation as output magnitudes are irrelevant, e.g., to classify data points into positive

and negative classes, 1 and 1

−10

are both positive; however, having close to zero outputs is

meaningless for continuous outputs, as very little regularization occurs.

This problem has only recently been addressed with the local Gaussian (LG) regularizer

[

], which explicitly addresses this degeneracy and reduces the computational complexity

required to remove it. However, since the LG regularizer was developed for general batch semi-

supervised learning, i.e., all labeled data points are provided before the learning process, it can-

not be directly applied to interactive ranking. Hence, we adapt the LG regularizer to our setting.

Active learning formulation.

Following the analogy between regularized empirical risk

minimization and the maximum a posteriori (MAP) estimation [

] and adopting the Bayesian

optimization [

], we reformulate

−

× E

(Eq. 2) as (the logarithm of) a product of the prior

(

) and a Gaussian noise model

(

y|f

). This leads us to assess the minimizer of

as the mean

of the predictive distribution (the posterior):

−logp(f|y) = (m−f)

−1

(m−f)+Z, (8)

with mean

CLy

, covariance matrix

= (

λH

)

−1

, the normalization constant

, and the

local Gaussian regularization matrix

. This perspective informs a predictive uncertainty for

each data point: The

-th diagonal component

of the covariance matrix

contains informa-

TOMPKIN ET AL.: CRITERIA SLIDERS 9

tion on the uncertainty of the prediction on label

, which is typically low when

is labeled

and is high otherwise. Active label suggestion presents results which minimize uncertainty.

Note that this Bayesian reformulation is mathematically rigorous: By the construction of

[

−1

is positive deﬁnite and therefore,

and

are valid mean vector and covariance

matrix. This shapes the posterior

(

f|y

) as a sample from an underlying Gaussian process that

factorizes to a prior on

and the corresponding likelihood

(

f|y

). This re-interpretation enables

us to exploit well-established Bayesian uncertainty modeling. In particular, the variance of the

predicted distribution

(

f|y

) indicates how uncertain we are about each prediction we make

on data points X ∈ X .

One simple and well-established strategy to exploit these modeled uncertainties for active

label selection is to predict at each iteration

and choose the point

which has the largest

uncertainty. However, Figure 8 shows that naively choosing data points with maximum un-

certainty leads to poor performance with a higher error rate than random selection, as isolated

outlier data points—which are not broadly informative—receive high variances and are chosen.

Instead, we construct the candidate data points that minimizes the predictive variance over

the entire set of data points. At each time step

, we choose data points with the highest average

information gain I, deﬁned as:

I(X

j=1,...,u

[C(t−1)−C(t)

]

[ j, j]

. (9)

The matrix

(

)

is constructed by adding 1 to the

-th diagonal element of

(

t −

−1

and

inverting it (i.e., i-th data point is regarded as labeled; see Eq. 2).

As suggested by the form of the matrix C, this score can be calculated without knowing the

corresponding label value

. This is due to the Gaussian noise model; in general, diﬀerent

noise models lead to the predictive variance (C) depending on the labels.

Naïvely estimating the information gain for all data points requires quadratic computational

complexity: One has to estimate the minimizer of

(

) (Eq. 2), which is

(

) for each data

point. However, in our iterative label suggestion scenario,

can be eﬃciently computed in

linear time: Assuming that

(

t−

1) is given from the previous step, calculating

diag

[

(

)

] does

not require inverting the matrix

(

λH

: Using Sherman–Morrison–Woodbury formula,

(

)

can be eﬃciently calculated from C(t−1):

diag[C(t)

]= diag[C(t−1)]−

squ[C(t−1)

[:,i]

]

1+diag[C(t−1)]

, (10)

where

[:, j]

denotes a vector formed from the

-th column of the matrix

diag

[

] constructs

a vector from the diagonal components of matrix

, and

squ

[

] is a vector obtained by tak-

ing element-wise squares of the vector

. Accordingly, after explicitly calculating

(0),

subsequent updates in C(t) and C(t)

can be performed eﬃciently for each iteration t.

Amershai et al. comparison.

One of the ﬁrst algorithms to propose active learning in end-

user interaction application is Amershi et al.’s approach [

]. Their algorithm was originally

developed for active labeling in metric learning contexts, and so it is not directly applicable

to our problem. However, the re-interpretation of their label selection criterion enables trans-

ferring their algorithm to our current setting: Their strategy is based on a Gaussian process

(GP) model with Gaussian noise: For a given set of labeled data points

and the unlabeled

data points U, and for the i-th unlabeled point, the corresponding score f (i) is deﬁned as:

f =

Var(i|S )

Var(i|U)

, (11)

In practice, we limit C(0) to 2,000 labels for faster interactivity.

10 TOMPKIN ET AL.: CRITERIA SLIDERS

where

Var

(

i|S

) is the variance predicted for

-th data point based on the GP model trained based

, and

Var

(

i|U

) is the variance predicted based on the GP model trained on

, assuming that

are all labeled. Details of GP prediction can be found in [

]. This interpretation enables us to

apply the same principle to our setting: for the

-th data point, the corresponding score

(

) is:

diag[C(t−1)]

diag[U(t−1)]

, (12)

where

(

t−

1) is the variance predicted based on all unlabeled data points. Maximizing only

the predictive uncertainty

diag

[

(

t −

1)]

corresponds to active label selection based on the

predictive uncertainty, which is likely to choose an outlier. The (reciprocal of the) denominator

represents how well the point

is connected to the unlabeled data points [

]. Accordingly,

normalizing the uncertainty by this values rules out the possibility of choosing the outlier.

However,

(

) is based on how much information the unlabeled data points contains about

which does not directly represent our ﬁnal goal: our goal is to choose a data point minimizes

the overall uncertainty on every data points which is directly addressed by I.

Zhu et al. comparison.

Zhu et al. [

] proposed an active learning framework that selects

the labels by minimizing an estimate of the expected risk. This framework relies on evaluating

the prediction accuracy of the updated system for each hypothesized output values. Therefore,

the application of this approach is limited to the case in which the output space is ﬁnite, e.g.,

{−1,1} in the classiﬁcation problem. Our approach does not have this limit.

Large-scale extension.

The rank-one update model (Eqs. 9 and 10) relies on explicitly cal-

culating the covariance matrix

= (

λH

)

−1

. Even though the original regularization matrix

λH

is sparse, its inverse

is in general dense. Therefore, this strategy is not directly

applicable for large-scale problems where calculating and storing the inverse of

λH

are

infeasible. For this case, we adopt sparse eigen-decomposition-based approximation of

λH

Assume that L+λH is eigen-decomposed:

−1

= L+λH = E

⇔C = E

−1

i=1

where

u×u

-sized matrix

stores eigenvectors (

}

i=1

) column-wise, and Λ

is a diagonal

matrix of the corresponding eigenvalues (

{λ

}

i=1

). We assume that the eigenvalues and the

corresponding eigenvectors are arranged in increasing eigenvalues. Given the prescribed rank

parameter

, our approximate covariance matrix

is given as the best possible

-approximation

of rank r to C:

C ≈ C =

i=1

= EΛ

−1

, (13)

where

is a

u×r

submatrix of

containing the ﬁrst

eigenvectors and Λ is the corresponding

r×r submatrix of Λ

Now suppose that we add a label of the

-th data point at time 1. Applying the Sherman–

TOMPKIN ET AL.: CRITERIA SLIDERS 11

Figure 8: Reproduced from

the main paper for reading

ease. Mean absolute error for

estimating the ground-truth

weight parameter in the CAE-

SAR database. Error bar

lengths correspond to twice

the standard deviations (std.).

Note the smaller stds. of

the proposed full and sparse

methods. The error rates for

two labeled data points cases

are the same by construction.

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

# labels

Error rates

Random label selection

Active label selection (predictive variance)

Amershi et al.

Active label selection (proposed)

Active label selection sparse (proposed)

Morrison–Woodbury formula to C, we can represent the update equation:

C(1)

= EΛ

−1

−

EΛ

−1

[i,:]

−1

(t−1)E

1+H

−1

= E







−1

−

−1

[i,:]

−1

1+H

−1







The corresponding information gain is given as

diag[C−C

]

= E

[k,:]







−1

[i,:]

−1

1+H

−1







[k,:]







1+H

−1









[k,:]

−1

[i,:]

−1

[k,:]



. (14)

At step t, adding a label to the i(t)-th data point leads the a new covariance estimate:

[EF

]

k,k

= E

[k,:]

−1

[k,:]

i=1,...,t

[k,:]

where







1+C

−1







[i,:]

−1

. (15)

Active label selection performance.

For CAESAR, we randomly selected a set

of 2000

data points and a subset

B ⊂ A

of 2 initial data points to be labeled (our user scenario). First, we

train on

. Then, we calculate the information gain

(Eq. 9) for each data point in

A\B

. The best

2 data points were assigned labels and were added to

. We iterated until

|B|

= 46. Figure 8 com-

pares error rates of our sparse eigen-decomposition-based approximation (Active label selection

sparse (proposed)) to the alternative algorithms (averaged over 10 trials): 1) random label

selection, 2) predictive uncertainty as discussed previously, 3) Amershi et al. [

] adapted to our

semi-supervised setting, 4) the full information-gain-based method (Eqs. 9 and 10; Active label

selection (proposed)). We used only the ﬁrst 100 (

= 100) eigenvectors. Predictive variance only

resulted in suggesting outliers which led to worse results than random selection. The adapted

algorithm of Amershi et al. [

] improves upon random selection, while our two information

gain-based approaches show further improvement, especially when the number of labels is low.

12 TOMPKIN ET AL.: CRITERIA SLIDERS

References

[1]

S. Amershi, J. Fogarty, A. Kapoor, and D. S. Tan. Eﬀective end-user interaction with

machine learning. In Proc. AAAI, 2011.

[2]

Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based

manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2005.

[3] O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.

[4]

S. Chaudhuri, E. Kalogerakis, S. Giguere, and T. Funkhouser. AttribIt: content creation

with semantic attributes. In Proc. UIST, pages 193–202, 2013.

[5]

Sandra Ebert, Diane Larlus, and Bernt Schiele. Extracting structures in image collections

for object recognition. In Proc. ECCV, pages 720–733, 2010.

[6]

M. Hein, J.-Y. Audibert, and U. von Luxburg. From graphs to manifolds - weak and

strong pointwise consistency of graph Laplacians. In Proc. COLT, pages 470–485, 2005.

[7]

K. I. Kim, J. Tompkin, H. Pﬁster, and C. Theobalt. Context-guided diﬀusion for label

propagation on graphs. In Proc. ICCV, pages 2776–2784, 2015.

[8]

K. I. Kim, J. Tompkin, H. Pﬁster, and C. Theobalt. Local high-order regularization on

data manifolds. In Proc. IEEE CVPR, pages 5473–5481, 2015.

[9]

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical

report, University of Toronto, 2009.

[10]

T. Kulesza, S. Amershi, R. Caruana, D. Fisher, and D. Charles. Structured labeling for

facilitating concept evolution in machine learning. In Proc. CHI, pages 3075–3084, 2014.

[11]

John M. Lee. Riemannian Manifolds- An Introduction to Curvature. Springer, New York,

1997.

[12]

B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object

categorization. In 2003 IEEE Computer Society Conference on Computer Vision and

Pattern Recognition, 2003. Proceedings., volume 2, pages II–409–15 vol.2, June 2003.

doi: 10.1109/CVPR.2003.1211497.

[13]

U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416,

2007.

[14]

B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: the

limit of inﬁnite unlabelled data. In NIPS, pages 1330–1338, 2009.

[15] D. Parikh and K. Grauman. Relative attributes. In Proc. ICCV, pages 503–510, 2011.

[16]

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT

Press, 2006.

[17]

J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine

learning algorithms. In NIPS, pages 2951–2959, 2012.

[18]

D. Zhou, O Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and

global consistency. In NIPS, pages 1330–328, 2004.

TOMPKIN ET AL.: CRITERIA SLIDERS 13

[19]

D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data

manifolds. In NIPS, pages 169–176, 2004.

[20]

X. Zhu, J. Laﬀerty, and Z. Ghahramani. Combining active learning and semi-supervised

learning using Gaussian ﬁelds and harmonic functions. In Proc. ICML workshop on

The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining,

pages 58–65, 2003.