Deep and Young Vision Learning at UPC BarcelonaTech

Deep and Young Vision Learning at
UPC BarcelonaTech
Research Group presentation for NIPS 2016
Xavier Giro-i-Nieto

How to find us at NIPS 2016 ?
2
Women in Machine Learning WS: Monday 5, 1.30pm to 3:30pm @ Area 5+6+7+8, level P0
Míriam Bellver et al, “Efficient
search of objects in images using
deep reinforcement learning”
Dèlia Fernàndez et al, “Is a “happy
dog” image more “happy” than
“dog”? - Analyzing Adjective and
Noun Visual Contributions”
Deep Reinforcement Learning WS
Friday 9, 2.30pm to 3:30pm &
5:30pm to 6:30pm
@ Area 1
Míriam Bellver, Xavier Giro, Ferran
Marqués and Jordi Torres,
“Hierarchical Object Detection with
Deep Reinforcement Learning”
Deep Reinforcement Learning WS
Saturday 10, 3:00pm to 4:00pm
@ Room 111
Alberto Montes, Amaia Salvador,
Santiago Pascual and Xavier Giro,
“Temporal Activity Detection in
Untrimmed Videos with Recurrent
Neural Networks”

Our young experience
4
Phd MSc
CVPR’16
CVPR’15
LSUN Winner
CVPR’16
WS
(extended
abstract)
ACM MM’15
WS
(...)
ICMR’16
CVPR’16 WS
NIPS’16 WS
Amaia
Salvador
Miriam
Bellver
Junting
Pan
Victor
Campos
Albert
Jimenez
Manel
Baradad
Víctor
Garcia
Marc
Carné
NIPS’16
WS
Publications

Our deep experience
5
2015 20162014
Off-the-shelf ConvNets (as feature extractors)
Trained & Fine-tuned ConvNets
Recurrent Networks
Reinforcement
Learning
Adversarial Training

Our deep experience
6
2015 20162014
Recurrent Networks
Reinforcement
Learning
Computational
bottleneck

ReadCV seminar
https://github.com/imatge-upc/readcv
Weekly Reading Group
9
Slidecasts
online

10
Slides
&Videos
Online
[http://imatge-upc.github.io/telecombcn-2016-dlcv/]
Next editions:
Dublin (Apr’17)
Barcelona (Jun’17)

11
Slides & Videos
will beOnline
[https://telecombcn-dl.github.io/2017-dlsl/]
Next editions:
Barcelona
(January’17)

Our deep experience
12
2015 20162014
Recurrent Networks
Reinforcement
Learning

Our deep experience
13
2015 20162014

Bags of Local Convolutional Features
for Instance Search
https://imatge-upc.github.io/retrieval-2016-icmr/

Visual Image Search
15
Visual Query
“A dog”
Expected outcome:

Visual Instance Search
16Image Database
Visual Query
“This dog”
Expected outcome:

17
Image RepresentationsQuery image
Image
Database
Image Matching Ranking List
Similarity score Image
...
0.98
0.97
0.10
0.01
v = (v1
, …, vn
)
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
Similarity
Metric
(e.g. cosine similarity)
...

18
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
INVERTED FILE
word Image ID
1 1, 12,
2 1, 30, 102
3 10, 12
4 2,3
6 10
...
Local hand-crafted features
(e.g. SIFT)
Bag of Visual
WordsN-Dimensional
feature space High-dimensional
Highly sparse

Image Representations
19
...
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 1097-1105).
Convolutional Neural Networks

20
...
Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In
DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation

21
...
Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015
Tolias, G., Sicre, R., & Jégou, H. (2015). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2015
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv
preprint arXiv:1512.04065.
sum/max pooled conv features as global representation

22
...
Ng, J., Yang, F., & Davis, L. (2015). Exploiting local features from deep networks for image retrieval. In DeepVision CVPRW 2015
conv features encoded with VLAD as global representation

Image Representations: Motivation
23
...
High-dimensional & Sparse
Bag of Visual Words
Compact & Dense
(e.g. sum/max pooling conv feats, FC feats)
Capacity?
High-dimensional & Dense
(e.g. VLAD encoding)
Scalability?

24
Query Representation
... ... ...
... ... ...
Global Search
(GS)
Local Search
(LS)

25

26

ICMR 2016 Best Poster Award
27

Wearable Cameras for Visual
Memory (summaries)
Aniol
Lidon
Marc
Carné
Petia
Radeva
Xavier
Giró-i-Nieto
Maite
Garolera

● Narrative Clip wearable camera.
BSc
thesisEgocentric vision

● Egocentric vision.
Egocentric vision

● Reminiscence therapy for mild cognitive impairement (“early Alzheimer’s”).
Egocentric vision

Egocentric vision
M. Bolaños, Mestre, R., Talavera, E., Giró-i-Nieto, X., and Radeva, P., “Visual Summary of Egocentric Photostreams by
Representative Keyframes”, WEsAX workshop at ICME 2015, Turin, Italy, 2015
● Clustering based on low level features from ConvNet’s fully connected layer.

Egocentric vision
Lidon, Aniol, Marc Bolaños, Mariella Dimiccoli, Petia Radeva, Maite Garolera, and Xavier Giró-i-Nieto.
"Semantic Summarization of Egocentric Photo Stream Events." arXiv preprint arXiv:1511.00438 (2015).
● Clustering based semantic detectors and diversity.

Wearable Cameras for Visual Memory
(objects)
Cristian
Reyes
Eva
Mohedano
Noel E.
O’Connor
Xavier
Giró-i-Nieto
Kevin
McGuinness
https://imatge-upc.github.io/retrieval-2016-lostobject/
C. Reyes, Mohedano, E., McGuinness, K., O'Connor, N. E., and Giró-i-Nieto, X., “Where is my Phone? Personal Object
Retrieval from Egocentric Images”, in Lifelogging Tools and Applications Workshop in ACM Multimedia, Amsterdam, The
Netherlands, 2016.

Motivation
Can’t find my phone

Hundreds of images!
Review
Egocentric cameras may help
Last time seen: At the CAFE

● Lost & found objects in visual diaries.
C. Reyes, Mohedano, E., McGuinness, K., O'Connor, N. E., and Giró-i-Nieto, X., “Where is my Phone? Personal Object Retrieval from
Egocentric Images”, in Lifelogging Tools and Applications Workshop in ACM Multimedia, Amsterdam, The Netherlands, 2016.
Egocentric vision: Objects

Visual Ranking - Descriptors
Bag of Words
Eva Mohedano, Amaia Salvador, Kevin McGuinness, Ferran Marques, Noel E. O’Connor, and Xavier Giro-i Nieto.
Bags of local convolutional features for scalable instance search. In Proceedings of the ACM International Conference on Multimedia. ACM, 2016.

Visual Ranking - Queries
5 visual examples

Visual Ranking - Queries
3 masking strategies
Full Image
(FI)
Hard Bounding Box
(HBB)
Soft Bounding Box
(SBB)

Egocentric vision: Lost & Found

Egocentric vision: Objects
Configuration Parameters
MRR
Query Day Images Threshold Temporal ordering
Full images Saliency Mask TVSS Diversity 0,283
Full images Saliency Mask TVSS Timestamp 0,274
Weighted Mask Full image TVSS Diversity 0,269
Weighted Mask Weighted Mask TVSS Diversity 0,258
Comparison of the 4 best configurations according to their performance in MRR terms.

Our deep experience
45
2015 20162014

46
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural Event
Recognition with Visual ConvNets and Temporal Models”, in CVPR ChaLearn Looking at People
Workshop 2015, 2015.
FINE-TUNING A CONVOLUTIONAL NETWORK
FOR CULTURAL EVENT RECOGNITION
ADVISORS:
Andrea Calafell
Xavier Giró-i-Nieto Amaia SalvadorAUTHOR:

49
Transfer learning
Awarded with the 2nd prize of the Cultural Event Recognition
Challenge in the ChaLearn Workshop at CVPR 2015

Faster R-CNN for Instance Search
50
Ferran
Marqués
Shin’ichi
Satoh
Xavier
Giró-i-Nieto
Amaia
Salvador
DeepVision Workshop @

51
Conv
layers
Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Ren, S., He, K., Girshick, R., & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal
networks. NIPS 2015
Faster R-CNN

52
Faster R-CNN
Conv
layers Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Ren, S., He, K., Girshick, R., & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal
networks. NIPS 2015
Image representation
Region Representation

53
Conv
layers Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RoI
Pooling
Conv5_3
RPN Proposals
Adapt object detector for query instances using query images as training data
Fine-tuning (FT) with query-related data

54
Spatial Reranking (SR) strategies
Class-agnostic Spatial Reranking (CA-SR)
Query Image Database
Image
FC6
Class probabilities
FC7
FC8
...
Class-specific Spatial Reranking (CS-SR)

55

56

57
http://imatge-upc.github.io/retrieval-2016-deepvision/
Talk available
on video

Visual Sentiment Analysis
Xavier
Giró-i-Nieto
Victor
Campos
Amaia
Salvador
Brendan
Jou
ASM Workshop @

59
CNN
Campos, V., Salvador, A., Giro-i-Nieto, X., & Jou, B. (2015, October). Diving Deep into Sentiment: Understanding Fine-tuned
CNNs for Visual Sentiment Prediction. In Proceedings of the 1st International Workshop on Affect & Sentiment in
Multimedia (pp. 57-62). ACM.

60
VSO
dataset
Quality of the
annotations
Twitter
dataset
MVSO
dataset
(†) B. Jou*, T. Chen*, N. Pappas*, M. Redi*, M. Topkara*, and S.-F. Chang. Visual Affect Around the World: A Large-scale Multilingual
Visual Sentiment Ontology. ACM Int'l Conference on Multimedia (MM), 2015.
†
Size

61
Model
ARCHITECTURE
CaffeNet
SOFTWARE
[Jia’14]
DATASET
[Deng’09]
DATASET
[You’15]
Twitter 5-agree
+
Fine-tuning
Pre-training

65
Pre-trained
model

67
Campos, V., Giro-i-Nieto, X., & Jou, B. (2015, October). From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment
Prediction. arXiv Pre-print, 2016.

68
http://imatge-upc.github.io/sentiment-2015-asm/

69
Xavier GiróShih-Fu Chang Brendan Jou Víctor Campos
D. Fernàndez, “Clustering and Prediction of Adjective-Noun Pairs for Affective Computing”. 2016.

71
Mid-level affect
representations
* Borth, Damian, et al. "Large-scale visual sentiment ontology and detectors using adjective noun pairs." ACM Multimedia. 2013.

72
CRYING BABY
STORMY LANDSCAPE
SMILING BABY
SUNNY LANDSCAPE
Adjective + Noun = ANP

73
CRYING BABY
STORMY LANDSCAPE
SMILING BABY
SUNNY LANDSCAPE
Adjective + Noun = ANP

74
Emotions:
1. fear
2. amazement
3. sadness
Sentiment:
Negative
Emotions:
1. joy
2. serenity
3. amazement
Sentiment:
Positive
Emotions:
1. sadness
2. surprise
3. grief
Sentiment:
Negative
Emotions:
1. joy
2. surprise
3. amazement
Sentiment:
Positive

76
* Jou, Brendan, and Shih-Fu Chang. "Going Deeper for Multilingual Visual Sentiment Detection." arXiv preprint arXiv:1605.09211 (2016).
ANP Prediction

79
Visual-ANPNet Semantic-ANPNet
* Montavon, Grégoire, et al. "Explaining nonlinear classification decisions with deep taylor decomposition." (2015).
ANP Prediction

81
Noun-Oriented Adjective-Oriented
Cute Cat Foggy Day
ANP Prediction

82
Visually Equivalent ANPs
ANP Prediction

Shallow and Deep Convolutional Networks
for Saliency Prediction
88
Junting
Pan
Kevin
McGuinness
Elisa
Sayrol
Noel E.
O’Connor
Xavier
Giró-i-Nieto

Upsam
ple +
filter
2
D
m
a
pConvnet
Visual Saliency Prediction
89

TRAIN VALIDATION TEST
SALICON [Jiang’15] 10,000 5,000 5,000
iSun [Xu’15] 6,000 926 2,000
CAT2000 [Borji’15] 2,000 - 2,000
MIT300 [Judd’12] 300 - -
90

Two different approaches:
·Shallow Convnet: trained from scratch
·Deep Convnet: reused the first 3 layer from
VGG-M network.
91

-3 convolutional layers
-64.4 million of learnable
parameters.
-Norm constraint regularization
for the maxout layers
-Input images are resized to
96x96
92
Shallow ConvNet trained from scratch (JuntingNet)

- 3 convolutional layers with the
pre-trained weights from the
VGG-CNN-M network
-25.8 million parameters.
-It can handle images of any size,
since it only consists of
convolutional and pooling layers
93
Deep ConvNet initialized with VGG-M (SalNet)

100
http://imatge-upc.github.io/saliency-2016-cvpr/

Our deep experience
101
2015 20162014

SalGAN: Visual Saliency Prediction with
102
Junting
Pan
Kevin
McGuinness
Elisa
Sayrol
Noel E.
O’Connor
Xavier
Giró-i-Nieto
Cristian
Canton

Our deep experience
106
2015 20162014
Recurrent Networks

Activity Recognition from Videos
Using Deep Learning
Xavier
Giró-i-Nieto
Amaia
Salvador
Alberto
Montes
Santi
Pascual
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with
Recurrent Neural Networks”, in 1st NIPS Workshop on Large Scale Computer Vision Systems 2016,
https://imatge-upc.github.io/activitynet-2016-cvprw/

Activity Recognition BSc
thesis
http://activity-net.org/

thesis
Videos
Activity Classification
Longboarding

thesis
Videos
Activity Temporal Localization
Longboarding

thesis
Neural Network
Activity

thesis
Activity
CNN RNN+

Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." CVPR 2015
113
3D Convolutions over sets of 16 frames...
Activity Recognition

thesis

thesis
mAP = 0.5938 mAP = 0.5492 mAP = 0.5635
Deeper networks present overfitting

thesis
Ground Truth:
Playing water polo
Prediction:
0.765 Playing water polo
0.202 Swimming
0.007 Springboard diving

thesis
Ground Truth:
Hopscotch
Prediction:
0.848 Running a marathon
0.023 Triple jump
0.022 Javelin throw

BSc
thesis
Challenge Results
Classification Task
(24 participants)
Baseline
42.20%
0%
Average
Performance
66.26%58.74%
UPC Team
* results over test subset
Slide Design by Issey Masuda
mAP

BSc
thesis
Challenge Results
Detection Task
(6 participants)
Baseline
9.70%
0%
42.47%
Winner
Average
Performance
29.94%22.36%
UPC Team
mAP
* results over test subset
Slide Design by Issey Masuda

Santi Pascual
Open-ended
Visual Question-Answering
[thesis][web][code]
Issey Masuda Mora Santiago Pascual de la PuenteXavier Giró i Nieto

Visual Question-Answering
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).

125
Visual Question-Answering: Types
Real images Abstract scenes
Multi-Choice
Open-ended
Q: Does it
appear to be
rainy?
A: no
Q: What is just
under the tree?
A: a ball
Q: How
many slices
of pizza are
there?
A: 1, 2, 3, 4
Q: What is for
desert?
A: cake, ice
cream,
cheesecake, pie

126
Example
Question: What is bobbing in the water other than
the boats?
Answer: buoys

127
VQA: Common approach
Visual
representation
Textual
representation
Predict answerMerge
Question
What object is flying?
Answer
Kite
CNN
Word/sentence
embedding + LSTM

128
Image
Question
AnswerSentence embedding and image projection

129
Results in context
129
100%0%
Humans
83.30%
UC Berkeley
& Sony
66.47%
Baseline
LSTM&CNN
54.06%
Baseline Nearest
neighbor
42.85%
Baseline Prior per
question type
37.47%
Baseline All yes
29.88%
Ours
53.62%

132
VQA Challenge statistics: Answering method

Visual Question Answering
http://imatge-upc.github.io/vqa-2016/

Our deep experience
134
2015 20162014
Reinforcement
Learning

135
Hierarchical Object Detection with Deep
Reinforcement Learning
NIPS 2016 Workshop on Reinforcement Learning
[github] [arXiv]

136
Introduction
We present a method for performing hierarchical object detection in images
guided by a deep reinforcement learning agent.
OBJECT
FOUND

137
Introduction
OBJECT
FOUND

138
Introduction
OBJECT
FOUND

139
Introduction
Reinforcement Learning
An agent that is a decision-maker interacts with the environment and learns
through trial-and-error
Slide credit: UCL Course on RL by David Silver

140
Related Work
Deep Reinforcement Learning
ATARI 2600 Alpha Go
Mnih, V. (2013). Playing atari with deep reinforcement learning
Silver, D. (2016). Mastering the game of Go with deep neural networks and tree search

141
Related Work
Region
Proposals/Sliding
Window +
Detector
Sharing
convolutions over
locations +
Detector
Sharing
convolutions over
location and also
to the detector
Single Shot
detectors
Uijlings, J. R.
(2013). Selective
search for object
recognition
Girshick, R.
(2015). Fast
R-CNN
Ren, S., He, K., Girshick, R., &
Sun, J. (2015). Faster R-CNN
Redmon, J., (2015). YOLO
Liu, W.,(2015). SSD
Object Detection

142
Related Work
Region
Proposals/Sliding
Window +
Detector
Sharing
convolutions over
locations +
Detector
Sharing
convolutions over
location and also
to the detector
Single Shot
detectors
Object Detection
they rely on a large
number of locations
they rely on a number
of reference boxes
from which bbs are
regressed
Uijlings, J. R.
(2013). Selective
search for object
recognition
Girshick, R.
(2015). Fast
R-CNN
Ren, S., He, K., Girshick, R., &
Sun, J. (2015). Faster R-CNN
Redmon, J., (2015). YOLO
Liu, W.,(2015). SSD

143
Reinforcement Learning Formulation
Hierarchies of regions
For the first kind of hierarchy,
less steps are required to reach
a certain scale of bounding
boxes, but the space of possible
regions is smaller
trigger

144
Model
We tested two different
configurations of feature
extraction:
Image-Zooms model: We extract
features for every region observed
Pool45-Crops model: We extract
features once for the whole image,
and ROI-pool features for each
subregion

145
Model
Our RL agent is based on a
Q-network. The input is:
● Visual description
● History vector
The output is:
● A FC of 6 neurons,
indicating the Q-values
for each action

146
Visualizations
These results were obtained
with the Image-zooms
model, which yielded better
results.
We observe that the model
approximates to the
object, but that the final
bounding box is not
accurate.

147
Experiments
We calculate an upper-bound and baseline experiment with the hierarchies,
and observe that both are very limited in terms of recall.
Image-Zooms model achieves better Precision-Recall metric

148
Experiments
Most of the searches for objects of our agent
finish with just 1, 2 or 3 steps, so our agent
requires very few steps to approximate to
objects.

Happy to learn from your feedback as well as
explore opportunities for joint research.
xavier.giro@upc.edu
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
facebook.com/DocXavi
Thank you !
@DocXavi
Slides
available

Deep and Young Vision Learning at UPC BarcelonaTech

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep and Young Vision Learning at UPC BarcelonaTech

Similar to Deep and Young Vision Learning at UPC BarcelonaTech (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep and Young Vision Learning at UPC BarcelonaTech