https://imatge.upc.edu/web/people/xavier-giro
These slides provide an overview of our research group at UPC, which has been applying deep learning to computer vision since 2014. We are one of the pioneering research groups in Europe and, despite the youth of most of its member, it has already contributed to the community with a diverse range of publications and software at top scientific venues.
Deep and Young Vision Learning at UPC BarcelonaTech
1. Deep and Young Vision Learning at
UPC BarcelonaTech
Research Group presentation for NIPS 2016
Xavier Giro-i-Nieto
2. How to find us at NIPS 2016 ?
2
Women in Machine Learning WS: Monday 5, 1.30pm to 3:30pm @ Area 5+6+7+8, level P0
Míriam Bellver et al, “Efficient
search of objects in images using
deep reinforcement learning”
Dèlia Fernàndez et al, “Is a “happy
dog” image more “happy” than
“dog”? - Analyzing Adjective and
Noun Visual Contributions”
Deep Reinforcement Learning WS
Friday 9, 2.30pm to 3:30pm &
5:30pm to 6:30pm
@ Area 1
Míriam Bellver, Xavier Giro, Ferran
Marqués and Jordi Torres,
“Hierarchical Object Detection with
Deep Reinforcement Learning”
Deep Reinforcement Learning WS
Saturday 10, 3:00pm to 4:00pm
@ Room 111
Alberto Montes, Amaia Salvador,
Santiago Pascual and Xavier Giro,
“Temporal Activity Detection in
Untrimmed Videos with Recurrent
Neural Networks”
18. Visual Instance Search
18
v1
= (v11
, …, v1n
)
vk
= (vk1
, …, vkn
)
...
INVERTED FILE
word Image ID
1 1, 12,
2 1, 30, 102
3 10, 12
4 2,3
6 10
...
Local hand-crafted features
(e.g. SIFT)
Bag of Visual
WordsN-Dimensional
feature space High-dimensional
Highly sparse
19. Image Representations
19
...
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 1097-1105).
Convolutional Neural Networks
20. Image Representations
20
...
Babenko, A., Slesarev, A., Chigorin, A., & Lempitsky, V. (2014). Neural codes for image retrieval. In ECCV 2014
Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: an astounding baseline for recognition. In
DeepVision CVPRW 2014
Convolutional Neural Networks FC layers as global feature representation
21. Image Representations
21
...
Babenko, A., & Lempitsky, V. (2015). Aggregating local deep features for image retrieval. ICCV 2015
Tolias, G., Sicre, R., & Jégou, H. (2015). Particular object retrieval with integral max-pooling of CNN activations. ICLR 2015
Kalantidis, Y., Mellina, C., & Osindero, S. (2015). Cross-dimensional Weighting for Aggregated Deep Convolutional Features. arXiv
preprint arXiv:1512.04065.
Convolutional Neural Networks
sum/max pooled conv features as global representation
22. Image Representations
22
...
Ng, J., Yang, F., & Davis, L. (2015). Exploiting local features from deep networks for image retrieval. In DeepVision CVPRW 2015
Convolutional Neural Networks
conv features encoded with VLAD as global representation
32. Egocentric vision
M. Bolaños, Mestre, R., Talavera, E., Giró-i-Nieto, X., and Radeva, P., “Visual Summary of Egocentric Photostreams by
Representative Keyframes”, WEsAX workshop at ICME 2015, Turin, Italy, 2015
● Clustering based on low level features from ConvNet’s fully connected layer.
33. Egocentric vision
Lidon, Aniol, Marc Bolaños, Mariella Dimiccoli, Petia Radeva, Maite Garolera, and Xavier Giró-i-Nieto.
"Semantic Summarization of Egocentric Photo Stream Events." arXiv preprint arXiv:1511.00438 (2015).
● Clustering based semantic detectors and diversity.
34. Wearable Cameras for Visual Memory
(objects)
Cristian
Reyes
Eva
Mohedano
Noel E.
O’Connor
Xavier
Giró-i-Nieto
Kevin
McGuinness
https://imatge-upc.github.io/retrieval-2016-lostobject/
C. Reyes, Mohedano, E., McGuinness, K., O'Connor, N. E., and Giró-i-Nieto, X., “Where is my Phone? Personal Object
Retrieval from Egocentric Images”, in Lifelogging Tools and Applications Workshop in ACM Multimedia, Amsterdam, The
Netherlands, 2016.
37. ● Lost & found objects in visual diaries.
C. Reyes, Mohedano, E., McGuinness, K., O'Connor, N. E., and Giró-i-Nieto, X., “Where is my Phone? Personal Object Retrieval from
Egocentric Images”, in Lifelogging Tools and Applications Workshop in ACM Multimedia, Amsterdam, The Netherlands, 2016.
Egocentric vision: Objects
39. Visual Ranking - Descriptors
Bag of Words
Eva Mohedano, Amaia Salvador, Kevin McGuinness, Ferran Marques, Noel E. O’Connor, and Xavier Giro-i Nieto.
Bags of local convolutional features for scalable instance search. In Proceedings of the ACM International Conference on Multimedia. ACM, 2016.
44. Egocentric vision: Objects
Configuration Parameters
MRR
Query Day Images Threshold Temporal ordering
Full images Saliency Mask TVSS Diversity 0,283
Full images Saliency Mask TVSS Timestamp 0,274
Weighted Mask Full image TVSS Diversity 0,269
Weighted Mask Weighted Mask TVSS Diversity 0,258
Comparison of the 4 best configurations according to their performance in MRR terms.
46. 46
A. Salvador, Zeppelzauer, M., Manchon-Vizuete, D., Calafell-Orós, A., and Giró-i-Nieto, X., “Cultural Event
Recognition with Visual ConvNets and Temporal Models”, in CVPR ChaLearn Looking at People
Workshop 2015, 2015.
FINE-TUNING A CONVOLUTIONAL NETWORK
FOR CULTURAL EVENT RECOGNITION
ADVISORS:
Andrea Calafell
Xavier Giró-i-Nieto Amaia SalvadorAUTHOR:
50. Faster R-CNN for Instance Search
50
Ferran
Marqués
Shin’ichi
Satoh
Xavier
Giró-i-Nieto
Amaia
Salvador
DeepVision Workshop @
51. Faster R-CNN for Instance Search
51
Conv
layers
Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Ren, S., He, K., Girshick, R., & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal
networks. NIPS 2015
Faster R-CNN
52. Faster R-CNN for Instance Search
52
Faster R-CNN
Conv
layers Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
Ren, S., He, K., Girshick, R., & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal
networks. NIPS 2015
Image representation
Region Representation
53. 53
Faster R-CNN for Instance Search
Conv
layers Region Proposal
Network
FC6
Class probabilities
FC7
FC8
RoI
Pooling
Conv5_3
RPN Proposals
Adapt object detector for query instances using query images as training data
Fine-tuning (FT) with query-related data
59. 59
CNN
Campos, V., Salvador, A., Giro-i-Nieto, X., & Jou, B. (2015, October). Diving Deep into Sentiment: Understanding Fine-tuned
CNNs for Visual Sentiment Prediction. In Proceedings of the 1st International Workshop on Affect & Sentiment in
Multimedia (pp. 57-62). ACM.
Visual Sentiment Analysis
60. 60
Visual Sentiment Analysis
VSO
dataset
Quality of the
annotations
Twitter
dataset
MVSO
dataset
(†) B. Jou*, T. Chen*, N. Pappas*, M. Redi*, M. Topkara*, and S.-F. Chang. Visual Affect Around the World: A Large-scale Multilingual
Visual Sentiment Ontology. ACM Int'l Conference on Multimedia (MM), 2015.
†
Size
91. Two different approaches:
·Shallow Convnet: trained from scratch
·Deep Convnet: reused the first 3 layer from
VGG-M network.
91
Visual Saliency Prediction
92. -3 convolutional layers
-64.4 million of learnable
parameters.
-Norm constraint regularization
for the maxout layers
-Input images are resized to
96x96
92
Visual Saliency Prediction
Shallow ConvNet trained from scratch (JuntingNet)
93. - 3 convolutional layers with the
pre-trained weights from the
VGG-CNN-M network
-25.8 million parameters.
-It can handle images of any size,
since it only consists of
convolutional and pooling layers
93
Visual Saliency Prediction
Deep ConvNet initialized with VGG-M (SalNet)
101. Our deep experience
101
2015 20162014
Off-the-shelf ConvNets (as feature extractors)
Trained & Fine-tuned ConvNets
Adversarial Training
102. SalGAN: Visual Saliency Prediction with
Adversarial Training
102
Junting
Pan
Kevin
McGuinness
Elisa
Sayrol
Noel E.
O’Connor
Xavier
Giró-i-Nieto
Cristian
Canton
107. Activity Recognition from Videos
Using Deep Learning
Xavier
Giró-i-Nieto
Amaia
Salvador
Alberto
Montes
Santi
Pascual
A. Montes, Salvador, A., Pascual-deLaPuente, S., and Giró-i-Nieto, X., “Temporal Activity Detection in Untrimmed Videos with
Recurrent Neural Networks”, in 1st NIPS Workshop on Large Scale Computer Vision Systems 2016,
https://imatge-upc.github.io/activitynet-2016-cvprw/
113. Figure: Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3D convolutional networks." CVPR 2015
113
3D Convolutions over sets of 16 frames...
Activity Recognition
124. Visual Question-Answering
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question
answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).
125. 125
Visual Question-Answering: Types
Real images Abstract scenes
Multi-Choice
Open-ended
Q: Does it
appear to be
rainy?
A: no
Q: What is just
under the tree?
A: a ball
Q: How
many slices
of pizza are
there?
A: 1, 2, 3, 4
Q: What is for
desert?
A: cake, ice
cream,
cheesecake, pie
136. 136
Introduction
We present a method for performing hierarchical object detection in images
guided by a deep reinforcement learning agent.
OBJECT
FOUND
137. 137
Introduction
We present a method for performing hierarchical object detection in images
guided by a deep reinforcement learning agent.
OBJECT
FOUND
138. 138
Introduction
We present a method for performing hierarchical object detection in images
guided by a deep reinforcement learning agent.
OBJECT
FOUND
140. 140
Related Work
Deep Reinforcement Learning
ATARI 2600 Alpha Go
Mnih, V. (2013). Playing atari with deep reinforcement learning
Silver, D. (2016). Mastering the game of Go with deep neural networks and tree search
141. 141
Related Work
Region
Proposals/Sliding
Window +
Detector
Sharing
convolutions over
locations +
Detector
Sharing
convolutions over
location and also
to the detector
Single Shot
detectors
Uijlings, J. R.
(2013). Selective
search for object
recognition
Girshick, R.
(2015). Fast
R-CNN
Ren, S., He, K., Girshick, R., &
Sun, J. (2015). Faster R-CNN
Redmon, J., (2015). YOLO
Liu, W.,(2015). SSD
Object Detection
142. 142
Related Work
Region
Proposals/Sliding
Window +
Detector
Sharing
convolutions over
locations +
Detector
Sharing
convolutions over
location and also
to the detector
Single Shot
detectors
Object Detection
they rely on a large
number of locations
they rely on a number
of reference boxes
from which bbs are
regressed
Uijlings, J. R.
(2013). Selective
search for object
recognition
Girshick, R.
(2015). Fast
R-CNN
Ren, S., He, K., Girshick, R., &
Sun, J. (2015). Faster R-CNN
Redmon, J., (2015). YOLO
Liu, W.,(2015). SSD
143. 143
Reinforcement Learning Formulation
Hierarchies of regions
For the first kind of hierarchy,
less steps are required to reach
a certain scale of bounding
boxes, but the space of possible
regions is smaller
trigger
144. 144
Model
We tested two different
configurations of feature
extraction:
Image-Zooms model: We extract
features for every region observed
Pool45-Crops model: We extract
features once for the whole image,
and ROI-pool features for each
subregion
145. 145
Model
Our RL agent is based on a
Q-network. The input is:
● Visual description
● History vector
The output is:
● A FC of 6 neurons,
indicating the Q-values
for each action
146. 146
Visualizations
These results were obtained
with the Image-zooms
model, which yielded better
results.
We observe that the model
approximates to the
object, but that the final
bounding box is not
accurate.
147. 147
Experiments
We calculate an upper-bound and baseline experiment with the hierarchies,
and observe that both are very limited in terms of recall.
Image-Zooms model achieves better Precision-Recall metric
148. 148
Experiments
Most of the searches for objects of our agent
finish with just 1, 2 or 3 steps, so our agent
requires very few steps to approximate to
objects.
149. Happy to learn from your feedback as well as
explore opportunities for joint research.
xavier.giro@upc.edu
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
facebook.com/DocXavi
Thank you !
@DocXavi
Slides
available