This post is a learning notes from this paper:
Rich feature hierarchies for accurate object detection and semantic segmentation
R-CNN stands for Region-based Convolutional Neural Networks
But I will only summary from the computer vision perspective, which is to do an object detection system.
(The approach is involved with deep learning, bottom up region proposals and convolutional neural networks)
In the paper, a simple and scalable detection algorithm is proposed. Two key points:
1. locolizing objects with a deep network;
2. training a high-capacity model with a small quality of annotated detection data.
For both object detection and semantic segmentation, the approach works well.
Object detection with R-CNN
1. Input Image: Nothing special to say, any images as inputs.
2. Extract Region: selective search “fast mode” (J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.)
Extract a 4096-dimensional feature vector from each region proposal.
* Warped Region: a 227 × 227 pixel RGB image. Simplest way: warp all pixels in a tight bounding box around it to the required size. (dilate p = 16). Generally speaking, not all region proposals are standard squares, the object proposal transformations are needed.
The amount of context padding( above) is defined as a border size around the original object proposal(like a frame).
3. Compute CNN Features: 5 Convolution Layers, 2 Fully Connected layers.(Similar to the previous work by Y.Lecun)
4. Classify Regions: score each extracted feature vector using the SVM trained for that class. A greedy non-maximum suppression( rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.)
Why efficient:
1. CNN shared features (parameters), original characteristic of CNN
2. CNN low-dimensional computing. (Reduced number of features extracted)
Run-time characteristics
Compute region proposals and features (category agnostic):13s/image on a GPU, or 53s/image on a CPU.
Class-specific computing: The feature matrix is typically 2000×4096 and the SVM weight matrix is 4096×N, where N is the number of classes.
Training:
1.Supervised pre-training: a CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations (i.e., no bounding box labels) by Caffe CNN library.
2.Domain-specific fine-tuning: Continue SGD training of CNN parameters using warped region proposals.We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate). In each SGD iteration, mini-batch of 128( 32 positive windows, 96 background windows ).
3.Object category classifiers: Carefully choose IoU overlap threshold. (Positive vs negative, region proposals with background). Then optimise one linear SVM per class, and the standard hard negative mining method.
Conclusion
Achieved performance through:
1. apply high-capacity CNN works to bottom-up region proposals in order to localize and segment objects.
2. paradigm for training large CNNs when labeled training data is scarce.
Fast R-CNN :
http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf
http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Girshick_Fast_R-CNN_ICCV_2015_paper.pdf
http://arxiv.org/pdf/1506.01497v3.pdf