To automate the detection of POST displays, we are using a deep learning algorithm known as Faster RCNN \cite{Ren_2017} . The implementation of Faster RCNN is adopted from existing GitHub repository by endernewton \cite{gupta2017}.
Our main metric for evaluating performance is mean average precision (mAP). This measure is our precision for each class in our data that we are predicting. The mAP value can change based on choices of our parameters such as number of iterations, anchor size, and classification threshold. Anchors are region boxes that are used to contain objects in an image. Multiple anchors and generated for an image and ranked according to how likely they contain a single object. Iterations is another parameter that we can manipulate in developing our model. Iterations reflect the number of times our model trains on a dataset. The higher the number of iterations the more opportunities our model has to reduce the error between its output and the correct classification of a sign. The threshold is the probability of an image to be a tobacco advertisement and whether we classifiy it as such. The initial data we have for training is from last year's research group. In total, there are 300 different images with tobacco advertisements. 300 images were augmented to 9331 images such as color balance adjustments, contrast levels modification, and random noise pixels additions. Different attempts of selecting training set and validation set were made. For example, a random sample of 70% of the 9331 images as training and 30% as validation has been approved to be inappropriate since there were considerably high overlapping between training and validation set after images augmentation because they were too similar. This was established with cosine similarity and hashing, and it occurred because the changes were made by drawing parameters from random distribution with X and X parameter. Considering the limited initial data we have, choosing a pre-trained model is crucial, and is also a typical research approach which takes advantages of previous established work. ResNet101 \cite{He2016}, a residual neural network with 101 residual layers, pre-trained model was used.
In addition to preparing training data and validation data, three different anchor scale sets, [4,8,16], [8,16,32], and [4,8,16,32] were set for 3 training attempts with 70K iterations. Due to the limited number of tobacco advertisements detected in images from Google Street View in comparison to object detection standards, [4, 8, 16] anchor scale for training gave the best mean average precision and recall. Along with training and validation set up, Precision Recall curve and ROC curve were plotted for model evaluation.
** Add definition of Region Proposal Network (RNN) and how it applies to our project **