Image classification in Galaxy with fruit 360 dataset
Deep Learning (Part 3) - Convolutional neural networks (CNN)
Deep Learning (Part 3) - Convolutional neural networks (CNN)
Published: Dec 1, 2021
What is a convolutional neural network (CNN)?
Convolutional Neural Network (CNN)
- Increasing popularity of social media in past decade
- Image and video processing tasks have become very important
- FNN could not scale up to image and video processing tasks
- CNN specifically tailored for image and video processing tasks
Inspiration for CNN
- In 1959 Hubel & Wiesel did an experiment to understand how visual cortex of brain processes visual info
- Recorded activity of neurons in visual cortex of a cat
- While moving a bright line in front of the cat
- Some cells fired when bright line is shown at a particular angle/location
- Called these simple cells
- Other cells fired when bright line was shown regardless of angle/location
- Seemed to detect movement
- Called these complex cells
- Seemed complex cells receive inputs from multiple simple cells
- Have an hierarchical structure
- Hubel and Wiesel won Noble prize in 1981
- Inspired by complex/simple cells, Fukushima proposed Neocognitron (1980)
- Hierarchical neural network used for handwritten Japanese character recognition
- First CNN, had its own training algorithm
- In 1989, LeCun proposed CNN that was trained by backpropagation
- CNN got popular when outperformed other models at ImageNet Challenge
- Competition in object classification/detection
- On hundreds of object categories and millions of images
- Run annually from 2010 to present
- Notable CNN architectures that won ImageNet challenge
- AlexNet (2012), ZFNet (2013), GoogLeNet & VGG (2014), ResNet (2015)
Architecture of CNN
- A typical CNN has 4 layers
- Input layer
- Convolution layer
- Pooling layer
- Fully connected layer
- We will explain a 2D CNN here
- Same concepts apply to a 1 (or 3) dimensional CNN
Input layer
- Example input a 28 pixel by 28 pixel grayscale image
- Unlike FNN, we do not “flatten” the input to a 1D vector
- input is presented to network in 2D as 28 x 28 matrix
- This makes capturing spatial relationships easier
Convolution layer
- Composed of multiple filters (kernels)
- Filters for 2D image are also 2D
- Suppose we have a 3 by 3 filter (9 values in total)
- Values are randomly set to 0 or 1
- Convolution: placing 3 by 3 filter on the top left corner of image
- Multiply filter values by pixel values, add the results
- Move filter to right one pixel at a time, and repeat this process
- When at top right corner, move filter down one pixel and repeat process
- Process ends when we get to bottom right corner of image
3 by 3 Filter
Convolution operator parameters
- Filter size
- Padding
- Stride
- Dilation
- Activation function
Filter size
- Filter size can be 5 by 5, 3 by 3, and so on
- Larger filter sizes should be avoided
- As learning algorithm needs to learn filter values (weights)
- Odd sized filters are preferred to even sized filters
- Nice geometric property of all input pixels being around output pixel
- After applying 3 by 3 filter to 4 by 4 image, we get a 2 by 2 image – Size of the image has gone down
- If we want to keep image size the same, we can use padding
- We pad input in every direction with 0’s before applying filter
- If padding is 1 by 1, then we add 1 zero in every direction
- If padding is 2 by 2, then we add 2 zeros in every direction, and so on
3 by 3 filter with padding of 1
- How many pixels we move filter to the right/down is stride
- Stride 1: move filter one pixel to the right/down
- Stride 2: move filter two pixels to the right/down
3 by 3 filter with stride of 2
- When we apply 3 by 3 filter, output affected by pixels in 3 by 3 subset of image
- Dilation: To have a larger receptive field (portion of image affecting filter’s output)
- If dilation set to 2, instead of contiguous 3 by 3 subset of image, every other pixel of a 5 by 5 subset of image affects output
3 by 3 filter with dilation of 2
Activation function
- After filter applied to whole image, apply activation function to output to introduce non-linearity
- Preferred activation function in CNN is ReLU
- ReLU leaves outputs with positive values as is, replaces negative values with 0
Relu activation function
Single channel 2D convolution
Triple channel 2D convolution
Triple channel 2D convolution in 3D
Change channel size
- Output of a multi-channel 2D filter is a single channel 2D image
- Applying multiple filters results in a multi-channel 2D image
- E.g., if input image is 28 x 28 x 3 (rows x columns x channels)
- We apply a 3 x 3 filter with 1 x 1 padding, we get a 28 x 28 x 1 image
- If we apply 15 such filters, we get a 28 x 28 x 15
- Number of filters allows us to increase or decrease channel size
Pooling layer
- Pooling layer performs down sampling to reduce spatial dimensionality of input
- This decreases number of parameters
- Reduces learning time/computation
- Reduces likelihood of overfitting
- Most popular type is max pooling
- Usually a 2 x 2 filter with a stride of 2
- Returns maximum value as it slides over input data
Fully connected layer
- Last layer in a CNN
- Connect all nodes from previous layer to this fully connected layer
- Which is responsible for classification of the image
An example CNN
An example CNN
- A typical CNN has several convolution plus pooling layers
- Each responsible for feature extraction at different levels of abstraction
- E.g., filters in first layer detect horizental, vertical, and diagonal edges
- Filters in the next layer detect shapes
- Filters in the last layer detect collection of shapes
- Filter values randomly initialized, learned by learning algorithm
- CNN not only do classification, but can also automatically do feature extraction
- Distinguishes CNN from other classification techniques (like Support Vector Machines)
Fruit 360 dataset
- A dataset with 90,380 images of 131 fruits and vegetables
- Images are 100 by 100 pixel and are color (RGB) images (3 values per pixel)
- 67,692 images in training dataset and 22,688 images in test dataset
- This tutorial’s dataset is a subset of fruit 360 dataset
- Containing only 10 fruits/vegetables
- Selected a subset of images, so dataset size is smaller and CNN trains faster
5,015 images in training dataset, and 1,679 images in test dataset
Utilities for creating a subset of fruit 360 dataset
- The utilities and instructions at
- First, creat feature vectors for each image
- Images are 100 by 100 pixel color (RGB) images
- Each image represented by 30,000 values (100 X 100 X 3)
- Second, we selected a subset of 10 fruit/vegetable images
- Training and test dataset sizes went from 7 GB and 2.5 GB to 500 MB and 177 MB
- Third, we created separate files for feature vectors and labels
- Finally, mapped labels for 10 selected fruits/vegetables to a 0 to 9 range
Full dataset labels are in the 0 to 130 range
Classification of fruit/vegetable images with CNN
- We define a CNN and train it using fruit 360 dataset training data
- Goal is to learn a model such that given image of a fruit/vegetable we predict its label (0 to 9)
We then evaluate the trained CNN on test dataset and plot the confusion matrix
