Introduction
An artificial intelligence system called a convolutional neural network (CNN) has gained a lot of popularity recently. For jobs like image recognition, where we want to teach a computer to recognize things in a picture, they are especially well suited.
CNNs operate by dissecting an image into increasingly minute components, or “features.” The network then examines each feature and searches for patterns shared by various objects. For instance, a CNN might come to understand that some pixel patterns are frequently linked to faces, while others are linked to vehicles or trees.
The unique feature of CNNs is their capacity to discover these patterns on its own, without having to be explicitly trained to do so. This is what makes them so powerful: by analyzing thousands or even millions of images, a CNN can learn to recognize a wide variety of objects with remarkable accuracy.
In a previous blog post, we used a two-regular dense layer neural network for the MNIST images classification. The testing accuracy was 97.8%.
Let’s implement a convolutional neural network and see how it performs.
Understand CNN in a high level with Josh Starmer’s video.
Let’s load the data
Load the libraries:
#install.packages("keras") install the keras R package
library(keras)
#install_keras(version = "release") install the core Keras library and TensorFlow
library(reticulate)
use_condaenv("r-reticulate")
mnist<- dataset_mnist()
split the training and testing sets
train_images<- mnist$train$x
train_labels<- mnist$train$y
train_labels<- to_categorical(train_labels)
test_images<- mnist$test$x
test_labels<- mnist$test$y
test_labels<- to_categorical(test_labels)
The training sets is a 3D tensor
dim(train_images)
#> [1] 60000 28 28
It is an array of 60,000 matrices of 28x28 integers. Each matrix is a grayscale image:
# get the fifth matrix
digit<- train_images[5, ,]
dim(digit)
#> [1] 28 28
plot(as.raster(digit, max = 255))
It is a matrix denoting the image of 9.
reshape the data into 3D tensor
In our regular dense neural network model, we reshaped the tensor into 2D.
Importantly, a convnet
takes tensors of shape (image_height, image_width, image_channels), not including the
batch dimension. We will convert the input to size (28, 28, 1). For the MNIST dataset, it is black and white, so only 1 channel.
For colored images, you will have 3 channels (RGB colors).
# a 2D tensor/matrix
digit
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 0 0 0 0 0 0 0 0 0 0 0 0 55
#> [9,] 0 0 0 0 0 0 0 0 0 0 0 87 232
#> [10,] 0 0 0 0 0 0 0 0 0 4 57 242 252
#> [11,] 0 0 0 0 0 0 0 0 0 96 252 252 183
#> [12,] 0 0 0 0 0 0 0 0 132 253 252 146 14
#> [13,] 0 0 0 0 0 0 0 126 253 247 176 9 0
#> [14,] 0 0 0 0 0 0 16 232 252 176 0 0 0
#> [15,] 0 0 0 0 0 0 22 252 252 30 22 119 197
#> [16,] 0 0 0 0 0 0 16 231 252 253 252 252 252
#> [17,] 0 0 0 0 0 0 0 55 235 253 217 138 42
#> [18,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [19,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [20,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [21,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [22,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [23,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [24,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [25,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [26,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [27,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [28,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 148 210 253 253 113 87 148 55 0 0 0 0
#> [9,] 252 253 189 210 252 252 253 168 0 0 0 0
#> [10,] 190 65 5 12 182 252 253 116 0 0 0 0
#> [11,] 14 0 0 92 252 252 225 21 0 0 0 0
#> [12,] 0 0 0 215 252 252 79 0 0 0 0 0
#> [13,] 0 8 78 245 253 129 0 0 0 0 0 0
#> [14,] 36 201 252 252 169 11 0 0 0 0 0 0
#> [15,] 241 253 252 251 77 0 0 0 0 0 0 0
#> [16,] 226 227 252 231 0 0 0 0 0 0 0 0
#> [17,] 24 192 252 143 0 0 0 0 0 0 0 0
#> [18,] 62 255 253 109 0 0 0 0 0 0 0 0
#> [19,] 71 253 252 21 0 0 0 0 0 0 0 0
#> [20,] 0 253 252 21 0 0 0 0 0 0 0 0
#> [21,] 71 253 252 21 0 0 0 0 0 0 0 0
#> [22,] 106 253 252 21 0 0 0 0 0 0 0 0
#> [23,] 45 255 253 21 0 0 0 0 0 0 0 0
#> [24,] 0 218 252 56 0 0 0 0 0 0 0 0
#> [25,] 0 96 252 189 42 0 0 0 0 0 0 0
#> [26,] 0 14 184 252 170 11 0 0 0 0 0 0
#> [27,] 0 0 14 147 252 42 0 0 0 0 0 0
#> [28,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [,26] [,27] [,28]
#> [1,] 0 0 0
#> [2,] 0 0 0
#> [3,] 0 0 0
#> [4,] 0 0 0
#> [5,] 0 0 0
#> [6,] 0 0 0
#> [7,] 0 0 0
#> [8,] 0 0 0
#> [9,] 0 0 0
#> [10,] 0 0 0
#> [11,] 0 0 0
#> [12,] 0 0 0
#> [13,] 0 0 0
#> [14,] 0 0 0
#> [15,] 0 0 0
#> [16,] 0 0 0
#> [17,] 0 0 0
#> [18,] 0 0 0
#> [19,] 0 0 0
#> [20,] 0 0 0
#> [21,] 0 0 0
#> [22,] 0 0 0
#> [23,] 0 0 0
#> [24,] 0 0 0
#> [25,] 0 0 0
#> [26,] 0 0 0
#> [27,] 0 0 0
#> [28,] 0 0 0
#reshape it to 3D
digit2<- array_reshape(digit, c(28, 28, 1))
digit2[,,1]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 0 0 0 0 0 0 0 0 0 0 0 0 55
#> [9,] 0 0 0 0 0 0 0 0 0 0 0 87 232
#> [10,] 0 0 0 0 0 0 0 0 0 4 57 242 252
#> [11,] 0 0 0 0 0 0 0 0 0 96 252 252 183
#> [12,] 0 0 0 0 0 0 0 0 132 253 252 146 14
#> [13,] 0 0 0 0 0 0 0 126 253 247 176 9 0
#> [14,] 0 0 0 0 0 0 16 232 252 176 0 0 0
#> [15,] 0 0 0 0 0 0 22 252 252 30 22 119 197
#> [16,] 0 0 0 0 0 0 16 231 252 253 252 252 252
#> [17,] 0 0 0 0 0 0 0 55 235 253 217 138 42
#> [18,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [19,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [20,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [21,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [22,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [23,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [24,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [25,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [26,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [27,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [28,] 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [3,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [5,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [6,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [7,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [8,] 148 210 253 253 113 87 148 55 0 0 0 0
#> [9,] 252 253 189 210 252 252 253 168 0 0 0 0
#> [10,] 190 65 5 12 182 252 253 116 0 0 0 0
#> [11,] 14 0 0 92 252 252 225 21 0 0 0 0
#> [12,] 0 0 0 215 252 252 79 0 0 0 0 0
#> [13,] 0 8 78 245 253 129 0 0 0 0 0 0
#> [14,] 36 201 252 252 169 11 0 0 0 0 0 0
#> [15,] 241 253 252 251 77 0 0 0 0 0 0 0
#> [16,] 226 227 252 231 0 0 0 0 0 0 0 0
#> [17,] 24 192 252 143 0 0 0 0 0 0 0 0
#> [18,] 62 255 253 109 0 0 0 0 0 0 0 0
#> [19,] 71 253 252 21 0 0 0 0 0 0 0 0
#> [20,] 0 253 252 21 0 0 0 0 0 0 0 0
#> [21,] 71 253 252 21 0 0 0 0 0 0 0 0
#> [22,] 106 253 252 21 0 0 0 0 0 0 0 0
#> [23,] 45 255 253 21 0 0 0 0 0 0 0 0
#> [24,] 0 218 252 56 0 0 0 0 0 0 0 0
#> [25,] 0 96 252 189 42 0 0 0 0 0 0 0
#> [26,] 0 14 184 252 170 11 0 0 0 0 0 0
#> [27,] 0 0 14 147 252 42 0 0 0 0 0 0
#> [28,] 0 0 0 0 0 0 0 0 0 0 0 0
#> [,26] [,27] [,28]
#> [1,] 0 0 0
#> [2,] 0 0 0
#> [3,] 0 0 0
#> [4,] 0 0 0
#> [5,] 0 0 0
#> [6,] 0 0 0
#> [7,] 0 0 0
#> [8,] 0 0 0
#> [9,] 0 0 0
#> [10,] 0 0 0
#> [11,] 0 0 0
#> [12,] 0 0 0
#> [13,] 0 0 0
#> [14,] 0 0 0
#> [15,] 0 0 0
#> [16,] 0 0 0
#> [17,] 0 0 0
#> [18,] 0 0 0
#> [19,] 0 0 0
#> [20,] 0 0 0
#> [21,] 0 0 0
#> [22,] 0 0 0
#> [23,] 0 0 0
#> [24,] 0 0 0
#> [25,] 0 0 0
#> [26,] 0 0 0
#> [27,] 0 0 0
#> [28,] 0 0 0
dim(digit2)
#> [1] 28 28 1
reshape input tensor including the batch (6000 images):
#train_images<- array_reshape(train_images, c(60000, 28 * 28))
train_images<- array_reshape(train_images, c(60000, 28, 28, 1))
## one of the image
# train_images[1,,,]
train_images<- train_images/255
test_images<- array_reshape(test_images, c(10000, 28, 28, 1))
test_images<- test_images/255
Read my previous post on how to reshape tensors.
dim(train_images)
#> [1] 60000 28 28 1
dim(test_images)
#> [1] 10000 28 28 1
build the network
model<- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = "relu",
input_shape = c(28, 28, 1)) %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu") %>%
layer_max_pooling_2d(pool_size = c(2,2)) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu")
model<- model %>%
layer_flatten() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
Take a look at the details of the model
model
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> conv2d_2 (Conv2D) (None, 26, 26, 32) 320
#> ________________________________________________________________________________
#> max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
#> ________________________________________________________________________________
#> conv2d_1 (Conv2D) (None, 11, 11, 64) 18496
#> ________________________________________________________________________________
#> max_pooling2d (MaxPooling2D) (None, 5, 5, 64) 0
#> ________________________________________________________________________________
#> conv2d (Conv2D) (None, 3, 3, 64) 36928
#> ________________________________________________________________________________
#> flatten (Flatten) (None, 576) 0
#> ________________________________________________________________________________
#> dense_1 (Dense) (None, 64) 36928
#> ________________________________________________________________________________
#> dense (Dense) (None, 10) 650
#> ================================================================================
#> Total params: 93,322
#> Trainable params: 93,322
#> Non-trainable params: 0
#> ________________________________________________________________________________
Convolutions operates over 3D tensors, called feature maps
with two spatial axes (height
and width
) as well
as the a depth
axis. For an RGB image, the dimension of the depth axis is 3 with Red, green and blue channels.
For the black and white MNIST images, the depth is 1 (levels of gray).
The convolution extracts patches from its input feature map and applies the same transformation for all patches, producing
an output feature map
. This out put feature map is still 3D: width, height and depth. But the depth now is an arbitrary
parameter of the layer and it does not represent the RGB colors: they are called filters.
Filters encode specific aspects of the input data. At a higher level, a single filter can encode the concept “presence of a face in the image”.
In our MNIST example, we take a (28, 28, 1) input feature map and output a feature map of (26, 26, 32). It computes 32 filters over its input. Each of those 32 output channels contains a 26 x 26 grid of values, which is a response map of the filter over the input, indicating the response of that filter pattern at different locations in the input.
Two key parameters we defined in our model
size of the patches extracted from the inputs: they are typically 3x3 or 5x5. We used 3x3. Without padding, for a 28 x 28 image, the output feature map dimension becomes 26 x 26 (Think how many 3x3 patches you can get from a 28x28 grid.)
Depth of the output feature map: the number of filters computed by the convolution. We used 32 and ended with 64.
Max-pooling consists of extracting windows from the output feature maps and outputting the max value of each channel. Instead of transforming local patches via a learned linear transformation, they are transformed via a hard coded max tensor operation. The max pooling is usually done with 2x2 window. That’s why after 1st max pooling, the 26 x 26 grid becomes 13 x 13; and after 2nd max pooling, 11 x 11 becomes 5x 5. Because of the border issues, the grid becomes smaller and smaller: 28 x 28 to 26 x 26; 13 x 13 to 11 x 11, 5x 5 to 3x3.
Let’s make a new model by padding:
model2<- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = "relu",
input_shape = c(28, 28, 1), padding = "same") %>%
layer_max_pooling_2d(pool_size = c(2,2), padding = "same") %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu", padding = "same") %>%
layer_max_pooling_2d(pool_size = c(2,2), padding = "same") %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu", padding = "same")
model2<- model2 %>%
layer_flatten() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
model2
#> Model: "sequential_1"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> conv2d_5 (Conv2D) (None, 28, 28, 32) 320
#> ________________________________________________________________________________
#> max_pooling2d_3 (MaxPooling2D) (None, 14, 14, 32) 0
#> ________________________________________________________________________________
#> conv2d_4 (Conv2D) (None, 14, 14, 64) 18496
#> ________________________________________________________________________________
#> max_pooling2d_2 (MaxPooling2D) (None, 7, 7, 64) 0
#> ________________________________________________________________________________
#> conv2d_3 (Conv2D) (None, 7, 7, 64) 36928
#> ________________________________________________________________________________
#> flatten_1 (Flatten) (None, 3136) 0
#> ________________________________________________________________________________
#> dense_3 (Dense) (None, 64) 200768
#> ________________________________________________________________________________
#> dense_2 (Dense) (None, 10) 650
#> ================================================================================
#> Total params: 257,162
#> Trainable params: 257,162
#> Non-trainable params: 0
#> ________________________________________________________________________________
After we use padding = same
(default is valid), our input image is 28x28, after 1st conv2d, it is still 28x28,
max pooling reduces it to half: 14 x 14, but the second conv2d keeps it 14x 14, then max pooling reduced it to half again to 7x7 etc etc.
Let’s compile the first model:
model %>%
compile(optimizer = "rmsprop",
loss = "categorical_crossentropy",
metrics= c("accuracy"))
Finally, train the model:
model %>%
fit(train_images, train_labels, epochs = 5, batch_size = 64)
The network will iterate on the training data in mini-batches of 64 samples, 5 times over(each iteration over all the training data is called an epoch).
evaluate the model
metrics<- model %>% evaluate(test_images, test_labels)
metrics
#> loss accuracy
#> 0.02705173 0.99220002
It blows my mind that this simple CNN reached an accuracy of ~99%! The time is a little longer than the regular dense layer neural network.
Take home messages
CNN reduces the number of input nodes and thus total number of parameters (the pooling step does it).
CNNs are capable of hierarchical feature learning: CNNs use multiple layers to learn hierarchical representations of an image, starting with low-level features like edges and gradually building up to more complex features like shapes and objects. This allows them to extract more meaningful features from images, which can improve their accuracy and performance. it tolerate small shifts in where the pixle are in the image. CNN learns local features such as an “ear” and can find it in other places of the image.
CNNs are able to learn on their own: One of the key advantages of CNNs is that they are able to learn from data without being explicitly programmed to do so. This means that they can automatically adapt to new data and improve their performance over time, making them highly flexible and adaptable.
CNNs can be highly accurate: When trained on large datasets, CNNs can achieve very high levels of accuracy in image recognition tasks. In fact, in some cases they can even outperform humans, making them a valuable tool for tasks like medical diagnosis, quality control, and more.
We can use k-fold cross-validation and a validate set to further increase the prediction accuracy for testing data.