Let’s play Chrome’s Dino game with gestures only !!
So moving forward from my previous post on Gesture recognition using Convolution Neural Network — Gesture recognition using Convolutional Neural Net.
I did some modifications to my previous version of the project and here are some of the changes I did in CNNGestureRecognizer ver 2.0 :
- Increased the training image samples to 4015 images.
- More variety in training sample images like images taken at different lighting conditions, background, different filters.
- Added an additional classification class ‘Nothing’ to properly ignore inputs without any valid gestures.
- Trained for more cycles/epochs.
As a result, this ver 2.0 has much-improved gesture prediction ability. As you can see here:
What about performance?
Well, I thought of testing out the responsiveness of NeuralNet predictions and games are a good benchmark. On MAC I don't have any games installed but then this Chrome Browser’s Dino game came handy. So I bound the ‘Punch’ gesture with jump action of the Dino character. Basically can work with any other gesture but felt Punch gesture is easy. ‘Stop’ gesture was another candidate.
Well here is how it turned out
Watch the full video —
This project comes with CNN model to recognize up to 5 pre-trained gestures:
- NOTHING (ie when none of the above gestures are input)
And it provides following functionalities:
- Prediction: This allows the app to guess the user’s gesture against pre-trained gestures. App can dump the prediction data to the console terminal or to a JSON file directly which can be used to plot real-time prediction bar chart (you can use my other script — https://github.com/asingh33/LivePlot )
- New Training: This allows the user to retrain the NN model. User can change the model architecture or add/remove new gestures. This app has inbuilt options to allow the user to create new image samples of user-defined gestures if required.
- Visualization: This allows the user to see feature maps of different NN layers for a given input gesture image. Interesting to see how NN works and learns things.
I am using OpenCV for capturing the user’s hand gestures. In order to simplify things, I am doing post-processing on the captured images to highlight the contours & edges. Like applying binary threshold, blurring, gray scaling.
I have provided two modes of capturing:
- Binary Mode: In here I first convert the image to grayscale, then apply a gaussian blur effect with an adaptive threshold filter. This mode is useful when you have an empty background like a wall, whiteboard, etc.
- SkinMask Mode: In this mode, I first convert the input image to HSV and put range on the H,S,V values based on skin color range. Then apply erosion followed by dilation. Then Gaussian blur to smoothen out the noises. Using this output as a mask on original input to mask out everything other than skin-colored things. Finally, I have grayscaled it. This mode is useful when there is a good amount of light and you don't have empty background.
Binary Mode processing
gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray,(5,5),2)
th3 = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY_INV,11,2)
ret, res = cv2.threshold(th3, minValue, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
SkinMask Mode processing
hsv = cv2.cvtColor(roi, cv2.COLOR_BGR2HSV)
#Apply skin color range
mask = cv2.inRange(hsv, low_range, upper_range)
mask = cv2.erode(mask, skinkernel, iterations = 1)
mask = cv2.dilate(mask, skinkernel, iterations = 1)
mask = cv2.GaussianBlur(mask, (15,15), 1)
#bitwise and mask original frame
res = cv2.bitwise_and(roi, roi, mask = mask)
# color to grayscale
res = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)
CNN Model used
The Convolutional Neural Net model I have used for this project is pretty basic model:
model = Sequential()
model.add(Conv2D(nb_filters, (nb_conv, nb_conv),
input_shape=(img_channels, img_rows, img_cols)))
convout1 = Activation(‘relu’)
model.add(Conv2D(nb_filters, (nb_conv, nb_conv)))
convout2 = Activation(‘relu’)
This model has the following 12 layers -
Layer (type) Output Shape Param #
conv2d_1 (Conv2D) (None, 32, 198, 198) 320
activation_1 (Activation) (None, 32, 198, 198) 0
conv2d_2 (Conv2D) (None, 32, 196, 196) 9248
activation_2 (Activation) (None, 32, 196, 196) 0
max_pooling2d_1 (MaxPooling2 (None, 32, 98, 98) 0
dropout_1 (Dropout) (None, 32, 98, 98) 0
flatten_1 (Flatten) (None, 307328) 0
dense_1 (Dense) (None, 128) 39338112
activation_3 (Activation) (None, 128) 0
dropout_2 (Dropout) (None, 128) 0
dense_2 (Dense) (None, 5) 645
activation_4 (Activation) (None, 5) 0
Total params: 39,348,325.0 Trainable params: 39,348,325.0
In version 1.0 of this project, I had used 1204 images only for training. Predictions probability was ok but not satisfying. So in version 2.0, I increased the training image set to 4015 images i.e. 803 image samples per class. Also added an additional class ‘Nothing’ along with the previous 4 gesture classes.
I have trained the model for 15 epochs.
CNN is good at detecting edges and that's why it's useful for image classification kind of tasks. In order to understand how the neural net is understanding the different gesture input its possible to visualize the layer feature map contents.
After launching the main script choose option 3 for visualizing different or all layer for a given image
layer = model.layers[layerIndex]
get_activations = K.function([model.layers.input, K.learning_phase()], [layer.output,])
activations = get_activations([input_image, 0])
output_image = activations
Layer 4 visualization for PUNCH gesture
Layer 2 visualization for STOP gesture
Cool! So you have reached the end of this post. Here is your cookie 😅
My Github link — https://github.com/asingh33
And in case you find this write up interesting then please comment and let me know and I will try to put more. I am planning to expand my knowledge more deeper in machine learning with RNN, transfer learning technique, and my favorite Reinforcement Learning.