Let’s play Chrome’s Dino game with gestures only !!

Abhishek Singh
5 min readNov 27, 2019

So moving forward from my previous post on Gesture recognition using Convolution Neural Network — Gesture recognition using Convolutional Neural Net.

I did some modifications to my previous version of the project and here are some of the changes I did in CNNGestureRecognizer ver 2.0 :

  • Increased the training image samples to 4015 images.
  • More variety in training sample images like images taken at different lighting conditions, background, different filters.
  • Added an additional classification class ‘Nothing’ to properly ignore inputs without any valid gestures.
  • Trained for more cycles/epochs.

As a result, this ver 2.0 has much-improved gesture prediction ability. As you can see here:

What about performance?

Well, I thought of testing out the responsiveness of NeuralNet predictions and games are a good benchmark. On MAC I don't have any games installed but then this Chrome Browser’s Dino game came handy. So I bound the ‘Punch’ gesture with jump action of the Dino character. Basically can work with any other gesture but felt Punch gesture is easy. ‘Stop’ gesture was another candidate.

Well here is how it turned out

Watch the full video —

Features:

This project comes with CNN model to recognize up to 5 pre-trained gestures:

  • OK
  • PEACE
  • STOP
  • PUNCH
  • NOTHING (ie when none of the above gestures are input)

And it provides following functionalities:

  • Prediction: This allows the app to guess the user’s gesture against pre-trained gestures. App can dump the prediction data to the console terminal or to a JSON file directly which can be used to plot real-time prediction bar chart (you can use my other script — https://github.com/asingh33/LivePlot )
  • New Training: This allows the user to retrain the NN model. User can change the model architecture or add/remove new gestures. This app has inbuilt options to allow the user to create new image samples of user-defined gestures if required.
  • Visualization: This allows the user to see feature maps of different NN layers for a given input gesture image. Interesting to see how NN works and learns things.

Functionality:

Gesture Input

I am using OpenCV for capturing the user’s hand gestures. In order to simplify things, I am doing post-processing on the captured images to highlight the contours & edges. Like applying binary threshold, blurring, gray scaling.

I have provided two modes of capturing:

  • Binary Mode: In here I first convert the image to grayscale, then apply a gaussian blur effect with an adaptive threshold filter. This mode is useful when you have an empty background like a wall, whiteboard, etc.
  • SkinMask Mode: In this mode, I first convert the input image to HSV and put range on the H,S,V values based on skin color range. Then apply erosion followed by dilation. Then Gaussian blur to smoothen out the noises. Using this output as a mask on original input to mask out everything other than skin-colored things. Finally, I have grayscaled it. This mode is useful when there is a good amount of light and you don't have empty background.

Binary Mode processing

gray = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)

blur = cv2.GaussianBlur(gray,(5,5),2)

th3 = cv2.adaptiveThreshold(blur,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,cv2.THRESH_BINARY_INV,11,2)

ret, res = cv2.threshold(th3, minValue, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)

SkinMask Mode processing

hsv = cv2.cvtColor(roi, cv2.COLOR_BGR2HSV)

#Apply skin color range

mask = cv2.inRange(hsv, low_range, upper_range)

mask = cv2.erode(mask, skinkernel, iterations = 1)

mask = cv2.dilate(mask, skinkernel, iterations = 1)

#blur

mask = cv2.GaussianBlur(mask, (15,15), 1)

#bitwise and mask original frame

res = cv2.bitwise_and(roi, roi, mask = mask)

# color to grayscale

res = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)

CNN Model used

The Convolutional Neural Net model I have used for this project is pretty basic model:

model = Sequential()

model.add(Conv2D(nb_filters, (nb_conv, nb_conv),

padding=’valid’,

input_shape=(img_channels, img_rows, img_cols)))

convout1 = Activation(‘relu’)

model.add(convout1)

model.add(Conv2D(nb_filters, (nb_conv, nb_conv)))

convout2 = Activation(‘relu’)

model.add(convout2)

model.add(MaxPooling2D(pool_size=(nb_pool, nb_pool)))

model.add(Dropout(0.5))

model.add(Flatten())

model.add(Dense(128))

model.add(Activation(‘relu’))

model.add(Dropout(0.5))

model.add(Dense(nb_classes))

model.add(Activation(‘softmax’))

This model has the following 12 layers -

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 32, 198, 198) 320
_________________________________________________________________
activation_1 (Activation) (None, 32, 198, 198) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 32, 196, 196) 9248
_________________________________________________________________
activation_2 (Activation) (None, 32, 196, 196) 0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 98, 98) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 32, 98, 98) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 307328) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 39338112
_________________________________________________________________
activation_3 (Activation) (None, 128) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 5) 645
_________________________________________________________________
activation_4 (Activation) (None, 5) 0
=================================================================

Total params: 39,348,325.0 Trainable params: 39,348,325.0

Training

In version 1.0 of this project, I had used 1204 images only for training. Predictions probability was ok but not satisfying. So in version 2.0, I increased the training image set to 4015 images i.e. 803 image samples per class. Also added an additional class ‘Nothing’ along with the previous 4 gesture classes.

I have trained the model for 15 epochs.

Visualization

CNN is good at detecting edges and that's why it's useful for image classification kind of tasks. In order to understand how the neural net is understanding the different gesture input its possible to visualize the layer feature map contents.

After launching the main script choose option 3 for visualizing different or all layer for a given image

layer = model.layers[layerIndex]

get_activations = K.function([model.layers[0].input, K.learning_phase()], [layer.output,])

activations = get_activations([input_image, 0])[0]

output_image = activations

Layer 4 visualization for PUNCH gesture

Layer 2 visualization for STOP gesture

Source code

Cool! So you have reached the end of this post. Here is your cookie 😅

My Github link — https://github.com/asingh33

And in case you find this write up interesting then please comment and let me know and I will try to put more. I am planning to expand my knowledge more deeper in machine learning with RNN, transfer learning technique, and my favorite Reinforcement Learning.

--

--