Gesture recognition using Convolution Neural Net

Abhishek Singh
2 min readNov 26, 2019

So just in case, you have not realized yet, but Machine learning is the new buzz word which is hard to ignore. You must have already heard about it being used in training Autonomous or Driverless cars. That’s the most commonly known implementation of this technology where specially modified cars (equipped with various sensors, cameras, and radar-based technologies) are being trained or self learn to drive on the road.

There are other implementations of Machine learnings as well like in natural language processing (eg: used by virtual assistants like Siri, Alexa, etc), image classifications (eg: used in Google search engine), medical industry (eg: identify certain diseases).

Basically, Machine Learning is the technology of making computing devices intelligent enough to do certain tasks on par or more efficiently than human beings. This can be achieved by different methods. I think machine learning gained more fame when various Neural Networks implementations (CNN, RNN) became feasible enough to be utilized for training i.e. due to a huge leap in computational power of CPU/GPU and ease of NN implementation through different programming languages, especially Python & Ruby.

Well enough of this theoretical talks. Like many others, I am also intrigued by neural network-based machine learning implementations and wanted to learn more about it. So I am sharing one of my recent fun project that I did to better understand it. I implemented — Hand gesture recognition using Computer Vision and Convolution Neural Network.

For this I used the following:

- Python implementation

- OpenCV 2 for computer vision

- Keras neural network API

- Theano backend library.

One of the trained gesture — ‘Peace’

In this implementation I have trained the model to recognize 4 gestures:

  • Ok
  • Peace
  • Punch
  • Stop
Demo of my Gesture Recognition implementation

And as you will notice that it's able to recognize the Peace & Punch gestures with maximum accuracy. While Stop sign with medium probability and Ok sign with worst accuracy. There could be multiple reasons for this:

  • Ambient light can affect the camera output.
  • Lower number of sample images. I used 200 images per gesture. So 200 x 4 = 800 images only.
  • Lower training duration. I trained for around 11–12 epochs (its a neural network term for pass-through of training set<images> through the network)

I don’t want to make this blog a long and boring one. So I am thinking of writing another one with deep dive into my implementation. Hope you will like it!