By now, you may already be familiar with VASE, the audio & video framework behind WIT. The VASE framework handles all the complexity of receiving video from local and remote sources – from webcam and DVB dongles, all the way to internet streams. This ease of accessing video means that one is free to focus on other interesting problems – for instance, computer vision.
We recently started using VASE in a project to identify objects in remote video streams. One of the first tasks was face detection, where we used the face detector from Dlib, a helpful library for computer vision. This face detector calculates the histogram of oriented gradients (HOG) in small patches of an image, and uses a support vector machine (SVM) to decide whether a face is inside one of the patches. This detector works really well, and one can see the detected faces for two frames of a video stream in Figure 1. The red lines represent the region in the image which contains a face.
Figure 1: Example face detection from an Internet stream at two distinct moments. Red lines represent the region that contains a face, blue lines represent nose and head contours from a facial model with 68 landmarks.
After detecting the faces, we could estimate their shapes and detailed location by fitting 68 landmarks to the face using Dlib’s face pose estimation. These landmarks are useful, for example, to identify a face based on its facial features, to draw overlays (such as virtual glasses, make-up), or for simply drawing the detected face’s contours (blue polygons in Figure 1).
Using VASE and Dlib, we were able to analyze a video stream in real-time (25 frames per second) on a laptop (Core i5 2430M, see Figure 2). To achieve this, the face detection had to be updated every two frames, while the face shape was estimated on every frame.
Figure 2: Laptop running a VASE-based program and analyzing the video stream in real-time with Dlib.
These are just baby steps into a new project, so optimizations were not exhausted here. It should be feasible to perform a faster facial detection by processing the video frames in parallel, or by following a face’s movement in the time between two face-detection frames. Nevertheless, it already shows how VASE can be used to hide the complexity of video transmission in computer vision applications, efficiently.