Posted on

Face recognition in video using Kinect v2 sensor

Michal Viskup

We detect and recognize the human faces in the video stream. Each face in the video is either recognized and the label is drawn next to their facial rectangle or it is labelled as unknown.

The video stream is obtained using Kinect v2 sensor. This sensor offers several data streams, we mention only the 2 relevant for our work:

  • RGB stream (resolution: 1920×1080, depth: 8bits)
  • Depth stream (resolution: 512×424, depth: 16bits)

The RGB stream is self-explanatory.  The depth stream consists of the values that denote the distance of the each pixel from the sensor.  The reliable distance lays between the 50 mms and extends to 8 meters. However, past the 4.5m mark, the reliability of the data is questionable. Kinect offers the methods that map the pixels from RGB stream to Depth stream and vice-versa.

We utilize the facial data from RGB stream for the recognition. The depth data is used to enhance the face segmentation through the nose-tip detection.

First of all, the face recognizer has to be trained. The training is done only once. The state of the trained recognizer can be persisted in xml format and reloaded in the future without the need for repeated training. OpenCV offers implementation of three face recognition methods:

  • Eigenfaces
  • Fisherfaces
  • Local Binary Pattern Histograms

We used the Eigenfaces and Fisherfaces method. The code for creation of the face recognizer follows:

void initRecognizer()
	Ptr<FaceRecognizer> fr;
	fr = createEigenFaceRecognizer();

It is simple as that. Face recognizer that uses the Fisherfaces method can be created accordingly. The Ptr interface ensures the correct memory management.

All the faces presented to such recognizer would be labelled as unknown. The recognizer is not trained yet. The training requires the two vectors:

  • The vector of facial images in the OpenCV Mat format
  • The vector of integer values containing the identifiers for the facial images

These vectors can be created manually. This however is not sufficient for processing the large training sets. We thus provide the automated way to create these vectors. Data for each subject should be placed in a separate directory. Directories containing the subject data should be places within the single directory (referred to as root directory). The algorithm is given an access to the root directory. It processes all the subject directories and creates both the vector images and the vector labels. We think that the Windows API for accessing the file system is inconvenient. On the other hand, UNIX based systems offer convenient C API through the Dirent interface. Visual Studio compiler lacks the dirent interface. We thus used an external library to gain access to this convenient interface ( Following code requires the library to run:

First we obtain the list of subject names. These stand for the directory names within the root directory. The subject names are stored in the vector of string values. It can be initialized manually or using the text file.

Then, for each subject, the path to their directory is created:

std::ostringstream fullSubjectPath;
fullSubjectPath << ROOT_DIRECTORY_PATH;
fullSubjectPath << "\\";
fullSubjectPath << subjectName;
fullSubjectPath << "\\";

We then obtain the list of file names that reside within the subject directory:

std::vector<std::string> DataProvider::getFileNamesForDirectory(const std::string subjectDirectoryPath)
	std::vector<std::string> fileNames;
	DIR *dir;
	struct dirent *ent;
	if ((dir = opendir(subjectDirectoryPath.c_str())) != NULL) {
		while ((ent = readdir(dir)) != NULL) {
			if ((strcmp(ent->d_name, ".") == 0) || (strcmp(ent->d_name, "..") == 0))
	else {
		std::cout << "Cannot open the directory: ";
		std::cout << subjectDirectoryPath;
	return fileNames;

Then, the images are loaded and stored in vector:

std::vector<std::string> subjectFileNames = getFileNamesForDirectory(fullSubjectPath.str());

std::vector<cv::Mat> subjectImages;
for (std::string fileName : subjectFileNames)
	std::ostringstream fullFileNameBuilder;
	fullFileNameBuilder << fullSubjectPath.str();
	fullFileNameBuilder << fileName;
	cv::Mat subjectImage = cv::imread(fullFileNameBuilder.str());
return subjectImages;

In the end, label vector is created:

for (int i = 0; i < subjectImages.size(); i++){

With images and labels vectors ready, the training is a one-liner:


The recognizer is trained. What we need now is a video and depth stream to recognize from.
Kinect sensor is initialized by the following code:

void initKinect()

	hr = GetDefaultKinectSensor(&kinectSensor);
	if (FAILED(hr))

	if (kinectSensor)
		// Initialize the Kinect and get the readers
		IColorFrameSource* colorFrameSource = NULL;
		IDepthFrameSource* depthFrameSource = NULL;

		hr = kinectSensor->Open();

		if (SUCCEEDED(hr))
			hr = kinectSensor->get_ColorFrameSource(&colorFrameSource);

		if (SUCCEEDED(hr))
			hr = colorFrameSource->OpenReader(&colorFrameReader);


		if (SUCCEEDED(hr))
			hr = kinectSensor->get_DepthFrameSource(&depthFrameSource);

		if (SUCCEEDED(hr))
			hr = depthFrameSource->OpenReader(&depthFrameReader);


	if (!kinectSensor || FAILED(hr))

The following function obtains the next color frame from Kinect sensor:

Mat getNextColorFrame()
	IColorFrame* nextColorFrame = NULL;
	IFrameDescription* colorFrameDescription = NULL;
	ColorImageFormat colorImageFormat = ColorImageFormat_None;

	HRESULT errorCode = colorFrameReader->AcquireLatestFrame(&nextColorFrame);
	if (!SUCCEEDED(errorCode))
		Mat empty;
		return empty;

	if (SUCCEEDED(errorCode))
		errorCode = nextColorFrame->get_FrameDescription(&colorFrameDescription);
	int matrixWidth = 0;
	if (SUCCEEDED(errorCode))
		errorCode = colorFrameDescription->get_Width(&matrixWidth);
	int matrixHeight = 0;
	if (SUCCEEDED(errorCode))
		errorCode = colorFrameDescription->get_Height(&matrixHeight);
	if (SUCCEEDED(errorCode))
		errorCode = nextColorFrame->get_RawColorImageFormat(&colorImageFormat);
	UINT bufferSize;
	BYTE *buffer = NULL;
	if (SUCCEEDED(errorCode))
		bufferSize = matrixWidth * matrixHeight * 4;
		buffer = new BYTE[bufferSize];
		errorCode = nextColorFrame->CopyConvertedFrameDataToArray(bufferSize, buffer, ColorImageFormat_Bgra);
	Mat frameKinect;
	if (SUCCEEDED(errorCode))
		frameKinect = Mat(matrixHeight, matrixWidth, CV_8UC4, buffer);
	if (colorFrameDescription)
	if (nextColorFrame)

	return frameKinect;

Analogous function obtains the next depth frame. The only change is the type and size of the buffer, as the depth frame is single channel 16 bit per pixel.
Finally, we are all set to do the recognition. The face recognition task consists of the following steps:

  1. Detect the faces in video frame
  2. Crop the faces and process them
  3. Predict the identity

For face detection, we use OpenCV CascadeClassifier. OpenCV provides the extracted features for the classifier for both the frontal and the profile faces. However, in video both the slight and major variations from these positions are present. We thus increase the tolerance for the false positives to prevent the cases when the track of the face is lost between the frames.
The classifier is simply initialized by loading the set of features using its load function.

CascadeClassifier cascadeClassifier;

The face detection is done as follows:

vector<Mat> getFaces(const Mat frame, vector<Rect_<int>> &rectangles)
	Mat grayFrame;
	cvtColor(frame, grayFrame, CV_BGR2GRAY);

	cascadeClassifier.detectMultiScale(grayFrame, rectangles, 1.1, 5);

	vector<Mat> faces;
	for (Rect_<int> face : rectangles){
		Mat detectedFace = grayFrame(face);
		Mat faceResized;
		resize(detectedFace, faceResized, Size(240, 240), 1.0, 1.0, INTER_CUBIC);
	return faces;

With faces detected, we are set to proceed to recognition. The recognition process is as follows:

Mat colorFrame = getNextColorFrame();
vector<Rect_<int>> rectangles;
vector<Mat> faces = getFaces(colorFrameResized, rectangles);
int label = -1;
label = fr->predict(face);
string box_text = format("Prediction = %d", label);
putText(originalFrame, box_text, Point(rectangles[i].tl().x, rectangles[i].tl().y), FONT_HERSHEY_PLAIN, 1.0, CV_RGB(0, 255, 0), 2.0);

Nose tip detection is done as follows:

unsigned short minReliableDistance;
unsigned short maxReliableDistance;
Mat depthFrame = getNextDepthFrame(&minReliableDistance, &maxReliableDistance);
double scale = 255.0 / (maxReliableDistance - minReliableDistance);
depthFrame.convertTo(depthFrame, CV_16UC1, scale);

// detect nose tip
// only search for the nose tip in the head area
Mat deptHeadRegion = depthFrame(rectangles[i]);
// Nose is probably the local minima in the head area
double min, max;
Point minLoc, maxLoc;
minMaxLoc(deptHeadRegion, &min, &max, &minLoc, &maxLoc);
	minLoc.x += rectangles[i].x;
	minLoc.y += rectangles[i].y;

// Draw the circle at proposed nose position.
circle(depthFrame, minLoc, 5, 255, -1);

To conclude, we provide a simple implementation that allows the detection and recognition of human faces within a video. The room for improvement is that rather than allowing more false positives in detection phase, the detected nose tip can be used for face tracking.

Posted on

Medical image segmentation

Martin Tamajka

In this project, our goal was to apply image segmentation techniques to dense volume of standard medical data.


Our method is based on oversegmentation to supervoxels (similar to superpixels, but in 3D volume). Such oversegmentation dramatically decreases processing time and has many other advantages over working directly with voxels. Oversegmentation is done using SLIC algorithm ( Implementation that we use was created by authors and does not depend on any other library. This comes at cost of necessity to transform images in OpenCV format to C++ arrays. SLIC allows to choose between supervoxel compactness (or regularity of shape) and intensity homogeneity. In our work, we decided to prefer homogeneity over regularity, because different tissues in anatomical organs have their typical intensities.

Oversegmented MRI slice – it can be seen that supervoxels adhere boundaries.

Merging supervoxels


After we oversegmented images, we created object representation of volume. Our object representation (SLIC3D) has following attributes:

vector&amp;amp;lt;Supervoxel*&amp;amp;gt;				m_supervoxels;
std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt;	m_supervoxelsMap;
int	m_height;
int	m_width;
int	m_depth;

The most important is the vector of Supervoxel pointers. Supervoxel is our basic class providing important information about contained voxels, can generate features that can be used in classification process and (very important) knows its neighbouring supervoxels. Currently, Supervoxel can generate 3 kinds of features:

float Supervoxel::AverageIntensity()
	return m_centroid-&amp;amp;gt;intensity;

float Supervoxel::AverageQuantileIntensity(float quantile)
	assert(quantile &amp;amp;gt;= 0 &amp;amp;amp;&amp;amp;amp; quantile &amp;amp;lt;= 1);

	float intens = 0;

	int terminationIndex = quantile * m_points.size();
	for (int i = 0; i &amp;amp;lt; terminationIndex; i++)
		intens += m_points[i]-&amp;amp;gt;intensity;

	return intens / terminationIndex;

float Supervoxel::MedianIntensity()
	return m_points[m_points.size() / 2]-&amp;amp;gt;intensity;

SLIC3D class (the one containing supervoxels) has method where “all magic happens” - MergeSimilarSupervoxels. As stated in its name, method merges voxels. Method performs given number of iterations. In each iteration, method takes random supervoxels, compares it with its neighbours and if a neighbour and examined supervoxels have similar average intensity, the latter one is merged to the examined one. Code can be seen right below.

bool SLIC3D::MergeSimilarSupervoxels()
	vector&amp;amp;lt;Supervoxel*&amp;amp;gt; supervoxelsToBeErased;
	for (int i = 0; i &amp;amp;lt; 10000; i++)
		cout &amp;amp;lt;&amp;amp;lt; i &amp;amp;lt;&amp;amp;lt; " " &amp;amp;lt;&amp;amp;lt; m_supervoxels.size() &amp;amp;lt;&amp;amp;lt; endl;
		std::sort(m_supervoxels.begin(), m_supervoxels.end(), helper_sortFunctionByAvgIntensity);
		Supervoxel* brightestSupervoxel = m_supervoxels[rand() % (m_supervoxels.size() - 1)];
		std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt; nb = *(brightestSupervoxel-&amp;amp;gt;GetNeighbours());

		int numberOfIterations = 0;
		vector&amp;amp;lt;int&amp;amp;gt; labelsToBeErased;
		for (auto it = nb.begin(); it != nb.end(); it++)
			if (std::min(brightestSupervoxel-&amp;amp;gt;AverageIntensity(), it-&amp;amp;gt;second-&amp;amp;gt;AverageIntensity()) / std::max(brightestSupervoxel-&amp;amp;gt;AverageIntensity(), it-&amp;amp;gt;second-&amp;amp;gt;AverageIntensity()) &amp;amp;gt; 0.95)
				helper_mergeSupervoxels(brightestSupervoxel, it-&amp;amp;gt;second);
				//cout &amp;amp;lt;&amp;amp;lt; "nope" &amp;amp;lt;&amp;amp;lt; endl;

		for (int i = 0; i &amp;amp;lt; labelsToBeErased.size(); i++)
			Supervoxel* toBeRemoved = m_supervoxelsMap[labelsToBeErased[i]];

			if (NULL == toBeRemoved)

			std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt; nbb = *(toBeRemoved-&amp;amp;gt;GetNeighbours());

			for (auto it = nbb.begin(); it != nbb.end(); it++)
				catch (Exception e)
					cout &amp;amp;lt;&amp;amp;lt; "exc: " &amp;amp;lt;&amp;amp;lt; e.msg &amp;amp;lt;&amp;amp;lt; endl;
		for (int i = 0; i &amp;amp;lt; labelsToBeErased.size(); i++)
			Supervoxel* toBeRemoved = m_supervoxelsMap[labelsToBeErased[i]];
			//delete toBeRemoved; //COMMENT


		for (auto it = m_supervoxelsMap.begin(); it != m_supervoxelsMap.end(); ++it)

	for (int i = 0; i &amp;amp;lt; supervoxelsToBeErased.size(); i++)
		;// delete supervoxelsToBeErased[i];	//COMMENT - to be considered if delete

	return true;

Results of merging in such form highly depend on chosen similarity level. In the picture below left we can see result of applying similarity level 0.95. In the image right the value of similarity level was chosen to be 0.65.


We also tried to train SVM to classify brain and non-brain structures using just these features. We got 4 successful classifications of 5. With classification we will continue later.

Posted on

Lane markers detection

Michal Polko

In this project, we detect lane markers in videos taken with dashboard camera.


  1. Convert a video frame to grayscale, boost contrast and apply dilation operator to highlight lane markers in the frame.
    Highlighted lane markers.
    cvtColor(frame, frame_bw, CV_RGB2GRAY);
    frame_bw.convertTo(frame_bw, CV_32F, 1.0 / 255.0);
    pow(frame_bw, 3.0, frame_bw);
    frame_bw *= 3.0;
    frame_bw.convertTo(frame_bw, CV_8U, 255.0);
    dilate(frame_bw, frame_bw, getStructuringElement(CV_SHAPE_RECT, Size(3, 3)));
  2. Apply the Canny edge detection to find edges.
    Application of the Canny edge detection.
    int cny_threshold = 100;
    Canny(frame_bw, frame_edges, cny_threshold, cny_threshold * 3, 3);
  3. Apply the Hough transform to find line segments.
    vector<Vec4i> hg_lines;
    HoughLinesP(frame_edges, hg_lines, 1, CV_PI / 180, 15, 15, 2);
  4. Since the Hough transform returns all line segments, not only those around lane markers, it is necessary to filter the results.
    1. We create two lines that describe boundaries of the current lane (hypothesis).
      1. We place two converging lines in the frame.
      2. Using brute-force search, we try to find position where they capture as many line segments as possible.
      3. Since road in the frame can have more than one lane, we try to find result as narrow as possible.
    2. We select line segments that are captured by the created hypothesis, mark them as lane markers and draw them.
    3. Each frame, we take the detected lane markers from the previous frame and perform linear regression to adjust the hypothesis (continuous adjustment).
    4. If we cannot find lane markers in more than 5 successive frames (due to failure of continuous adjustment, lane change, intersection, …), we create a new hypothesis.
    5. If the hypothesis is too wide (almost full width of the frame), we create a new one, because arrangement of road lanes might have changed (e.g. additional lane on freeway).
  5. To distinguish between solid and dashed lane markers, we calculate coverage of the hypothesis by line segments. If the coverage is less than 60%, it is a dashed line; if more, it is a solid line.

    Filtered result of the Hough transform + detection of solid/dashed lines.
Posted on

Split and merge

Matus Pikuliak

In our work we have implemented segmentation algorithm split-and-merge. We have designed each step of this algorithm processing original image into segmented image composed of homogeneous regions. We have used OpenCV library to build our solution.

Our method has 4 steps. We will demonstrate their effect on image depicted in Figure 1.

Original photo
  1. Pre-processing
    First we convert image from RGB color space to Lab. Then we blur the image to remove various blemishes, that could potentially halt the splitting phase

    cv::cvtColor(image, image, CV_BGR2Lab);
    cv::GaussianBlur(image, image, Size(5,5), 0, 0);
  2. Splitting
    In splitting phase we divide image to same-sized quarters. We compute the standard deviation of each of the dimensions of color space. If these values for given quarter exceed adjustable thresholds, we recursively divide this quarter as well. This go on, until the created quarters are big enough. If they one of their dimensions is smaller than what we have set as minimum length (9), dividing is stopped.
    If the values of deviations does not exceed thresholds, dividing is stopped and given quarter is completely filled with its mean color. Examples of splitting can be seen on Figure 2.

    After splitting.
    void split(roi) {
    	if (roi.standard_deviation > threshold) {
    		quarters = roi.get_quarters
    	else {
  3. Merging
    Split image is subsequently processed with merge operation. This operation merges neighboring regions with similar features. In our work we use only the dimensions of Lab color space and therefor we are merging regions with similar color. The similarity is again calculated using thresholds for individual dimensions of color space. Result of this operation can be seen on Figure 3. We can see that a lot of rectangular regions were merged into bigger super-regions. This effect can be seen on every part of the image.

    After merging.
    // runs floodFill for every unaffected ROI
    void merge() {
    	imageorg = image.clone();
    	for (std::list<list_item>::iterator it = rois.begin(); it != rois.end(); it++) {
    		Vec3b c =<Vec3b>(it->point);
    		Vec3b c2 =<Vec3b>(it->point);
    		if (c[0] == c2[0] && c[1] == c2[1] && c[2] == c2[2]) {
    			floodFill(image, it->point, Scalar(c[0], c[1], c[2]), 0,
    				Scalar(l_treshold, ab_treshold, ab_treshold),
    				Scalar(l_treshold, ab_treshold, ab_treshold));
  4. Post-processing
    As we can see there are some artifacts left, mainly on the edges of different bigger regions. These artifacts are result of splitting, which can not handle parts of images, where there are sudden changes from one dominant color to other. In order to remove these artifacts and improve overall we are using several morphological operations followed with yet another merging of color regions. Result of post-processing can be seen on Figure 4, which is at the same time final result of our method for this image.

    Final image.
    Mat kernel = getStructuringElement(MORPH_ELLIPSE,Size(Morph, Morph));
    dilate(image, image, kernel,Point(-1,-1),2);
    erode(image, image, kernel,Point(-1,-1),2);
    erode(image, image, kernel,Point(-1,-1),2);
    dilate(image, image, kernel,Point(-1,-1),2);


There are several settings that can affect the result of our method:

  • color dimensions – we can change how much is our method sensitive to changes in individual dimensions of our color space. Higher sensitivity leads to smaller regions after split phase and worse merging. Lower sensitivity can on the other hand lead to results which are too rough.
  • minimum size of region – this setting can affect resulting granularity of final image. Too small minimum regions lead to over-split image, while too large lead to inaccurate results.
  • color space – color space is another setting that can change the outcome of our method. In our experiments we find out, that the Lab color space outperforms RGB color space in generality. Lab was providing satisfactory results to almost all of our testing images with similar settings. RGB on the other hand requires more tweaking for optimal results.


We have successfully implemented split-and-merge segmentation algorithm and tested in on variety of images with different features and characteristics. We conclude that our implemented method is fast and reliable.

Posted on

Free parking spots detection

Jan Onder

The goal of this project is to determine the state of a parking lot, more precisely the number of parking spaces. Basically this project is divided to two interconnected parts. One to determine number of parking spots from image, for example from first frame of video from camera, monitoring the parking lot, and second to determine wheter, or not is there a movement on the parking lot.

The process:

  1. We get parking lines from image of parking lot and get rid of noise:
    Canny(inputImage, helpMatrix, 450, 400, 3);
    cvtColor(helpMatrix, helpMatrix2, CV_GRAY2BGR);
    vector<Vec4i> lines;
    HoughLinesP(helpMatrix, lines, 1, CV_PI / 180, 7, 10, 10);
    for (size_t i = 0; i < lines.size(); i++)
    	Vec4i l = lines[i];
    	line(helpMatrix2, Point(l[0], l[1]), Point(l[2], l[3]), Scalar(0, 0, 255), 5, CV_AA)
    Mat element2 = getStructuringElement(CV_SHAPE_RECT, Size(3, 3));
    cv::erode(helpMatrix2, helpMatrix2, element);
    cv::dilate(helpMatrix2, helpMatrix2, element2);

    Original Image (A), Canny edges with noise (B), HoughLines without noise (C)
  2. We use double dilate and substract their results to get mask of lines:
    morphologyEx(helpMatrix2, mark, CV_MOP_DILATE, element,Point(-1,-1), 3);
    morphologyEx(helpMatrix2, mark2, CV_MOP_DILATE, element, Point(-1, -1), 2);
    result = mark - mark2;

    Result of dilating and substracting
  3. We use Canny and Hough lines, this time for removing the connecting line between each parking spot:
    Canny(resu, mark, 750, 800, 3);
    cvtColor(mark, mark2, CV_GRAY2BGR);
    mark2 = Scalar::all(0);
    vector<Vec4i> lines3;
    HoughLinesP(mark, lines3, 1, CV_PI / 180, 20, 15, 10);
    for (size_t i = 0; i < lines3.size(); i++)
    	Vec4i l = lines3[i];
    	line(mark2, Point(l[0], l[1]), Point(l[2], l[3]), Scalar(0, 0, 255), 2, CV_AA);

    Result of Hough lines to remove connection between lines in mask
  4. We use this as a mask for finding contours for the Watershed algorithm and get result with detected parking spots, each colored with different color:
    vector<vector<Point> > contours;
    vector<Vec4i> hierarchy;
    findContours(markerMask, contours, hierarchy, RETR_CCOMP, CHAIN_APPROX_SIMPLE);
    int contourID = 0;
    for (; contourID >= 0; contourID = hierarchy[contourID][0], parkingSpaceCount++)
    	drawContours(markers, contours, contourID, Scalar::all(parkingSpaceCount + 1), -1, 8, hierarchy, INT_MAX);
    watershed(helpMatrix2, markers);
    Mat wshed(markers.size(), CV_8UC3);
    for (i = 0; i < markers.rows; i++)
    	for (j = 0; j < markers.cols; j++)
    		int index =<int>(i, j);
    		if (index == -1)<Vec3b>(i, j) = Vec3b(255, 255, 255);
    		else if (index <= 0 || index > parkingSpaceCount)<Vec3b>(i, j) = Vec3b(0, 0, 0);
    		else<Vec3b>(i, j) = colorTab[index - 1];

    Result of watershed algorithm with detected parking spots
  5. If our user is not satisfied with this result, he can always draw the seeds for watershed himself, or just adjust these seeds (img is the name of matrix, where user can see markers and markerMask matrix, where seeds are stored):
    Point prevPt(-1, -1);
    static void onMouse(int event, int x, int y, int flags, void*)
    	if (event == EVENT_LBUTTONDOWN) prevPt = Point(x, y);
    	else if (event == EVENT_MOUSEMOVE && (flags & EVENT_FLAG_LBUTTON))
    		Point pt(x, y);
    		if (prevPt.x < 0)
    			prevPt = pt;
    		line(markerMask, prevPt, pt, Scalar::all(255), 5, 8, 0);
    		line(img, prevPt, pt, Scalar::all(255), 5, 8, 0);
    		prevPt = pt;
    		imshow("image", img);

    User inputing seeds for watershed algorithm
  6. We have our spots stored, so we know their exact location, now its time to determine, wheter, or not check the lot again, if some vehicles are moving. For this purpose we need to detect movement on the lot with backgroundSubstraction, which can constantly learn what is static in image:
    Ptr<BackgroundSubtractor> pMOG2;
    pMOG2 = new BackgroundSubtractorMOG2(3000, 20.7,true);
  7. We will give the MOG every frame captured from video feed and see what it results:
    pMOG2->operator()(frame, matMaskMog2,0.0035);
    imshow("MOG2", matMaskMog2);

    Result of MOG substraction
  8. As we can see, there is some noise detected – this noise represents for example moving leaves on trees, so it is necessary to remove it:
    cv::morphologyEx(matMaskMog2, matMaskMog2, CV_MOP_ERODE, element);
    cv::medianBlur(matMaskMog2, matMaskMog2, 3);
    cv::morphologyEx(matMaskMog2, matMaskMog2, CV_MOP_DILATE, element2);
  9. Finally we find coordinates of moving object from MOG and draw a rectangle with random color around it (result can be seen at the top):
    scv::findContours(matMaskMog2, contours, CV_RETR_EXTERNAL, CV_CHAIN_APPROX_NONE);
    vector<vector<Point> > contours_poly(contours.size());
    vector<Rect> boundRect(contours.size());
    for (int i = 0; i < contours.size(); i++)
    approxPolyDP(Mat(contours[i]), contours_poly[i], 3, true);
    	boundRect[i] = boundingRect(Mat(contours_poly[i]));		
    RNG rng(01);
    for (int i = 0; i< contours.size(); i++)
    Scalar color = Scalar(rng.uniform(0, 255), rng.uniform(0, 255), rng.uniform(0, 255)); 
    	rectangle(frame, boundRect[i].tl(), boundRect[i].br(), color, 2, 8, 0);


We have a functional parking spot detection, which means we can easily determine how much parking spots our parking lot have. We also have stored where are these parking spots exactly located. From the camera feed, we can detect car movement and also determine, on which coordinates the movement stopped. We did not implemented the function to connect these infomormation sources, but it can be easily added.


  • For parking spots detection we need an empty lot. Otherwise it will be nearly impossible to determine where are these spots exactly located, mainly if vehicles are not parking at their exact center.
  • For movement detection, we need static camera feed, becouse of used MOG method, which constantly learns what is background and which object are moving.
  • Parking spots detection is not perfect, it still needs some user correction to determine exact number of parking spots.
Posted on

Panorama – Image registration

Vladimir Ogurcak

The main objective of this project is to create panoramic image from sequence of two or more overlapping images using OpenCV. We assume then overlap of two adjacent images is more than 30%, vertical variation of all images is minimal and images are ordered from left to right.

The main idea of this algorithm is based on independent image stitching of two adjacent images and recurrent stitching of results until complete panorama image isn’t created. The code below shows this idea:

AutomaticPanorama(vector<Mat> results){			//results contains all input images
	vector<Mat> partialResults = vector<Mat>();
	while (results.size() != 1){
		//Stitch all adjacent images and save result as partial result
		for (int i = 0; i < results.size() - 1; i++){
			Mat panoramaResult = Panorama(results[i], results[i + 1]);
		//results = paritalResults
		vector<Mat> temp = results;
		results = partialResults;
		partialResults = temp;

Function Panorama(results[i], results[i + 1]) is custom implementation of image stitching. This function uses local detector and descriptor SIFT, Brute-force keypoint matcher and perspective transformation to realize image registration and stitching. Individual steps (parts) of function are described below. In this project we also tried to use other detectors like SURF, ORB and descriptors like SURF, ORB, BRISK and FREAK, but SIFT and SURF appears to be the best choices.

OpenCV functions

SiftFeatureDetector, SiftDescriptorExtractor, BFMatcher, drawMatches, findHomography, perspectiveTransform, warpPerspective, imshow, imwrite


Left and right image.


  1. Calculate keypoints for left and right image using SIFT feature detector:
    SiftFeatureDetector detector = SiftFeatureDetector();
    vector<KeyPoint> keypointsPrev, keypointsNext;
    detector.detect(imageNextGray, keypointsNext);
    detector.detect(imagePrevGray, keypointsPrev);

    Left and right image with keypoints.
  2. Calculate local descriptor for all keypoints using SIFT descriptor:
    SiftDescriptorExtractor extractor = SiftDescriptorExtractor();
    Mat descriptorsPrev, descriptorsNext;
    extractor.compute(imageNextGray, keypointsNext, descriptorsNext);
    extractor.compute(imagePrevGray, keypointsPrev, descriptorsPrev);
  3. For keypoint descriptors from left image find corresponding keypoint descriptors in right image using Brute-Force matcher:
    BFMatcher bfMatcher;
    bfMatcher.match(descriptorsPrev, descriptorsNext, matches);

    Pairs of key points (every fifth match)
  4. Find only good matches. Good matches are pairs of keypoints which vertical coordinate variation is less than 5%:
    vector<DMatch> goodMatches;
    int minDistance = imagePrevGray.rows / 100 * VERTICALVARIATION;
    goodMatches = FindGoodMatches(matches, keypointsPrev, keypointsNext, minDistance);

    Good matches (5% variation)
  5. Find homography matrix for perspective transformation of right image. Use only good keypoints for computing homography matrix:
    Mat homographyMatrix;
    homographyMatrix = findHomography(pointsNext, pointsPrev, CV_RANSAC);
  6. Warp right image using homography matrix from previous step:
    Mat warpImageNextGray;
    warpPerspective(imageNextGray, warpImageNextGray, homograpyMatrix, Size(imageNextGray.cols + imagePrevGray.cols, imageNextGray.rows));

    Warped right image
  7. Calculate left image and right (warped) image corners:
    vector<Point2f> cornersPrev, cornersNext;
    SetCorners(imagePrevGray, imageNextGray, &cornersPrev, &cornersNext, homograpyMatrix);

    Left (blue) and right (green) image boundaries
  8. Find overlap coordinates of left and right images:
    int overlapFromX, overlapToX;
    if (cornersNext[0].x < cornersNext[3].x){
    	overlapFromX = cornersNext[0].x;
    	overlapFromX = cornersNext[3].x;
    overlapToX = cornersPrev[1].x;
  9. Join left and right (warped) image using linear interpolation in overlapped area. Both images contributes with 100% of its pixels outside the overlapping area. Left image contribute with 100% at the beginning of overlapping area and gradually decreases its contribution to 0% at the end. Right image contribute opposite way from 0% at beginning to 100 % at the end:
    Mat result = Mat::zeros(warpImageNextGray.rows, warpImageNextGray.cols, CV_8UC3);
    DrawTransparentImage(imagePrevGray, cornersPrev, warpImageNextGray, cornersNext, &result, overlapFromX, overlapToX);
  10. Crop joined image:
    Rect rectrangle(0, 0, cornersNext[1].x, result.rows);
    result = result(rectrangle);


  • Sufficient overlap of adjavent images. We assume more than 30%.
  • Maximum number of input images is generally less or equals to 5. The deformation of perspective transformation causes failure of algorithm in larger set of input images.
  • Input images has to be sorted for example from left to right.
  • Panorama of distance objects gives better results than panorama of nearby objects.


Goldangate bridge – Panorama of 5 images
Mountains – Panorama of 5 images

Shanghai – Panorama of 5 images


Comparison of different keypoint detectors and descriptors

Detector / Descriptor Number of all matches Number of good matches Result
SIFT / SIFT 3064 976 Successful
SURF / SURF 3217 1309 Successful
ORB / ORB 3000 1113 Successful
SIFT / BRIEF 2827 790 Successful
SIFT / BRISK 1012 128 Failure
SIFT / FREAK 1154 151 Failure
Posted on

Motion analysis in CCTV records

Filip Mazan

This project deals with analysis of video captures from CCTVs to detect people’s motion and extract their trajectories in time. The output of this project is a relatively short video file containing only frames of the original where the movement was detected along with shown trajectories of people. The second output consists of cumulative image of all trajectories. This can be later used to classify trajectories as (not) suspicious.

  1. Each frame of input video is converted into grayscale and median filtered to remove noise
  2. First 30 seconds of video is used as a learning phase for MOG2 background subtractor
  3. For each next frame the MOG2 mask is calculated and morphological closing is applied to it
  4. If count of non-zero pixels is greater than a set threshold, we claim there is a movement present
    1. Good features to track are found if there is not many left
    2. Optical flow is calculated
    3. Each point which has moved is stored along with frame number
  5. If there is no movement on current frame, postprocess last movement interval (if any)
    1. All stored tracking points (x, y, frame number) from previous phase are clusterized by k-means into variable number of centroids
    2. All centroids are sorted by their frame number dimension
    3. Trajectory is drawn onto the output frame
    4. Movement sequence is written into the output video file along with continuously drawn trajectory

Following image shows the sum of all trajectories found in 2 hours long input video. This can be used to classify trajectories as (not) suspicious.


Posted on

Car detection in videos

Peter Horvath

We detect cars from videos recorded by dash cameras situated in cars. This type of camera is dynamic so we decided to train and use Haar Cascade Classifier. The classifier itself returns a lot of false positive results. So we improved classifier by removing false positive results using road detection.

Functions used: cvtColor, split, Rect, inRange, equalizeHist, detectMultiScale, rectangle, bitwise_and


1st part – training haar cascade classifier

Collect a set of positive samples and negative samples. Make a list file of both (positives.dat and negatives.dat). Then use opencv_createsamples function with parameters to make a single .vec file with all positive samples.

opencv_createsamples -info positives.dat -vec samples.vec -num 500 -w 20 -h 20

Now train a cascade classifier using HAAR features

opencv_traincascade -data classifier -featureType HAAR -vec samples.vec -bg negatives.dat -numPos 500 -numNeg 850 -numStages 15 -precalcValBufSize 1000 -precalcIdxBufSize 1000 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -mode ALL -w 20 -h 20

Output of this procedure is trained classifier – xml file.

2nd part – using classifier in C++ code to detect cars, improved by road detection

Open video file using VideoCapture. For every video frame do:

  1. Convert actual video frame to HSV color model
    cvtColor(frame, frame_hsv, CV_BGR2HSV);
  2. Make sum of H S V in captured road sample. Calculate average Hue Saturation and Value of captured road sample.
    int averageHue = sumHue / (rectangle_hsv_channels[0].rows*rectangle_hsv_channels[0].cols);
    int averageSat = sumSat / (rectangle_hsv_channels[1].rows*rectangle_hsv_channels[1].cols);
    int averageVal = sumVal / (rectangle_hsv_channels[2].rows*rectangle_hsv_channels[2].cols);
  3. Use inRange function to make a binary result – road is white colored, other is black colored
    inRange(frame_hsv, cv::Scalar(averageHue - 180, averageSat - 15, averageVal - 20), cv::Scalar(averageHue + 180, averageSat + 15, averageVal + 20), final);		


  4. Convert actual video frame to grayscale
    cvtColor(frame, image_gray, CV_BGR2GRAY);
  5. Create an instance of CascadeClassifier
    String car_cascade_file = "classifier.xml";
    CascadeClassifier car_classifier;
  6. Detect cars in grayscale video frame using classifier
    car_classifier.detectMultiScale(image_gray, cars, 1.1, 2, 0 | CV_HAAR_SCALE_IMAGE, Size(20, 20));

    Result have a lot of false positives


  7. Make a black image with white squares at locations returned by cascade classifier. Make logical and between it and image with detected road
  8. Accept only squares which have at least 20% of pixels white.


  • Cascade classifier trained only with 560 positive and 860 negative samples – detect cars only from near distance
  • Road detection fails when some object (car, road line) comes to blue rectangle (supposed to be road sample)
  • Dirt have a similar saturation as road – detected as road
Posted on

Card detection

Michael Garaj

The goal of this project is to detect card in captured image. Motivation was to make automatized recognizer of cards for poker tournaments. Application is implemented to find orthogonal edges in an image and try to find card by ratio of its edges.

Process of finding and recognizing a card in image follows these steps:

  1. Load an image from local repository.
  2. Apply blur and bilateral filter.
  3. Compute binary threshold.
  4. Extract edges from binary image by Canny algorithm.
  5. Apply Hough lines to get lines find in edge image.
  6. Search for orthogonal lines and store them in structure for future optimalization.
  7. Optimise number of detected lines in same area by choosing only the biggest ones.
  8. Find card which consist of 3 touching lines.
  9. Compute ratio of the lines and identify cards in the image.
    Following code sample shows steps of optimalization of detected corners:

    vector<MyCorner> optimalize(vector<MyCorner> corners, Mat image) {
    	vector<MyCorner> optCorners;
    	for (int i = 0; i < corners.size(); i++) {
    		corners[i].crossing = crossLines(corners[i]);
    		corners[i].single = 1;
    	int distance = 25;
    	for (int i = 0; i < corners.size() - 1; i++) {
    		MyCorner corner = corners[i];
    		float lengthI = 0, lengthJ = 0;
    		if (corner.single){
    			for (int j = i + 1; j < corners.size(); j++) {
    				if (abs(corner.crossing.x - corners[j].crossing.x) < distance && abs(corner.crossing.y - corners[j].crossing.y) < distance &&
    					(corner.single || corners[j].single)) {
    					lengthI = getLength(corner.u) + getLength(corner.v);
    					lengthJ = getLength(corners[j].u) + getLength(corners[j].v);
    					if (lengthI < lengthJ) {
    						corner = corners[j];
    					corner.single = 0;
    					corners[i].single = 0;
    					corners[j].single = 0;
    	return optCorners;
Posted on

Bag of Words algorithm

Tomas Drutarovsky

We implement well-known Bag of Words algorithm (BoW) in order to perform image classification of tiger cat images. In the work, we use a subset of publicly available ImageNet dataset and divide data on two sets – tiger cats and non-cat objects, which consist of images of 10 random chosen object types.

The main processing algorithm is performed by these steps:

  1. Choose a suitable subset of images from a large dataset
    • We use around 100 000 unique images
  2. Detect keypoints
    • We detect keypoints using SIFT or Dense keypoint extractor
    DenseFeatureDetector dense(20.0f, 3, 2, 10, 4);
    BOWKMeansTrainer bowTrainer(dictionarySize, tc, retries, flags);
    for (int i = 0; i < list.count(); i++){
    	Mat img = imread(, CV_LOAD_IMAGE_COLOR);
    	dense.detect(img, keypoints);

    Keypoints detected using SIFT detect function – more than 500 keypoints.
  3. Describe keypoints using SIFT
    • SIFT descriptor produces description for each keypoint separately
      sift.compute(img, keypoints, descriptor);
  4. Cluster descriptors using k-means
    • Around 10 million of keypoints are chosen to cluster
    • Clustering results in 1000 clusters represented by centroids (visual words)
    Mat vocabulary = bowTrainer.cluster();
  5. Calculate BoW descriptors
    • Each keypoint from an input image is then evaluated for response from 1000 visual words or represents
    • Histogram of reponse is normalized for each image
    Ptr<DescriptorMatcher> matcher(new FlannBasedMatcher);
    Ptr<FeatureDetector> detector(new SiftFeatureDetector());
    BOWImgDescriptorExtractor bowExtractor(detector, matcher);
    bowExtractor.compute(img, keypoints, descriptor);

    BoW descriptor of 200 ats visualized over 1000 clustered visual words vocabulary
  6. Train SVM using BoW descriptors
    • Calculated histograms or BoW descriptors are trained using linear SVM
    • Suitable rate between positive and negative subset needs to be chosen
  7. Test images using SVM
    • Response of test images is used to evaluate algorithm
    • Our model shows accuracy of (62% of positive set and 58% of negative set)
    • Better results are achievable using larger datasets, but both time and computational power are necessary
Posted on

Face recognition using depth data from Kinect sensor

Lukas Cader

We will segment face from color camera with use of depth data and run recognition on it using OpenCV functions: EigenFaces, FisherFaces and LBPH.

Complete process is as follows:

  1. First we need to obtain RGB and depth stream from Kinect sensor and copy it to byte array in order to be usable for OpenCV
    IColorFrame* colorFrame = nullptr;
    IDepthFrame* depthFrame = nullptr;
    ushort* _depthData = new ushort[depthWidth * depthHeight];
    byte* _colorData = new byte[colorWidth * colorHeight * BYTES_PER_PIXEL];
    colorFrame->CopyConvertedFrameDataToArray(colorWidth * colorHeight * BYTES_PER_PIXEL, _colorData, ColorImageFormat_Bgra);
    depthFrame->CopyFrameDataToArray(depthWidth * depthHeight, _depthData);
  2. Because color and depth camera have different resolutions we need to map coordinates from color image to depth image. (We will use Kinect’s Coordinate Mapper)
    m_pCoordinateMapper->MapDepthFrameToColorSpace(depthWidth * depthHeight,(UINT16*) _depthData, depthWidth * depthHeight, _colorPoints);


  3. Because we are going to segment face from depth data we need to process them as is shown in the next steps:
    1. Unmodified depth data shown in 2D
    2. Normalization of values to 0-255 range
      – Better representation

      cv::Mat img0 = cv::Mat::zeros(depthHeight, depthWidth, CV_8UC1);
      double scale = 255.0 / (maxDist - minDist);
      depthMap.convertTo(img0, CV_8UC1, scale);


    3. Removal of the nearest points and bad artifacts
      – the points for which Kinect can’t determine depth value are by default set to 0 – we will set them to 255

      if (val < MinDepth)
      {[image.step[0] * i + image.step[1] * j + 0] = 255;


    4. Next we want to segment person, we apply depth threshold to filter only nearest points and the ones within certain distance from them and apply median blur to image to remove unwanted artifacts such as isolated points and make edges of segmented person less sharp.

      if (val > (__dpMax+DepthThreshold))
      {[image.step[0] * i + image.step[1] * j + 0] = 255;
  4. Now when we have processed depth data we need to segment face. We find the highest non-white point in depth map and mark it as the top of head. Next we make square segmentation upon depth mask with dynamic size (distance from user to sensor is taken into account) from top of the head and in this segmented part we find the leftmost and rightmost point and made second segmentation. The 2 new points and point representing top of the head will now be the border points of the new segmented region. (Sometimes because of dynamic size of square we have also parts of shoulders in our first segmentation, in order to mitigate this negative effect we are looking for leftmost and rightmost point only in the upper half of the image)
    if (val == 255 || i > (highPointX + headLength) || (j < (highPointY - headLength / 2) && setFlag) || (j > (highPointY + headLength / 2) && setFlag))
    //We get here if point is not in face segmentation region
    else if (!setFlag) 
    //We get here if we find the first non-white (highest) point in image and set segmentation region
    highPointX = i;
    highPointY = j;
    headLength = 185 - 1.2*(val); //size of segmentation region
    setFlag = true;
    //We get here if point is in face segmentation region and we want to find the leftmost and the rightmost point
    if (j < __leftMost && i < (__faceX + headLength/2)) __leftMost = j;
    if (j > __rightMost && i < (__faceX + headLength/2)) __rightMost = j;
  5. When face is segmented we can use one of OpenCV functions for face recognition and show result to the user.
Posted on

Segmentation of Brain Tumors from MRI using Adaptive Thresholding and Graph Cut Algorithm

Development of methods for automatic brain tumor segmentation remains one of the most challenging tasks in processing of medical data. Exact segmentation could improve the diagnostics, as for example the time evaluation of the tumor volume. However, manual segmentation in magnetic resonance data is a time-consuming task. We present a method of automatic tumor segmentation in magnetic resonance images which consists of several steps. In the first step high intense cranium is removed. In the next step parameters of the image are derived using the method “Mixture of Gaussians”. These parameters control the morphological reconstruction (proposed by Luc Vincent 1993). The morphological reconstruction produces binary mask which is used in the last step of the segmentation: graph cut segmentation. First results of this method are presented in this paper.

Source code