MODELLING OF HUMAN VISUAL ATTENTION (Ing. Patrik Polatsek – Dissertation thesis)



Degree Course: Applied Informatics
Author: Ing. Patrik Polatsek
Supervisor: doc. Ing. Vanda Benešová, PhD.
May 2019
In recent decades, visual attention modelling became a prominent research area. To simulate human attention, a computational model has to incorporate various attention mechanisms.
In this thesis we explored how low- and mid-level features such as color, motion, depth and shape influence visual attention in our own eye-tracking experiments. To measure these effects, we utilized various state-of-the-art as well as novel computational models which estimate saliency of a specific feature.
To deeper understand the process of selective attention in everyday actions, we conducted several experiments in real environments recorded from the first-person perspective. Our results showed that egocentric attention is very individual and differs from 2D image viewing conditions, partially due to binocular cues that enhance viewer’s perception. We therefore suggest to employ specialized saliency models for egocentric vision. Finally, we found out that high-level factors such as individual’s emotions and task-based analysis of visualizations influence human gaze behavior too.

pdf Autoreferat

pdf DissertationThesis-Polatsek

Segmentation of anatomical organs in medical data – Master’s Thesis : Bc. Martin Tamajka

Download: Master’s Thesis – Bc. Martin Tamajka: Segmentation of anatomical organs in medical data



2016, May
Medical image segmentation is an important part of medical practice. Primarily as far as radiologists are concerned it simplifies their everyday tasks and allows them to use their time more effective, because in most cases radiologists only have a certain amount of time they can spend examining patient’s data. Computer aided diagnosis is also a powerful instrument in elimination of possible human failure.
In this work, we propose a novel approach to human organs segmentation. We primarily concentrate on segmentation of human brain from MR volume. Our method is based on oversegmenting 3D volume to supervoxels using SLIC algorithm. Individual supervoxels are described by features based on intensity distribution of contained voxels and on position within the brain. Supervoxels are classified by neural networks which are trained to classify supervoxels to individual tissues. In order to give our method additional precision, we use information about the shape and inner structure of the organ. In general we propose a 6-step segmentation method based on classification.
We compared our results with those of state-of-the-art methods and we can conclude that the results are clearly comparable.
Apart from the global focus of this thesis, our goal is to apply engineering skills and best practices to implement proposed method and necessary tools in such a way that they can be easily extended and maintained in the future.

Automatic brain segmentation method based on supervoxels

Martin Tamajka, Wanda Benesova


In this work, we present a fully automatic brain segmentation method based on supervoxels (ABSOS). We propose novel features used for classification, that are based on distance and angle in different planes between supervoxel and brain center. These novel features are combined with other prominent features. The presented method is based on machine learning and incorporates also a skull stripping (cranium removing) in the preprocessing step. Neural network – multilayer perceptron (MLP) was trained for the classification process. In this paper we also present thorough analysis, which supports choice of rather small supervoxels, preferring homogeneity over compactness, and value of intensity threshold parameter used in preprocessing for skull stripping. In order to decrease computational complexity and increase segmentation performance we incorporate prior knowledge of typical background intensities acquired in analysis of subjects.

Published in:

2016 International Conference on Systems, Signals and Image Processing (IWSSIP)

Date of Conference:

23-25 May 2016


Grant VEGA 1/0625/14

Scientific Grant Agency of the Ministry of Education, science, research and sport of the Slovak Republic and the Slovak Academy of Sciences –  Grant VEGA 1/0625/14.

Visual object class recognition  in video sequences using a linkage of information derived by a semantic local segmentation and a global segmentation of visual saliency.

Visual class objects recognition  is one of the biggest challenges of current research in the field of computer vision. This project aims to explore new methods of recognizing classes of objects in video sequences. In the center of current research in the field of computer vision. This project aims to explore new methods of recognizing classes of objects in video sequences. In the center of research, the focus will be the research of  new methods of semantic segmentation at the local level approach and segmentation of the visual saliency at the global level. An integrating part of the project proposal will be research of intelligent methods of   transfer of information, which will be obtained by the local and global approach using the principle of cooperative agents.


Egocentric RGB-D dataset (eye-tracker + Kinect v2)

pdf: Visual attention in egocentric field-of-view using RGB-D data

[1] V. Olesova, W. Benesova, and P. Polatsek, “Visual Attention in Egocentric Field-of-view using RGB-D Data .,” in Proc. SPIE 10341, Ninth International Conference on Machine Vision (ICMV 2016), 2016.

You are free to use this dataset for any purpose. If you use this datasetplease cite the paper above.

Download:    fiit-dataset (RGB-D Gaze videos 2GB)

Camera tracking

Martin Volovar

Camera tracking is used in visual effects to synchronize movement and rotation between real and virtual camera .This article deals with obtaining rotation and translation from two images and trying to reconstruct scene.

  1. First we need find keypoints on both images:
    SurfFeatureDetector detector(400);
    vector<KeyPoint> keypoints1, keypoints2, findKeypoints;
    detector.detect(img1, keypoints1);
    detector.detect(img2, keypoints2);
    SurfDescriptorExtractor extractor;
    extractor.compute(img1, keypoints1, descriptors1);
    extractor.compute(img2, keypoints2, descriptors2);
  2. Then we need find matches between keypoints from first and second image:
    cv::BFMatcher matcher(cv::NORM_L2, true);
    vector<DMatch> matches;
    matcher.match(descriptors1, descriptors2, matches);


  3. Some keypoints are wrong so we use filtration:
    x = ABS(x);
    y = ABS(y);
    if (x < x_threshold && y < y_threshold)
    	status[i] = 1;
    	status[i] = 0;


  4. After that we can find dependency using FM:
    Mat FM = findFundamentalMat(keypointsPosition1, keypointsPosition2, FM_RANSAC, 1., 0.99, status);

    we can obtain essential matrix using camera internal parameters (K matrix):

    Mat E = K. t() * FM * K;
  5. Using singular value decomposition we can extract camera rotation and translation:
    SVD svd(E, SVD::MODIFY_A) ;
    Mat svd_u = svd. u;
    Mat svd_vt = svd. vt;
    Mat svd_w = svd. w;
    Matx33d W(0, -1, 0,
    1, 0, 0,
    0, 0, 1) ;	
    Mat R = svd_u * Mat(W) * svd_vt;
    Mat_<double> t = svd_u. col(2) ;

    Rotation have two solutions (R = U*W*VT or R = U*WT*VT), so we check if camera has right direction:

    double *R_D = (double*);
    if (R_D[8] < 0.0)
    	R = svd_u * Mat(W.t()) * svd_vt;

    To construct rays we need inverse camera matrix (R|t):

    Mat Cam(4, 4, CV_64F, Cam_D);
    Mat Cam_i = Cam.inv();

    Both lines have one point in camera center:

    Line l0, l1;
    l0.pos.x = 0.0;
    l0.pos.y = 0.0;
    l0.pos.z = 0.0;
    l1.pos.x = Cam_iD[3];
    l1.pos.y = Cam_iD[7];
    l1.pos.z = Cam_iD[11];

    Other point is calculated via projection plane.
    Then we can construct rays and find intersection from each keypoint:

    getNearestPointBetweenTwoLines(pointCloud[j], l0, l1, k);


Recovered 3D scene.

Face recognition in video using Kinect v2 sensor

Michal Viskup

We detect and recognize the human faces in the video stream. Each face in the video is either recognized and the label is drawn next to their facial rectangle or it is labelled as unknown.

The video stream is obtained using Kinect v2 sensor. This sensor offers several data streams, we mention only the 2 relevant for our work:

  • RGB stream (resolution: 1920×1080, depth: 8bits)
  • Depth stream (resolution: 512×424, depth: 16bits)

The RGB stream is self-explanatory.  The depth stream consists of the values that denote the distance of the each pixel from the sensor.  The reliable distance lays between the 50 mms and extends to 8 meters. However, past the 4.5m mark, the reliability of the data is questionable. Kinect offers the methods that map the pixels from RGB stream to Depth stream and vice-versa.

We utilize the facial data from RGB stream for the recognition. The depth data is used to enhance the face segmentation through the nose-tip detection.

First of all, the face recognizer has to be trained. The training is done only once. The state of the trained recognizer can be persisted in xml format and reloaded in the future without the need for repeated training. OpenCV offers implementation of three face recognition methods:

  • Eigenfaces
  • Fisherfaces
  • Local Binary Pattern Histograms

We used the Eigenfaces and Fisherfaces method. The code for creation of the face recognizer follows:

void initRecognizer()
	Ptr<FaceRecognizer> fr;
	fr = createEigenFaceRecognizer();

It is simple as that. Face recognizer that uses the Fisherfaces method can be created accordingly. The Ptr interface ensures the correct memory management.

All the faces presented to such recognizer would be labelled as unknown. The recognizer is not trained yet. The training requires the two vectors:

  • The vector of facial images in the OpenCV Mat format
  • The vector of integer values containing the identifiers for the facial images

These vectors can be created manually. This however is not sufficient for processing the large training sets. We thus provide the automated way to create these vectors. Data for each subject should be placed in a separate directory. Directories containing the subject data should be places within the single directory (referred to as root directory). The algorithm is given an access to the root directory. It processes all the subject directories and creates both the vector images and the vector labels. We think that the Windows API for accessing the file system is inconvenient. On the other hand, UNIX based systems offer convenient C API through the Dirent interface. Visual Studio compiler lacks the dirent interface. We thus used an external library to gain access to this convenient interface ( Following code requires the library to run:

First we obtain the list of subject names. These stand for the directory names within the root directory. The subject names are stored in the vector of string values. It can be initialized manually or using the text file.

Then, for each subject, the path to their directory is created:

std::ostringstream fullSubjectPath;
fullSubjectPath << ROOT_DIRECTORY_PATH;
fullSubjectPath << "\\";
fullSubjectPath << subjectName;
fullSubjectPath << "\\";

We then obtain the list of file names that reside within the subject directory:

std::vector<std::string> DataProvider::getFileNamesForDirectory(const std::string subjectDirectoryPath)
	std::vector<std::string> fileNames;
	DIR *dir;
	struct dirent *ent;
	if ((dir = opendir(subjectDirectoryPath.c_str())) != NULL) {
		while ((ent = readdir(dir)) != NULL) {
			if ((strcmp(ent->d_name, ".") == 0) || (strcmp(ent->d_name, "..") == 0))
	else {
		std::cout << "Cannot open the directory: ";
		std::cout << subjectDirectoryPath;
	return fileNames;

Then, the images are loaded and stored in vector:

std::vector<std::string> subjectFileNames = getFileNamesForDirectory(fullSubjectPath.str());

std::vector<cv::Mat> subjectImages;
for (std::string fileName : subjectFileNames)
	std::ostringstream fullFileNameBuilder;
	fullFileNameBuilder << fullSubjectPath.str();
	fullFileNameBuilder << fileName;
	cv::Mat subjectImage = cv::imread(fullFileNameBuilder.str());
return subjectImages;

In the end, label vector is created:

for (int i = 0; i < subjectImages.size(); i++){

With images and labels vectors ready, the training is a one-liner:


The recognizer is trained. What we need now is a video and depth stream to recognize from.
Kinect sensor is initialized by the following code:

void initKinect()

	hr = GetDefaultKinectSensor(&kinectSensor);
	if (FAILED(hr))

	if (kinectSensor)
		// Initialize the Kinect and get the readers
		IColorFrameSource* colorFrameSource = NULL;
		IDepthFrameSource* depthFrameSource = NULL;

		hr = kinectSensor->Open();

		if (SUCCEEDED(hr))
			hr = kinectSensor->get_ColorFrameSource(&colorFrameSource);

		if (SUCCEEDED(hr))
			hr = colorFrameSource->OpenReader(&colorFrameReader);


		if (SUCCEEDED(hr))
			hr = kinectSensor->get_DepthFrameSource(&depthFrameSource);

		if (SUCCEEDED(hr))
			hr = depthFrameSource->OpenReader(&depthFrameReader);


	if (!kinectSensor || FAILED(hr))

The following function obtains the next color frame from Kinect sensor:

Mat getNextColorFrame()
	IColorFrame* nextColorFrame = NULL;
	IFrameDescription* colorFrameDescription = NULL;
	ColorImageFormat colorImageFormat = ColorImageFormat_None;

	HRESULT errorCode = colorFrameReader->AcquireLatestFrame(&nextColorFrame);
	if (!SUCCEEDED(errorCode))
		Mat empty;
		return empty;

	if (SUCCEEDED(errorCode))
		errorCode = nextColorFrame->get_FrameDescription(&colorFrameDescription);
	int matrixWidth = 0;
	if (SUCCEEDED(errorCode))
		errorCode = colorFrameDescription->get_Width(&matrixWidth);
	int matrixHeight = 0;
	if (SUCCEEDED(errorCode))
		errorCode = colorFrameDescription->get_Height(&matrixHeight);
	if (SUCCEEDED(errorCode))
		errorCode = nextColorFrame->get_RawColorImageFormat(&colorImageFormat);
	UINT bufferSize;
	BYTE *buffer = NULL;
	if (SUCCEEDED(errorCode))
		bufferSize = matrixWidth * matrixHeight * 4;
		buffer = new BYTE[bufferSize];
		errorCode = nextColorFrame->CopyConvertedFrameDataToArray(bufferSize, buffer, ColorImageFormat_Bgra);
	Mat frameKinect;
	if (SUCCEEDED(errorCode))
		frameKinect = Mat(matrixHeight, matrixWidth, CV_8UC4, buffer);
	if (colorFrameDescription)
	if (nextColorFrame)

	return frameKinect;

Analogous function obtains the next depth frame. The only change is the type and size of the buffer, as the depth frame is single channel 16 bit per pixel.
Finally, we are all set to do the recognition. The face recognition task consists of the following steps:

  1. Detect the faces in video frame
  2. Crop the faces and process them
  3. Predict the identity

For face detection, we use OpenCV CascadeClassifier. OpenCV provides the extracted features for the classifier for both the frontal and the profile faces. However, in video both the slight and major variations from these positions are present. We thus increase the tolerance for the false positives to prevent the cases when the track of the face is lost between the frames.
The classifier is simply initialized by loading the set of features using its load function.

CascadeClassifier cascadeClassifier;

The face detection is done as follows:

vector<Mat> getFaces(const Mat frame, vector<Rect_<int>> &rectangles)
	Mat grayFrame;
	cvtColor(frame, grayFrame, CV_BGR2GRAY);

	cascadeClassifier.detectMultiScale(grayFrame, rectangles, 1.1, 5);

	vector<Mat> faces;
	for (Rect_<int> face : rectangles){
		Mat detectedFace = grayFrame(face);
		Mat faceResized;
		resize(detectedFace, faceResized, Size(240, 240), 1.0, 1.0, INTER_CUBIC);
	return faces;

With faces detected, we are set to proceed to recognition. The recognition process is as follows:

Mat colorFrame = getNextColorFrame();
vector<Rect_<int>> rectangles;
vector<Mat> faces = getFaces(colorFrameResized, rectangles);
int label = -1;
label = fr->predict(face);
string box_text = format("Prediction = %d", label);
putText(originalFrame, box_text, Point(rectangles[i].tl().x, rectangles[i].tl().y), FONT_HERSHEY_PLAIN, 1.0, CV_RGB(0, 255, 0), 2.0);

Nose tip detection is done as follows:

unsigned short minReliableDistance;
unsigned short maxReliableDistance;
Mat depthFrame = getNextDepthFrame(&minReliableDistance, &maxReliableDistance);
double scale = 255.0 / (maxReliableDistance - minReliableDistance);
depthFrame.convertTo(depthFrame, CV_16UC1, scale);

// detect nose tip
// only search for the nose tip in the head area
Mat deptHeadRegion = depthFrame(rectangles[i]);
// Nose is probably the local minima in the head area
double min, max;
Point minLoc, maxLoc;
minMaxLoc(deptHeadRegion, &min, &max, &minLoc, &maxLoc);
	minLoc.x += rectangles[i].x;
	minLoc.y += rectangles[i].y;

// Draw the circle at proposed nose position.
circle(depthFrame, minLoc, 5, 255, -1);

To conclude, we provide a simple implementation that allows the detection and recognition of human faces within a video. The room for improvement is that rather than allowing more false positives in detection phase, the detected nose tip can be used for face tracking.

Medical image segmentation

Martin Tamajka

In this project, our goal was to apply image segmentation techniques to dense volume of standard medical data.


Our method is based on oversegmentation to supervoxels (similar to superpixels, but in 3D volume). Such oversegmentation dramatically decreases processing time and has many other advantages over working directly with voxels. Oversegmentation is done using SLIC algorithm ( Implementation that we use was created by authors and does not depend on any other library. This comes at cost of necessity to transform images in OpenCV format to C++ arrays. SLIC allows to choose between supervoxel compactness (or regularity of shape) and intensity homogeneity. In our work, we decided to prefer homogeneity over regularity, because different tissues in anatomical organs have their typical intensities.

Oversegmented MRI slice – it can be seen that supervoxels adhere boundaries.

Merging supervoxels


After we oversegmented images, we created object representation of volume. Our object representation (SLIC3D) has following attributes:

vector&amp;amp;lt;Supervoxel*&amp;amp;gt;				m_supervoxels;
std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt;	m_supervoxelsMap;
int	m_height;
int	m_width;
int	m_depth;

The most important is the vector of Supervoxel pointers. Supervoxel is our basic class providing important information about contained voxels, can generate features that can be used in classification process and (very important) knows its neighbouring supervoxels. Currently, Supervoxel can generate 3 kinds of features:

float Supervoxel::AverageIntensity()
	return m_centroid-&amp;amp;gt;intensity;

float Supervoxel::AverageQuantileIntensity(float quantile)
	assert(quantile &amp;amp;gt;= 0 &amp;amp;amp;&amp;amp;amp; quantile &amp;amp;lt;= 1);

	float intens = 0;

	int terminationIndex = quantile * m_points.size();
	for (int i = 0; i &amp;amp;lt; terminationIndex; i++)
		intens += m_points[i]-&amp;amp;gt;intensity;

	return intens / terminationIndex;

float Supervoxel::MedianIntensity()
	return m_points[m_points.size() / 2]-&amp;amp;gt;intensity;

SLIC3D class (the one containing supervoxels) has method where “all magic happens” - MergeSimilarSupervoxels. As stated in its name, method merges voxels. Method performs given number of iterations. In each iteration, method takes random supervoxels, compares it with its neighbours and if a neighbour and examined supervoxels have similar average intensity, the latter one is merged to the examined one. Code can be seen right below.

bool SLIC3D::MergeSimilarSupervoxels()
	vector&amp;amp;lt;Supervoxel*&amp;amp;gt; supervoxelsToBeErased;
	for (int i = 0; i &amp;amp;lt; 10000; i++)
		cout &amp;amp;lt;&amp;amp;lt; i &amp;amp;lt;&amp;amp;lt; " " &amp;amp;lt;&amp;amp;lt; m_supervoxels.size() &amp;amp;lt;&amp;amp;lt; endl;
		std::sort(m_supervoxels.begin(), m_supervoxels.end(), helper_sortFunctionByAvgIntensity);
		Supervoxel* brightestSupervoxel = m_supervoxels[rand() % (m_supervoxels.size() - 1)];
		std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt; nb = *(brightestSupervoxel-&amp;amp;gt;GetNeighbours());

		int numberOfIterations = 0;
		vector&amp;amp;lt;int&amp;amp;gt; labelsToBeErased;
		for (auto it = nb.begin(); it != nb.end(); it++)
			if (std::min(brightestSupervoxel-&amp;amp;gt;AverageIntensity(), it-&amp;amp;gt;second-&amp;amp;gt;AverageIntensity()) / std::max(brightestSupervoxel-&amp;amp;gt;AverageIntensity(), it-&amp;amp;gt;second-&amp;amp;gt;AverageIntensity()) &amp;amp;gt; 0.95)
				helper_mergeSupervoxels(brightestSupervoxel, it-&amp;amp;gt;second);
				//cout &amp;amp;lt;&amp;amp;lt; "nope" &amp;amp;lt;&amp;amp;lt; endl;

		for (int i = 0; i &amp;amp;lt; labelsToBeErased.size(); i++)
			Supervoxel* toBeRemoved = m_supervoxelsMap[labelsToBeErased[i]];

			if (NULL == toBeRemoved)

			std::unordered_map&amp;amp;lt;int, Supervoxel*&amp;amp;gt; nbb = *(toBeRemoved-&amp;amp;gt;GetNeighbours());

			for (auto it = nbb.begin(); it != nbb.end(); it++)
				catch (Exception e)
					cout &amp;amp;lt;&amp;amp;lt; "exc: " &amp;amp;lt;&amp;amp;lt; e.msg &amp;amp;lt;&amp;amp;lt; endl;
		for (int i = 0; i &amp;amp;lt; labelsToBeErased.size(); i++)
			Supervoxel* toBeRemoved = m_supervoxelsMap[labelsToBeErased[i]];
			//delete toBeRemoved; //COMMENT


		for (auto it = m_supervoxelsMap.begin(); it != m_supervoxelsMap.end(); ++it)

	for (int i = 0; i &amp;amp;lt; supervoxelsToBeErased.size(); i++)
		;// delete supervoxelsToBeErased[i];	//COMMENT - to be considered if delete

	return true;

Results of merging in such form highly depend on chosen similarity level. In the picture below left we can see result of applying similarity level 0.95. In the image right the value of similarity level was chosen to be 0.65.


We also tried to train SVM to classify brain and non-brain structures using just these features. We got 4 successful classifications of 5. With classification we will continue later.

Lane markers detection

Michal Polko

In this project, we detect lane markers in videos taken with dashboard camera.


  1. Convert a video frame to grayscale, boost contrast and apply dilation operator to highlight lane markers in the frame.
    Highlighted lane markers.
    cvtColor(frame, frame_bw, CV_RGB2GRAY);
    frame_bw.convertTo(frame_bw, CV_32F, 1.0 / 255.0);
    pow(frame_bw, 3.0, frame_bw);
    frame_bw *= 3.0;
    frame_bw.convertTo(frame_bw, CV_8U, 255.0);
    dilate(frame_bw, frame_bw, getStructuringElement(CV_SHAPE_RECT, Size(3, 3)));
  2. Apply the Canny edge detection to find edges.
    Application of the Canny edge detection.
    int cny_threshold = 100;
    Canny(frame_bw, frame_edges, cny_threshold, cny_threshold * 3, 3);
  3. Apply the Hough transform to find line segments.
    vector<Vec4i> hg_lines;
    HoughLinesP(frame_edges, hg_lines, 1, CV_PI / 180, 15, 15, 2);
  4. Since the Hough transform returns all line segments, not only those around lane markers, it is necessary to filter the results.
    1. We create two lines that describe boundaries of the current lane (hypothesis).
      1. We place two converging lines in the frame.
      2. Using brute-force search, we try to find position where they capture as many line segments as possible.
      3. Since road in the frame can have more than one lane, we try to find result as narrow as possible.
    2. We select line segments that are captured by the created hypothesis, mark them as lane markers and draw them.
    3. Each frame, we take the detected lane markers from the previous frame and perform linear regression to adjust the hypothesis (continuous adjustment).
    4. If we cannot find lane markers in more than 5 successive frames (due to failure of continuous adjustment, lane change, intersection, …), we create a new hypothesis.
    5. If the hypothesis is too wide (almost full width of the frame), we create a new one, because arrangement of road lanes might have changed (e.g. additional lane on freeway).
  5. To distinguish between solid and dashed lane markers, we calculate coverage of the hypothesis by line segments. If the coverage is less than 60%, it is a dashed line; if more, it is a solid line.

    Filtered result of the Hough transform + detection of solid/dashed lines.