Data mining 2 (2020)

In this second data mining project we worked on a dataset with attributes related to conditions such as humidity, temperature, light inside a room with the aim of predicting whether the room is occupied or not.

Classifiers and imbalanced datasets

Classification algorithms already seen in Data mining 1 have been applied, such as KNN, Decision Tree, but also new ones such as Logistic Regression and Naive Bayes.

These algorithms were applied to a balanced dataset but also to an unbalanced dataset in the classes to be predicted, to verify the difference in the results obtained. Techniques of resampling of the dataset were then applied, in order to make it balanced again.

PCA was applied, a technique for dimensionality reduction of the dataset, subsequently always applying the algorithms mentioned above, comparing the results with the base case.

Advanced classifiers

In this second phase more advanced classifiers were applied such as SVM (Support Vector Machine), Multilayer Perceptron, but also Deep Neural Network (feedforward, CNN, recurrent).

We used ensemble classifiers, such as Random Forest, AdaBoost, BAGGING, evaluating the performance improvement compared to simpler classifiers.

Time series analysis, forecasting and classification

In this third phase, we extracted time series from the dataset (the attributes mentioned at the beginning were measured over time). Then, w performed algorithms for Motif discovery (repeated patterns within a time series) and Shapelet discovery (subsequences which are the most representative of the class to which the time series belongs).

We applied clustering algorithms for time series, classification algorithms for univariate and multivariate time series (at each instant more attributes are taken into consideration, therefore each value is n-dimensional).

Sequential Pattern Mining

Starting from the time series extracted above, we have applied discretization algorithms such as SAX, in order to treat the time series as sequences and perform algorithms for the discovery of repeated significant patterns.

Outlier detection

In this fifth part we applied outlier detection algorithms, such as LOF (Local Outlier Factor), ABOD (Angle Based Outlier Detection).

Explainability of the models

In this last task, we used algorithms that help to solve the so-called black box problem, effective but difficult to interpret models: the example par excellence are neural networks, but also a random forest compared to a decision tree, or the SVM.

It was therefore decided to apply an Inspection Model Explainer to models previously used such as SVM and Multilayer Perceptron.

Download of the document

For a better understanding of the algorithms used and the tests made, you can download the relevant document below (only in Italian)

Download

Follow me!

Lorenzo Mannocci