In this second data mining project we worked on a dataset with attributes related to conditions such as humidity, temperature, light inside a room with the aim of **predicting whether the room is occupied** or not.

## Classifiers and imbalanced datasets

Classification algorithms already seen in Data mining 1 have been applied, such as **KNN, Decision Tree**, but also new ones such as **Logistic Regression and Naive Bayes**.

These algorithms were applied to a **balanced dataset** but also to an **unbalanced dataset** in the classes to be predicted, to verify the difference in the results obtained. **Techniques of resampling of the dataset** were then applied, in order to make it balanced again.

**PCA** was applied, a technique for **dimensionality reduction** of the dataset, subsequently always applying the algorithms mentioned above, comparing the results with the base case.

## Advanced classifiers

In this second phase more advanced classifiers were applied such as **SVM (Support Vector Machine), Multilayer Perceptron**, but also **Deep Neural Network (feedforward, CNN, recurrent)**.

We used ensemble classifiers, such as **Random Forest, AdaBoost, BAGGING**, evaluating the performance improvement compared to simpler classifiers.

## Time series analysis, forecasting and classification

In this third phase, we **extracted time series** from the dataset (the attributes mentioned at the beginning were measured over time). Then, w performed algorithms for **Motif discovery** (repeated patterns within a time series) and **Shapelet discovery** (subsequences which are the most representative of the class to which the time series belongs).

We applied **clustering algorithms for time series**, **classification** algorithms for **univariate and multivariate time series** (at each instant more attributes are taken into consideration, therefore each value is n-dimensional).

## Sequential Pattern Mining

Starting from the time series extracted above, we have applied** discretization algorithms such as SAX**, in order to treat the time series as sequences and perform algorithms for the** discovery of repeated significant patterns**.

## Outlier detection

In this fifth part we applied outlier detection algorithms, such as **LOF (Local Outlier Factor), ABOD (Angle Based Outlier Detection)**.

## Explainability of the models

In this last task, we used algorithms that help to solve the so-called **black box problem**, effective but difficult to interpret models: the example par excellence are neural networks, but also a random forest compared to a decision tree, or the SVM.

It was therefore decided to apply an **Inspection Model Explainer** to models previously used such as SVM and Multilayer Perceptron.

## Download of the document

For a better understanding of the algorithms used and the tests made, you can download the relevant document below (only in Italian)