Syllabus
1. Introduction to Data Science: Concept, history and process (CRISP-DM) of Data Science, goal of data science and its applications. Attributes,
datasets, Big Data, Machine Learning tasks.
2. Data exploration, preparation and similarity measures: Data preparation, explanatory analysis, data visualization, summary statistics,
sampling, attribute aggregation, transformation, and discretization. Minkowski distance, Mahalanobis distance, Cosine similarity, SMC, Jaccard
index, Hamming distance, DTW.
3. kNN and Decision Tree: Method of nearest neighbors and its accelerations (K-d tree), Bayes classifier, Decision Tree, Hunt algorithm, split
purity, impurity metrics, validation.
4. Overfitting, validation: Generalization, training, test, and validation sets. Cross-validation, under and overfitting, Occam’s razor, confusion
matrix, performance indicators, ROC, AUC
5. Naive Bayes: Naive Bayes classifier, a posteriori and maximum likelihood estimation, estimation with normal distribution, Laplace and m
estimation
6. Linear regression: Parametric and nonparametric regression, kNN and Decision Tree for regression task, MSE, decomposition of MSE and
variance, Bias–Variance tradeoff, optimal solution of regression, linear regression, gradient descent, stochastic gradient descent, learning rate,
regularization, polynomial regression, interpreting linear regression models.
7. Logistic regression and SVM: Classification by regression, sigmoid function, logistic regression, linear separability, non-linear decision
boundary, logit model, maximal margin, support vectors and SVM
8. Neural networks: Biological motivation, activation function, perceptron and its relation to other algorithms, representing Boolean functions
with neural networks, deep-learning, forward propagation, backpropagation.
9. Ensemble learning: Ensemble methods, bagging, metamodels, boosting and AdaBoost, gradient boosting, Random Forest, semi-supervised
learning, classification of imbalanced data, SMOTE.
10. Cluster analysis: Concept, types, clustering algorithms, k-means algorithm, hierarchical clustering, distance of clusters, Simple-linkage and
Complete-linkage clustering, DBSCAN algorithm, core border and noise points, validation of clustering (distance matrix, SSE, silhouette)
11. Recomendation systems: content based recommender, collaborative filtering, user based and k nearest neighbour recommender, latent factor
recommender system, matrix factorization.