Introduction to Data Science: Concept, history, and process (CRISP-DM) of Data Science, its goals and applications. Attributes, datasets, data quality issues, Big Data, basic machine learning tasks.
Data Exploration, Preparation, and Similarity Measures: Data preparation, exploratory data analysis and visualization, summary statistics, sampling, attribute aggregation, transformation, discretization. Minkowski distance and its special cases, Mahalanobis distance, cosine similarity, SMC, Jaccard index, Hamming distance, DTW.
kNN and Decision Trees: Nearest neighbor methods and their accelerations (K-d tree, CNN), Bayes classifier, decision tree, Hunt algorithm, split quality, impurity measures, evaluation.
Overfitting and Model Evaluation: Generalization capability, training, test, and validation sets. Cross-validation, underfitting and overfitting, Occam’s razor, decision tree pruning, confusion matrix, performance metrics, ROC curve, AUC.
Naive Bayes and Bayesian Networks: Bayes classifier principle, a posteriori and maximum likelihood estimation, estimation with normal distribution, Laplace and m-estimation, evaluation of Naive Bayes, Bayesian networks, conditional independence.
Linear Regression: Parametric and nonparametric regression, kNN and decision tree for regression tasks, MSE, decomposition of MSE and variance, bias–variance trade-off, optimal regression solution, linear regression, gradient descent, stochastic gradient descent, learning rate, regularization, polynomial regression.
Logistic Regression and SVM: Classification by regression, sigmoid function, objective function of logistic regression, linear separability, nonlinear decision boundaries, logit model, maximum margin principle, support vectors and SVM, optimization task, kernel trick, handling multiclass classification.
Neural Networks: Biological motivation, activation functions, perceptron and its relationship to other algorithms, learning logical functions, multilayer neural networks, forward propagation, backpropagation.
Ensemble Learning for Classification: Ensemble methods, bagging, metamodels, boosting and AdaBoost, gradient boosting, random forest, semi-supervised learning, classification of imbalanced data, SMOTE.
Clustering: Concept and types of clustering, clustering algorithms, Stirling numbers of the second kind, Kleinberg’s impossibility theorem, k-means algorithm, bisecting k-means, hierarchical clustering, cluster distance measures, single-link and complete-link methods, DBSCAN algorithm, core, border, and noise points, k-medoids, fuzzy c-means, Gaussian mixture models, EM algorithm, cluster validation (distance matrices, SSE, silhouette).
Recommender Systems: Content-based approach, collaborative filtering, nearest-neighbor methods, latent factor models, matrix factorization.
Dimensionality Reduction: Advantages and disadvantages of high dimensionality, curse of dimensionality, high-dimensional paradoxes, dimensionality reduction methods, feature selection techniques, principal component analysis, independent component analysis.
Association Rule Learning: Market basket transactions, support, confidence, rule mining, brute-force and two-step approaches, apriori principle and algorithm, maximal and closed itemsets, rule generation (brute-force and apriori-based), lift measure.
Outlier and Anomaly Detection: Causes, goals, and aspects of anomaly detection, applications, detection methods and anomaly types, supervised and statistical approaches, clustering-based and nearest-neighbor-based detection, density-based methods, local outlier factor (LOF), isolation forest, convex hull, half-space depth, detection after dimensionality reduction, contextual and collective anomalies, online detection methods.
Networks and PageRank: Graph representations, properties of real-world networks, scale-free property, Erdős–Rényi model, preferential attachment model, node centrality measures, PageRank idea, random surfer model, Markov chains, stationary distribution, power method, PageRank walk, importance of teleportation, PageRank manipulability, personalized PageRank.
Technologies
Python: IPython, Anaconda, Jupyter. Libraries: pandas, Scikit-learn, NumPy, SciPy, matplotlib, IPython, Keras, TensorFlow. Topics: array handling, web scraping, data acquisition (API), data import (CSV, JSON, XML/HTML), classification and regression tasks, gradient and ensemble methods, character recognition with neural networks, PCA, face recognition.
R: RStudio. Packages: ggplot2, class, dplyr. Topics: vectors, matrices, lists, data frames, importing data from files and the web, visualization, machine learning (classification, regression), aggregation, clustering, box plots.
Tableau: Measures and dimensions, data transformation and aggregation, creating calculated fields and parameters, bar and scatter plots, axis transformations, filtering data, editing tooltips, reference lines, histograms, Python/R integration, clustering.