
ML vs Statistics vs Data Science
Data Preprocessing: Encoding, Missing Values, Outliers
Python and R for ML: NumPy, Pandas, ggplot2
Spark Basics: RDD, DF, SparkR, MLlib
Lab: S&P 500 stock data analysis
Linear, Multiple Linear Regression
Ridge, Lasso, ElasticNet, Cross Validation
Gradient Boosting for Regression
Lab: Power demand prediction, Housing price regression
Decision Trees, Random Forests
Logistic Regression, Support Vector Machines
Evaluation: Confusion Matrix, ROC-AUC
Lab: Customer segmentation, Credit risk analysis, UCI wine dataset
Clustering: K-Means, Hierarchical
Feature Engineering and PCA
Text Analytics: TF-IDF, POS, Lemmatization, Sentiment Analysis
Lab: Movie genre clustering, IMDB comment classification
Spark MLlib & ML Pipelines
Saving and Serving Models
Optional: PredictionIO for streaming models
Lab: Stack Overflow dataset processing and community detection