Skip to main content

Section outline

    • ML vs Statistics vs Data Science

    • Data Preprocessing: Encoding, Missing Values, Outliers

    • Python and R for ML: NumPy, Pandas, ggplot2

    • Spark Basics: RDD, DF, SparkR, MLlib

    • Lab: S&P 500 stock data analysis

    • Linear, Multiple Linear Regression

    • Ridge, Lasso, ElasticNet, Cross Validation

    • Gradient Boosting for Regression

    • Lab: Power demand prediction, Housing price regression

    • Decision Trees, Random Forests

    • Logistic Regression, Support Vector Machines

    • Evaluation: Confusion Matrix, ROC-AUC

    • Lab: Customer segmentation, Credit risk analysis, UCI wine dataset

    • Clustering: K-Means, Hierarchical

    • Feature Engineering and PCA

    • Text Analytics: TF-IDF, POS, Lemmatization, Sentiment Analysis

    • Lab: Movie genre clustering, IMDB comment classification

    • Spark MLlib & ML Pipelines

    • Saving and Serving Models

    • Optional: PredictionIO for streaming models

    • Lab: Stack Overflow dataset processing and community detection