EGIT 532: Data Science and Big Data Analytics (Semester 2/2018)
by ITM-Mahidol University
   
Descriptions This course will provide practical coverage of essential data mining topics including:
  • Data ecosystem and introduction to data science
  • Data science team building
  • Data and data exploration
  • Data ingestion, querying, and cleansing
  • Data mining techniques: classification, regression, clustering, association, and text mining
  • Algorithm optimization
  • Big data architecture and Spark platforms
  • Ethical Issues in Data Science,Data Governance and GDPR

Students will work with the following software:

Rapidminer Studio 9.1 [download]

   
Textbook

Provost F, Fawcett T. Data science for business: what you need to know about data mining and data-analytic thinking. Sebastopol: OReilly; 2013.

Leskovec J, Rajaraman A, Ullman JD. Mining of massive datasets. Delhi: Cambridge University Press; 2016.

Tan P-N, Steinbach M, Karpatne A, Kumar V. Introduction to data mining. New York, NY: Pearson Education, Inc.; 2019.

Ladley J. Data governance: how to design, deploy, and sustain an effective data governance program. Amsterdam: M. Kaufmann; 2012.

 

Course Date

Sunday, 9.00am -12.00pm, Room: R403

   
Instructors
Asst.Prof. Sotarat Thammaboosadee, Ph.D.
[Certified RapidMiner Analyst and CIMP-Data Governance]
Office: R303
Line: zotarutto
e-mail: sotarat.tha@mahidol.ac.th
Office Hours: kindly appoint
 
Repositories Shared drive

Tentative Course Schedule
Week Date Lectures Topics Materials
(Lectures and Lab)
Assignments
1 January 13, 2019 

Data ecosystem

and introduction to data science

Data era
Data ecosystem
Data lifecycle
Data jobs
Data science definition and opportunities
Data science skillset
Data science process and usecase
Data science team
Data science project management
Course introduction
Lecture01-Intro
 
2 January 20, 2019 Data, data explorations,
and visual analytics
Structured, semistructure, unstructured data
Data types
Basic statistical measurement
Data quality
Basic data visualization
Introduction to RapidMiner Studio 9.1
Lecture02.1-Data
Lecture02.2-DataViz
 
3 January 27, 2019 Basic data Ingestion and querying Quick introduction to relational database
Data integration
Data filtering
Aggregation
Pivoting
Lecture and labs HW1
4 February 3, 2019

Data cleansing

Data transformation
Attributes binning
Normalization
Missing values handling
Basic sampling techniques
Lecture and labs  
5 February 10, 2019
(part1 / part2)
Basic classification techniques
and performance evaluations
Supervised learning concepts
K-Nearest neighbours
Naïve bayes
Decision tree
Confusion matrix
Train-Test validation
Cross-validation
Cost validation
Lecture and labs HW2
6 February 17, 2019
(part1 / part2
Regression analysis
and times-series forecasting
Linear regression
Polynomial regression
Logistics regression
Errors Measurements
Times-series data transformation
Time-series forecasting tecchniques
Lecture and labs  
7 February 24, 2019
(part1 / part2)
Advanced data preprocessing Feature weighting and selction
Feature subset selection
Principal Components Analysis
Singular Value Decomposition
Missing values imputtation
Imbalance data handling
SMOTE Technique
Lecture and labs  
8 March 3, 2019
(9am-12pm)
(part1 / part2)
Advanced predictive modeling
and outlier analysis
Neural Network
Ensemble methods
Random forests
Gradien Boosted Tree
Distance-based, density-based,
and class-based outlier analysis
Lecture and labs  
9 March 3, 2019
(2pm-5pm)
(part1 / part2)
Advanced control flow
and hyperparameter Tuning
Advanced data science control flow
Conditions and looping
Dynamic process
Hyperparameter tuning
 
10 March 10, 2019
(part1 / part2)
Clustering Analysis
and semi-supervised learning
Centroid-based clustering
Density-based Method
Hierarchical method
Semi-supervised learning
Lecture and labs HW3
11 March 17, 2019
(2pm-5pm)
Association Rules Discovery
and Recommendation System
Apriori
FP-Growth
Sequential pattern mining
Recommendation system
Lecture and labs

*Term paper assigning

**Project requirements announcement
12 March 31, 2019 Text Mining
Text processing method
Basic Natural Language Processing
Text mining applications
Thai text Mining
Lecture and labs HW4
13 April 21, 2019 Big Data: Ecosystem and Architecture Big data motivation
Big data Architecture
Big data atechnology stacks
Data lake
Lecture  
14 April 21, 2019 Big Data Analytics
with Spark and MapReduce
Hadoop and MapReduce architecture
Spark architecture
Distributed analytics frameworks
Lecture and labs
15 April 28, 2019 Ethical Issues in Data Science,
Data Governance and GDPR
Data Governance and Data Steward
GDPR
Data Analytics Frameworks
Data Monetization
Data Science Governance
Lecture  
16 May 5, 2019 Paper Presentation**
(Assigned on week 8)
   

Remark:
1.
The individual project report and presentation clip must be submitted before 12 May 2019. The project requirements and specification will be announced on week 8.


Resources:

Rapidminer: http://docs.rapidminer.com/

Grading

Grades will be based on points earned from assignments, attendance, individual project, and research presentation as follows:

Grading Procedures:

Assignments: Assignments will be graded on the +/ok/- scale, where + indicates excellent, ok indicates satisfactory, and - indicates needs improvement. Assignments will be given each selected class. Your solutions to these assignments are your way of telling the instructor about your mastery of this course. Your solutions must be clearly different than those turned in by others in the class and represent a unique and special effort on your part.

Assignment Submission: The deadline of each assignment is set to a week after an assigning date, submitted to zotarat@gmail.com. The email must contains a subject likes "HW1-61xxxxx". The submitted assignment MUST be in PDF file. Late submission will not be accepted.

Individual Project: Students have to propose a complete data science process for their selected data set, including domain problems, nature and example of data, selected data preprocessing methods, selected data mining algorithms with comparisons, model evaluation, and possible deployment in real world. All works must be identified in a report and submitted as a PDF file. Additionally, students are required to present their work via recorded VDO clip which may be uploaded via Youtube or sent in any share drive services.


Attendance Policy:

Lectures: Since this course has been designed as a theory-and-practice contents, there are several hands-on workshops that they are required to do by themselves which will be counted as attendance score. There are no minimum requirements for number of class attending but the scores will reflect by themselves.