Course title Mining massive datasets
Assesment method Hours/semester 60 Lect. Exercises Lab. Project
ETCS 4 Year Hours/week 2 2
Student should be familiar with Java, Eclipse IDE, Linux commands and bash scripts.
Course description
The goal of this course is to overview recent developments in the field of processing massive datasets. We will learn techniques and tools that are related to Hadoop ecosystem.
Course objectives
The objective of this course is to familiarize students with the problems of processing large datasets. And state-of-the art solutions to these problems.
  • Writing map-reduce jobs
  • Executing data-mining libraries working on top of Hadoop HDFS
Homeworks 20%, Presentation 30%, Project 50%.
Reference Texts and Software

Books and papers:

  1. Hadoop In Practice, Alex Holmes
  2. Hadoop In Action, Chuck Lam
  3. Mahout In Action, Sean Owen
  4. Hadoop The Definitive guide, Tom White
Lecture Schedule
1. The challanges of processing massive datasets
2. Installing Hadoop in Virtual Machines with VirtualBox
3. Debugging Hadoop MapReduce examples in Eclipse IDE
4. HDFS, Serialization and FileFormats
5. Joining collections of documents with MapReduce
6. Sorting and sampling key-value datasets
7. Case study: clustering Reuters articles with Mahout
8. Finding shortest path between two nodes
9. Calculating Friend-of-Friends in social network
10. Calculating PageRank over web graph
11. Data analytics with Hive
12. Resilient Distributed Datasets and Spark
13. Debugging and diagnosing performance problems.
14. Collaborative filtering, in-memory implementations
15. Collaborative filtering, distributed implementations