Algorithms for Massive Datasets

A.Y. 2024/2025
6
Max ECTS
48
Overall hours
SSD
INF/01
Language
English
Learning objectives
The course aims at describing the big data processing framework, both in terms of methodologies and technologies.
Expected learning outcomes
Students:
- will be able to use technologies for the distributed storage of datasets;
- will know the map-reduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.
Single course

This course can be attended as a single course.

Course syllabus and organization

Single session

Responsible
Lesson period
Second semester
Course syllabus
The course will consider the main processing techniques dealing with data at massive scale, and their implementation on distributed computational frameworks. More precisely, lectures will review the principal application contexts characterized by amounts of data that cannot be handled using standard computing facilities and procedures. Such contexts will be analyzed in terms of tailored algorithms. Meanwhile, some general big data processing techniques, such as those falling within the hat of machine learning, will be considered.

More precisely, the following topics will be covered.
- Technical and mathematical preliminaries.
- Bases of MapReduce, Hadoop, and Spark.
- Analysis of MapReduce algorithms.
- NoSQL databases: MongoDB.
- Link analysis.
- Regression.
- Logistic regression.
- Stream analysis.
- Deep learning.
- Clustering.
- Finding similar items.
- Market-basket analysis.
- Gradient boosting.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.
Prerequisites for admission
The course requires knowledge of the main topics of bachelor-level computer programming, linear algebra, calculus, probability, and statistics.
Teaching methods
Frontal classes
Teaching Resources
Textbook:
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).

Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)

Lecture notes, supplementary material, and sample code:
- https://labonline.ctu.unimi.it/
- https://malchiodi.di.unimi.it/teaching/AMD/
Assessment methods and Criteria
The exam consists of a project and an oral test, both related to the topics covered in the course. The project, described in a report, requires to process one or more datasets through the critical application of the techniques described during the classes. The evaluation of the project, expressed with a pass/fail mark and sent to students via mail, considers the level of mastery of the topics and the clarity of the report. The oral test, which is accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.
INF/01 - INFORMATICS - University credits: 6
Lessons: 48 hours
Professor: Malchiodi Dario
Shifts:
Turno
Professor: Malchiodi Dario
Professor(s)
Reception:
By appointment
Room 5015 of the Computer Science Department