Algorithms for Massive Datasets | Università degli Studi di Milano Statale

A.Y. 2024/2025

Max ECTS

Overall hours

SSD

INF/01

Language

English

Included in the following degree programmes

Computer Science (Classe LM-18)-Enrolled from 2014/2015 Until 2024/2025 Academic Year

Learning objectives

The course aims at describing the big data processing framework, both in terms of methodologies and technologies.

Expected learning outcomes

Students:
- will be able to use technologies for the distributed storage of datasets;
- will know the map-reduce distributed processing framework and its leading extensions;
- will know the principal algorithms used in order to deal with classical big data problems, as well as to implement them using a distributed processing framework;
- will be able to choose appropriate methods for solving big data problems.

Lesson period: Second semester

Lessons timetable

Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi

Exams calendar

Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Search a single course

Course syllabus and organization

Single session

Responsible

Malchiodi Dario

Lesson period

Second semester

Syllabus

Course syllabus

The course will consider the main processing techniques dealing with data at massive scale, and their implementation on distributed computational frameworks. More precisely, lectures will review the principal application contexts characterized by amounts of data that cannot be handled using standard computing facilities and procedures. Such contexts will be analyzed in terms of tailored algorithms. Meanwhile, some general big data processing techniques, such as those falling within the hat of machine learning, will be considered.

More precisely, the following topics will be covered.
- Technical and mathematical preliminaries.
- Bases of MapReduce, Hadoop, and Spark.
- Analysis of MapReduce algorithms.
- NoSQL databases: MongoDB.
- Link analysis.
- Regression.
- Logistic regression.
- Stream analysis.
- Deep learning.
- Clustering.
- Finding similar items.
- Market-basket analysis.
- Gradient boosting.
- Recommender systems.
- Dimensionality reduction.
- Embeddings.

Prerequisites for admission

The course requires knowledge of the main topics of bachelor-level computer programming, linear algebra, calculus, probability, and statistics.

Teaching methods

Frontal classes

Teaching Resources

Textbook:
- Anand Rajaraman and Jeff Ullman, Mining of Massive Datasets, Cambridge University Press (ISBN:9781107015357).

Suggested readings:
- Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4)
- Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8)

Lecture notes, supplementary material, and sample code:
- https://labonline.ctu.unimi.it/
- https://malchiodi.di.unimi.it/teaching/AMD/

Assessment methods and Criteria

The exam consists of a project and an oral test, both related to the topics covered in the course. The project, described in a report, requires to process one or more datasets through the critical application of the techniques described during the classes. The evaluation of the project, expressed with a pass/fail mark and sent to students via mail, considers the level of mastery of the topics and the clarity of the report. The oral test, which is accessed after a positive evaluation of the project, is based on the discussion of some topics covered in the course and on in-depth questions about the presented project. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.

Course structure

INF/01 - INFORMATICS - University credits: 6

Lessons: 48 hours

Professor: Malchiodi Dario

Educational website(s)

Algorithms for massive datasets (a.a. 2024/25)

Professor(s)

Malchiodi Dario

Web site

Reception:

By appointment

Room 5015 of the Computer Science Department