Architectures for Big Data
A.Y. 2022/2023
Learning objectives
The course aims at describing the big data processing framework, both in terms of methodologies and technologies. Part of the lessons will focus on Apache Spark and distributed patterns.
Expected learning outcomes
Students will learn:
How to distribute computation over clusters using Map Reduce model
How to write Apache Spark code
How Hadoop works and why it works that way
What a software architecture is
How to design batch architectures to manage data workflows
Several design patterns that could be used in a distributed environment
The limit of traditional SQL with Big Data
How to distribute computation over clusters using Map Reduce model
How to write Apache Spark code
How Hadoop works and why it works that way
What a software architecture is
How to design batch architectures to manage data workflows
Several design patterns that could be used in a distributed environment
The limit of traditional SQL with Big Data
Lesson period: First semester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Lesson period
First semester
Course syllabus
High Level Information
1. Topics
a. Enterprise Architectures
b. Design Patterns
c. Hadoop
d. Distributed Algorithms
e. Big Data and SQL
f. Document Based DB
g. Containers
2. Technologies
a. Python
b. Apache Spark - Resilient Distributed Dataset
c. ELK Stack: Elastic Search, Logstash, Kibana
d. Docker
3. External Workshops
a. Workshop 1: IOT for Connected Vehicle (Marelli - Riccardo Tomasi, PHD) - TBC
b. Workshop 2: Services and Microservices (artea.com - TBD) - TBC
c. Workshop 3: TBD (Unimi - prof. Dario Malchiodi) - TBC
d. Workshop 4: TBD (Google - TBD) - TBC
1. Topics
a. Enterprise Architectures
b. Design Patterns
c. Hadoop
d. Distributed Algorithms
e. Big Data and SQL
f. Document Based DB
g. Containers
2. Technologies
a. Python
b. Apache Spark - Resilient Distributed Dataset
c. ELK Stack: Elastic Search, Logstash, Kibana
d. Docker
3. External Workshops
a. Workshop 1: IOT for Connected Vehicle (Marelli - Riccardo Tomasi, PHD) - TBC
b. Workshop 2: Services and Microservices (artea.com - TBD) - TBC
c. Workshop 3: TBD (Unimi - prof. Dario Malchiodi) - TBC
d. Workshop 4: TBD (Google - TBD) - TBC
Prerequisites for admission
The course requires knowledge of the main topics of bachelor-level computer programming, calculus, probability, and statistics. A knowledge of python could be useful (even if a 101 lecture on how to program with python is part of the lectures calendar)
Teaching methods
.
Teaching Resources
Lectures are based:
· on the notes and sample code published in the calendar of lectures.
It is also suggested to read the following material.
· To practice with Spark: H. Karau, A. Konwinski, P. Wendell, M. Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4).
· For a deeper study of Spark: S. Ryza, U. Laserson, S. Owen, J. Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8).
· About distributed file systems and the MapReduce paradigm: Yahoo! Hadoop Tutorial (besides Chapter 2 in RU).
· For a deeper study of the practical parts: Data Science and Engineering with Spark program of edX.
· on the notes and sample code published in the calendar of lectures.
It is also suggested to read the following material.
· To practice with Spark: H. Karau, A. Konwinski, P. Wendell, M. Zaharia, Learning Spark. Lightning-Fast Big Data Analysis, O'Reilly, 2015 (ISBN:978-1-449-35862-4).
· For a deeper study of Spark: S. Ryza, U. Laserson, S. Owen, J. Wills, Advanced Analytics with Spark. Patterns for Learning from Data at Scale, O'Reilly, 2015 (ISBN:978-1-491-91276-8).
· About distributed file systems and the MapReduce paradigm: Yahoo! Hadoop Tutorial (besides Chapter 2 in RU).
· For a deeper study of the practical parts: Data Science and Engineering with Spark program of edX.
Assessment methods and Criteria
The exam consists of an oral test related to the topics covered in the course. The oral test, is based on the discussion of some topics covered in the course. The evaluation of the oral test, expressed on a scale between 0 and 30, takes into account the level of mastery of the topics, clarity, and language skills.
INF/01 - INFORMATICS - University credits: 6
Lessons: 48 hours
Professor:
Condorelli Andrea
Educational website(s)