Coding for Data Science and Data Management
A.Y. 2024/2025
Learning objectives
The course aims at providing technical skills about coding/scripting aspects for data analysis and to manage persistent data storage of sources and results involved in analysis. On the one side, the Python programming language and the R framework are illustrated. The goal is to deal with essential notions about data structures and control structures of both Python and R. On the other side, the goal is to present the core notions of relational databases, such as keys, integrity, and primary/foreign key constraints, as well as the SQL language for data definition, manipulation, and query. Recent and innovative NoSQL solutions are also discussed, with special focus on a document-oriented system called MongoDB.
Expected learning outcomes
Upon completion of the course, students will be able to:
- manage data using R and R Studio;
- solve coding challenges using R libraries and functions;
- make statistical inference and graphics using R;
- writing an apply family of functions in R;
- understand the Python data model and the flow control statements;
- use the built-in Python data structures;
- perform basic linear algebra operations using Numpy;
- perform basic data set manipulations using Pandas:
- perform simple machine learning experiments using Scikit-learn;
- understand and apply the core notions of data modeling in relational databases;
- use the SQL language for creating and querying relational database structures;
- understand and apply the principles of data organization in NoSQL systems;
- use MongoDB for data retrieval and aggregation in a document-oriented NoSQL system.
- manage data using R and R Studio;
- solve coding challenges using R libraries and functions;
- make statistical inference and graphics using R;
- writing an apply family of functions in R;
- understand the Python data model and the flow control statements;
- use the built-in Python data structures;
- perform basic linear algebra operations using Numpy;
- perform basic data set manipulations using Pandas:
- perform simple machine learning experiments using Scikit-learn;
- understand and apply the core notions of data modeling in relational databases;
- use the SQL language for creating and querying relational database structures;
- understand and apply the principles of data organization in NoSQL systems;
- use MongoDB for data retrieval and aggregation in a document-oriented NoSQL system.
Lesson period: First trimester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course can be attended as a single course.
Course syllabus and organization
Single session
Responsible
Lesson period
First trimester
Prerequisites for admission
No prerequisites.
Assessment methods and Criteria
The assessment method consists in a written exam on the syllabus of both coding for data science and data management units. The written exam is organized in:
- exercises on R scripts;
- quizzes on Python code fragments;
- open-ended questions on topics from data analysis;
- quizzes on i) relational data modeling and ii) NoSQL systems;
- exercises on the SQL language and find/aggregation queries in MongoDB.
For the data management unit, only the written-exam modality is possible.
For the coding unit about R and Python programming, the students can choose to replace the written-exam modality with a project modality. When a project modality is selected, the contents and deadlines must be shared and approved by the teachers of the coding unit. In the project modality, two distinct projects are expected: one for R and one for Python. The student can also choose a mixed modality where R is taken as project and Python is taken as written exam, or vice versa.
The result of the whole course is expressed in thirtieths, and it is the average of the two units (coding and data management units).
- exercises on R scripts;
- quizzes on Python code fragments;
- open-ended questions on topics from data analysis;
- quizzes on i) relational data modeling and ii) NoSQL systems;
- exercises on the SQL language and find/aggregation queries in MongoDB.
For the data management unit, only the written-exam modality is possible.
For the coding unit about R and Python programming, the students can choose to replace the written-exam modality with a project modality. When a project modality is selected, the contents and deadlines must be shared and approved by the teachers of the coding unit. In the project modality, two distinct projects are expected: one for R and one for Python. The student can also choose a mixed modality where R is taken as project and Python is taken as written exam, or vice versa.
The result of the whole course is expressed in thirtieths, and it is the average of the two units (coding and data management units).
Coding for Data Science and Data Management-Module Coding for Data Science
Course syllabus
The syllabus of the unit is about the following topics.
R framework:
- Introduction to R framework and R Studio;
- Data structures: vectors, array, data.frames, lists, environments;
- Building functions;
- Generation of random numbers with R;
- Making plots and base Graphics with R and the package ggplot2;
- Data Manipulation with R;
- References to parallel computing with R;
- Introduction to Rcpp package;
- Building an interactive interface with shiny;
- Building R packages.
Python language:
- Principles of programming and introduction to the language;
- Data structures and data types;
- Object-oriented programming and flow control;
- Efficient handling of numeric data (Numpy);
- Advanced data structures;
- Pandas and Matplotlib;
- Introduction to data analysis;
- Performance metrics, classification and clustering algorithms;
- Scikit-learn.
R framework:
- Introduction to R framework and R Studio;
- Data structures: vectors, array, data.frames, lists, environments;
- Building functions;
- Generation of random numbers with R;
- Making plots and base Graphics with R and the package ggplot2;
- Data Manipulation with R;
- References to parallel computing with R;
- Introduction to Rcpp package;
- Building an interactive interface with shiny;
- Building R packages.
Python language:
- Principles of programming and introduction to the language;
- Data structures and data types;
- Object-oriented programming and flow control;
- Efficient handling of numeric data (Numpy);
- Advanced data structures;
- Pandas and Matplotlib;
- Introduction to data analysis;
- Performance metrics, classification and clustering algorithms;
- Scikit-learn.
Teaching methods
Lectures are based on frontal teaching with the support of slides and handouts that are progressively published on the reference course website (Ariel platform). Throughout the lectures, classroom exercises are proposed and discussed to analyze pros and cons of different, possible solutions.
Lecture attendance is not mandatory, but it is strongly recommended.
Lecture attendance is not mandatory, but it is strongly recommended.
Teaching Resources
Python:
- Wes McKinney, Python for Data Analysis, 2nd Edition, O'Reilly Media, 2017.
Online resources and handouts provided throughout the lectures.
- Wes McKinney, Python for Data Analysis, 2nd Edition, O'Reilly Media, 2017.
Online resources and handouts provided throughout the lectures.
Coding for Data Science and Data Management-Module Data Management
Course syllabus
The syllabus of the unit is about the following topics.
- Introduction to relational databases;
- Database and database systems (DBMS);
- The relational model;
- Data definition languages for databases;
- Data manipulation languages for databases;
- Queries with the SQL language;
- Simple queries and group queries with aggregate operators;
- Queries with set operators;
- Nested queries.
- Introduction to NoSQL databases;
- Data models for NoSQL;
- Types of NoSQL;
- Comparison against the relational model;
- The "document-oriented" data model;
- The MongoDB system;
- Collection in MongoDB;
- Collection queries in MongoDB;
- Aggregation pipeline in MongoDB.
- Introduction to relational databases;
- Database and database systems (DBMS);
- The relational model;
- Data definition languages for databases;
- Data manipulation languages for databases;
- Queries with the SQL language;
- Simple queries and group queries with aggregate operators;
- Queries with set operators;
- Nested queries.
- Introduction to NoSQL databases;
- Data models for NoSQL;
- Types of NoSQL;
- Comparison against the relational model;
- The "document-oriented" data model;
- The MongoDB system;
- Collection in MongoDB;
- Collection queries in MongoDB;
- Aggregation pipeline in MongoDB.
Teaching methods
Lectures are based on frontal teaching with the support of slides and handouts that are progressively published on the reference course website (Ariel platform). Throughout the lectures, classroom exercises are proposed and discussed to analyze pros and cons of different, possible solutions.
Lecture attendance is not mandatory, but it is strongly recommended.
Lecture attendance is not mandatory, but it is strongly recommended.
Teaching Resources
P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone, Database Systems - Concepts, Languages and Architectures - Mc-Graw Hill, available on-line at http://dbbook.dia.uniroma3.it/. Chapters: 1, 2, 4.
Online resources and handouts provided throughout the lectures.
Online resources and handouts provided throughout the lectures.
Coding for Data Science and Data Management-Module Coding for Data Science
SECS-S/01 - STATISTICS - University credits: 6
Lessons: 40 hours
Professors:
Cappozzo Andrea, Montanelli Stefano
Coding for Data Science and Data Management-Module Data Management
INF/01 - INFORMATICS - University credits: 6
Lessons: 40 hours
Professor:
Montanelli Stefano
Educational website(s)
Professor(s)
Reception:
On appointment by email
Room 7015, Dipartimento di Informatica "Giovanni degli Antoni", Via Celoria 18 - 20133 Milano