Coding for data science and data management
A.A. 2019/2020
Obiettivi formativi
The course aims at providing technical skills about coding/scripting aspects for data analysis and to manage persistent data storage of sources and results involved in analysis. On the one side, the Python programming language and the R framework are illustrated. The goal is to deal with essential notions about data structures and control structures of both Python and R. On the other side, the goal is to present the core notions of relational databases, such as keys, integrity, and primary/foreign key constraints, as well as the SQL language for data definition, manipulation, and query. Recent and innovative NoSQL solutions are also discussed, with special focus on a document-oriented system called MongoDB.
Risultati apprendimento attesi
Upon completion of the course, students will be able to:
- manage data using R and R Studio;
- solve coding challenges using R libraries and functions;
- make statistical inference and graphics using R;
- writing an apply family of functions in R;
- understand the Python data model and the flow control statements;
- use the built-in Python data structures;
- perform basic linear algebra operations using Numpy;
- perform basic data set manipulations using Pandas:
- perform simple machine learning experiments using Scikit-learn;
- understand and apply the core notions of data modeling in relational databases;
- use the SQL language for creating and querying relational database structures;
- understand and apply the principles of data organization in NoSQL systems;
- use MongoDB for data retrieval and aggregation in a document-oriented NoSQL system.
- manage data using R and R Studio;
- solve coding challenges using R libraries and functions;
- make statistical inference and graphics using R;
- writing an apply family of functions in R;
- understand the Python data model and the flow control statements;
- use the built-in Python data structures;
- perform basic linear algebra operations using Numpy;
- perform basic data set manipulations using Pandas:
- perform simple machine learning experiments using Scikit-learn;
- understand and apply the core notions of data modeling in relational databases;
- use the SQL language for creating and querying relational database structures;
- understand and apply the principles of data organization in NoSQL systems;
- use MongoDB for data retrieval and aggregation in a document-oriented NoSQL system.
Periodo: Secondo trimestre
Modalità di valutazione: Esame
Giudizio di valutazione: voto verbalizzato in trentesimi
Corso singolo
Questo insegnamento non può essere seguito come corso singolo. Puoi trovare gli insegnamenti disponibili consultando il catalogo corsi singoli.
Programma e organizzazione didattica
Edizione unica
Responsabile
Periodo
Secondo trimestre
Prerequisiti
No prerequisites are required.
Modalità di verifica dell’apprendimento e criteri di valutazione
The assessment method consists in a written exam on the syllabus of both coding for data science and data management units. The written exam is organized in:
- exercises on R scripts;
- quizzes on Python code fragments;
- open-ended questions on topics from data analysis;
- quizzes on i) relational data modeling and ii) NoSQL systems;
- exercises on the SQL language and find/aggregation queries in MongoDB.
The final result is expressed in thirtieths.
- exercises on R scripts;
- quizzes on Python code fragments;
- open-ended questions on topics from data analysis;
- quizzes on i) relational data modeling and ii) NoSQL systems;
- exercises on the SQL language and find/aggregation queries in MongoDB.
The final result is expressed in thirtieths.
Module Coding for Data Science
Programma
The syllabus of the unit is about the following topics.
Python language:
- Introduction to the language;
- Data structures;
- Control flow;
- File I/O;
- Numpy;
- Pandas and Matplotlib;
- Scikit-learn.
Data management:
- Introduction to relational databases;
- Database and database systems (DBMS);
- The relational model;
- Data definition languages for databases;
- Data manipulation languages for databases;
- Queries with the SQL language;
- Simple queries and group queries with aggregate operators;
- Queries with set operators;
- Nested queries.
Python language:
- Introduction to the language;
- Data structures;
- Control flow;
- File I/O;
- Numpy;
- Pandas and Matplotlib;
- Scikit-learn.
Data management:
- Introduction to relational databases;
- Database and database systems (DBMS);
- The relational model;
- Data definition languages for databases;
- Data manipulation languages for databases;
- Queries with the SQL language;
- Simple queries and group queries with aggregate operators;
- Queries with set operators;
- Nested queries.
Metodi didattici
Lectures are based on frontal teaching with the support of slides and handouts that are progressively published on the reference course website (Ariel platform). Throughout the lectures, classroom exercises are proposed and discussed to analyze pros and cons of different, possible solutions.
Lecture attendance is not mandatory, but it is strongly recommended.
Lecture attendance is not mandatory, but it is strongly recommended.
Materiale di riferimento
Python: online resources and handouts provided by the teaching throughout the lectures.
Data management: choose the preferred book between the following alternatives:
- P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone, Database Systems - Concepts, Languages and Architectures - Mc-Graw Hill, available on-line at http://dbbook.dia.uniroma3.it/.
- R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, 7th edition, Pearson, 2015.
Data management: choose the preferred book between the following alternatives:
- P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone, Database Systems - Concepts, Languages and Architectures - Mc-Graw Hill, available on-line at http://dbbook.dia.uniroma3.it/.
- R. Elmasri, S.B. Navathe, Fundamentals of Database Systems, 7th edition, Pearson, 2015.
Module Data Management
Programma
The syllabus of the unit is about the following topics.
R framework:
- Introduction to R framework and R Studio;
- Basic data types, data structures and operations
- Control structures
- Custom functions
- Time series
- Data acquisition
- ggplot2 and plotly: advanced data visualization
- Rmarkdown: rendering reports directly from R scripts
- Rcpp: speeding up the code
- RShiny: building interactive web apps in R
- Building R packages
Data management:
- Introduction to NoSQL databases;
- Data models for NoSQL;
- Types of NoSQL;
- Comparison against the relational model;
- The "document-oriented" data model;
- The MongoDB system;
- Collection in MongoDB;
- Collection queries in MongoDB;
- Aggregation pipeline in MongoDB.
R framework:
- Introduction to R framework and R Studio;
- Basic data types, data structures and operations
- Control structures
- Custom functions
- Time series
- Data acquisition
- ggplot2 and plotly: advanced data visualization
- Rmarkdown: rendering reports directly from R scripts
- Rcpp: speeding up the code
- RShiny: building interactive web apps in R
- Building R packages
Data management:
- Introduction to NoSQL databases;
- Data models for NoSQL;
- Types of NoSQL;
- Comparison against the relational model;
- The "document-oriented" data model;
- The MongoDB system;
- Collection in MongoDB;
- Collection queries in MongoDB;
- Aggregation pipeline in MongoDB.
Metodi didattici
Lectures are based on frontal teaching with the support of slides and handouts that are progressively published on the reference course website (Ariel platform). Throughout the lectures, classroom exercises are proposed and discussed to analyze pros and cons of different, possible solutions.
Lecture attendance is not mandatory, but it is strongly recommended.
Lecture attendance is not mandatory, but it is strongly recommended.
Materiale di riferimento
Online resources and handouts provided by the teaching throughout the lectures.
Moduli o unità didattiche
Module Coding for Data Science
INF/01 - INFORMATICA - CFU: 6
Lezioni: 40 ore
Turni:
Module Data Management
SECS-S/01 - STATISTICA - CFU: 6
Lezioni: 40 ore
Docenti:
Guidotti Emanuele, Montanelli Stefano
Turni:
-
Docenti:
Guidotti Emanuele, Montanelli StefanoDocente/i
Ricevimento:
Su appuntamento da concordare via email
Stanza 7015, Dipartimento di Informatica "Giovanni degli Antoni", Via Celoria 18 - 20133 Milano