Biostatistics
A.Y. 2024/2025
Learning objectives
Modern high-throughput assays generate large amounts of data that must be handled and processed appropriately to extract meaningful biological knowledge and generate testable hypotheses. Proficiency in data handling and processing, and the ability to unravel and highlight complex relationships in biological data using adequate tools and methods constitute a crucial skill for the modern molecular biologist. Methods for the analysis, interpretation and integration of such complex large scale (BIG) biological data, require a good background in statistics and bioinformatics for their application and the verification of the final results.
The aims of this course are (i) in the Biostatistics segment to make the students familiar with the statistical theory and terminology, so to understand the power and pitfalls of statistical analysis, with special emphasis on the planning of experiments for the analysis of large scale biological data, (ii) in the molecular segment to provide a primer on methods for the analysis of gene expression (RNA-Seq) data and the interpretation of the final results. Both segments will be carried on in the frame of the R programming language and software environment, seen as an effective tool for large data analysis.
The aims of this course are (i) in the Biostatistics segment to make the students familiar with the statistical theory and terminology, so to understand the power and pitfalls of statistical analysis, with special emphasis on the planning of experiments for the analysis of large scale biological data, (ii) in the molecular segment to provide a primer on methods for the analysis of gene expression (RNA-Seq) data and the interpretation of the final results. Both segments will be carried on in the frame of the R programming language and software environment, seen as an effective tool for large data analysis.
Expected learning outcomes
After following this course, the students are expected to:
1. Know the syntax of the R programming language, and how to import data into the R environment.
2. correctly analyse experimental data in the field of Life Sciences
3. interpret experimental data
4 perform basic statistical tests
5 Correctly analyse, interpret and visualize the results of dirrerential gene expression analyses, based on RNA sequencing data
1. Know the syntax of the R programming language, and how to import data into the R environment.
2. correctly analyse experimental data in the field of Life Sciences
3. interpret experimental data
4 perform basic statistical tests
5 Correctly analyse, interpret and visualize the results of dirrerential gene expression analyses, based on RNA sequencing data
Lesson period: Second semester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course can be attended as a single course.
Course syllabus and organization
Single session
Responsible
Lesson period
Second semester
Course syllabus
The first six lectures of the course will introduce students to the R/Rstudio programming environment. These programming skills will be reinforced throughout the course.
This will include:
Introduction to the R environment for the analysis of biological data - 1.5 cfu Bio/11 (12hrs)
- Setting up R projects
- Basic data structures, data.frames, vectors, matrices
- Importing data in R
- Installation and management of software packages
- Data wrangling using dplyr package (tidyverse)
- Data visualisation using ggplot package (tidyverse)
- Data simulation using stochastic models
- Introduction to Rmarkdown
The following 12 classes will introduce students to principles of statistical inference, statistical modelling and parameter estimation. These ideas will be reinforced using examples from published data. Using R/Rstudio, emphasis will be on creating transparent and reproducible analysis workflows.
Basics of statistical analysis - 1 cfu Bio/18 (8hrs)
- Data visualisation and pattern recognition
- Principles of statistical inference
- p-values: measures of evidence against "null hypothesis"
-Statistical models of biological experiments, ANOVA
-Assessing model assumptions using residual plots
-First analysis workflow using R/RStudio
Exploring mean and variance structure in statistical models - 1 cfu Bio/18 (8hrs)
-Factorial designs, ANOVA with multiple factors
-Linear models with covariates
-Linear mixed models
-Complete analysis workflow using R/RStudio
Experimental design, Generalised Linear Models and topics relevant to high dimensionsal data- 1 cfu Bio/18 (8hrs)
-Principles of Experimental design: Randomisation, replication, blocking
-Generalised linear models: negative binomial model
-Principal Component Analysis
-Multiple hypothesis testing, p-value adjustments, False Discovery Rates(FDR)
The final part of the course will be an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
Differential gene expression analysis in R- 1 cfu Bio/11 (8hrs)
- Quality metrics and quality control
- Differential gene expression analyses in R
- Multiple testing correction and FDR (false discovery rate)
Visualization and interpretation of the results- 0.5 cfu Bio/11 (4hrs)
- Visualization of the data: heatmaps, scatterplots, boxplot
Classes will consist of intuitive descriptions of programming principles, bioinformatics methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases.
Prof Neeman will introduce statistical modelling, principles of statistical inference and learning from data using R/Rstudio (3 CFUs); Prof Chiara will lead the R language training component, and following the statistical modelling component, will introduce methods for analysing and interpreting complex large scale (BIG) biological data.
This will include:
Introduction to the R environment for the analysis of biological data - 1.5 cfu Bio/11 (12hrs)
- Setting up R projects
- Basic data structures, data.frames, vectors, matrices
- Importing data in R
- Installation and management of software packages
- Data wrangling using dplyr package (tidyverse)
- Data visualisation using ggplot package (tidyverse)
- Data simulation using stochastic models
- Introduction to Rmarkdown
The following 12 classes will introduce students to principles of statistical inference, statistical modelling and parameter estimation. These ideas will be reinforced using examples from published data. Using R/Rstudio, emphasis will be on creating transparent and reproducible analysis workflows.
Basics of statistical analysis - 1 cfu Bio/18 (8hrs)
- Data visualisation and pattern recognition
- Principles of statistical inference
- p-values: measures of evidence against "null hypothesis"
-Statistical models of biological experiments, ANOVA
-Assessing model assumptions using residual plots
-First analysis workflow using R/RStudio
Exploring mean and variance structure in statistical models - 1 cfu Bio/18 (8hrs)
-Factorial designs, ANOVA with multiple factors
-Linear models with covariates
-Linear mixed models
-Complete analysis workflow using R/RStudio
Experimental design, Generalised Linear Models and topics relevant to high dimensionsal data- 1 cfu Bio/18 (8hrs)
-Principles of Experimental design: Randomisation, replication, blocking
-Generalised linear models: negative binomial model
-Principal Component Analysis
-Multiple hypothesis testing, p-value adjustments, False Discovery Rates(FDR)
The final part of the course will be an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
Differential gene expression analysis in R- 1 cfu Bio/11 (8hrs)
- Quality metrics and quality control
- Differential gene expression analyses in R
- Multiple testing correction and FDR (false discovery rate)
Visualization and interpretation of the results- 0.5 cfu Bio/11 (4hrs)
- Visualization of the data: heatmaps, scatterplots, boxplot
Classes will consist of intuitive descriptions of programming principles, bioinformatics methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases.
Prof Neeman will introduce statistical modelling, principles of statistical inference and learning from data using R/Rstudio (3 CFUs); Prof Chiara will lead the R language training component, and following the statistical modelling component, will introduce methods for analysing and interpreting complex large scale (BIG) biological data.
Prerequisites for admission
Basic knowledge of molecular biology topics:
- Structure and biochemical properties of nucleic acids;
- Nucleic acid sequencing methods;
- Mechanisms of gene expression regulation;
- Structure of the eukaryotic gene.
Basic IT knowledge:
File and folder management.
- Structure and biochemical properties of nucleic acids;
- Nucleic acid sequencing methods;
- Mechanisms of gene expression regulation;
- Structure of the eukaryotic gene.
Basic IT knowledge:
File and folder management.
Teaching methods
Teaching method: Lectures accompanied by practical exercises with real data. Instructors will assign exercises at the end of most lessons to help reinforce concepts between classes. Attendance is highly recommended.
Teaching Resources
W. N. Venables, D. M. Smith and the R Core Team. An introduction to R.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.
https://r4ds.hadley.nz
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID: 27441086; PMCID: PMC4937821.
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Glimma: https://bioconductor.org/packages/release/bioc/html/Glimma.html
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the ARIEL platform of the University of Milano. This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biology of the Cell and should not be distributed to others without express consent of the teachers.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Grolemund, G., & Wickham, H. (2017). R for Data Science. O'Reilly Media.
https://r4ds.hadley.nz
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Law CW, Alhamdoosh M, Su S, Dong X, Tian L, Smyth GK, Ritchie ME. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Res. 2016 Jun 17;5:ISCB Comm J-1408. doi: 10.12688/f1000research.9005.3. PMID: 27441086; PMCID: PMC4937821.
https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Glimma: https://bioconductor.org/packages/release/bioc/html/Glimma.html
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the ARIEL platform of the University of Milano. This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biology of the Cell and should not be distributed to others without express consent of the teachers.
Assessment methods and Criteria
Students will be required to complete a transparent data analysis workflow using Rmarkdown, consisting of the analysis of biological data from real experiments. The students will produce and submit a report describing their results to the teachers. Delivery of the report is due at least 48h before the selected exam session. Projects will be undertaken in small groups (2-3 students per group).
The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Project and oral presentation - 100%
The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Project and oral presentation - 100%
BIO/11 - MOLECULAR BIOLOGY - University credits: 3
BIO/18 - GENETICS - University credits: 3
BIO/18 - GENETICS - University credits: 3
Lessons: 48 hours
Professors:
Chiara Matteo, Neeman Teresa M
Educational website(s)
Professor(s)