Biostatistics

A.Y. 2024/2025
6
Max ECTS
48
Overall hours
SSD
BIO/11 BIO/18
Language
English
Learning objectives
Modern high-throughput assays generate large amounts of data that must be handled and processed appropriately to extract meaningful biological knowledge and generate testable hypotheses. Proficiency in data handling and processing, and the ability to unravel and highlight complex relationships in biological data using adequate tools and methods constitute a crucial skill for the modern molecular biologist. Methods for the analysis, interpretation and integration of such complex large scale (BIG) biological data, require a good background in statistics and bioinformatics for their application and the verification of the final results.
The aims of this course are (i) in the Biostatistics segment to make the students familiar with the statistical theory and terminology, so to understand the power and pitfalls of statistical analysis, with special emphasis on the planning of experiments for the analysis of large scale biological data, (ii) in the molecular segment to provide a primer on methods for the analysis of gene expression (RNA-Seq) data and the interpretation of the final results. Both segments will be carried on in the frame of the R programming language and software environment, seen as an effective tool for large data analysis.
Expected learning outcomes
After following this course, the students are expected to:

1. Know the syntax of the R programming language, and how to import data into the R environment.
2. correctly analyse experimental data in the field of Life Sciences
3. interpret experimental data
4 perform basic statistical tests
5 Correctly analyse, interpret and visualize the results of dirrerential gene expression analyses, based on RNA sequencing data
Single course

This course can be attended as a single course.

Course syllabus and organization

Single session

Responsible
Lesson period
Second semester
Course syllabus
First, teachings will introduce students to principles, concepts and statistical methods commonly used for the analysis and interpretation of large scale biological data.
This will include:
Basics of statistical analysis - 1 cfu (8hrs)

Why use Statistics. Populations and samples. Basics of probability. Random variables.
Frequency distributions; normal and Poisson distributions. The idea behind a statistical test: power and protection of a test, Type I and Type II errors. False Detection Rate (FDR).

The most common statistical tests - 1 cfu (8hrs)

Quantitative and qualitative variables - which test?
Some uses of the z variable.
The General Linear Model (GLM)
Some uses of Student's t.

Other statistical tests - 1 cfu (8hrs)

The model of Analysis of Variance (ANOVA).
Linear regression models, parameters estimate in linear, multiple and curvilinear regression.
Basics of multivariate analysis, Principal Component Analysis.
Use of statistical software. Examples in R.


This first part of the course will be followed by an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
.
Introduction to the R environment for the analysis of biological data - 1 cfu (8hrs)

How to import data in R
Basic data structures, data.frames, vectors, matrices
Installation and management of software packages
Introduction to the R graphical environment
Use of statistical software. Examples in R.

Differential gene expression analysis in R- 1 cfu (8hrs)

Quality metrics and quality control
Differential gene expression analyses in R
Multiple testing correction and FDR (false discovery rate)

Visualization and interpretation of the results- 1 cfu (8hrs)
Functional enrichment analysis of gene lists
Visualization of the data: heatmaps, scatterplots, boxplot
RMarkdown for the generation of analyses reports

Classes will consist of intuitive descriptions of programming principles, bioinformatics methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases.
Prerequisites for admission
Knowledge of basic molecular biology topics, with particular reference to transcription, gene expression regulation, and nucleic acid sequencing is highly recommended for attending the course.
Teaching methods
Teaching mode: classroom lectures supported by practicals on real or realistic datasets. Teachers will assign exercises at the end of most lessons to help in fixing concepts between classes. Attendance is highly recommended.
Teaching Resources
W. N. Venables, D. M. Smith and the R Core Team. An introduction to R.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf

Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the ARIEL platform of the University of Milano. This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biology of the Cell and should not be distributed to others without express consent of the teachers.
Assessment methods and Criteria
Notions and skills acquired in this course will be evaluated through a written exam. Students will be required to complete a small project to qualify for an exam session, consisting of the analysis of gene expression data from real experiments. The students will produce and submit a report describing their results to the teachers. Delivery of the report is due at least 48h before the selected exam session. Projects will be undertaken in small groups (2-3 students per group). During the exam, for the statistics part the students will be challenged with a numerical exercise, aimed at verifying the knowledge of the logic and methodological tools required for a correct evaluation of experimental data. (The allotted time for the exam is 1 hour and the exam will be considered passed equal or over the 18/30 mark.) During the exam the use of personal computers and/or pocket calculators and the perusal of one's own notes are allowed.

The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Project - 50%
Written exam - 50%
BIO/11 - MOLECULAR BIOLOGY - University credits: 3
BIO/18 - GENETICS - University credits: 3
Lessons: 48 hours
Professor: Chiara Matteo
Professor(s)