Methods in Bioinformatics
A.Y. 2024/2025
Learning objectives
High-throughput experimental assays generate large amounts of data that must be handled and processed appropriately in order to extract meaningful biological knowledge. Bioinformatics provides methods and tools to perform complex and elaborate analyses of large scale (BIG) biological data, prompting novel testable hypotheses and allowing their verification. Proficiency in data handling and processing, and the ability to unravel and highlight complex relationships in biological data using adequate tools and methods constitute a crucial skill for the modern biotechnology researcher.
The aims of this course are (i) to introduce the basic principles of procedural and object-oriented programming, (ii) to present the R programming language and software environment as an effective instrument for the analysis of large scale biological data, (iii) to provide a primer on methods for the analysis of gene expression (RNA-Seq) data and their statistical foundations.
The course is ideally linked to those dealing with genomics and bioinformatics.
The aims of this course are (i) to introduce the basic principles of procedural and object-oriented programming, (ii) to present the R programming language and software environment as an effective instrument for the analysis of large scale biological data, (iii) to provide a primer on methods for the analysis of gene expression (RNA-Seq) data and their statistical foundations.
The course is ideally linked to those dealing with genomics and bioinformatics.
Expected learning outcomes
After following this course, the students are expected to:
(1)Understand the basic principles of programming and be able to map those concepts to R programming language specificities and features.
(2)Know the syntax of the R programming language and its basic data types, data structures, and functions.
(3)Become proficient in splitting simple data analysis procedures into elementary logical steps and translate them to R functions and scripts.
(4)Know how to import data into the R environment.
(5)Be able to represent data and their relationships using basic R plotting functions.
(6)Know how to manage R software packages and libraries.
(7)Produce impactful reports of an analysis workflow, by integrating text, R code, and plots.
(8)Perform and interpret preliminary RNA-seq data analysis: normalization, Principal Component Analysis (PCA), and quality control.
(9)Know how to execute differential expression analysis.
(10)Be able to perform post-processing and functional enrichment analysis of differentially expressed genes.
(1)Understand the basic principles of programming and be able to map those concepts to R programming language specificities and features.
(2)Know the syntax of the R programming language and its basic data types, data structures, and functions.
(3)Become proficient in splitting simple data analysis procedures into elementary logical steps and translate them to R functions and scripts.
(4)Know how to import data into the R environment.
(5)Be able to represent data and their relationships using basic R plotting functions.
(6)Know how to manage R software packages and libraries.
(7)Produce impactful reports of an analysis workflow, by integrating text, R code, and plots.
(8)Perform and interpret preliminary RNA-seq data analysis: normalization, Principal Component Analysis (PCA), and quality control.
(9)Know how to execute differential expression analysis.
(10)Be able to perform post-processing and functional enrichment analysis of differentially expressed genes.
Lesson period: First semester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course can be attended as a single course.
Course syllabus and organization
Single session
Responsible
Lesson period
First semester
Course syllabus
First, teachings will introduce students to programming principles for data analysis, using the R programming language as the practice ground for their understanding and application. In particular, students will familiarise themselves with the following concepts:
- Data types and variables
- Basic data structures: vectors, factors, matrices, arrays, lists.
- Essential standard functions of R.
- Control of the execution flow: blocks, conditional statements, loops.
- Environments, custom functions, and scripts.
- I/O operations: data import and export.
- Graphical representation of biological data: scatterplots, bar plots, histograms, heat-maps, boxplots, and Venn diagrams.
- Software packages, libraries, and repositories.
This first part of the course will be followed by an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
- Basics of NGS data analysis.
- Primer on dimensionality reduction techniques and descriptive statistics.
- Normalization, PCA and quality control of RNA-Seq data.
- Introduction to statistical tests for the comparison of gene expression levels.
- Differential gene expression analysis.
- Post-processing and functional enrichment analyses.
Classes will consist of intuitive descriptions of programming principles, bioinformatic methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases. Prof Zambelli will cover the first part of the course (3 CFUs) introducing R programming; the second part of the course (3 CFUs), delivered by Prof Chiara, will follow seamlessly and focus on NGS data analysis.
- Data types and variables
- Basic data structures: vectors, factors, matrices, arrays, lists.
- Essential standard functions of R.
- Control of the execution flow: blocks, conditional statements, loops.
- Environments, custom functions, and scripts.
- I/O operations: data import and export.
- Graphical representation of biological data: scatterplots, bar plots, histograms, heat-maps, boxplots, and Venn diagrams.
- Software packages, libraries, and repositories.
This first part of the course will be followed by an introduction to the analysis of Next Generation Sequencing (NGS) data using R, with insights on the theoretical and practical principles underlying state-of-the-art methods for processing RNA-Seq assays to assess differential gene expression. In particular:
- Basics of NGS data analysis.
- Primer on dimensionality reduction techniques and descriptive statistics.
- Normalization, PCA and quality control of RNA-Seq data.
- Introduction to statistical tests for the comparison of gene expression levels.
- Differential gene expression analysis.
- Post-processing and functional enrichment analyses.
Classes will consist of intuitive descriptions of programming principles, bioinformatic methods, and their underlying statistics, compounded with practicals. Students will apply the newly introduced concepts to data analysis use cases. Prof Zambelli will cover the first part of the course (3 CFUs) introducing R programming; the second part of the course (3 CFUs), delivered by Prof Chiara, will follow seamlessly and focus on NGS data analysis.
Prerequisites for admission
Knowledge of basic molecular biology topics, with particular reference to transcription, gene expression regulation, and nucleic acid sequencing is highly recommended for attending the course.
Teaching methods
Teaching mode: classroom lectures supported by practicals on real or realistic datasets. Teachers will assign exercises at the end of most lessons to help in fixing concepts between classes. Attendance is highly recommended.
Teaching Resources
W. N. Venables, D. M. Smith and the R Core Team. An introduction to R.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis
of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the myARIEL platform of the University of Milano (https://myariel.unimi.it/course/view.php?id=1214). This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biotechnology and Bioinformatics and should not be distributed to others without express consent of the teachers.
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
Chen Y, McCarthy D, Ritchie M, Robinson, M, Smyth G. edgeR: differential expression analysis
of digital gene expression data. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Copies of the slides projected during the classes, as well as additional materials and datasets will be made available through the course website on the myARIEL platform of the University of Milano (https://myariel.unimi.it/course/view.php?id=1214). This material is intended as a support for lectures, and its study cannot be considered as a full alternative to constant attendance of classes. The material is made available only to registered students of the Degree Course in Molecular Biotechnology and Bioinformatics and should not be distributed to others without express consent of the teachers.
Assessment methods and Criteria
Notions and skills acquired in this course will be evaluated through an oral exam. Students will be required to complete a small project to qualify for an exam session, consisting of the analysis of gene expression data from real experiments. The students will produce and submit a report describing their results to the teachers. Delivery of the report is due at least 48h before the selected exam session. Projects will be undertaken in small groups (1-3 students per group).
The exam will consist of a brief individual dissertation (approx 15 minutes) of the project report and the theoretical topics covered in the classes. The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Knowledge of the R programming language - 25%
Theoretical principles of gene expression analysis - 25%
Project report and its discussion - 50%
The exam will consist of a brief individual dissertation (approx 15 minutes) of the project report and the theoretical topics covered in the classes. The grade will result from the joint evaluation of each candidate by the two teachers weighted as follows:
Knowledge of the R programming language - 25%
Theoretical principles of gene expression analysis - 25%
Project report and its discussion - 50%
INF/01 - INFORMATICS - University credits: 6
Lectures: 48 hours
Professors:
Chiara Matteo, Zambelli Federico
Educational website(s)
Professor(s)
Reception:
Friday 15.00-16.00 by appointment
Beacon Lab, 2nd floor, B Tower, Dept. of Biosciences / MS Teams