Scientific Programming
A.Y. 2023/2024
Learning objectives
The objective of the course is to make students proficient in writing programs and scripts in the programming languages most widely used in modern genomic research: R and Python.
Expected learning outcomes
At the end of this class , the students are expected to be able to design and write advanced programs in Python and R programming languages, applying them to case studies derived from the analysis of genomic data.
Lesson period: Second semester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Lesson period
Second semester
Course syllabus
Seminar lectures and practical exercises on the following topics:
Part A: R programming
1. Course introduction: Motivation, course information, introduction.
2. Introduction to R, CRAN and Bioconductor: repetition of the basic syntax and execution flow (blocks, conditional statements, loops); basic data structures (vectors, factors, matrices, data frames, lists), functions and scripts, data import/export.
3. Data processing in R: advanced use of data structures, vectorized operations and efficient coding in R (e.g., apply versus for-loops; differences in syntax and performance).
4. Important data types and packages for bioinformatics in R: GRanges for genomic locations, DNAString and RNAString, SummrizedExperiment, annotation packages (e.g., GenomicFeatures).
5. Class systems in R: S3, S4 and Reference classes
6. Data visualization in R: simple plots, boxplots, heatmaps and more; basic introduction to the powerful and flexible ggplot2 framework, its syntax and use.
7. Creating R/Bioconductor packages: basic package structure; requirements; building and verifying packages; Bioconductor submission process.
8. Unit testing in R: the testthat framework for unit testing in R.
9. Specific use cases in R: e.g., differential expression analysis (DEseq2), pathway analysis (GSEA).
Part B: Python programming
1. Python recap: main Python concept; control flow statement, variables, data structures, classes, handling of exceptions, file management.
2. Pandas and NumPy libraries: efficient matrix operations with NumPy; concept of Pandas DataFrame and some hints on the internals; overview of the main functionalities provided by a DataFrame (import from and export to files, relational operators, data retrieval, data manipulation).
3. Visualization libraries: overview of the two main libraries for data visualization in Python: matplotlib and Seaborn; trivial plots: curves, scatter plots; sophisticated plots: heatmaps, clustermap, plots of distributions; good practice for realizing plots: correct usage of axis scale, legend, title, etc.
4. Concurrent programming: needs of parallelism; theoretical benefits of parallelism; Python library "multiprocessing" to spawn and join processes; Python library "multithreading" and its limitations.
5. Network programming: theory and implementation of client-server architectures; implementation of RESTful web service using the Python module "flask".
Generic topics (independent of the used programming language):
1. Version control with Git/GitHub
2. Best practices for computational biology
Part A: R programming
1. Course introduction: Motivation, course information, introduction.
2. Introduction to R, CRAN and Bioconductor: repetition of the basic syntax and execution flow (blocks, conditional statements, loops); basic data structures (vectors, factors, matrices, data frames, lists), functions and scripts, data import/export.
3. Data processing in R: advanced use of data structures, vectorized operations and efficient coding in R (e.g., apply versus for-loops; differences in syntax and performance).
4. Important data types and packages for bioinformatics in R: GRanges for genomic locations, DNAString and RNAString, SummrizedExperiment, annotation packages (e.g., GenomicFeatures).
5. Class systems in R: S3, S4 and Reference classes
6. Data visualization in R: simple plots, boxplots, heatmaps and more; basic introduction to the powerful and flexible ggplot2 framework, its syntax and use.
7. Creating R/Bioconductor packages: basic package structure; requirements; building and verifying packages; Bioconductor submission process.
8. Unit testing in R: the testthat framework for unit testing in R.
9. Specific use cases in R: e.g., differential expression analysis (DEseq2), pathway analysis (GSEA).
Part B: Python programming
1. Python recap: main Python concept; control flow statement, variables, data structures, classes, handling of exceptions, file management.
2. Pandas and NumPy libraries: efficient matrix operations with NumPy; concept of Pandas DataFrame and some hints on the internals; overview of the main functionalities provided by a DataFrame (import from and export to files, relational operators, data retrieval, data manipulation).
3. Visualization libraries: overview of the two main libraries for data visualization in Python: matplotlib and Seaborn; trivial plots: curves, scatter plots; sophisticated plots: heatmaps, clustermap, plots of distributions; good practice for realizing plots: correct usage of axis scale, legend, title, etc.
4. Concurrent programming: needs of parallelism; theoretical benefits of parallelism; Python library "multiprocessing" to spawn and join processes; Python library "multithreading" and its limitations.
5. Network programming: theory and implementation of client-server architectures; implementation of RESTful web service using the Python module "flask".
Generic topics (independent of the used programming language):
1. Version control with Git/GitHub
2. Best practices for computational biology
Prerequisites for admission
Basic knowledge of programming, preferable in Python language and/or in R. Basic knowledge in molecular biology.
Teaching methods
Class lectures and practices in an informatics room or using the students' laptop computers.
Teaching Resources
Recommended textbooks (not required to pass the exam):
· W. McKinney. Python for data analysis. 2013. O'Reilly. http://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
· Z.A. Shaw. Learn Python the hard way: A very simple introduction to the terrifyingly beautiful world of computers and code. (3rd Edition) 2014. (Zed Shaw's Hard Way Series) 3rd Edition.
· O. Jones et al. Introduction to Scientific Programming and Simulation Using R. (2nd Edition) 2014. Chapman and Hall/CRC. ISBN 978-1-466-56999-7
· C. Ortutay, Z. Ortutay. Molecular Data Analysis Using R. 2017. Wiley Blackwell. ISBN: 978-1-119-16504-0
· W. McKinney. Python for data analysis. 2013. O'Reilly. http://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
· Z.A. Shaw. Learn Python the hard way: A very simple introduction to the terrifyingly beautiful world of computers and code. (3rd Edition) 2014. (Zed Shaw's Hard Way Series) 3rd Edition.
· O. Jones et al. Introduction to Scientific Programming and Simulation Using R. (2nd Edition) 2014. Chapman and Hall/CRC. ISBN 978-1-466-56999-7
· C. Ortutay, Z. Ortutay. Molecular Data Analysis Using R. 2017. Wiley Blackwell. ISBN: 978-1-119-16504-0
Assessment methods and Criteria
The assessment will be based on
1. a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points).
2. two project assignments, one to be implemented in R and one to be implemented in Python.
The final grade will be a weighted average based on the grade of the exam (70%) and of each of the project assignments (15% each).
The following table provides a detailed overview of the elements that will be considered.
Written test (70%):
Practical programming exercises in R and Python (Dublin descriptor: DD1, DD2, DD3):
- Writing and interpreting command statements
- Working with basic data structures as well as advanced data structures and packages/software libraries for genomics
- Practical notions of concurrent programming (Python)
- Practical notions of network programming (Python)
- Practical notions of R packages (including unit testing)
- Practical notions of R classes
Descriptive exercises focusing on conceptual aspects (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
- Basic concepts of R: e.g., attaching vs. loading packages
- Theoretical notions/concepts of concurrent programming
- Theoretical notions/concepts of network programming
- Theoretical notions/concepts of R packages (including unit testing)
- Theoretical notions/concepts of R classes
Software projects (2 x 15%), (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
From a list of approx. 10 selected projects (the list may change from year to year to adapt to new interesting topics), each student will have to select one to be implemented in Python and a second to be implemented in R.
Example assignment for R:
- Develop a Bioconductor-complinat R package that provides functions for
Example assignment for Python:
- Develop a stand-alone Python script that
Evaluation criteria:
- To what degree are the requirements satisfied?
- Does the software/package offer reasonable and user-friendly functionality? (E.g., use of correct, interoperable data structures?)
- Was the workload reasonable? (E.g., documentation, unit tests?)
- Is the software/package or program well structured?
- What's the code quality? (E.g., efficient data processing?)
- Are there any errors when running the software and/or compiling the package?
IMPORTANT NOTE: plagiarism detection software will be applied!
1. a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points).
2. two project assignments, one to be implemented in R and one to be implemented in Python.
The final grade will be a weighted average based on the grade of the exam (70%) and of each of the project assignments (15% each).
The following table provides a detailed overview of the elements that will be considered.
Written test (70%):
Practical programming exercises in R and Python (Dublin descriptor: DD1, DD2, DD3):
- Writing and interpreting command statements
- Working with basic data structures as well as advanced data structures and packages/software libraries for genomics
- Practical notions of concurrent programming (Python)
- Practical notions of network programming (Python)
- Practical notions of R packages (including unit testing)
- Practical notions of R classes
Descriptive exercises focusing on conceptual aspects (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
- Basic concepts of R: e.g., attaching vs. loading packages
- Theoretical notions/concepts of concurrent programming
- Theoretical notions/concepts of network programming
- Theoretical notions/concepts of R packages (including unit testing)
- Theoretical notions/concepts of R classes
Software projects (2 x 15%), (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
From a list of approx. 10 selected projects (the list may change from year to year to adapt to new interesting topics), each student will have to select one to be implemented in Python and a second to be implemented in R.
Example assignment for R:
- Develop a Bioconductor-complinat R package that provides functions for
Example assignment for Python:
- Develop a stand-alone Python script that
Evaluation criteria:
- To what degree are the requirements satisfied?
- Does the software/package offer reasonable and user-friendly functionality? (E.g., use of correct, interoperable data structures?)
- Was the workload reasonable? (E.g., documentation, unit tests?)
- Is the software/package or program well structured?
- What's the code quality? (E.g., efficient data processing?)
- Are there any errors when running the software and/or compiling the package?
IMPORTANT NOTE: plagiarism detection software will be applied!
ING-INF/05 - INFORMATION PROCESSING SYSTEMS - University credits: 6
Practicals: 24 hours
Lectures: 36 hours
Lectures: 36 hours
Professor:
Piro Rosario Michael