Scientific Programming

A.Y. 2024/2025
6
Max ECTS
60
Overall hours
SSD
ING-INF/05
Language
English
Learning objectives
Programming skills are essential for bioinformatics and computational genomics, including the implementation of scripts for automation of recurrent data processing procedures and development of novel tools and reusable software packages.
The objective of the course is to make students proficient in writing software by adopting computational approaches, concepts and methods. As currently being the most widely used programming languages in the area of data science and in particular genomic data analysis, the course will present the scientific programming features of the Python and R programming languages. The students will have the opportunity to familiarize not only with the syntax and execution flow of the programming languages, but also with some of the software packages or libraries commonly used in bioinformatics. Also the creation of own software packages or libraries will be discussed.
Expected learning outcomes
Specifically, the goal of the R part of the course is to teach a correct and efficient use of this software environment for statistical computing and flexible visualization of data. The R part will illustrate the enormous performance loss that can result from inefficient coding and introduce both basic and advanced data structures and processing methods commonly used for bioinformatics, focusing on Bioconductor packages. Furthermore, R will be explored as a powerful tool for data visualization; also unit testing and version control will be discussed.
The objective for the Python part of the course is to empower the students with skills for a) coherent and efficient dataset manipulation by mean of the Pandas and NumPy libraries; b) concepts of concurrent programming for efficient execution; c) data visualization, by means of classical curve/scatter plots as well as more sophisticated heatmap, clustermap, histograms, boxplots; d) network programming, in particular how to access and deploy REST services.
Exercises will help to get hands on experience with the theoretical concepts discussed in the lectures and with the handling of biological or biomedical data.
Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Course syllabus and organization

Single session

Lesson period
Second semester
Course syllabus
Seminar lectures and practical exercises on the following topics:

Part A: R programming
1. Course introduction: Motivation, course information, introduction.
2. Introduction to R, CRAN and Bioconductor: repetition of the basic syntax and execution flow (blocks, conditional statements, loops); basic data structures (vectors, factors, matrices, data frames, lists), functions and scripts, data import/export.
3. Data processing in R: advanced use of data structures, vectorized operations and efficient coding in R (e.g., apply versus for-loops; differences in syntax and performance).
4. Important data types and packages for bioinformatics in R: GRanges for genomic locations, DNAString and RNAString, SummrizedExperiment, annotation packages (e.g., GenomicFeatures).
5. Class systems in R: S3, S4 and Reference classes
6. Data visualization in R: simple plots, boxplots, heatmaps and more; basic introduction to the powerful and flexible ggplot2 framework, its syntax and use.
7. Creating R/Bioconductor packages: basic package structure; requirements; building and verifying packages; Bioconductor submission process.
8. Unit testing in R: the testthat framework for unit testing in R.
9. Specific use cases in R: e.g., differential expression analysis (DEseq2), pathway analysis (GSEA).

Part B: Python programming
1. Python recap: main Python concept; control flow statement, variables, data structures, classes, handling of exceptions, file management.
2. Pandas and NumPy libraries: efficient matrix operations with NumPy; concept of Pandas DataFrame and some hints on the internals; overview of the main functionalities provided by a DataFrame (import from and export to files, relational operators, data retrieval, data manipulation).
3. Visualization libraries: overview of the two main libraries for data visualization in Python: matplotlib and Seaborn; trivial plots: curves, scatter plots; sophisticated plots: heatmaps, clustermap, plots of distributions; good practice for realizing plots: correct usage of axis scale, legend, title, etc.
4. Concurrent programming: needs of parallelism; theoretical benefits of parallelism; Python library "multiprocessing" to spawn and join processes; Python library "multithreading" and its limitations.
5. Network programming: theory and implementation of client-server architectures; implementation of RESTful web service using the Python module "flask".

Generic topics (independent of the used programming language):
1. Version control with Git/GitHub
2. Best practices for computational biology
Prerequisites for admission
Basic knowledge of programming, preferable in Python language and/or in R. Basic knowledge in molecular biology.
Teaching methods
Class lectures and practices in an informatics room or using the students' laptop computers.
Teaching Resources
Recommended textbooks (not required to pass the exam):
· W. McKinney. Python for data analysis. 2013. O'Reilly. http://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
· Z.A. Shaw. Learn Python the hard way: A very simple introduction to the terrifyingly beautiful world of computers and code. (3rd Edition) 2014. (Zed Shaw's Hard Way Series) 3rd Edition.
· O. Jones et al. Introduction to Scientific Programming and Simulation Using R. (2nd Edition) 2014. Chapman and Hall/CRC. ISBN 978-1-466-56999-7
· C. Ortutay, Z. Ortutay. Molecular Data Analysis Using R. 2017. Wiley Blackwell. ISBN: 978-1-119-16504-0
Assessment methods and Criteria
The assessment will be based on

1. a written exam to be taken in the exam sessions defined by the school and covering all aspects in the syllabus. The exam will assign up to 60 points (+ 6 possible bonus points).
2. two project assignments, one to be implemented in R and one to be implemented in Python.

The final grade will be a weighted average based on the grade of the exam (70%) and of each of the project assignments (15% each).
The following table provides a detailed overview of the elements that will be considered.

Written test (70%):
Practical programming exercises in R and Python (Dublin descriptor: DD1, DD2, DD3):
- Writing and interpreting command statements
- Working with basic data structures as well as advanced data structures and packages/software libraries for genomics
- Practical notions of concurrent programming (Python)
- Practical notions of network programming (Python)
- Practical notions of R packages (including unit testing)
- Practical notions of R classes
Descriptive exercises focusing on conceptual aspects (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
- Basic concepts of R: e.g., attaching vs. loading packages
- Theoretical notions/concepts of concurrent programming
- Theoretical notions/concepts of network programming
- Theoretical notions/concepts of R packages (including unit testing)
- Theoretical notions/concepts of R classes

Software projects (2 x 15%), (Dublin descriptor: DD1, DD2, DD3, DD4, DD5):
From a list of approx. 10 selected projects (the list may change from year to year to adapt to new interesting topics), each student will have to select one to be implemented in Python and a second to be implemented in R.
Example assignment for R:
- Develop a Bioconductor-complinat R package that provides functions for
Example assignment for Python:
- Develop a stand-alone Python script that
Evaluation criteria:
- To what degree are the requirements satisfied?
- Does the software/package offer reasonable and user-friendly functionality? (E.g., use of correct, interoperable data structures?)
- Was the workload reasonable? (E.g., documentation, unit tests?)
- Is the software/package or program well structured?
- What's the code quality? (E.g., efficient data processing?)
- Are there any errors when running the software and/or compiling the package?
IMPORTANT NOTE: plagiarism detection software will be applied!
ING-INF/05 - INFORMATION PROCESSING SYSTEMS - University credits: 6
Practicals: 24 hours
Lectures: 36 hours
Professor: Piro Rosario Michael