Knowledge Extraction and Information Retrieval | Università degli Studi di Milano Statale

A.Y. 2022/2023

Max ECTS

Overall hours

SSD

INF/01

Language

English

Included in the following degree programmes

Data Science and Economics - (Classe LM-91)-Enrolled Until 2021/2022 Academic Year

Learning objectives

The course provides a general introduction to information retrieval research concerning both the state of the art and the main research trends in the field. In particular, the course addresses the issues of document retrieval, document classification, topic discovery and language modeling. Besides an updated review of the literature, the course is then focused on the evaluation of information retrieval systems, the use of machine learning techniques on textual data collections, and on latent and probabilistic semantic indexing. Finally, the course provides also an introduction to the use of NoSql databases for the implementation of information retrieval systems.

Expected learning outcomes

Students will acquire the following skills: 1) knowing and understanding the main topics as well as the research issues and the future trends in the field of information retrieval; 2) learn how to apply natural language processing, indexing, clustering and classification techniques to a corpus of texts for a specific information need; 3) being able to judge the quality of different design and implementation choices; 4) being able to design, implement, and evaluate a specific project focused on document search or document classification; 5) understand the notion of language model and being able to detect language specificities and topics in a corpus of text documents; 6) being able to use the Python stack of libraries and tools required to develop a text analysis project.

Lesson period: Second trimester

Lessons timetable

Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi

Exams calendar

Single course

This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.

Search a single course

Course syllabus and organization

Single session

Responsible

Ferrara Alfio

Lesson period

Second trimester

Syllabus

Course syllabus

The course provides an in-depth introduction to the main research topics in the field of Deep Learning applied to Natural Language Processing, in particular for Information Retrieval objectives. In addition to the lectures, the course includes a final project through which students will acquire the necessary skills to design, implement and understand the main neural network models for natural language, using Python and Pytorch.

INTRODUCTION TO INFORMATION RETRIEVAL
Introduction to the course, exam modality and scheduling. Introduction to Information Retrieval. Preliminary tokenization and frequency-based indexing.

VECTOR SPACE MODEL, EVALUATION AND RELEVANCE FEEDBACK
Scoring, term weighting and the vector space model. Advanced text pre-processing and weighting. Evaluation in information retrieval. Relevance feedback and query expansion. Improving performances by relevance feedback.

UNSUPERVISED AND SUPERVISED TEXT CLASSIFICATION AND TOPIC MODELING
Matrix decompositions and Topic Models. Implementing and using LDA. Zero-shoot learning of topics. Supervised text classification. Use case on text classification. Introduction to deep learning classification models.

INTRODUCTION TO DEEP LEARNING FOR NLP
Backpropagation and Feedforward Neural Networks. Word and document vectors. Word2Vec.

LANGUAGE MODELS
Recurrent Neural Networks and Language Models. Vanishing Gradients, Seq2Seq learning. LSTM and GRU models. Encoder-Decoder architectures for sequence learning. Autoencoders.

ATTENTION AND TRANSFORMERS
Attention mechanism in the encoder-decoder architecture. Some transformer models: BERT. GPL2.

APPLICATIONS
Natural Language Generation. Integrating knowledge bases in language models.

Prerequisites for admission

Intermediate knowledge of Python. Basic knowledge of derivatives and understanding of matrix/vector notation and operations. Basics of probabilities and gaussian distributions.

Teaching methods

The course is given in the form of lectures with extensive use of examples and support materials such as Python notebooks. Slides and handouts are employed throughout the lectures and they are progressively published on the reference course website on the Ariel platform (https://aferrarair.ariel.ctu.unimi.it).
Lecture attendance is not mandatory, but it is strongly recommended.

Teaching Resources

- Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge University Press. (Http://nlp.stanford.edu/IR-book/)
- Notes, notebooks and materials provided by the lecturer and published on the website of the course (https://aferrarair.ariel.ctu.unimi.it)

Assessment methods and Criteria

Development of a project. The project topic has to be previously discussed with the lecturer. The project should demonstrate the comprehension of the lectures topics and the capability of proposing and motivating innovative solutions to specific research problems.
The project will be evaluated through a discussion with the lecturer about the project outcomes and the related topics of the course. The evaluation will take into account both the project and the interview.
Using the SIFA service for participating in the examination is mandatory. After the registration to an examination on SIFA, the students are requested to contact the lecturer for scheduling the discussion.

Course structure

INF/01 - INFORMATICS - University credits: 6

Lessons: 40 hours

Professor: Ferrara Alfio

Educational website(s)

Information retrieval

Professor(s)

Ferrara Alfio

Web site

Reception:

On appointment. The meeting will be online by first contacting the professor by email.

Online. In case of a meeting in person, Department of Computer Science, via Celoria 18 Milano, Room 7012 (7 floor)