Text Mining and Sentiment Analysis
A.Y. 2022/2023
Learning objectives
Understand the state of the art on text mining and sentiment analysis. Design and develop methods for text classification and topic modeling. Design and develop methods for sentiment classification and polarity detection. Understand the differences between sentiment analysis and emotion detection. Design and develop methods for emotion detection in text.
Expected learning outcomes
At the end of the course the student will be able to address a specific problem in the area of text mining and sentiment analysis. In particular student will know he main notions needed to understand text processing, foundations of natural language processing, text classification, and topic modeling. Moreover students will deal with sentiment analysis in the context of opinion mining and rule-based models and machine learning models for text.
Lesson period: Second trimester
Assessment methods: Esame
Assessment result: voto verbalizzato in trentesimi
Single course
This course cannot be attended as a single course. Please check our list of single courses to find the ones available for enrolment.
Course syllabus and organization
Single session
Responsible
Lesson period
Second trimester
Course syllabus
The course provide a complete overview of the state of the art and research perspective in the field of text mining and sentiment analysis, with an introduction to some relevant and correlated problems such as emotion detection and opinion mining.
Introduction (0:30 hours)
Course introduction, logistic issues, course requirements and Python installation.
Natural language processing (3:30 hours)
Basic techniques in natural language processing: tokenization (bag-of-words and n-gram models), stopwords and punctuation, stemming and lemmatization, part-of-speech tagging, chunking, regular expressions and named entity recognition. Public NLP toolkits such as NLTK and SpaCy will be introduced to gain hand-on experience in Python.
Document representation (2 hour)
The Vector Space Model and tf-idf weighting: representing unstructured text documents with appropriate format and structure to support later automated text mining algorithms. PCA as dimensionality reduction technique.
Text clustering (3 hours)
Clustering algorithms, i.e., connectivity-based clustering (a.k.a., hierarchical clustering) and centroid-based clustering (e.g., k-means clustering). Evaluation of text clustering: purity and Rand index.
Text categorization (5 hours)
Feature selection and text categorization algorithms: Naive Bayes, k Nearest Neighbor (kNN), Logistic Regression, Support Vector Machines and Decision Trees. Evaluation of text classification: precision and recall, confusion matrix, F-score.
Topic modeling (4 hours)
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. Two basic topic models will be covered: Probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA).
Document summarization (2 hours)
It refers to the process of reducing a text document to a summary that retains the most important points of the original document. Extraction-based summarization methods will be covered.
Introduction to sentiment analysis and emotion detection (1 hour)
Definition of the sentiment analysis problem. Differences between sentiment analysis and emotion detection.
Lexicon-based approaches to sentiment analysis (4 hours)
Survey of the main approaches that exploit dictionaries, ontologies, and specialized corpora for detecting the sentiment polarity in texts.
Machine learning approaches to sentiment analysis (4 hours)
Sentiment and polarity detection as a classification problem. Overview and comparison of the main unsupervised and supervised models on a case study.
Overview of neural network architectures for sentiment analysis (2 hour)
Design and implementation of a case study based on a neural network for sentiment detection and polarity evaluation.
Affect and emotion detection (1 hour)
Survey and definition of affect and emotion detection in texts. Discussion about the differences between the tasks of detection of sentiment, feelings, emotions, and opinions.
The language of emotions (4 hour)
Methods and techniques for modeling the language of emotions using neural networks and statistical language models. Application to a case study.
Multimodal approaches to emotion detection (1 hour)
Survey on the exploitation of multimodal data (e.g., face and body language in video and audio recordings) in combination with text to detect the language of emotions.
Hands on a real case study for design to implementation (2 hour)
Students will be provided with a real case study on sentiment analysis and emotion detection. During the lesson, the case study will be studied to the end of design and implement a solution.
Recap and conclusion (1 hour)
Recap on the main course topics. Open discussion of the project work chosen by the students as their exam assignment.
Introduction (0:30 hours)
Course introduction, logistic issues, course requirements and Python installation.
Natural language processing (3:30 hours)
Basic techniques in natural language processing: tokenization (bag-of-words and n-gram models), stopwords and punctuation, stemming and lemmatization, part-of-speech tagging, chunking, regular expressions and named entity recognition. Public NLP toolkits such as NLTK and SpaCy will be introduced to gain hand-on experience in Python.
Document representation (2 hour)
The Vector Space Model and tf-idf weighting: representing unstructured text documents with appropriate format and structure to support later automated text mining algorithms. PCA as dimensionality reduction technique.
Text clustering (3 hours)
Clustering algorithms, i.e., connectivity-based clustering (a.k.a., hierarchical clustering) and centroid-based clustering (e.g., k-means clustering). Evaluation of text clustering: purity and Rand index.
Text categorization (5 hours)
Feature selection and text categorization algorithms: Naive Bayes, k Nearest Neighbor (kNN), Logistic Regression, Support Vector Machines and Decision Trees. Evaluation of text classification: precision and recall, confusion matrix, F-score.
Topic modeling (4 hours)
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. Two basic topic models will be covered: Probabilistic Latent Semantic Indexing (pLSI) and Latent Dirichlet Allocation (LDA).
Document summarization (2 hours)
It refers to the process of reducing a text document to a summary that retains the most important points of the original document. Extraction-based summarization methods will be covered.
Introduction to sentiment analysis and emotion detection (1 hour)
Definition of the sentiment analysis problem. Differences between sentiment analysis and emotion detection.
Lexicon-based approaches to sentiment analysis (4 hours)
Survey of the main approaches that exploit dictionaries, ontologies, and specialized corpora for detecting the sentiment polarity in texts.
Machine learning approaches to sentiment analysis (4 hours)
Sentiment and polarity detection as a classification problem. Overview and comparison of the main unsupervised and supervised models on a case study.
Overview of neural network architectures for sentiment analysis (2 hour)
Design and implementation of a case study based on a neural network for sentiment detection and polarity evaluation.
Affect and emotion detection (1 hour)
Survey and definition of affect and emotion detection in texts. Discussion about the differences between the tasks of detection of sentiment, feelings, emotions, and opinions.
The language of emotions (4 hour)
Methods and techniques for modeling the language of emotions using neural networks and statistical language models. Application to a case study.
Multimodal approaches to emotion detection (1 hour)
Survey on the exploitation of multimodal data (e.g., face and body language in video and audio recordings) in combination with text to detect the language of emotions.
Hands on a real case study for design to implementation (2 hour)
Students will be provided with a real case study on sentiment analysis and emotion detection. During the lesson, the case study will be studied to the end of design and implement a solution.
Recap and conclusion (1 hour)
Recap on the main course topics. Open discussion of the project work chosen by the students as their exam assignment.
Prerequisites for admission
Basic knowledge on Machine Learning, Statistical Learning, Deep Learning and Artificial intelligence.
Teaching methods
The course is given in the form of lectures with extensive use of examples and support materials such as Python notebooks. Slides and handouts are employed throughout the lectures and they are progressively published on the reference course website on the Ariel platform (https://aferraratmsa.ariel.ctu.unimi.it).
Lecture attendance is not mandatory, but it is strongly recommended.
Lecture attendance is not mandatory, but it is strongly recommended.
Teaching Resources
Materials provided by the lecturer on the course website https://aferraratmsa.ariel.ctu.unimi.it
TOOLS
Reference programming language: Python.
Main modules:
NLTK
scikit-learn
spaCy
Gensim
PyTorch
REFERENCES
NLTK Book: https://www.nltk.org/book/
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
Munezero, M. D., Montero, C. S., Sutinen, E., & Pajunen, J. (2014). Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2), 101-111.
Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.
TOOLS
Reference programming language: Python.
Main modules:
NLTK
scikit-learn
spaCy
Gensim
PyTorch
REFERENCES
NLTK Book: https://www.nltk.org/book/
Aggarwal, C. C., & Zhai, C. (Eds.). (2012). Mining text data. Springer Science & Business Media.
Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1), 1-167.
Munezero, M. D., Montero, C. S., Sutinen, E., & Pajunen, J. (2014). Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text. IEEE transactions on affective computing, 5(2), 101-111.
Calvo, R. A., & D'Mello, S. (2010). Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Transactions on affective computing, 1(1), 18-37.
Assessment methods and Criteria
Students will be required to prepare and discuss a short paper and a project on the course topics. The topic of the short paper and the project will be defined with the lecturer.
INF/01 - INFORMATICS - University credits: 3
SECS-S/01 - STATISTICS - University credits: 3
SECS-S/01 - STATISTICS - University credits: 3
Lessons: 40 hours
Professor:
Ferrara Alfio
Educational website(s)
Professor(s)
Reception:
On appointment. The meeting will be online by first contacting the professor by email.
Online. In case of a meeting in person, Department of Computer Science, via Celoria 18 Milano, Room 7012 (7 floor)