Laboratory work - Spring 2022
The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).
Important links
Lab sessions course repository (continuously updated, use weekly plan links for latest materials)
Books and other materials
Speech and language processing (online draft)
Python 3 Text Processing with NLTK 3 Cookbook
Introduction to Data Science Handbook
Previous years NLP course materials
Lab sessions recordings 2020/2021 (this year's sessions are regularly published below)
NLP course 2020 project reports
NLP course 2021 project reports
NLP course 2022 projects
Groups project selection (public id)
Peer review
Projects you need to review (check your email for repository assignment)
Weekly plan
This plan is regularly updated.
Lab sessions are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which you will work on your projects. Based on your proposals / new ideas we can adapt the weekly plan and prepare additional materials.
First lab session (or the first one with physical participants) will be recorded via Zoom. Only parts with hands-on tutorials will be available online. You can also attend that session via zitnik.si/zoom. Please note that focus will be on in-class participants, so use these live sessions only if necessary. Recordings will be available each week, for your reference only.
Week | Description | Materials and links |
14.2. - 18.2. | INFO: No one was present on Tuesday, so lab sessions will start in the second week ;). | |
21.2. - 25.2. | Lab work introduction Projects overview + data presentation Group work and projects application procedure | Python basics (non-CS background students) NLP Course Spring 2022 Sign Up Form (open Friday 8am, close Monday 6pm) - CLOSED |
28.2. - 4.3. | Basic text processing Text clustering INFO: Thursday session is from this week on moved to Wednesday, 5pm. | |
7.3. - 11.3. | Text classification Slovene text processing First joint meeting and detailed problem presentation - March 9, 2022, 5pm at UL FRI, P22 (Project 2 and 4). | |
14.3. - 18.3. | Traditional sequence tagging (HMM, MEMM, CRF, ...) Language models, knowledge bases First submission (Friday, 23:59) | Traditional language modelling, knowledge bases |
21.3. - 25.3. | First submission defense (in person) Joint meeting with prof. Glenn Smith (University of South Florida) regarding ideas for Project 1. Please join in via Zoom link, Wednesday, 6:00pm. Session recording (prof. Glenn Smith talk) | |
28.3. - 1.4. | Neural networks introduction (TensorFlow, Keras) Word embeddings & visualizations (offensive language) RNNs vs. GRUs vs. LSTMs + examples | Introduction to neural networks |
4.4. - 8.4. | Tensorflow examples Multiple simple NN architectures example Google Colab SLING tutorial (setup, Singularity, SLURM) Project 2 and 4 (collaboration group) sync meeting - April 6, 2022, 5pm at UL FRI, PR05 (UL FRI and UL FF students). | |
11.4. - 15.4. | Transformers, BERT (tagging, classification) BERT (custom task) | BERT (classification - Tensorflow) BERT (tagging & examples - PyTorch) |
18.4. - 22.4. (Mon. holiday) | Graph neural networks for text processing (Timotej Knez) | |
25.4. - 29.4. (Wed. holiday) | Consultations Second submission (Friday, 23:59) | |
2.5. - 6.5. (Mon. holiday) | Second submission defense (in person) | |
9.5. - 13.5. | No lab sessions on Wednesday and Thursday (due to Konferenca DSI 2022) (if you present a student project @DSI, you get +10 points for the lab part, link) Consultations on Tuesday only! | |
16.5. - 20.5. | No lab sessions this week (due to RCIS Conference) | |
23.5. - 27.5. | No lab sessions on Tuesday and Wednesday (due to NexusLinguarum COST Action) Final submission deadline (Wednesday, 6:00) IMPORTANT: Put your repositories visibility to public before the deadline or shortly after! Peer review submission deadline (Friday, 23:59) Peer review link (each group got an email with repositories to review)!
| |
30.5. - 2.6. | Final project presentations and best group award announcement - Tuesday, May 31 in P22: Project 1: 1pm Project 3: 3pm Project 4 & 5: 4pm |
Course obligations
Please regularly check Weekly plan and course announcements for possible changes. You are expected to attend the sessions but you must attend the defense sessions. At the assignment defense dates at least one member of a group must be present, otherwise all need to provide their doctor’s justification. At the last assignment all members must be present and also need to understand all parts of the submitted solution.
All the work must be submitted using your Github project repository. Submission deadlines are indicated in the table above. Submission defenses will be held during the lab sessions.
Students must work in groups of three members! There can exist only one group of two members per project type. The distribution of work between members should be seen by commits within the repository.
Obligation | Description | Final grade relevance (%) |
Submission 1 | Project selection & simple corpus analysis - Group (three members) selection - Report containing Introduction, existing solutions/related work and initial ideas - Well organized repository | 10 |
Submission 2 | Initial implementation / baseline with results - Updated Submission 1 parts - Implemented at least one solution with analysis - Future directions and ideas - Well organized repository | 20 |
Submission 3 | Final solution and report - Final report incl. analyses and discussions - Fully reproducible repository | 60 |
Peer review | Evaluate your peer group's work - Each group will check final submissions of three two other peer groups having the same topic (except Task 5 groups) | 10 |
Total: 100% |
Grading criteria
All the graded work is group work. All the work is graded following the scoring schema below. All the course obligations must be graded positive (i.e., 6 or more) to pass.
Use PUBLIC GROUP ID for public communication regarding your group. GROUP ID is your internal id of a group for which marks will be publicly available.
Scoring
Scoring is done relative to achievements of all the participants in the course. Instructions will define the criteria the participants need to address and exactly fulfilling the instructions will result in score 8. All the other scores will be relative to the quality of the submitted work. The role of instructors is to find an appropriate clustering of all the works into 6 clusters. To better illustrate the scoring, schema will be as follows:
Repository is clear and runnable. Report is well organized, results are discussed, visualizations are added value to the text and well prepared. Apart from minimum criteria the group tried multiple of their novel ideas to approach the problem.
Same as above - it can be visible that a group had novel ideas but they got out of time or did not finish (polish) everything. The submission has multiple minor flaws.
Group implemented everything suggested by the minimum criteria but not investigated further (not found much related work, did not apply multiple other techniques, ...).
Group implemented everything suggested by the minimum criteria but did not discuss results well, performed simple analyses only, ... Also the report is not well organized, and lacks data for reproducibility.
Group was trying to implement minimum criteria (or part only) but their work has many minor flaws or a few major ones. The report also reflects their motivation.
The group did not address one or more points of the minimum criteria and the report contains major flaws. It can be seen that the group did not invest enough time into the work.
Best group awards
This year's course is part of University of Ljubljana's Digitalna.si pilot implementations of study programmes. We have therefore been awarded 1500 EUR gross, which prof. Robnik-Šikonja and assist. prof. Žitnik decided to divide among best performing groups. Therefore, a best performing group will receive 1500 EUR gross/#bestGroups types award.
Final project preparation guidelines and peer review instructions
Some major remarks that you should keep in mind out are the following regarding final submission:
Peer review instructions:
Final project presentation instructions
Each group will have max. 3 minutes (STRICT) to present their project. I will put your report to the projector and you will present along your report. I propose that you focus on specific interesting part that you will present (e.g., table, graph, figure, ...). The most important aspect to present is:
What is the "take-away message" of your work? This should be concrete and concise, so that anyone can understand (also a completely lay person).
See timetable above for timeslots of your presentation. If you cannot attend, please write to me to get an alternative timeslot.
Specific projects information
Project 1: Literacy situation models knowledge base creation
Building a knowledge base based on situation models from selected English/Slovene short stories. Knowledge base can focus on a subset of the following inference types: Referential, Case structure role assignment, Causal antecedent, Superordinate goal, Thematic, Character emotional reaction, Causal consequence, Instantiation of noun category, Instrument, Subordinate goal, State, Emotion of reader, Author's intent.
Description: Extraction of concepts/entities/relationships/topics/time-aware data from novels. Creation of a book knowledge base and methods for inference and visualization (e.g. main characters, good/bad characters, "dramski trikotnik" (story grammar), time and space analysis, ...)
Task: (1) selection and retrieval of additional books (EN/SL) - 7 English provided, (2) review and detailed proposal of analysis to be done, (3) analysis with evaluations, explanations from narrative texts/stories, for example: causal chain summaries.
Provided English books (link):
- Henry Red Chief
- Hills Like White Elephants
- Leiningen Versus the Ants
- The Lady or the Tiger
- The Most Dangerous Game
- The Tell Tale Heart
- The Gift of the Magic Henry
References:
- Casseli, Vossen (2017): The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction (github)
- Nahatame (2020): Revisiting Second Language Readers’ Memory for Narrative Texts: The Role of Causal and Semantic Text Relations
- van den Broek (1985): Causal Thinking and the Representation of Narrative Events
- Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli, Allen (2016): A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
- Dasgupta, Saha, Dey, Naskar (2018): Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks
- McNamara (2011): Coh-Metrix: An Automated Tool for Theoretical and Applied Natural Language Processing
- Zwaan, Magliano, Graesser (1995): Dimensions of Situation Model Construction in Narrative Comprehension
- Zwaan (1999): Situation Models: The Mental Leap Into Imagined Worlds
- Antonio Lieto, Gian Luca Pozzato, Stefano Zoia, Viviana Patti, Rossana Damiano (2021): A Commonsense Reasoning Framework for Explanatory Emotion Attribution, Generation and Re-classification, https://arxiv.org/abs/2101.04017
- Filip Ilievski, Pedro Szekely, Bin Zhang (2020): CSKG: The CommonSense Knowledge Graph, https://arxiv.org/abs/2012.11490
- Prof. Glenn Smith materials + additional video clarification: LINK
Proposed methodology:
1. Datasets collection & description (short stories, longer novels, # of stories for manual analysis, ...).
2. Definition of clear aspects of your work, for example:
- identification of entities, relationships,
- causal or time-based analysis of events,
- sentiment analysis of characters related to events, ...
3. Identification of additional datasets and tools (papers with runnable repositories) that you can use.
4. Analysis setup and running, preparing results, for example:
- general statistics of uncovered data from from books
- manual check of selection of stories
- changing parameters, using different setups/models to provide comparison tables
- provision of one or more interesting visualizations of results (if you focus more on this, you might perform less analysis types)
5. Writing a final report + additional tests if needed (reserve one week for doing this).
Project 2: Automatic semantic relation extraction (joint projects with UL FF)
Using TermFrame knowledge base and word embeddings to extract hypernym/hyponym pairs from specialized or general corpora. The goal is to build and visualize a semantic network. Extract non-hierarchical semantic relations using semantically annotated sentences as input.
The UL FF students will provide testing corpora for other domains; perform additional semantic annotation (if needed), prepare training data, manually evaluate results, visualise results using NetViz.
Description: Numerous approaches have been proposed to mine hierarchical relations from text, but extracting non-hierarchical relations is more demanding, also because these are often domain-specific. The student group may test different methods to extract relation candidates, but the ultimate goal is that the method performs on domains other than the training domain (Karstology).
Training data:
- TermFrame knowledge base with annotated definitions. These sentences contain DEFINIENDUM - GENUS pairs (e.g. Karst - landscape, limestone - rock, stalagmite - speleothem); the hierarchical relation of hyperonymy/hyponymy can be modelled via word embeddings.
- For non-hierarchical relation extraction, the annotated sentences contain annotations of different relations (HAS_FORM, HAS_SIZE, HAS_COMPOSITION, HAS_CAUSE, HAS_LOCATION etc.). In addition, each relation instance contains an annotated relation marker (e.g. 'found_on' as a marker for LOCATION in 'found on soluble terrain').
First joint meeting and detailed problem presentation - March 9, 2022, 5pm at FRI, P22. Agreed collaboration with UL FF:
- Each group has a member from UL FF and actively collaborates with him. The final work will be a joint work of all four members of a group.
- Expectations (also subject to each group agreements):
- UL FF students:
- April 30: Selection of new not-annotated data to be annotated by UL FRI.
- May 15: Preparation of a test corpus (new annotated data).
- UL FRI students:
- May 15: Provide predictions for the not-annotated data, provided by UL FF.
Data and additional details:
- https://drive.google.com/drive/folders/17LF5gKGX-bgBL9NAa037pum4-i7gpqhm
References:
- Mikolov, Chen, Corrado, Dean (2013): Efficient Estimation of Word Representations in Vector Space
- NLP Progress (relationship extraction)
- Linguistic Data Consortium: ACE (Automatic Content Extraction) English Annotation Guidelines for Relations, Version 6.2 2008.04.12, http://projects.ldc.upenn.edu/ace/ (Dostopano dne: 2. november 2020)
- Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position- aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.
- Yao, Yuan, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. "DocRED: A large-scale document-level relation extraction dataset." arXiv preprint arXiv:1906.06127(2019).
- Liu, K. A survey on neural relation extraction. Sci. China Technol. Sci. 63, 1971–1989 (2020). https://doi.org/10.1007/s11431-020-1673-6
- Li, X., Yin, F., Sun, Z., Li, X., Yuan, A., Chai, D., ... & Li, J. (2019). Entity-relation extraction as multi-turn question answering. In Proceedings of ACL 2019.
- Christensen, J., Soderland, S., & Etzioni, O. (2011). An analysis of open information extraction based on semantic role labeling. In Proceedings of the sixth international conference on Knowledge capture (pp. 113-120).
- Guo, J., Che, W., Wang, H., Liu, T., & Xu, J. (2016). A unified architecture for semantic role labeling and relation classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 1264-1274).
- Shi, P., & Lin, J. (2019). Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255.
- Mandya, A., Bollegala, D., Coenen, F., & Atkinson, K. (2017). Frame-based semantic patterns for relation extraction. In International Conference of the Pacific Association for Computational Linguistics (pp. 51-62).
- https://paperswithcode.com/task/relation-extraction
- https://link.springer.com/article/10.1007/s11431-020-1673-6
Proposed methodology:
1. Datasets collection & description (other relationship/ontology building datasets).
- finding similar/same semantic relations to Karstology in other datasets
2. Definition of clear evaluation setups for your work, for example:
- training/fine-tuning on other datasets, testing on karstology
- using an English model only
- using a multilingual model for zero-shot transfer learning (e.g., English -> Slovene)
- definition of manual rules/feature extractors (together with UL FF) for traditional methods usage
- evaluation of a model fine-tuned on Karstology on a new corpus (prepared by UL FF)
3. Identification tools (papers with runnable repositories) that you can use for your evaluation setups.
4. Analysis setup and running, preparing results, for example:
- reporting on scores for selected evaluation setups (choose correct measures)
- qualitative analysis of results (UL FF)
- changing parameters, using different setups/models to provide comparison tables
5. Writing a final report + additional tests if needed (reserve one week for doing this).
Project 3: Cross-lingual sense disambiguation
Students will jointly work on dataset preparation and then use multilingual models for English and Slovene corpora. There exists an English variant of a task - the word-in-context (WiC) SuperGLUE task. The results will be compared with a Slovene-only and multilingual model.
The UL FRI are expected to semi-automatically prepare corpus and perform analyses on it. A group can select random words, check their collocations and try to detect different contexts automatically. After that examples will need to be manually checked and corrected. Corpus will also need to be published to Clarin.si repository.
References:
- Wang et al. (2018): SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Web page, see Word in Context task)
- A Resource for Evaluating Graded Word Similarity in Context: CoSimLex
- SimLex-999 Slovenian translation SimLex-999-sl 1.0
- Interesting lists of words related to semantic shifts between 1997-2017 (data exported by Matej Martinc, March 2022):
- V seznamu je 50 besed, ki jih je moj sistem zaznal kot najbolj spremenjene glede na gigafido med obdobji 1997 in 2017. Drugih 50 besed (besede na dnu seznama) je sistem označil kot nespremenjene (JSD razdalja med distribucijami rab je blizu 0, iziroma manj kot 0.1), izbrane pa so bile naključno, vendar na način, da se njihova frekvenčna distribucija v obeh obdobjih ujema s frekvenčno distribucijo spremenjenih besed, sepravi, da jih ne moreš zaznati le s pomočjo frekvenčne analize.
- V seznamu je poleg besed tudi informacija o zaznanem semantičnem premiku (stolpec JSD K5 1997-2018) in informacija o frekvenci.
- V datoteki poleg seznama so za vsako besedo še stavki, v katerih se pojavlja. Vsak stavek ima info o časovnem obdobju, viru in vrsti publikacije, iz katerega je stavek, cluster, kamor ga je uvrstil sistem ter lematizirano obliko.
Proposed methodology:
1. Corpora collection & description that you can use for sentence/collocation searching.
- ccGigafida, KAS, other useful corpora on clarin.si
2. Definition of clear methodology for dataset preparation:
- selection of candidate words (random might not be okay, start from Slovene Homonyms dictionary, SloWNet or check references above)
- how you will detect and select sentences for clustering?
- what methods will be compared for getting as good data as possible (embeddings such as Word2Vec, FastText, ELMo, TF-IDF)
- how much data will be manually checked
3. Identification tools (papers with runnable repositories) that you can use for solving the Slovene WiC task.
4. Analysis setup and running, preparing results, for example:
- reporting on scores for different evaluation setups (choose correct measures)
- changing parameters, using different setups/models to provide comparison tables
5. Writing a final report + additional tests if needed (reserve one week for doing this).
- as your task is two fold (data preparation + analysis), you can also focus more one one step (but then I need to see that you did e.g., more comparisons for dataset creation)
Project 4: Cross-lingual question answering (joint projects with UL FF - one group only)
Students will jointly work on dataset preparation and then use multilingual models for English and Slovene corpora. They will check the performance of transfer learning for question answering from English to Slovene. The results will be compared with the Slovene-only model, trained on translated data.
The UL FRI students will use EK translator (https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation) to translate corpora (SQuAD 2.0, ReaddleSense), then UL FF students will manually check and correct corpus. The corpus will need to be carefully prepared as it will be merged together from all the groups and used for the analysis. Corpus will also need to be published to Clarin.si repository.
The UL FRI are expected to set up and help UL FF students with the "translation platform - e.g., Memsource." Then they will build models (transfer learning, multi-lingual models, Slovene-only models) and do analyses, evaluation. Both students are supposed to work together on the discussion part.
First joint meeting and detailed problem presentation - March 9, 2022, 5pm at FRI, P22. Agreed collaboration with UL FF:
- A group having a UL FF member:
- The final work will be a joint work of all four members of a group.
- It is expected that UL FF member will manually annotate/update automatic translations.
- Groups not having a UL member:
- Think of using multi-lingual models, automatic translation techniques or other ideas that might help you automatically answer questions.
References:
- Devlin, Chang, Lee, Toutanova (2018): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Rajpurkar et al. (2018): The Stanford Question Answering Dataset - SQuAD2.0 (link)
- Shared Task on Cross-lingual Open-Retrieval QA (2022)
- https://paperswithcode.com/task/question-answering
eTranslation Service:
- Documentation for the webservice: LINK
- Different domains (language styles and subjects) that are available can also be found through a web service. The easiest way to see them is from the drop-down on the web page: https://webgate.ec.europa.eu/etranslation.
- Full technical documentation on the webservice: LINK
- To register your application, you need to provide a couple of sentences that describe your plans, and you will get the credentials (Application name and Password)
Proposed methodology:
1. Datasets collection & description (other relationship/ontology building datasets).
- choose at least one dataset you will automatically translate
- check that the data is aligned to the task definition
2. Definition of clear methodology for automatic data preparation:
- after translation check that dataset is sound
- if there are glitches, define manual rules/heuristics to update is accordingly
- (group with UL FF) send the data for manual translation
3. Identification tools (papers with runnable repositories) that you can use for your evaluation setups.
4. Analysis setup and running, preparing results, for example:
- reporting on scores for selected evaluation setups (choose correct measures)
- (group with UL FF) perform automatic question answering on manually checked data and report on automatically calculated scores (also compare to the same set of automatically annotated data)
- (group with UL FF) send results of automatic question answering on manually checked data to UL FF colleague for qualitative analysis
- changing parameters, using different setups/models to provide comparison tables
5. Writing a final report + additional tests if needed (reserve one week for doing this).
Project 5: Custom topic
For those who would already have some experience or would like to research more by themselves. For the proposed topic you need to suggest:
- Explanation of suggestion (tasks such as Twitter sentiment analysis are not allowed!)
- Clear idea presentation, motivation
- Data collection, description, organization
- Methodology and expected outcomes
Some ideas:
- National NLP Clinical Challenges
- FRI DataScience competition - Automated customer email classification (DS students)