Laboratory work - Spring 2022

The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).

Important links

Lab sessions course repository (continuously updated, use weekly plan links for latest materials)

Books and other materials

        Speech and language processing (online draft)

        Python 3 Text Processing with NLTK 3 Cookbook

Introduction to Data Science Handbook

Previous years NLP course materials

        Lab sessions recordings 2020/2021 (this year's sessions are regularly published below)

        NLP course 2020 project reports

        NLP course 2021 project reports

NLP report LaTeX template

NLP course 2022 projects

        Groups project selection (public id)

        Marks 2022 (PUBLIC SCORES)

        Marks 2022 (PUBLIC RESPONSES)

Peer review

        Projects you need to review (check your email for repository assignment)

        Peer review submission form

        

Weekly plan

This plan is regularly updated.

 

Lab sessions are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which you will work on your projects. Based on your proposals / new ideas we can adapt the weekly plan and prepare additional materials.

First lab session (or the first one with physical participants) will be recorded via Zoom. Only parts with hands-on tutorials will be available online. You can also attend that session via zitnik.si/zoom. Please note that focus will be on in-class participants, so use these live sessions only if necessary. Recordings will be available each week, for your reference only.

Week

Description

Materials and links

14.2. - 18.2.

INFO: No one was present on Tuesday, so lab sessions will start in the second week ;).

21.2. - 25.2.

Lab work introduction

Projects overview + data presentation

Group work and projects application procedure

Introduction slides

Python basics (non-CS background students)

Session recording

NLP Course Spring 2022 Sign Up Form (open Friday 8am, close Monday 6pm) - CLOSED

28.2. - 4.3.

Basic text processing

Text clustering

INFO: Thursday session is from this week on moved to Wednesday, 5pm.

Basic text processing

Text clustering

Session recording

7.3. - 11.3.

Text classification

Slovene text processing

First joint meeting and detailed problem presentation - March 9, 2022, 5pm at UL FRI, P22 (Project 2 and 4).

Text classification

Slovene text processing

Session recording

14.3. - 18.3.

Traditional sequence tagging (HMM, MEMM, CRF, ...)

Language models, knowledge bases

First submission (Friday, 23:59)

Traditional language modelling, knowledge bases

Traditional sequence tagging

Session recording

21.3. - 25.3.

First submission defense (in person)

Joint meeting with prof. Glenn Smith (University of South Florida) regarding ideas for Project 1. Please join in via Zoom link, Wednesday, 6:00pm.

Session recording (prof. Glenn Smith talk)

28.3. - 1.4.

Neural networks introduction (TensorFlow, Keras)

Word embeddings & visualizations (offensive language)

RNNs vs. GRUs vs. LSTMs + examples

Introduction to neural networks

Session recording

4.4. - 8.4.

Tensorflow examples

Multiple simple NN architectures example

Google Colab

SLING tutorial (setup, Singularity, SLURM)

Project 2 and 4 (collaboration group) sync meeting - April 6, 2022, 5pm at UL FRI, PR05 (UL FRI and UL FF students).

Tensorflow versions example

Simple NNs comparison

Google Colab showcase

SLING (SLURM and HPC usage)

Session recording

11.4. - 15.4.

Transformers, BERT (tagging, classification)

BERT (custom task)

BERT (classification - Tensorflow)

BERT (tagging & examples - PyTorch)

Session recording

18.4. - 22.4.

(Mon. holiday)

Graph neural networks for text processing (Timotej Knez)

Graph neural networks for NLP

Session recording

25.4. - 29.4.

(Wed. holiday)

Consultations

Second submission (Friday, 23:59)

2.5. - 6.5.

(Mon. holiday)

Second submission defense (in person)

9.5. - 13.5.

No lab sessions on Wednesday and Thursday

(due to Konferenca DSI 2022)

(if you present a student project @DSI, you get +10 points for the lab part, link)

Consultations on Tuesday only!

16.5. - 20.5.

No lab sessions this week

(due to RCIS Conference)

23.5. - 27.5.

No lab sessions on Tuesday and Wednesday

(due to NexusLinguarum COST Action)

Final submission deadline (Wednesday, 6:00)

      IMPORTANT: Put your repositories visibility to public before the deadline or shortly after!

Peer review submission deadline (Friday, 23:59)

      Peer review link (each group got an email with repositories to review)!


Consultations (please write to arrange for Thursday or Friday).

30.5. - 2.6.

Final project presentations and best group award announcement - Tuesday, May 31 in P22:

    Project 1: 1pm
   Project 2: 2pm

    Project 3: 3pm

    Project 4 & 5: 4pm

Course obligations

Please regularly check Weekly plan and course announcements for possible changes. You are expected to attend the sessions but you must attend the defense sessions. At the assignment defense dates at least one member of a group must be present, otherwise all need to provide their doctor’s justification. At the last assignment all members must be present and also need to understand all parts of the submitted solution.

All the work must be submitted using your Github project repository. Submission deadlines are indicated in the table above. Submission defenses will be held during the lab sessions.

Students must work in groups of three members! There can exist only one group of two members per project type. The distribution of work between members should be seen by commits within the repository.

Obligation

Description

Final grade relevance (%)

Submission 1

Project selection & simple corpus analysis

  - Group (three members) selection

  - Report containing Introduction, existing solutions/related work and initial ideas

  - Well organized repository

10

Submission 2

Initial implementation / baseline with results

  - Updated Submission 1 parts

  - Implemented at least one solution with analysis

  - Future directions and ideas

  - Well organized repository

20

Submission 3

Final solution and report

  - Final report incl. analyses and discussions

  - Fully reproducible repository

60

Peer review

Evaluate your peer group's work

  - Each group will check final submissions of three two other peer groups having the same topic (except Task 5 groups)

10

Total: 100%

Grading criteria

All the graded work is group work. All the work is graded following the scoring schema below. All the course obligations must be graded positive (i.e., 6 or more) to pass.

Use PUBLIC GROUP ID for public communication regarding your group. GROUP ID is your internal id of a group for which marks will be publicly available.

Scoring

Scoring is done relative to achievements of all the participants in the course. Instructions will define the criteria the participants need to address and exactly fulfilling the instructions will result in score 8. All the other scores will be relative to the quality of the submitted work. The role of instructors is to find an appropriate clustering of all the works into 6 clusters. To better illustrate the scoring, schema will be as follows:

Repository is clear and runnable. Report is well organized, results are discussed, visualizations are added value to the text and well prepared. Apart from minimum criteria the group tried multiple of their novel ideas to approach the problem.

Same as above - it can be visible that a group had novel ideas but they got out of time or did not finish (polish) everything. The submission has multiple minor flaws.

Group implemented everything suggested by the minimum criteria but not investigated further (not found much related work, did not apply multiple other techniques, ...).

Group implemented everything suggested by the minimum criteria but did not discuss results well, performed simple analyses only, ... Also the report is not well organized, and lacks data for reproducibility.

Group was trying to implement minimum criteria (or part only) but their work has many minor flaws or a few major ones. The report also reflects their motivation.

The group did not address one or more points of the minimum criteria and the report contains major flaws. It can be seen that the group did not invest enough time into the work.

Best group awards

This year's course is part of University of Ljubljana's Digitalna.si pilot implementations of study programmes. We have therefore been awarded 1500 EUR gross, which prof. Robnik-Šikonja and assist. prof. Žitnik decided to divide among best performing groups. Therefore, a best performing group will receive 1500 EUR gross/#bestGroups types award.

Final project preparation guidelines and peer review instructions

Some major remarks that you should keep in mind out are the following regarding final submission:

Peer review instructions:

  1. Please find the projects you need to review (see link above).
  2. Each group needs to review 3 projects of the same topic they have chosen.
  3. Submit your peer review scores in the Google Form (see link above).
  4. You will get score also for your grading, depending on how much (of course by some margin) your grading will be different from the assistant's grading.
  5. Follow the scoring criteria as presented above and include feedback to your mark.

Final project presentation instructions

Each group will have max. 3 minutes (STRICT) to present their project. I will put your report to the projector and you will present along your report. I propose that you focus on specific interesting part that you will present (e.g., table, graph, figure, ...). The most important aspect to present is:

        What is the "take-away message" of your work? This should be concrete and concise, so that anyone can understand (also a completely lay person).

See timetable above for timeslots of your presentation. If you cannot attend, please write to me to get an alternative timeslot.

Specific projects information

Project 1: Literacy situation models knowledge base creation

Building a knowledge base based on situation models from selected English/Slovene short stories. Knowledge base can focus on a subset of the following inference types: Referential, Case structure role  assignment, Causal antecedent, Superordinate goal, Thematic, Character emotional reaction, Causal consequence, Instantiation of noun category, Instrument, Subordinate goal, State, Emotion of reader, Author's intent.

Description: Extraction of concepts/entities/relationships/topics/time-aware data from novels. Creation of a book knowledge base and methods for inference and visualization (e.g. main characters, good/bad characters, "dramski trikotnik" (story grammar), time and space analysis, ...)

Task: (1) selection and retrieval of additional books (EN/SL) - 7 English provided, (2) review and detailed proposal of analysis to be done, (3) analysis with evaluations, explanations from narrative texts/stories, for example: causal chain summaries.

Provided English books (link):

        - Henry Red Chief

        - Hills Like White Elephants

        - Leiningen Versus the Ants

        - The Lady or the Tiger

        - The Most Dangerous Game

        - The Tell Tale Heart

        - The Gift of the Magic Henry

References:

        - Casseli, Vossen (2017): The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction (github)

        - Nahatame (2020): Revisiting Second Language Readers’ Memory for Narrative Texts: The Role of Causal and Semantic Text Relations

        - van den Broek (1985): Causal Thinking and the Representation of Narrative Events

        - Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli, Allen (2016): ​​A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

        - Dasgupta, Saha, Dey, Naskar (2018): Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks

        - McNamara (2011): Coh-Metrix: An Automated Tool for Theoretical and Applied Natural Language Processing

        - Zwaan, Magliano, Graesser (1995): Dimensions of Situation Model Construction in Narrative Comprehension

        - Zwaan (1999): Situation Models: The Mental Leap Into Imagined Worlds

        - Antonio Lieto, Gian Luca Pozzato, Stefano Zoia, Viviana Patti, Rossana Damiano (2021): A Commonsense Reasoning Framework for Explanatory Emotion Attribution, Generation and Re-classification, https://arxiv.org/abs/2101.04017 

        - Filip Ilievski, Pedro Szekely, Bin Zhang (2020): CSKG: The CommonSense Knowledge Graph, https://arxiv.org/abs/2012.11490 

- Prof. Glenn Smith materials + additional video clarification: LINK

Proposed methodology:

        1. Datasets collection & description (short stories, longer novels, # of stories for manual analysis, ...).

        2. Definition of clear aspects of your work, for example:

                - identification of entities, relationships,

                - causal or time-based analysis of events,

                - sentiment analysis of characters related to events, ...

        3. Identification of additional datasets and tools (papers with runnable repositories) that you can use.

        4. Analysis setup and running, preparing results, for example:

                - general statistics of uncovered data from from books

                - manual check of selection of stories

                - changing parameters, using different setups/models to provide comparison tables

                - provision of one or more interesting visualizations of results (if you focus more on this, you might perform less analysis types)

        5. Writing a final report + additional tests if needed (reserve one week for doing this).

Project 2: Automatic semantic relation extraction (joint projects with UL FF)

Using TermFrame knowledge base and word embeddings to extract hypernym/hyponym pairs from specialized or general corpora. The goal is to build and visualize a semantic network. Extract non-hierarchical semantic relations using semantically annotated sentences as input.

The UL FF students will provide testing corpora for other domains; perform additional semantic annotation (if needed), prepare training data, manually evaluate results, visualise results using NetViz.

Description: Numerous approaches have been proposed to mine hierarchical relations from text, but extracting non-hierarchical relations is more demanding, also because these are often domain-specific. The student group may test different methods to extract relation candidates, but the ultimate goal is that the method performs on domains other than the training domain (Karstology).

Training data:

- TermFrame knowledge base with annotated definitions. These sentences contain DEFINIENDUM - GENUS pairs (e.g. Karst - landscape, limestone - rock, stalagmite - speleothem); the hierarchical relation of hyperonymy/hyponymy can be modelled via word embeddings.

- For non-hierarchical relation extraction, the annotated sentences contain annotations of different relations (HAS_FORM, HAS_SIZE, HAS_COMPOSITION, HAS_CAUSE, HAS_LOCATION etc.). In addition, each relation instance contains an annotated relation marker (e.g. 'found_on' as a marker for LOCATION in 'found on soluble terrain').

First joint meeting and detailed problem presentation - March 9, 2022, 5pm at FRI, P22. Agreed collaboration with UL FF:

        - Each group has a member from UL FF and actively collaborates with him. The final work will be a joint work of all four members of a group.

        - Expectations (also subject to each group agreements):

                - UL FF students:

                        - April 30: Selection of new not-annotated data to be annotated by UL FRI.

                        - May 15: Preparation of a test corpus (new annotated data).

                - UL FRI students:

                        - May 15: Provide predictions for the not-annotated data, provided by UL FF.

Data and additional details:

        - https://drive.google.com/drive/folders/17LF5gKGX-bgBL9NAa037pum4-i7gpqhm

References:

        - Mikolov, Chen, Corrado, Dean (2013): Efficient Estimation of Word Representations in Vector Space

        - NLP Progress (relationship extraction)

        - Linguistic Data Consortium: ACE (Automatic Content Extraction) English Annotation Guidelines for Relations, Version 6.2 2008.04.12, http://projects.ldc.upenn.edu/ace/ (Dostopano dne: 2. november 2020)

- Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. 2017. Position- aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 35–45, Copenhagen, Denmark. Association for Computational Linguistics.

- Yao, Yuan, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. "DocRED: A large-scale document-level relation extraction dataset." arXiv preprint arXiv:1906.06127(2019).

- Liu, K. A survey on neural relation extraction. Sci. China Technol. Sci. 63, 1971–1989 (2020). https://doi.org/10.1007/s11431-020-1673-6

- Li, X., Yin, F., Sun, Z., Li, X., Yuan, A., Chai, D., ... & Li, J. (2019). Entity-relation extraction as multi-turn question answering. In Proceedings of ACL 2019.

- Christensen, J., Soderland, S., & Etzioni, O. (2011). An analysis of open information extraction based on semantic role labeling. In Proceedings of the sixth international conference on Knowledge capture (pp. 113-120).

- Guo, J., Che, W., Wang, H., Liu, T., & Xu, J. (2016). A unified architecture for semantic role labeling and relation classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 1264-1274).

- Shi, P., & Lin, J. (2019). Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255.

- Mandya, A., Bollegala, D., Coenen, F., & Atkinson, K. (2017). Frame-based semantic patterns for relation extraction. In International Conference of the Pacific Association for Computational Linguistics (pp. 51-62).

-https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets#traditional-information-extraction 

- https://paperswithcode.com/task/relation-extraction 

- https://link.springer.com/article/10.1007/s11431-020-1673-6 

Proposed methodology:

        1. Datasets collection & description (other relationship/ontology building datasets).

                - finding similar/same semantic relations to Karstology in other datasets

        2. Definition of clear evaluation setups for your work, for example:

                - training/fine-tuning on other datasets, testing on karstology

                - using an English model only

                - using a multilingual model for zero-shot transfer learning (e.g., English -> Slovene)

                - definition of manual rules/feature extractors (together with UL FF) for traditional methods usage

                - evaluation of a model fine-tuned on Karstology on a new corpus (prepared by UL FF)

        3. Identification tools (papers with runnable repositories) that you can use for your evaluation setups.

        4. Analysis setup and running, preparing results, for example:

                - reporting on scores for selected evaluation setups (choose correct measures)

                - qualitative analysis of results (UL FF)

                - changing parameters, using different setups/models to provide comparison tables

        5. Writing a final report + additional tests if needed (reserve one week for doing this).

Project 3: Cross-lingual sense disambiguation

Students will jointly work on dataset preparation and then use multilingual models for English and Slovene corpora. There exists an English variant of a task -  the word-in-context (WiC) SuperGLUE task. The results will be compared with a Slovene-only and multilingual model.

The UL FRI are expected to semi-automatically prepare corpus and perform analyses on it. A group can select random words, check their collocations and try to detect different contexts automatically. After that examples will need to be manually checked and corrected. Corpus will also need to be published to Clarin.si repository.

References:

        - Wang et al. (2018): SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Web page, see Word in Context task)

        - A Resource for Evaluating Graded Word Similarity in Context: CoSimLex 

        - SimLex-999 Slovenian translation SimLex-999-sl 1.0

        - Interesting lists of words related to semantic shifts between 1997-2017 (data exported by Matej Martinc, March 2022):

- V seznamu je 50 besed, ki jih je moj sistem zaznal kot najbolj spremenjene glede na gigafido med obdobji 1997 in 2017. Drugih 50 besed (besede na dnu seznama) je sistem označil kot nespremenjene (JSD razdalja med distribucijami rab je blizu 0, iziroma manj kot 0.1), izbrane pa so bile naključno, vendar na način, da se njihova frekvenčna distribucija v obeh obdobjih ujema s frekvenčno distribucijo spremenjenih besed, sepravi, da jih ne moreš zaznati le s pomočjo frekvenčne analize.

- V seznamu je poleg besed tudi informacija o zaznanem semantičnem premiku (stolpec JSD K5 1997-2018) in informacija o frekvenci.

- V datoteki poleg seznama so za vsako besedo še stavki, v katerih se pojavlja. Vsak stavek ima info o časovnem obdobju, viru in vrsti publikacije, iz katerega je stavek, cluster, kamor ga je uvrstil sistem ter lematizirano obliko.

Proposed methodology:

        1. Corpora collection & description that you can use for sentence/collocation searching.

                - ccGigafida, KAS, other useful corpora on clarin.si

        2. Definition of clear methodology for dataset preparation:

                - selection of candidate words (random might not be okay, start from Slovene Homonyms dictionary, SloWNet or check references above)

                - how you will detect and select sentences for clustering?

                - what methods will be compared for getting as good data as possible (embeddings such as Word2Vec, FastText, ELMo, TF-IDF)

                - how much data will be manually checked

        3. Identification tools (papers with runnable repositories) that you can use for solving the Slovene WiC task.

        4. Analysis setup and running, preparing results, for example:

                - reporting on scores for different evaluation setups (choose correct measures)

                - changing parameters, using different setups/models to provide comparison tables

        5. Writing a final report + additional tests if needed (reserve one week for doing this).

                - as your task is two fold (data preparation + analysis), you can also focus more one one step (but then I need to see that you did e.g., more comparisons for dataset creation)

Project 4: Cross-lingual question answering (joint projects with UL FF - one group only)

Students will jointly work on dataset preparation and then use multilingual models for English and Slovene corpora. They will check the performance of transfer learning for question answering from English to Slovene. The results will be compared with the Slovene-only model, trained on translated data.

The UL FRI students will use EK translator (https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eTranslation) to translate corpora (SQuAD 2.0, ReaddleSense), then UL FF students will manually check and correct corpus. The corpus will need to be carefully prepared as it will be merged together from all the groups and used for the analysis. Corpus will also need to be published to Clarin.si repository.

The UL FRI are expected to set up and help UL FF students with the "translation platform - e.g., Memsource." Then they will build models (transfer learning, multi-lingual models, Slovene-only models) and do analyses, evaluation. Both students are supposed to work together on the discussion part.

First joint meeting and detailed problem presentation - March 9, 2022, 5pm at FRI, P22. Agreed collaboration with UL FF:

        - A group having a UL FF member:

                - The final work will be a joint work of all four members of a group.

                - It is expected that UL FF member will manually annotate/update automatic translations.

        - Groups not having a UL member:

                - Think of using multi-lingual models, automatic translation techniques or other ideas that might help you automatically answer questions.

References:

        - Devlin, Chang, Lee, Toutanova (2018): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

        - Rajpurkar et al. (2018): The Stanford Question Answering Dataset - SQuAD2.0 (link)

        - Shared Task on Cross-lingual Open-Retrieval QA (2022)

        - https://paperswithcode.com/task/question-answering 

eTranslation Service:

- Documentation for the webservice: LINK 

- Different domains (language styles and subjects) that are available can also be found through a web service. The easiest way to see them is from the drop-down on the web page: https://webgate.ec.europa.eu/etranslation.

- Full technical documentation on the webservice: LINK 

- To register your application, you need to provide a couple of sentences that describe your plans, and you will get the credentials (Application name and Password)

Proposed methodology:

        1. Datasets collection & description (other relationship/ontology building datasets).

                - choose at least one dataset you will automatically translate

                - check that the data is aligned to the task definition

        2. Definition of clear methodology for automatic data preparation:

                - after translation check that dataset is sound

                - if there are glitches, define manual rules/heuristics to update is accordingly

                - (group with UL FF) send the data for manual translation

        3. Identification tools (papers with runnable repositories) that you can use for your evaluation setups.

        4. Analysis setup and running, preparing results, for example:

                - reporting on scores for selected evaluation setups (choose correct measures)

                - (group with UL FF) perform automatic question answering on manually checked data and report on automatically calculated scores (also compare to the same set of automatically annotated data)

                - (group with UL FF) send results of automatic question answering on manually checked data to UL FF colleague for qualitative analysis

                - changing parameters, using different setups/models to provide comparison tables

        5. Writing a final report + additional tests if needed (reserve one week for doing this).

Project 5: Custom topic

For those who would already have some experience or would like to research more by themselves. For the proposed topic you need to suggest:

        - Explanation of suggestion (tasks such as Twitter sentiment analysis are not allowed!)

        - Clear idea presentation, motivation

        - Data collection, description, organization

        - Methodology and expected outcomes

Some ideas:

        - National NLP Clinical Challenges

- SemEval 2022 shared tasks

        - COLING 2022 shared tasks

- MedVidQA 2022

- FRI DataScience competition - Automated customer email classification (DS students)