Natural language processing 2023

Laboratory work - Spring 2023

The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).

Important links

Lab sessions course repository (continuously updated, use weekly plan links for latest materials)

Books and other materials

Speech and language processing (online draft)

Python 3 Text Processing with NLTK 3 Cookbook

Introduction to Data Science Handbook

Review of NLP toolkits and proposal for a new one

Previous years NLP course materials

NLP course 2020 project reports

NLP course 2021 project reports

NLP course 2022 project reports

NLP report LaTeX template

NLP course 2023 projects

Projects data to share

* Please create a separate folder for each dataset. Also, include README with basic descriptions (data source, specifics of data retrieval, basic data statistics, ...)

Marks

Peer review

Projects you need to review (check your email for repository assignment)

Peer review submission form

Weekly plan

This plan is regularly updated.

Lab sessions are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which you will work on your projects. Based on your proposals / new ideas we can adapt the weekly plan and prepare additional materials.

All the lab session tutorials will be regularly updated in the Github repository. During the lab sessions we will briefly present each week's topic and then mostly discuss your project ideas and work. You are expected to check/run notebooks before the lab sessions and then ask questions/discuss during the lab sessions. In the repository's README you can also find the recordings of each topic.

Week

Description

Materials and links

13.2. - 17.2.

INFO: No one was present on Tuesday, so lab sessions will start in the second week ;).

20.2. - 24.2.

Lab work introduction

Projects overview

Group work and projects application procedure

Introduction slides

Python basics (non-CS background students)

Github classroom assignment (final deadline set to May 26, 23:59).

27.2. - 3.3.

Basic text processing

Text clustering

Basic text processing

Text clustering

6.3. - 10.3.

Text classification

Slovene text processing

Text classification

Slovene text processing

13.3. - 17.3.

Traditional sequence tagging (HMM, MEMM, CRF, ...)

Language models, knowledge bases

First submission (Friday, 23:59)

Traditional language modelling, knowledge bases

Traditional sequence tagging

20.3. - 24.3.

First submission defense (in person)

27.3. - 31.3.

Neural networks introduction (TensorFlow, Keras)

Word embeddings & visualizations (offensive language)

RNNs vs. GRUs vs. LSTMs + examples

Introduction to neural networks

3.4. - 7.4.

Tensorflow examples

Multiple simple NN architectures example

Google Colab

SLING tutorial (setup, Singularity, SLURM)

Tensorflow versions example

Simple NNs comparison

Google Colab showcase

SLING (SLURM and HPC usage)

10.4. - 14.4.

(Mon. holiday)

Transformers, BERT (tagging, classification)

BERT (custom task)

BERT (classification - Tensorflow)

BERT (tagging & examples - PyTorch)

17.4. - 21.4.

Graph neural networks for text processing

Graph neural networks for NLP

24.4. - 28.4.

(Thu. holiday)

Generative and conversational AI

Second submission (Friday, 23:59)

Generative AI

1.5. - 5.5.

(Mon., Tue., holiday)

No lab sessions on Monday and Tuesday

Second submission defense on Wednesday (remotely via zitnik.si/zoom)

8.5. - 12.5.

No lab sessions on Wednesday (only the ones held by Slavko!)

(due to Konferenca DSI 2023, if you present a student project @DSI, you get +10 points for the lab part,
link to instructions)

Second submission defense on Monday and Tuesday (in person)

15.5. - 19.5.

Consultations/Project work/Discussions

(Please attend the lab sessions and discuss your work and ideas!)

22.5. - 26.5.

Consultations/Project work/Discussions

(Please attend the lab sessions and discuss your work and ideas!)

Final submission deadline (Friday, 23:59)

IMPORTANT: Put your repositories visibility to public before the deadline or shortly after!

Peer review submission deadline (Monday, 29.5.2023, 23:59)

Peer review link (each group will get an email with repositories to review)!

29.5. - 31.5.

(Thu., Fri., out of range)

No organized lab sessions this week

Final project presentations - Tuesday, May 30 in P04:

Project 1: 1pm
Project 2: 2pm

Project 3: 3pm

Project 4 & 5: 4pm

-> Final project presentations are cancelled due to the strike! Ales and Slavko will grade your work and send you results. Please contact us or stop by after Tuesday if you need more clarifications.

Course obligations

Please regularly check Weekly plan and course announcements for possible changes. You are expected to attend the sessions but you must attend the defense sessions. At the assignment defense dates at least one member of a group must be present, otherwise all need to provide their doctor’s justification. At the last assignment all members must be present and also need to understand all parts of the submitted solution.

All the work must be submitted using your Github project repository. Submission deadlines are indicated in the table above. Submission defenses will be held during the lab sessions.

Students must work in groups of three members! There can exist only one group of two members per project type. The distribution of work between members should be seen by commits within the repository.

Obligation	Description	Final grade relevance (%)
Submission 1	Project selection & simple corpus analysis - Group (three members) selection - Report containing Introduction, existing solutions/related work and initial ideas - Well organized repository	10
Submission 2	Initial implementation / baseline with results - Updated Submission 1 parts - Implemented at least one solution with analysis - Future directions and ideas - Well organized repository	20
Submission 3	Final solution and report - Final report incl. analyses and discussions - Fully reproducible repository	60
Peer review	Evaluate your peer group's work - Each group will check final submissions of two other peer groups having the same topic (except Task 5 groups)	10
		Total: 100%

Grading criteria

All the graded work is group work. All the work is graded following the scoring schema below. All the course obligations must be graded positive (i.e., 6 or more) to pass.

Use PUBLIC GROUP ID for public communication regarding your group. GROUP ID is your internal id of a group for which marks will be publicly available.

Scoring

Scoring is done relative to achievements of all the participants in the course. Instructions will define the criteria the participants need to address and exactly fulfilling the instructions will result in score 8. All the other scores will be relative to the quality of the submitted work. The role of instructors is to find an appropriate clustering of all the works into 6 clusters. To better illustrate the scoring, schema will be as follows:

10 - exceptional: Extraordinary results, quality of work or extremely well structured and justified report. There might still be some very minor possibilities for improvement.

Repository is clear and runnable. Report is well organized, results are discussed, visualizations are added value to the text and well prepared. Apart from minimum criteria the group tried multiple of their novel ideas to approach the problem.

9 - very good: Above average knowledge presentation with some errors.

Same as above - it can be visible that a group had novel ideas but they got out of time or did not finish (polish) everything. The submission has multiple minor flaws.

8 - good: Submission of solid work, mainly addressing given instructions only.

Group implemented everything suggested by the minimum criteria but not investigated further (not found much related work, did not apply multiple other techniques, ...).

7 - superficial: Below average knowledge and submission of work with errors that show partial understanding.

Group implemented everything suggested by the minimum criteria but did not discuss results well, performed simple analyses only, ... Also the report is not well organized, and lacks data for reproducibility.

6 - sufficient: Minimum criteria addressed with some major errors and drawbacks.

Group was trying to implement minimum criteria (or part only) but their work has many minor flaws or a few major ones. The report also reflects their motivation.

5 - insufficient: Too much lack of knowledge, too many major errors or no work-effort could be drawn from the submitted work.

The group did not address one or more points of the minimum criteria and the report contains major flaws. It can be seen that the group did not invest enough time into the work.

Final project preparation guidelines and peer review instructions

Some major remarks that you should keep in mind out are the following regarding final submission:

Comment on all the specifics of your algorithms that you use or have designed (i.e. features, hyperparameters, ...). In some cases it is useful to include an image instead of providing long descriptions.
When you include graphs/images, they must be readable and provide additional insight (e.g. if there are lots of lines across the image or little space between them, it is not okay). When you report on results, keep them as much as possible in one table, so that a reader can compare different configurations. Also, it is useful to bold the best results.
Both images and tables should be self-contained - i.e., together with the caption they need to provide enough information to the reader to understand it's meaning without reading text around.
Keep your report concise and try to not submit report longer than 6 pages + references. Also, make sure you followed the proposed template.
Focus on reporting results and using sensible measures. Try to find examples where your algorithm works better and may not even work at all. Explain why and also justify the differences in approaches that you used. In case previous work exist for your dataset put also best results of other researchers in your results table (even if your results are much lower).
Your submitted work (repository and report) should be structured in a way that your colleagues would be able to understand and re-run everything. Include all dependencies for you projects:

In case you have used a non-public (or semi-public) dataset, do not include it in the repository, just put your contact data or protected link to download data/provide other instructions to retrieve data.
Datasets that are available elsewhere should be just linked in your report/repository. If you performed additional transformations on datasets, scripts for that should be available in the repository.
Some of you used models that take longer to train. You can include those models (maybe just the best one) in the repository or elsewhere and link it.

Lastly, check that your repositories are publicly available before the peer review period starts! IMPORTANT:

Include links to all dependencies/corpora or include them in the repository, so that anyone can check your work. Also include annotated data (if you manually prepared a corpus) or trained models (if training takes long time).
For anyone that will review your work, it must be as simple as possible to run your code.

Peer review instructions:

Please find the projects you need to review (see link above).
Each group needs to review projects of the same topic they have chosen.
Submit your peer review scores in the Google Form (see link above).
You will get score also for your grading, depending on how much (of course by some margin) your grading will be different from the assistant's grading.
Follow the scoring criteria as presented above and include feedback to your mark.

Final project presentation instructions

Each group will have max. 3 minutes (STRICT) to present their project. I will put your report to the projector and you will present along your report. I propose that you focus on specific interesting part that you will present (e.g., table, graph, figure, ...). The most important aspect to present is:

What is the "take-away message" of your work? This should be concrete and concise, so that anyone can understand (also a completely lay person).

See timetable above for timeslots of your presentation. If you cannot attend, please write to me to get an alternative timeslot.

Specific projects information

Project 1: Literacy situation models knowledge base creation

Building a knowledge base based on situation models from selected English/Slovene short stories. Knowledge base can focus on a subset of the following inference types: Referential, Case structure role assignment, Causal antecedent, Superordinate goal, Thematic, Character emotional reaction, Causal consequence, Instantiation of noun category, Instrument, Subordinate goal, State, Emotion of reader, Author's intent.

You can check NLP course 2022 project reports and build on top of that.

Description: Extraction of concepts/entities/relationships/topics/time-aware data from novels. Creation of a book knowledge base and methods for inference and visualization (e.g. main characters, good/bad characters, "dramski trikotnik" (story grammar), time and space analysis, ...)

Task: (1) selection and retrieval of additional books (EN/SL) - 7 English provided, (2) review and detailed proposal of analysis to be done, (3) analysis with evaluations, explanations from narrative texts/stories, for example: causal chain summaries.

Provided English books (link):

- Henry Red Chief

- Hills Like White Elephants

- Leiningen Versus the Ants

- The Lady or the Tiger

- The Most Dangerous Game

- The Tell Tale Heart

- The Gift of the Magic Henry

References:

- Casseli, Vossen (2017): The Event StoryLine Corpus: A New Benchmark for Causal and Temporal Relation Extraction (github)

- Nahatame (2020): Revisiting Second Language Readers’ Memory for Narrative Texts: The Role of Causal and Semantic Text Relations

- van den Broek (1985): Causal Thinking and the Representation of Narrative Events

- Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli, Allen (2016): A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

- Dasgupta, Saha, Dey, Naskar (2018): Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks

- McNamara (2011): Coh-Metrix: An Automated Tool for Theoretical and Applied Natural Language Processing

- Zwaan, Magliano, Graesser (1995): Dimensions of Situation Model Construction in Narrative Comprehension

- Zwaan (1999): Situation Models: The Mental Leap Into Imagined Worlds

- Antonio Lieto, Gian Luca Pozzato, Stefano Zoia, Viviana Patti, Rossana Damiano (2021): A Commonsense Reasoning Framework for Explanatory Emotion Attribution, Generation and Re-classification, https://arxiv.org/abs/2101.04017

- Filip Ilievski, Pedro Szekely, Bin Zhang (2020): CSKG: The CommonSense Knowledge Graph, https://arxiv.org/abs/2012.11490

- Prof. Glenn Smith materials + additional video clarification: LINK

- Interesting blog post: Creating a Knowledge Graph From Video Transcripts With ChatGPT

Proposed methodology:

1. Datasets collection & description (short stories, longer novels, # of stories for manual analysis, ...).

2. Definition of clear aspects of your work, for example:

- identification of entities, relationships,

- causal or time-based analysis of events,

- sentiment analysis of characters related to events,

- see NLP course 2022 project reports and built up on top of that, ...

3. Identification of additional datasets and tools (papers with runnable repositories) that you can use.

4. Analysis setup and running, preparing results, for example:

- general statistics of uncovered data from from books

- manual check of selection of stories

- changing parameters, using different setups/models to provide comparison tables

- provision of one or more interesting visualizations of results (if you focus more on this, you might perform less analysis types)

5. Writing a final report + additional tests if needed (reserve one week for doing this).

Project 2: (Semi-)automatic integration, translation and entity linking of knowledge bases

There exist many knowledge bases - created manually or automatically. It is time consuming to create or curate a manual knowledge base. WordNet is an English lexical database that is one of the most used language resource. It is organized using synsets and relationships between them. After the English version, similar databases were created for 80 languages, also sloWNet for Slovene. WordNets data is also a part of BabelNet that interconnects concepts/named entities (22 mln.) in 520 languages.

Current sloWNet issues:

- sloWNet (43.460 concepts) contains only a part of English WordNet (117.000 concepts)

- Data is not manually curated and there exist many errors due to automatic creation of corpus. 33.541 words out of 71.794 have been manually checked. An example of non-checked example is synset "eng-30-00839194-v"

"Ta vsebuje naslednje angleške besede in zveze: bamboozle, snow, hoodwink, pull the wool over someone's eyes, lead by the nose, play false. Koncept označuje dejanje, ko kdo želi koga prevarati ali pridobiti koristi s tem, da s pretvarjanjem prikriva slabe namene (angleška definicija: conceal one's true motives from, especially by elaborately feigning good intentions so as to gain an end). Edini slovenski literal, ki ga je ponudil avtomatski postopek, je povsem napačni “snežiti”, najbrž preko prevoda angleškega “snow”. Pričakovali bi sopomenski niz z elementi, kot so: preslepiti, naplahtati, pretentati, okrog prinesti, potegniti za nos. Morda tudi: opehariti, ogoljufati, nafarbati, kar je mogoče pridobiti iz baze slovarja sopomenk: https://viri.cjvt.si/sopomenke/slv/state?mw=preslepiti&mid=44537&source=synonyms_page). "

- For Slovene it is important that synsets are interconnected using Slovene semantic data.

Proposed methodlogy:

1. Review existing resources (Digitalna slovarska baza, WordNet, sloWNet, Sloleks, WikiData, BabelNet).

2. Think of a methodology/algorithms/procedures to create a Slovene semantic knowledge base (as accurate as possible, maybe annotated using probability on edges).

3. Collect and/or translate (use an existing NMT - e.g., eTranslation, https://github.com/clarinsi/Slovene_NMT) selected resources. You can also use larger text-only corpora and investigate collocations, ... (Gigafida, kolokacije - https://viri.cjvt.si/kolokacije/eng/).

4. Integrate, clean dataset.

5. (Optional) Review and try to entity link your data (Digitalna slovarska baza, WikiData).

6. Repeat previous steps and improve the data. Write justificiations/measurable facts in improving the quality of data.

7. Write a final report.

References:

Project ideas description (private use only!).

sloWNet presentation

Project 3: Paraphrasing sentences

Sentence paraphrasing involves the ability to generate alternative phrasings of a sentence while still conveying the same meaning. Students will create a dataset consisting of a range of sentence paraphrases and train generative models capable of producing rephrased sentences.

Proposed methodology:
        1. Find an appropriate dataset on clarin.si (ccGigafida, ccKres, …) and create a data set using appropriate methods.
                a. Methods: Back-translation, Automatic translation of non-Slovene datasets, …
                b. Choose the right tool for translation (https://github.com/clarinsi/Slovene_NMT, …)
        2. Evaluate a small part of the created data set using manual techniques. Design a simple metric that shows how good your dataset is (e.g., similarity, clarity, fluency).
        3. Train generative models. You can use text-to-text models (T5) or autoregressive models (GPT).
        4. Evaluate the performance of your models.
                a. Automatically (choose appropriate metric described in the references)
                b. Manually (use the metric you designed in the 2nd step)
                c. You can also compare your results with large LMs (such as GPT-3 models for example)
        5. Write a final report.

References:
        - J. Zhou and S. Bhat, “Paraphrase Generation: A Survey of the State of the Art,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, Nov. 2021, pp. 5075–5086. doi: 10.18653/v1/2021.emnlp-main.414.
        - C. Federmann, O. Elachqar, and C. Quirk, “Multilingual Whispers: Generating Paraphrases with Translation,” in Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, Nov. 2019, pp. 17–26. doi: 10.18653/v1/D19-5503.
        - L. Shen, L. Liu, H. Jiang, and S. Shi, “On the Evaluation Metrics for Paraphrase Generation.” arXiv, Oct. 08, 2022. Accessed: Feb. 14, 2023. [Online]. Available: http://arxiv.org/abs/2202.08479.

Project 4: Building a conversational AI (advanced)

Large language models are the key ingredient in advanced natural language processing techniques. Recently, a number of different very large language models were introduced, such as LaMDA, BLOOM, GPT(-3), Galactica, ... It is also infeasible to train such models without a powerful GPU infrastructure or large amounts of corpora. Based on these models, text-to-text models were often trained, compared to training specific models per each NLP task, such as text classification, question answering, ...

Task: Get to know LLMs and try to understand their creation from higher levels. Try to prepare large amounts of conversational data in Slovene (this is the focus of this task!) that is correctly organized and of good quality to be fed into fine-tuning a multi-lingual LLM (that supports Slovene). Demonstrate work by adapting a model to create a chatbot for Slovene (maybe compare with ChatGPT or translation of Slovene utterances into English and back).

Proposed methodology:
        1. Review usable LLMs, select one that you might use (e.g., within SLING infrastructure, VEGA, Nvidia A100 GPUs).
        2. (main goal of the project) Prepare a plan for data gathering, identify sources. Write crawlers, ... organize data in a way that is useful for "fine-tuning" the model. Check papers (e.g., BLOOM's) to get to know, what aspects are important when preparing data.
        3. Use the data to adapt an existing model using your data.
        4. Write a final report.

References:
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155.

- Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas et al. (292 additional authors not shown), BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, https://arxiv.org/abs/2211.05100, model: https://huggingface.co/bigscience/bloom

Project 5: Word sense disambiguation (Digital linguistics students only!)

The task of word sense disambiguation involves determining the correct meaning of a word. Students will create a data set aimed at training word sense disambiguation models. The creation of this data set will involve converting existing data into a new format and generating additional examples using both automatic and manual methods.

You can check NLP course 2022 project reports and build on top of that.

Proposed methodology:
        1. Create a list of candidate words. You can use Elexis-WSD or SloWNet datasets for a list of highly polysemy words.
        2. Develop a method for automatic extraction of sentence pair candidates that contain a word from your list.
                a. Choose an existing large corpus in Slovene (ccKres is suggested, but there are many others available on the clarin.si repository)
                b. Choose a technique to numerically present candidates (this can be done on a word or sentence level)
                c. Group sentences using e.g., clustering methods to narrow down the choices and speed up the process. Automatically assign truth values.
        3. Manually verify the selected sentence pairs and correct truth values as necessary. Each team member should contribute at least 200 examples.
        4. Transform the existing Elexis-WSD dataset into the WiC format (check the format of the dataset in the references). Create as many positive and negative examples as possible. Join both datasets.
        5. Write a final report.

References:
- M. T. Pilehvar and J. Camacho-Collados, “WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations.” arXiv, Apr. 27, 2019. Accessed: Feb. 14, 2023. [Online]. Available: http://arxiv.org/abs/1808.09121
- A. Wang et al., “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.” arXiv, Feb. 12, 2020. doi: 10.48550/arXiv.1905.00537.