Laboratory work - Spring 2024

The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).

Important links

Lab sessions course repository (continuously updated, use weekly plan links for latest materials)

Books and other materials

        Speech and language processing (online draft)

        Python 3 Text Processing with NLTK 3 Cookbook

Introduction to Data Science Handbook 

Razvoj slovenščine v digitalnem okolju (February 2023)

Previous years NLP course materials

        NLP course 2021 project reports

        NLP course 2022 project reports

        NLP course 2023 project reports

NLP course 2024 projects

        Marks

Peer review

        Peer review submission form (TBA)

        

Weekly plan

This plan is regularly updated.

 

Lab sessions are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which you will work on your projects. Based on your proposals / new ideas we can adapt the weekly plan and prepare additional materials.

All the lab session tutorials will be regularly updated in the Github repository. During the lab sessions we will briefly present each week's topic and then mostly discuss your project ideas and work. You are expected to check/run notebooks before the lab sessions and then ask questions/discuss during the lab sessions. In the repository's README you can also find the recordings of each topic.

Week

Description

Materials and links

19.2. - 23.2.

/

26.2. - 1.3.

Lab work introduction

Projects overview

Group work and projects application procedure

Basic text processing

Slovene text processing

Course overview and introduction

4.3. - 8.3.

Text clustering

Text classification

Traditional sequence tagging (HMM, MEMM, CRF, ...)

Language models, knowledge bases

Projects sign up form (deadline Friday midnight).

Github classroom assignment (deadline Friday midnight, only one group member creates a team, exactly three members for a group!).

11.3. - 15.3.

Neural networks introduction (TensorFlow, Keras)

Word embeddings & visualizations (offensive language)

RNNs vs. GRUs vs. LSTMs + examples

Multiple simple NN architectures example (Google Colab)

18.3. - 22.3.

Introduction to PyTorch

PyTorch Lightning

SLING tutorial (setup, Singularity, SLURM)

First submission (Friday, 23:59)

* There will be no lab session Wednesday at 10am. Please attend other groups if possible.

25.3. - 29.3.

First submission defense (in person)

1.4. - 5.4.

(Mon. holiday)

Transformers, BERT (custom task),

BERT (tagging, classification),  

KeyBERT (keyword extraction), TopicBERT (topic modeling)

8.4. - 12.4.

Generative and conversational AI

15.4. - 19.4.

Prompting and efficiently Fine-Tuning a Large Langugae Model

Retrieval Augmented Generation (RAG)

22.4. - 26.4.

Graph neural networks for text processing

No lab session on Thu, 5pm!

29.4. - 3.5.

(Wed., Thu., Fri. holiday)

No lab sessions

Second submission (Friday, 23:59)

6.5. - 10.5.

Second submission defense (in person)

Boshko's lab sessions will be online https://zoom.us/j/3145561389

13.5. - 17.5.

Consultations/Project work/Discussions

(Please attend the lab sessions and discuss your work and ideas!)

No lab session on Wed., 10am!

(Konferenca DSI 2024: if you attend Hackathon, you get +10 points for the lab part,
https://hackathon.si)

20.5. - 24.5.

Project work/Online discussions

(LREC-COLING 2024)

No lab sessions on Tue 3pm, Wed 4pm, Thu 5pm!

Final submission deadline (Friday, 23:59)

      IMPORTANT: Put your repositories visibility to public before the deadline or shortly after!

Peer review submission deadline (Monday, 27.5.2024, 23:59)

      Peer review link (each group will get an email with repositories to review)!

27.5. - 31.5.

No organized lab sessions this week

Final project presentations - Wednesday, May 29 in P21:

    Project 1 & 2: 2pm
   Project 3 & 4 & 5: 3pm

    Project 6 & 7: 4pm

    Project 9: during your lab session

Course obligations

Please regularly check Weekly plan and course announcements for possible changes. You are expected to attend the sessions but you must attend the defense sessions. At the assignment defense dates at least one member of a group must be present, otherwise all need to provide their doctor’s justification. At the last assignment all members must be present and also need to understand all parts of the submitted solution.

All the work must be submitted using your Github project repository. Submission deadlines are indicated in the table above. Submission defenses will be held during the lab sessions.

Students must work in groups of three members! There can exist only one group of two members per project type. The distribution of work between members should be seen by commits within the repository.

Obligation

Description

Final grade relevance (%)

Submission 1

Project selection & simple corpus analysis

  - Group (three members) selection

  - Report containing Introduction, existing solutions/related work and initial ideas

  - Well organized repository

10

Submission 2

Initial implementation / baseline with results

  - Updated Submission 1 parts

  - Implemented at least one solution with analysis

  - Future directions and ideas

  - Well organized repository

20

Submission 3

Final solution and report

  - Final report incl. analyses and discussions

  - Fully reproducible repository

60

Peer review

Evaluate your peer group's work

  - Each group will check final submissions of two other peer groups having the same topic

10

Total: 100%

Grading criteria

All the graded work is group work. All the work is graded following the scoring schema below. All the course obligations must be graded positive (i.e., 6 or more) to pass.

Use PUBLIC GROUP ID for public communication regarding your group. GROUP ID is your internal id of a group for which marks will be publicly available.

Scoring

Scoring is done relative to achievements of all the participants in the course. Instructions will define the criteria the participants need to address and exactly fulfilling the instructions will result in score 8. All the other scores will be relative to the quality of the submitted work. The role of instructors is to find an appropriate clustering of all the works into 6 clusters. To better illustrate the scoring, schema will be as follows:

Repository is clear and runnable. Report is well organized, results are discussed, visualizations are added value to the text and well prepared. Apart from minimum criteria the group tried multiple of their novel ideas to approach the problem.

Same as above - it can be visible that a group had novel ideas but they got out of time or did not finish (polish) everything. The submission has multiple minor flaws.

Group implemented everything suggested by the minimum criteria but not investigated further (not found much related work, did not apply multiple other techniques, ...).

Group implemented everything suggested by the minimum criteria but did not discuss results well, performed simple analyses only, ... Also the report is not well organized, and lacks data for reproducibility.

Group was trying to implement minimum criteria (or part only) but their work has many minor flaws or a few major ones. The report also reflects their motivation.

The group did not address one or more points of the minimum criteria and the report contains major flaws. It can be seen that the group did not invest enough time into the work.

Final project preparation guidelines and peer review instructions

Some major remarks that you should keep in mind out are the following regarding final submission:

Peer review instructions:

  1. Please find the projects you need to review (see link above).
  2. Each group needs to review projects of the same topic they have chosen.
  3. Submit your peer review scores in the Google Form (see link above).
  4. You will get a score also for your grading, depending on how much (of course by some margin) your grading will be different from the assistant's grading.
  5. Follow the scoring criteria as presented above and include feedback to your mark.

Final project presentation instructions

Each group will have max. 3 minutes (STRICT) to present their project. I will put your report to the projector and you will present it along with your report. I propose that you focus on specific interesting part that you will present (e.g., table, graph, figure, ...). The most important aspect to present is:

        What is the "take-away message" of your work? This should be concrete and concise, so that anyone can understand (also a completely lay person).

See timetable above for time slots of your presentation. If you cannot attend, please write to me to get an alternative time slot.

Specific projects information

Project 1: LLM Prompt Strategies for Commonsense-Reasoning Tasks (Aleš): This project aims to explore and compare various prompt strategies to enhance commonsense reasoning in large language models (LLMs). Students will investigate methods such as Chain of Thought (CoT), in-context learning, plan-and-solve techniques, etc., to improve the model's performance on tasks requiring commonsense knowledge. The project will involve designing experiments to evaluate the effectiveness of each strategy, analyzing the models' reasoning processes, and understanding how different prompting techniques influence the outcomes.

Proposed methodology:

  1. Literature review on current prompt strategies and their applications in commonsense reasoning.
  2. Selection of a commonsense reasoning dataset (e.g., Winograd Schema Challenge)
  3. Design and implementation of experiments to compare the effectiveness of various prompt strategies.
  4. Detailed analysis of model responses to identify strengths and weaknesses of each strategy (usage of an HPC is obligatory!).
  5. Final report summarizing findings, with recommendations for best practices in prompting for commonsense reasoning tasks.

References:

Project 2: Parameter-Efficient Fine-Tuning of Language Models (Aleš): This project focuses on investigating parameter-efficient techniques for fine-tuning large language models, such as Low-Rank Adaptation (LoRA), soft prompts, etc. Students will compare different approaches across various NLP tasks to assess the efficiency and effectiveness of each fine-tuning strategy. The evaluation will consider model performance, computational efficiency, and adaptability to different tasks.

Proposed methodology:

  1. Reviewing parameter-efficient fine-tuning techniques and selecting appropriate methods for experimentation.
  2. Designing experiments to compare learning across multiple NLP tasks. Selecting at least 5 different datasets that cover various natural language understanding skills (commonsense reasoning, coreference resolution, text summarization, etc.) and supervised learning settings (classification & generation).
  3. Evaluating the models based on appropriate performance metrics, computational resources required, and ease of adaptation to different tasks. It is obligatory to submit your results publicly to SloBENCH.
  4. Writing a comprehensive report that discusses the experimental setup, findings, and recommendations for efficient fine-tuning of language models.

References:

Project 3: Cross-Lingual Question Generation (Boshko): This project aims to extend the Doc2Query approach, which utilises a T5 model fine-tuned on the MSMARCO dataset for generating queries from documents, to the domain of question generation in multiple languages. The students will assess the quality of questions generated by the model and its effectiveness across different languages, thereby understanding the challenges and opportunities of applying such models in a cross-linguistic context. The students will then fine-tune the given system on Slovenian datasets and evaluate the outputs.

Proposed methodology:

  1. Literature Review: Conduct a comprehensive review of existing literature on question generation models, focusing on Doc2Query and its applications, as well as cross-lingual NLP techniques.
  2. Dataset Selection and Preparation:  Select a relevant Slovenian dataset question and answering dataset (e.g. SQuAD) or construct one from a sample of Slovenian news articles.
  3. Model Fine-Tuning: Fine-tune the T5 model on the selected datasets, adapting the Doc2Query approach for question generation tasks. It is obligatory to use an HPC.
  4. Quality Assessment: Design a framework for evaluating the quality of generated questions, considering factors such as relevance, coherence, linguistic correctness, of the pre-trained and fine-tuned model. It is obligatory to manually check and update 300 examples (QA pair) (try to evaluate up to 100 same examples by all members of a group).
  5. Final report summarising the results, highlighting advancements and limitations.

References:

Project 4: Slovenian Instruction-based Corpus Generation (Slavko): Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized domains such as programming and creative writing. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public. Recently, a number of different very large language models were introduced, such as LaMDA, BLOOM, GPT(-3), Galactica, Mixtral, OPT, ... It is also infeasible to train such models without a powerful GPU infrastructure or large amounts of corpora. Based on these models, text-to-text models were often trained, compared to training specific models per each NLP task, such as text classification, question answering, ... Your task is to get to know LLMs and try to understand their creation from higher levels. Try to prepare large amounts of conversational data in Slovene (this is the focus of this task!) that is correctly organized and of good quality to be fed into fine-tuning a multi-lingual LLM (that supports Slovene). Demonstrate work by adapting a model to fine-tune a conversational agent for Slovene.

Proposed methodology:

  1. Review usable LLMs, select one that you might use (e.g., within SLING infrastructure, VEGA, Nvidia A100 GPUs).
  2. (main goal of the project) Review datasets construction and categorization of instructions for selected Instruce-based LLMs. Prepare a plan for data gathering and identify sources (e.g., med-over.net, slo-tech forum, ...). Write crawlers, ... organize data in a way that is useful for "fine-tuning" the model. Check papers (e.g., BLOOM's, LLaMa 2's) to get to know, what aspects are important when preparing data.
  3. Use the data to adapt an existing model using your data (optional).
  4. Report on all your findings in the final report.

References:

Project 5: Unsupervised Domain adaptation for Sentence Classification (Boshko): This project seeks to improve document representation in specialized domains by adapting sentence-transformer models, which, while effective, are not inherently tuned to specific fields. The focus will be on investigating two advanced adaptation techniques: TSDAE (Transformer-based Denoising AutoEncoder) and GPL (generative pseudo labeling). These methods aim to refine the representation space, making it more sensitive and accurate within a given domain. The students will evaluate the effect of the adaptation on the classification result.

Proposed methodology:

  1. Literature review on sentence transformers, TSDAE and GPL  to understand their application in information retrieval.
  2. Selection of a (Slovenian) classification dataset for domain adaptation experiments (SentiNews,https://www.clarin.si/repository/xmlui/handle/11356/1110 )..
  3. Designing and implementation of experiments to assess the impact of domain adaptation techniques on classification performance. It is obligatory to use an HPC.
  4. Detailed analysis of classification results to determine the effectiveness of TSDAE, GPL, and ranking functions.
  5. Final report summarizing the findings, with recommendations of feasibility of domain adaptation in information retrieval systems for classification.

References:

 

Project 6: Qualitative Research on Discussions - text categorization (Slavko): Qualitative discourse analysis is an important way social scientists research human interaction. Large language models (LLMs) offer potential for tasks like qualitative discourse analysis, which demand a high level of inter-rater reliability among human “coders” (i.e., qualitative research categorizers). This is an exceedingly labor-intensive task, requiring human coders to fully understand the discussion context, consider each participant’s perspective, and comprehend the sentence’s associations with the previous discussion, as well as shared general knowledge. In this task, you create a model to categorize postings in online discussions, such as in a corpus — an online discussion about the story, “The Lady, or the Tiger?”. We provide a coded dataset with a high inter-rater reliability and a codebook including definitions of each category with examples. Your task is building and training a highly reliable language model for this coding task that generalizes to other online discussions.

Proposed methodology:

  1. Literature Review: Conduct a comprehensive literature review on discourse analysis or dialogic analysis, focusing on the coding criteria and applied approaches related to NLP.
  2. Data Exploration: Explore and understand the provided coded discourse dataset (FINAL DATA (unfiltered as in real-life scenarios)).
  3. Fine-tuning Models: Building and fine-tuning LLMs on the provided dataset to predict the discussion category of a posting, considering the discussion context, associations with the previous sentences and the involved participants. It is obligatory to use an HPC.
  4. Performance Evaluation: Exploring the metrics that used to evaluate the performance of your built models in discourse analysis. Iteratively comparare & revise your model’s performance compared to human coders (categorizers) and revise based on results. You have the option to implement your own evaluation approaches, or compare your model’s performance with that of other alternative models working on the dataset. Your model will also be tested, for generalizability, on another coded online discussion data set with a different codebook.
  5. Define explanations of the categories by an LLM model. Use a separately fine-tuned LLM to generate explanations and qualitatively assess them.
  6. Final Report: Delivering a comprehensive report on your findings, emphasizing the effectiveness, innovation, and limitations of your proposed models.

References:

Project 7: Conversations with Characters in Stories for Literacy — Quick, Customized Persona Bots from  novels (Slavko): There is a world-wide literacy crisis (Murray, 2021; OECD, 2015, 2019). Young people hate reading and rarely read recreationally. They fail at high level literacy skills, e.g., evaluating texts for validity and integrating across texts to create personal knowledge. Yet, literacy is vital for educational and professional success, life happiness and societal health. One way to motivate young people to read is through conversational interaction with digital personifications of characters (pedagogical agents or PersonaBots) in novels. LLMs provide possible solutions. Khanmigo offers personaBot ChatGPT text conversations with Jay Gatsby (of the classic novel, The Great Gatsby) and with Obama. However, their offerings are limited. Khanamigo provides no information on development time for personaBots, nor does it offer customized personaBots from user-suggested novels. Quick, customized personaBots, for conversations with characters, based on teacher-suggested novels would be enormously educational. To ensure a personaBot is fully contextualized in the specific context and at the same time within the constraints of token limitations, our suggestion is considering the current Retrieval and indexing techniques (i.e., Retrieval Augmented Generation) or implementing more efficient vector searching or similarity computation approaches. .

Proposed methodology:

  1. Literature Review: Conduct a comprehensive literature review (and services such as character.ai) on personaBots and pedagogical agents in language arts. Look at theoretical systems that inform PersonaBot design, personality theory, situation models, pedagogical agents and more.
  2. Explore provided sample stories with example scripts for conversations with characters, e.g., simple xml hard-coded scripts. Also, you will be provided with suggested novels in the public domain as test examples (No (useful) data available, just description and example from IMapBook).
  3. Based on formative evaluation, fine-tune models for creating custom PersonaBot conversations with characters from suggested stories and novels.
  4. Performance Evaluation: Test your persona bots with sample users with newly suggested stories/novels, with high school or university students, through teleconferences and analyze the transcripts of the conversations. Explore metrics to evaluate the performance of your personaBots - how would the evaluation look like. You have the option to implement your own evaluation approaches, or compare your model’s performance with that of other alternative models working on the dataset.
  5. Final Report: Delivering a comprehensive report on your findings, emphasizing the effectiveness, innovation, and limitations of your proposed models for creating customized personaBots based on characters in novels.

References:

Project 8: Automatic identification of multiword expressions and definitions generation (Slavko): Understanding the relation between the meaning of words is an important part of comprehending natural language. A lot of works have either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (LLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent LLMs capture relational knowledge and are able to transfer it across languages. We proposed MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. We tested LLMs' ability to capture analogies across languages, and predict translation targets (first reference). There are considerable differences across relation types and languages. Analyze Slovenian inter annotator agreement and generate definitions. Propose corpus improvement.

Proposed methodology:

  1. Literature review: Review the initial BATS task corpus, MultiLexBATS paper and corpus.
  2. Select an open-source LLM and define useful LLM prompts to generate English and Slovenian definitions (for both annotators). See example for the Bridge corpus.
  3. Perform automatic translation Eng-Slo along with definitions generation. Semantically analyze three Slovenian results for each keyword. It is obligatory to use an HPC.
  4. Propose dataset adaptation - define error types, dataset cleaning and updates. Show that following MutliLexBATS you can improve results (alternatively you can also replace BLOOM with another LLM - e.g. Mixtral).
  5. Final report summarizing the findings and critical evaluation of results.

References:

Project 9: Natural language inference dataset (DigiLing only, Aleš): The goal of this project is to create a NLI dataset by creating text passages that challenge the model's understanding of entailment, neutrality, and contradiction between pairs of longer texts. Students will use LLMs to generate two-paragraph texts using diverse prompts, analyse the accuracy of the model to follow the instructions and correct (if needed) the generated two-paragraph texts. They will also train a small model and (optionally) apply explanation methods to understand model predictions.

Proposed methodology:

  1. Studying the construction of SI-NLI dataset and other reference literature. Finding a solution to extend the dataset to longer contexts.
  2. Designing creative prompts to generate text passages using LLMs (two paragraphs, each approximately 5 sentences and a clear relation between them - entailment, contradiction, neutral). Each member of a team should produce 50 samples.
  3. Manual validation of the generated texts based on their logical relationships (entailment, neutrality, contradiction) and correction of mistakes.
  4. Combining samples of all team members into one large dataframe.
  5. Training a small model and using it to determine if the created dataset is challenging enough.
  6. Compilation of a report detailing the generation process and evaluation process.

References: