Natural language processing 2024

Laboratory work - Spring 2024

The main goal of laboratory work is to present the most important aspects of data science in practice and to teach you how to use key tools for a NLP engineer. We especially emphasize on self-paced work, raising standards related to development, replicability, reporting, research, visualizing, etc. Our goal is not to provide exact instructions or "make robots" out of participants of this course. Participants will need to try to navigate themselves among data, identify promising leads and extract as much information as possible from the data to present to the others (colleagues, instructors, companies or their superiors).

Important links

Lab sessions course repository (continuously updated, use weekly plan links for latest materials)

Books and other materials

Speech and language processing (online draft)

Python 3 Text Processing with NLTK 3 Cookbook

Introduction to Data Science Handbook

Razvoj slovenščine v digitalnem okolju (February 2023)

Previous years NLP course materials

NLP course 2021 project reports

NLP course 2022 project reports

NLP course 2023 project reports

NLP course 2024 projects

Marks

Peer review

Peer review submission form (TBA)

Weekly plan

This plan is regularly updated.

Lab sessions are meant to discuss materials and your project ideas. Those proficient in some of the topics covered during the course are expected to help other students during the lab work or in online discussions. Such contributions will also be taken into account. During the lab sessions we will show some DEMOs based on which you will work on your projects. Based on your proposals / new ideas we can adapt the weekly plan and prepare additional materials.

All the lab session tutorials will be regularly updated in the Github repository. During the lab sessions we will briefly present each week's topic and then mostly discuss your project ideas and work. You are expected to check/run notebooks before the lab sessions and then ask questions/discuss during the lab sessions. In the repository's README you can also find the recordings of each topic.

Week	Description	Materials and links
19.2. - 23.2.	/
26.2. - 1.3.	Lab work introduction Projects overview Group work and projects application procedure Basic text processing Slovene text processing	Course overview and introduction
4.3. - 8.3.	Text clustering Text classification Traditional sequence tagging (HMM, MEMM, CRF, ...) Language models, knowledge bases	Projects sign up form (deadline Friday midnight). Github classroom assignment (deadline Friday midnight, only one group member creates a team, exactly three members for a group!).
11.3. - 15.3.	Neural networks introduction (TensorFlow, Keras) Word embeddings & visualizations (offensive language) RNNs vs. GRUs vs. LSTMs + examples Multiple simple NN architectures example (Google Colab)
18.3. - 22.3.	Introduction to PyTorch PyTorch Lightning SLING tutorial (setup, Singularity, SLURM) First submission (Friday, 23:59) * There will be no lab session Wednesday at 10am. Please attend other groups if possible.
25.3. - 29.3.	First submission defense (in person)
1.4. - 5.4. (Mon. holiday)	Transformers, BERT (custom task), BERT (tagging, classification), KeyBERT (keyword extraction), TopicBERT (topic modeling)
8.4. - 12.4.	Generative and conversational AI
15.4. - 19.4.	Prompting and efficiently Fine-Tuning a Large Langugae Model Retrieval Augmented Generation (RAG)
22.4. - 26.4.	Graph neural networks for text processing	No lab session on Thu, 5pm!
29.4. - 3.5. (Wed., Thu., Fri. holiday)	No lab sessions Second submission (Friday, 23:59)
6.5. - 10.5.	Second submission defense (in person) Boshko's lab sessions will be online https://zoom.us/j/3145561389
13.5. - 17.5.	Consultations/Project work/Discussions (Please attend the lab sessions and discuss your work and ideas!) No lab session on Wed., 10am! (Konferenca DSI 2024: if you attend Hackathon, you get +10 points for the lab part, https://hackathon.si)
20.5. - 24.5.	Project work/Online discussions (LREC-COLING 2024) No lab sessions on Tue 3pm, Wed 4pm, Thu 5pm! Final submission deadline (Friday, 23:59) IMPORTANT: Put your repositories visibility to public before the deadline or shortly after! Peer review submission deadline (Monday, 27.5.2024, 23:59) Peer review link (each group will get an email with repositories to review)!
27.5. - 31.5.	No organized lab sessions this week Final project presentations - Wednesday, May 29 in P21: Project 1 & 2: 2pm Project 3 & 4 & 5: 3pm Project 6 & 7: 4pm Project 9: during your lab session

Course obligations

Please regularly check Weekly plan and course announcements for possible changes. You are expected to attend the sessions but you must attend the defense sessions. At the assignment defense dates at least one member of a group must be present, otherwise all need to provide their doctor’s justification. At the last assignment all members must be present and also need to understand all parts of the submitted solution.

All the work must be submitted using your Github project repository. Submission deadlines are indicated in the table above. Submission defenses will be held during the lab sessions.

Students must work in groups of three members! There can exist only one group of two members per project type. The distribution of work between members should be seen by commits within the repository.

Obligation	Description	Final grade relevance (%)
Submission 1	Project selection & simple corpus analysis - Group (three members) selection - Report containing Introduction, existing solutions/related work and initial ideas - Well organized repository	10
Submission 2	Initial implementation / baseline with results - Updated Submission 1 parts - Implemented at least one solution with analysis - Future directions and ideas - Well organized repository	20
Submission 3	Final solution and report - Final report incl. analyses and discussions - Fully reproducible repository	60
Peer review	Evaluate your peer group's work - Each group will check final submissions of two other peer groups having the same topic	10
		Total: 100%

Grading criteria

All the graded work is group work. All the work is graded following the scoring schema below. All the course obligations must be graded positive (i.e., 6 or more) to pass.

Use PUBLIC GROUP ID for public communication regarding your group. GROUP ID is your internal id of a group for which marks will be publicly available.

Scoring

Scoring is done relative to achievements of all the participants in the course. Instructions will define the criteria the participants need to address and exactly fulfilling the instructions will result in score 8. All the other scores will be relative to the quality of the submitted work. The role of instructors is to find an appropriate clustering of all the works into 6 clusters. To better illustrate the scoring, schema will be as follows:

10 - exceptional: Extraordinary results, quality of work or extremely well structured and justified report. There might still be some very minor possibilities for improvement.

Repository is clear and runnable. Report is well organized, results are discussed, visualizations are added value to the text and well prepared. Apart from minimum criteria the group tried multiple of their novel ideas to approach the problem.

9 - very good: Above average knowledge presentation with some errors.

Same as above - it can be visible that a group had novel ideas but they got out of time or did not finish (polish) everything. The submission has multiple minor flaws.

8 - good: Submission of solid work, mainly addressing given instructions only.

Group implemented everything suggested by the minimum criteria but not investigated further (not found much related work, did not apply multiple other techniques, ...).

7 - superficial: Below average knowledge and submission of work with errors that show partial understanding.

Group implemented everything suggested by the minimum criteria but did not discuss results well, performed simple analyses only, ... Also the report is not well organized, and lacks data for reproducibility.

6 - sufficient: Minimum criteria addressed with some major errors and drawbacks.

Group was trying to implement minimum criteria (or part only) but their work has many minor flaws or a few major ones. The report also reflects their motivation.

5 - insufficient: Too much lack of knowledge, too many major errors or no work-effort could be drawn from the submitted work.

The group did not address one or more points of the minimum criteria and the report contains major flaws. It can be seen that the group did not invest enough time into the work.

Final project preparation guidelines and peer review instructions

Some major remarks that you should keep in mind out are the following regarding final submission:

Comment on all the specifics of your algorithms that you use or have designed (i.e. features, hyperparameters, ...). In some cases it is useful to include an image instead of providing long descriptions.
When you include graphs/images, they must be readable and provide additional insight (e.g. if there are lots of lines across the image or little space between them, it is not okay). When you report on results, keep them as much as possible in one table, so that a reader can compare different configurations. Also, it is useful to bold the best results.
Both images and tables should be self-contained - i.e., together with the caption they need to provide enough information to the reader to understand it's meaning without reading text around.
Keep your report concise and try to not submit a report longer than 4 pages + references + appendices. Also, make sure you follow the proposed template.
Focus on reporting results and using sensible measures. Try to find examples where your algorithm works better and may not even work at all. Explain why and also justify the differences in approaches that you used. In case previous work exists for your dataset, put the best results of other researchers in your results table (even if your results are much lower).
Your submitted work (repository and report) should be structured in a way that your colleagues would be able to understand and re-run everything. Include all dependencies for you projects:

In case you have used a non-public (or semi-public) dataset, do not include it in the repository, just put your contact data or protected link to download data/provide other instructions to retrieve data.
Datasets that are available elsewhere should be just linked in your report/repository. If you performed additional transformations on datasets, scripts for that should be available in the repository.
Some of you used models that take longer to train. You can include those models (maybe just the best one) in the repository or elsewhere and link it.

Lastly, check that your repositories are publicly available before the peer review period starts! IMPORTANT:

Include links to all dependencies/corpora or include them in the repository, so that anyone can check your work. Also include annotated data (if you manually prepared a corpus) or trained models (if training takes a long time).
For anyone that will review your work, it must be as simple as possible to run your code.

Peer review instructions:

Please find the projects you need to review (see link above).
Each group needs to review projects of the same topic they have chosen.
Submit your peer review scores in the Google Form (see link above).
You will get a score also for your grading, depending on how much (of course by some margin) your grading will be different from the assistant's grading.
Follow the scoring criteria as presented above and include feedback to your mark.

Final project presentation instructions

Each group will have max. 3 minutes (STRICT) to present their project. I will put your report to the projector and you will present it along with your report. I propose that you focus on specific interesting part that you will present (e.g., table, graph, figure, ...). The most important aspect to present is:

What is the "take-away message" of your work? This should be concrete and concise, so that anyone can understand (also a completely lay person).

See timetable above for time slots of your presentation. If you cannot attend, please write to me to get an alternative time slot.

Specific projects information

Project 1: LLM Prompt Strategies for Commonsense-Reasoning Tasks (Aleš): This project aims to explore and compare various prompt strategies to enhance commonsense reasoning in large language models (LLMs). Students will investigate methods such as Chain of Thought (CoT), in-context learning, plan-and-solve techniques, etc., to improve the model's performance on tasks requiring commonsense knowledge. The project will involve designing experiments to evaluate the effectiveness of each strategy, analyzing the models' reasoning processes, and understanding how different prompting techniques influence the outcomes.

Proposed methodology:

Literature review on current prompt strategies and their applications in commonsense reasoning.
Selection of a commonsense reasoning dataset (e.g., Winograd Schema Challenge)
Design and implementation of experiments to compare the effectiveness of various prompt strategies.
Detailed analysis of model responses to identify strengths and weaknesses of each strategy (usage of an HPC is obligatory!).
Final report summarizing findings, with recommendations for best practices in prompting for commonsense reasoning tasks.

References:

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837.
Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023). Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review. arXiv preprint arXiv:2310.14735.
Fernando, C., Banarse, D., Michalewski, H., Osindero, S., & Rocktäschel, T. (2023). Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797.

Project 2: Parameter-Efficient Fine-Tuning of Language Models (Aleš): This project focuses on investigating parameter-efficient techniques for fine-tuning large language models, such as Low-Rank Adaptation (LoRA), soft prompts, etc. Students will compare different approaches across various NLP tasks to assess the efficiency and effectiveness of each fine-tuning strategy. The evaluation will consider model performance, computational efficiency, and adaptability to different tasks.

Proposed methodology:

Reviewing parameter-efficient fine-tuning techniques and selecting appropriate methods for experimentation.
Designing experiments to compare learning across multiple NLP tasks. Selecting at least 5 different datasets that cover various natural language understanding skills (commonsense reasoning, coreference resolution, text summarization, etc.) and supervised learning settings (classification & generation).
Evaluating the models based on appropriate performance metrics, computational resources required, and ease of adaptation to different tasks. It is obligatory to submit your results publicly to SloBENCH.
Writing a comprehensive report that discusses the experimental setup, findings, and recommendations for efficient fine-tuning of language models.

References:

Xu, L., Xie, H., Qin, S. Z. J., Tao, X., & Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.

Project 3: Cross-Lingual Question Generation (Boshko): This project aims to extend the Doc2Query approach, which utilises a T5 model fine-tuned on the MSMARCO dataset for generating queries from documents, to the domain of question generation in multiple languages. The students will assess the quality of questions generated by the model and its effectiveness across different languages, thereby understanding the challenges and opportunities of applying such models in a cross-linguistic context. The students will then fine-tune the given system on Slovenian datasets and evaluate the outputs.

Proposed methodology:

Literature Review: Conduct a comprehensive review of existing literature on question generation models, focusing on Doc2Query and its applications, as well as cross-lingual NLP techniques.
Dataset Selection and Preparation: Select a relevant Slovenian dataset question and answering dataset (e.g. SQuAD) or construct one from a sample of Slovenian news articles.
Model Fine-Tuning: Fine-tune the T5 model on the selected datasets, adapting the Doc2Query approach for question generation tasks. It is obligatory to use an HPC.
Quality Assessment: Design a framework for evaluating the quality of generated questions, considering factors such as relevance, coherence, linguistic correctness, of the pre-trained and fine-tuned model. It is obligatory to manually check and update 300 examples (QA pair) (try to evaluate up to 100 same examples by all members of a group).
Final report summarising the results, highlighting advancements and limitations.

References:

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1, Article 140 (January 2020), 67 pages.
Thakur, N., Reimers, N., R ̈uckl ́e, A., Srivastava, A., Gurevych, I.: BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Multilingual Doc2Query https://huggingface.co/doc2query/msmarco-14langs-mt5-base-v1

Project 4: Slovenian Instruction-based Corpus Generation (Slavko): Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized domains such as programming and creative writing. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public. Recently, a number of different very large language models were introduced, such as LaMDA, BLOOM, GPT(-3), Galactica, Mixtral, OPT, ... It is also infeasible to train such models without a powerful GPU infrastructure or large amounts of corpora. Based on these models, text-to-text models were often trained, compared to training specific models per each NLP task, such as text classification, question answering, ... Your task is to get to know LLMs and try to understand their creation from higher levels. Try to prepare large amounts of conversational data in Slovene (this is the focus of this task!) that is correctly organized and of good quality to be fed into fine-tuning a multi-lingual LLM (that supports Slovene). Demonstrate work by adapting a model to fine-tune a conversational agent for Slovene.

Proposed methodology:

Review usable LLMs, select one that you might use (e.g., within SLING infrastructure, VEGA, Nvidia A100 GPUs).
(main goal of the project) Review datasets construction and categorization of instructions for selected Instruce-based LLMs. Prepare a plan for data gathering and identify sources (e.g., med-over.net, slo-tech forum, ...). Write crawlers, ... organize data in a way that is useful for "fine-tuning" the model. Check papers (e.g., BLOOM's, LLaMa 2's) to get to know, what aspects are important when preparing data.
Use the data to adapt an existing model using your data (optional).
Report on all your findings in the final report.

References:

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., ... & Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. https://arxiv.org/abs/2307.09288.
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, ... et al. (300+ additional authors not shown), BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, https://arxiv.org/abs/2211.05100, model: https://huggingface.co/bigscience/bloom

Project 5: Unsupervised Domain adaptation for Sentence Classification (Boshko): This project seeks to improve document representation in specialized domains by adapting sentence-transformer models, which, while effective, are not inherently tuned to specific fields. The focus will be on investigating two advanced adaptation techniques: TSDAE (Transformer-based Denoising AutoEncoder) and GPL (generative pseudo labeling). These methods aim to refine the representation space, making it more sensitive and accurate within a given domain. The students will evaluate the effect of the adaptation on the classification result.

Proposed methodology:

Literature review on sentence transformers, TSDAE and GPL to understand their application in information retrieval.
Selection of a (Slovenian) classification dataset for domain adaptation experiments (SentiNews,https://www.clarin.si/repository/xmlui/handle/11356/1110 )..
Designing and implementation of experiments to assess the impact of domain adaptation techniques on classification performance. It is obligatory to use an HPC.
Detailed analysis of classification results to determine the effectiveness of TSDAE, GPL, and ranking functions.
Final report summarizing the findings, with recommendations of feasibility of domain adaptation in information retrieval systems for classification.

References:

Reimers, Nils, and Gurevych, Iryna. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Nov. 2019, Association for Computational Linguistics.
Wang, K., Reimers, N., Gurevych, I.: Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In: Findings of the Association for Computational Linguistics: EMNLP 2021. pp. 671–688.
Wang, K., Thakur, N., Reimers, N., Gurevych, I.: GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 2345–2360.
Unsupervised Training for Sentence Transformers (TSDAE) https://www.pinecone.io/learn/series/nlp/unsupervised-training-sentence-transformers/
Domain Adaptation with Generative Pseudo-Labeling (GPL) https://www.pinecone.io/learn/series/nlp/gpl/

Project 6: Qualitative Research on Discussions - text categorization (Slavko): Qualitative discourse analysis is an important way social scientists research human interaction. Large language models (LLMs) offer potential for tasks like qualitative discourse analysis, which demand a high level of inter-rater reliability among human “coders” (i.e., qualitative research categorizers). This is an exceedingly labor-intensive task, requiring human coders to fully understand the discussion context, consider each participant’s perspective, and comprehend the sentence’s associations with the previous discussion, as well as shared general knowledge. In this task, you create a model to categorize postings in online discussions, such as in a corpus — an online discussion about the story, “The Lady, or the Tiger?”. We provide a coded dataset with a high inter-rater reliability and a codebook including definitions of each category with examples. Your task is building and training a highly reliable language model for this coding task that generalizes to other online discussions.

Proposed methodology:

Literature Review: Conduct a comprehensive literature review on discourse analysis or dialogic analysis, focusing on the coding criteria and applied approaches related to NLP.
Data Exploration: Explore and understand the provided coded discourse dataset (FINAL DATA (unfiltered as in real-life scenarios)).
Fine-tuning Models: Building and fine-tuning LLMs on the provided dataset to predict the discussion category of a posting, considering the discussion context, associations with the previous sentences and the involved participants. It is obligatory to use an HPC.
Performance Evaluation: Exploring the metrics that used to evaluate the performance of your built models in discourse analysis. Iteratively comparare & revise your model’s performance compared to human coders (categorizers) and revise based on results. You have the option to implement your own evaluation approaches, or compare your model’s performance with that of other alternative models working on the dataset. Your model will also be tested, for generalizability, on another coded online discussion data set with a different codebook.
Define explanations of the categories by an LLM model. Use a separately fine-tuned LLM to generate explanations and qualitatively assess them.
Final Report: Delivering a comprehensive report on your findings, emphasizing the effectiveness, innovation, and limitations of your proposed models.

References:

Gee, J. P. (2010). An introduction to discourse analysis: Theory and method. Routledge. Note: pdf is freely available online or asking Dr. Glenn Smith
Sherry, M. B. (2021). How to facilitate meaningful classroom conversations across disciplines, grade levels, and digital platforms. Rowman & Littlefield Publishers, Inc.
Zwaan, R. A., Radvansky, G. A., Hilliard, A. E., & Curiel, J. M. (1998). Constructing multidimensional situation models during reading. Scientific studies of reading, 2(3), 199-220.
Li, L., Ma, Z., Fan, L., Lee, S., Yu, H., & Hemphill, L. (2023). ChatGPT in education: A discourse analysis of worries and concerns on social media. arXiv preprint arXiv:2305.02201.
Pham, C. M., Hoyle, A., Sun, S., & Iyyer, M. (2023). TopicGPT: A Prompt-based Topic Modeling Framework. arXiv preprint arXiv:2311.01449.

Project 7: Conversations with Characters in Stories for Literacy — Quick, Customized Persona Bots from novels (Slavko): There is a world-wide literacy crisis (Murray, 2021; OECD, 2015, 2019). Young people hate reading and rarely read recreationally. They fail at high level literacy skills, e.g., evaluating texts for validity and integrating across texts to create personal knowledge. Yet, literacy is vital for educational and professional success, life happiness and societal health. One way to motivate young people to read is through conversational interaction with digital personifications of characters (pedagogical agents or PersonaBots) in novels. LLMs provide possible solutions. Khanmigo offers personaBot ChatGPT text conversations with Jay Gatsby (of the classic novel, The Great Gatsby) and with Obama. However, their offerings are limited. Khanamigo provides no information on development time for personaBots, nor does it offer customized personaBots from user-suggested novels. Quick, customized personaBots, for conversations with characters, based on teacher-suggested novels would be enormously educational. To ensure a personaBot is fully contextualized in the specific context and at the same time within the constraints of token limitations, our suggestion is considering the current Retrieval and indexing techniques (i.e., Retrieval Augmented Generation) or implementing more efficient vector searching or similarity computation approaches. .

Proposed methodology:

Literature Review: Conduct a comprehensive literature review (and services such as character.ai) on personaBots and pedagogical agents in language arts. Look at theoretical systems that inform PersonaBot design, personality theory, situation models, pedagogical agents and more.
Explore provided sample stories with example scripts for conversations with characters, e.g., simple xml hard-coded scripts. Also, you will be provided with suggested novels in the public domain as test examples (No (useful) data available, just description and example from IMapBook).
Based on formative evaluation, fine-tune models for creating custom PersonaBot conversations with characters from suggested stories and novels.
Performance Evaluation: Test your persona bots with sample users with newly suggested stories/novels, with high school or university students, through teleconferences and analyze the transcripts of the conversations. Explore metrics to evaluate the performance of your personaBots - how would the evaluation look like. You have the option to implement your own evaluation approaches, or compare your model’s performance with that of other alternative models working on the dataset.
Final Report: Delivering a comprehensive report on your findings, emphasizing the effectiveness, innovation, and limitations of your proposed models for creating customized personaBots based on characters in novels.

References:

Alaimi, M., Law, E., Pantasdo, K. D., Oudeyer, P. Y., & Sauzeon, H. (2020, April). Pedagogical agents for fostering question-asking skills in children. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1-13).
Bogaerds-Hazenberg, S. T., Evers-Vermeul, J., & van den Bergh, H. (2022). What textbooks offer and what teachers teach: an analysis of the Dutch reading comprehension curriculum. Reading and writing, 35(7), 1497-1523.
Nielen, T. M. J., Smith, G. G., Sikkema-de Jong, M. T., Drobisz, J., van Horne, B., & Bus, A. G. (2018). Digital guidance for susceptible readers: effects on fifth graders’ reading motivation and incidental vocabulary learning, Journal of Educational Computing Research.
Goldberg, L. R. (2013). An alternative “description of personality”: The Big-Five factor structure. In Personality and Personality Disorders (pp. 34-47). Routledge.
Murray, J. (2021). Literacy is inadequate: young children need literacies. International Journal of Early Years Education, 29(1), 1-5.
Neuman, Y., Kozhukhov, V., & Vilenchik, D. (2023). Data Augmentation for Modeling Human Personality: The Dexter Machine. arXiv preprint arXiv:2301.08606.
OECD. (2015). Beyond PISA 2015: A longer-term strategy of PISA. www.oecd.org/pisa/pisaproducts/Longerterm-strategy-of-PISA.pdf
OECD. (2019). PISA 2018 assessment and analytical framework. https://dx.doi.org/10. 1787/b25efab8-en.
Papaioannou, I. (2022). Designing coherent and engaging open-domain conversational AI systems (Doctoral dissertation, Heriot-Watt University).
Zwaan, R. A., Radvansky, G. A., Hilliard, A. E., & Curiel, J. M. (1998). Constructing multidimensional situation models during reading. Scientific studies of reading, 2(3), 199-220.
Wu, Z., Wang, Y., Ye, J., Feng, J., Xu, J., Qiao, Y., & Wu, Z. (2023). Openicl: An open-source framework for in-context learning. arXiv preprint arXiv:2303.02913.

Project 8: Automatic identification of multiword expressions and definitions generation (Slavko): Understanding the relation between the meaning of words is an important part of comprehending natural language. A lot of works have either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (LLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent LLMs capture relational knowledge and are able to transfer it across languages. We proposed MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. We tested LLMs' ability to capture analogies across languages, and predict translation targets (first reference). There are considerable differences across relation types and languages. Analyze Slovenian inter annotator agreement and generate definitions. Propose corpus improvement.

Proposed methodology:

Literature review: Review the initial BATS task corpus, MultiLexBATS paper and corpus.
Select an open-source LLM and define useful LLM prompts to generate English and Slovenian definitions (for both annotators). See example for the Bridge corpus.
Perform automatic translation Eng-Slo along with definitions generation. Semantically analyze three Slovenian results for each keyword. It is obligatory to use an HPC.
Propose dataset adaptation - define error types, dataset cleaning and updates. Show that following MutliLexBATS you can improve results (alternatively you can also replace BLOOM with another LLM - e.g. Mixtral).
Final report summarizing the findings and critical evaluation of results.

References:

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
[BATS corpus] Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL-HLT SRW (pp. 47–54). San Diego, California, June 12-17, 2016: ACL.
MultiLexBATS. Accepted to LREC-COLING 2024. (LINK).
MultiLexBATS Corpus. Prepared within the COST NexusLinguarum action, led by prof. Dagmar Gromann. Slovenian part taken by Slavko Žitnik and Timotej Knez, 2023. (LINK).
Bridge prompts by Iztok Kosem, 2024 (LINK).

Project 9: Natural language inference dataset (DigiLing only, Aleš): The goal of this project is to create a NLI dataset by creating text passages that challenge the model's understanding of entailment, neutrality, and contradiction between pairs of longer texts. Students will use LLMs to generate two-paragraph texts using diverse prompts, analyse the accuracy of the model to follow the instructions and correct (if needed) the generated two-paragraph texts. They will also train a small model and (optionally) apply explanation methods to understand model predictions.

Proposed methodology:

Studying the construction of SI-NLI dataset and other reference literature. Finding a solution to extend the dataset to longer contexts.
Designing creative prompts to generate text passages using LLMs (two paragraphs, each approximately 5 sentences and a clear relation between them - entailment, contradiction, neutral). Each member of a team should produce 50 samples.
Manual validation of the generated texts based on their logical relationships (entailment, neutrality, contradiction) and correction of mistakes.
Combining samples of all team members into one large dataframe.
Training a small model and using it to determine if the created dataset is challenging enough.
Compilation of a report detailing the generation process and evaluation process.

References:

Request a copy of the unpublished paper in which SI-NLI dataset is described via email
Yu, F., Zhang, H., & Wang, B. (2023). Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.