Finetuning t5 transformer for Educational Question Generation
Image by Freepik
Abstract
This paper presents a detailed study on the fine-tuning of Google’s t5 model aimed at enhancing its ability to generate high-quality, relevant questions. Leveraging the diverse and widely accepted SQuAD dataset, we trained our model specifically for question generation. Our results indicate significant improvements in multiple performance metrics, including BLEU, ROUGE and Word Mover’s Distance scores. The results were evaluated in a qualitative manner using an interactive interface. We discuss the implications of these findings for the field of model based question generation, highlighting potential applications in education. Despite the advancements, we acknowledge certain limitations in terms of generalization. Future work is suggested to further refine the model, explore alternative architectures, and integrate the enhanced model into broader AI systems.
Introduction
The project aims to address the labor-intensive task of generating high-quality educational questions by leveraging AI, specifically the T5 transformer model. The goal is to create a model that generates accurate and contextually relevant questions, utilizing the SQuAD dataset. This approach is expected to improve accessibility and effectiveness in education, particularly in the context of online learning platforms.
Method
- Pretrained Model Selection: Google’s T5 model was chosen for its benchmark performance in question generation tasks.
- Data Preprocessing: The SQuAD dataset was split into training and validation sets. A custom dataset class was created for loading data into the model.
- Loss Logging: A custom LossLoggerCallback function was used to monitor both training and validation losses.
- Model Fine-tuning: The model was fine-tuned using the Huggingface Trainer class, with Cross Entropy as the loss function and the AdamW optimizer.
Results
The fine-tuned T5 model demonstrated significant improvements in various performance metrics. Below is a table summarizing the key results:
| Metric | T5 Fine-Tuned | Flan-T5 |
|---|---|---|
| BLEU Score | 35.12 | 32.45 |
| ROUGE-L Score | 58.76 | 55.23 |
| WMD | 0.423 | 0.445 |
- BLEU Score: The fine-tuned T5 model achieved a BLEU score of 35.12, indicating a higher level of overlap between the generated and reference questions compared to the baseline Flan-T5 model.
- ROUGE-L Score: The model’s ROUGE-L score was 58.76, reflecting an improvement in capturing the longest common subsequence between generated questions and references.
- Word Mover’s Distance (WMD): The WMD score of 0.423 suggests that the fine-tuned model generates questions that are semantically closer to the reference questions than those generated by the Flan-T5 model.
Experiments and Results
- Data: The SQuAD dataset, consisting of over 100,000 questions, was used for training and evaluation. The dataset was analyzed for context, question, and answer length distributions.
- Evaluation Method: Quantitative metrics (BLEU, ROUGE, WMD) and qualitative assessments were used to evaluate the model’s performance.
- Local Deployment: The model was deployed using Streamlit to facilitate qualitative evaluation through an interactive interface.
Analysis
The model generally performed well, generating grammatically correct, relevant, and answerable questions. However, occasional errors, such as generating non-existent words, were noted. The model’s ability to generalize to non-Wikipedia contexts was also tested and showed promising results.
Conclusion
The fine-tuned T5 model successfully generated high-quality educational questions, outperforming the Flan-T5 model in quantitative evaluations. However, issues like overfitting need further investigation. Future research could explore expanding the model’s capabilities to include question answering and improving the explainability of transformer models.
Limitations and Further Research
- Limitations: Evidence of overfitting was observed, which needs further analysis.
- Further Research: Expanding the model to generate reference answers and enhancing its explainability could improve its utility in educational settings.
Authors
- Christian Hobelsberger
- Jules Roboz
- Levente Kis