Devpost: QA Calibration
Research Focus: In our Grounded QA task, we assess the reliability of QA models by evaluating their calibration, specifically focusing on how well the model's confidence in its predictions aligns with the accuracy of those predictions. To better understand this concept, we adopt the idea of a "buzz" from Trivia Quiz competitions. In this context, a buzz occurs when a player is confident enough to provide an answer before the question is fully revealed. Similarly, in our evaluation, we measure whether the model's prediction probability reflects its actual prediction accuracy.
Summary: Our research project is centered on evaluating question-answering (QA) systems, with a particular focus on their calibration. Calibration, in this context, refers to how closely a model’s confidence in its predictions matches the actual correctness of those predictions. This is crucial for ensuring that the model’s confidence reflects its reliability in real-world tasks. To measure calibration, we draw on the concept of a "buzz" from Trivia Quiz competitions, where participants buzz in with an answer as soon as they feel confident enough, often before hearing the full question. Similarly, we assess whether a QA model’s confidence aligns with its likelihood of making a correct prediction as the question is incrementally revealed.
A key feature of our approach is that questions are presented in stages, with the model producing a series of guesses and confidence scores at each step. This allows us to track how the model’s confidence evolves as it receives more information. Our evaluation focuses on three main objectives: 1) determining at which point in the question reveals the model becomes confident enough to produce a correct answer, 2) assessing whether the model’s confidence scores accurately reflect the correctness of its guesses, and 3) comparing the alignment between confidence and correctness in models versus human participants.
To quantify these dynamics, we use a novel metric called Average Expected Buzz, which measures the expected confidence level at which the model will likely buzz in with a correct prediction. This provides a comprehensive evaluation of the system's calibration.
After submission, we plan to test these models on adversarial questions crafted by human experts, specifically designed to be challenging at the final stage (or "run") of the question in a Trivia human-computer tournament. This will allow us to evaluate whether the submitted QA systems can consistently outperform human experts using our calibration metric.
The overarching goal of this project is to enhance the reliability of QA systems by improving the alignment between their confidence estimates and actual performance, making them more trustworthy for real-world applications that depend on accurate, well-calibrated decision-making under uncertainty.
Goal: The broader goal of this project is to improve the reliability and trustworthiness of QA models by ensuring that their confidence estimates are better aligned with their actual performance, ultimately enhancing their applicability in real-world tasks where decision-making based on uncertainty is crucial.
Deliverables: Submission to HuggingFace leaderboard
Researchers
Yoo Yeon Sung | yysung53@umd.edu | Graduate Student
Yu Hou | houyu@umd.edu | Graduate Student
Jordan Boyd-Graber | ying@umd.edu | Faculty | CS Department