Reading view

There are new articles available, click to refresh the page.

Modeling Missing Response Data in Item Response Theory: Addressing Missing Not at Random Mechanism with Monotone Missing Characteristics

Abstract

Item nonresponses frequently occurs in educational and psychological assessments, and if not appropriately handled, it can undermine the reliability of the results. This study introduces a missing data model based on the missing not at random (MNAR) mechanism, incorporating the monotonic missingness assumption to capture individual-level missingness patterns and behavioral dynamics. In specific, the cumulative number of missing indicators allows to consider the tendency of current item's missingness based on the previous missingnesses, which reduces the number of nuisance parameters for modeling missing data mechanisms. Two Bayesian model evaluation criteria were developed to distinguish between missing at random (MAR) and MNAR mechanisms by imposing specific parameter constraints. Additionally, the study introduces a highly efficient Bayesian slice sampling algorithm to estimate the model parameters. Four simulation studies were conducted to show the performance of the proposed model. The PISA 2015 science data was carried out to further illustrate the application of the proposed approach.

The Vulnerability of AI‐Based Scoring Systems to Gaming Strategies: A Case Study

Abstract

Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.

Using Item Parameter Predictions for Reducing Calibration Sample Requirements—A Case Study Based on a High‐Stakes Admission Test

Abstract

In item difficulty modeling (IDM), item parameters are predicted from the items' linguistic features, aiming to ultimately render item calibration redundant. Current IDM applications, however, commonly do not yield the required prediction accuracy. To immediately exploit even somewhat inaccurate IDM predictions, we blend IDM with established Bayesian estimation techniques. We propose a two-step approach where (1) IDM predictions are obtained and (2) employed to construct informative prior distributions. We evaluate the approach in a case study on small-sample calibration of the 3PL in a high-stakes test. First, concerning implementation, we find computationally efficient penalized maximum likelihood estimation to be comparable to the best-performing MCMC-based approach. Second, we investigate sample size reductions achievable with state-of-the-art IDM predictions, finding negligible gains compared to merely considering the historical distribution of parameters. Third, we evaluate the prediction accuracy required for a targeted sample size reduction by gradually increasing simulated IDM prediction accuracies. We find that required accuracies can counterbalance each other, allowing calibration sample size to be reduced when either high-quality item difficulty predictions or good predictions of item discriminations and pseudo-guessing parameters are available. We argue that these evaluations provide new, portable IDM benchmarks quantifying performance in terms of achievable sample size reductions.

Using Multilabel Neural Network to Score High‐Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment

Abstract

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.

IRT Observed‐Score Equating for Rater‐Mediated Assessments Using a Hierarchical Rater Model

Abstract

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.

A Note on the Use of Categorical Subscores

Abstract

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructional purposes.

❌