❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayWiley: Journal of Educational Measurement: Table of Contents

Using Multilabel Neural Network to Score High‐Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment

Abstract

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs.Β .68) and precision (.53 vs.Β .38), while gaining an additional 3% in accuracy (.94 vs.Β .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.

IRT Observed‐Score Equating for Rater‐Mediated Assessments Using a Hierarchical Rater Model

Abstract

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.

A Note on the Use of Categorical Subscores

Abstract

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructionalΒ purposes.

❌
❌