❌

Reading view

There are new articles available, click to refresh the page.

Shortening Psychological Scales: Semantic Similarity Matters

Educational and Psychological Measurement, Ahead of Print.
In this study, we proposed a novel scale abbreviation method based on sentence embeddings and compared it to two established automatic scale abbreviation techniques. Scale abbreviation methods typically rely on administering the full scale to a large representative sample, which is often impractical in certain settings. Our approach leverages the semantic similarity among the items to select abbreviated versions of scales without requiring response data, offering a practical alternative for scale development. We found that the sentence embedding method performs comparably to the data-driven scale abbreviation approaches in terms of model fit, measurement accuracy, and ability estimates. In addition, our results reveal a moderate negative correlation between item discrimination parameters and semantic similarity indices, suggesting that semantically unique items may result in a higher discrimination power. This supports the notion that semantic features can be predictive of psychometric properties. However, this relationship was not observed for reverse-scored items, which may require further investigation. Overall, our findings suggest that the sentence embedding approach offers a promising solution for scale abbreviation, particularly in situations where large sample sizes are unavailable, and may eventually serve as an alternative to traditional data-driven methods.

Overestimation of Internal Consistency by Coefficient Omega in Data Giving Rise to a Centroid-Like Factor Solution

Educational and Psychological Measurement, Ahead of Print.
Coefficient Omega measuring internal consistency is investigated for its deviations from expected outcomes when applied to correlational patterns that produce variable-focused factor solutions in confirmatory factor analysis. In these solutions, the factor loadings on the factor of the one-factor measurement model closely correspond to the correlations of one manifest variable with the other manifest variables, as is in centroid solutions. It is demonstrated that in such a situation, a heterogeneous correlational pattern leads to an Omega estimate larger than those for similarly heterogeneous and uniform patterns. A simulation study reveals that these deviations are restricted to datasets including small numbers of manifest variables and that the degree of heterogeneity determines the degree of deviation. We propose a method for identifying variable-focused factor solutions and how to deal with deviations.

Obtaining a Bayesian Estimate of Coefficient Alpha Using a Posterior Normal Distribution

Educational and Psychological Measurement, Ahead of Print.
A new alternative to obtain a Bayesian estimate of coefficient alpha through a posterior normal distribution is proposed and assessed through percentile, normal-theory-based, and highest probability density credible intervals in a simulation study. The results indicate that the proposed Bayesian method to estimate coefficient alpha has acceptable coverage probability performance across the majority of investigated simulation conditions.

Examining the Instructional Sensitivity of Constructed-Response Achievement Test Item Scores

Educational and Psychological Measurement, Ahead of Print.
Inferences about student learning from large-scale achievement test scores are fundamental in education. For achievement test scores to provide useful information about student learning progress, differences in the content of instruction (i.e., the implemented curriculum) should affect test-takers’ item responses. Existing research has begun to identify patterns in the content of instructionally sensitive multiple-choice achievement test items. To inform future test design decisions, this study identified instructionally (in)sensitive constructed-response achievement items, then characterized features of those items and their corresponding scoring rubrics. First, we used simulation to evaluate an item step difficulty difference index for constructed-response test items, derived from the generalized partial credit model. The statistical performance of the index was adequate, so we then applied it to data from 32 constructed-response eighth-grade science test items. We found that the instructional sensitivity (IS) index values varied appreciably across the category boundaries within an item as well as across items. Content analysis by master science teachers allowed us to identify general features of item score categories that show high, or negligible, IS.

The Impact of Attentiveness Interventions on Survey Data

Educational and Psychological Measurement, Ahead of Print.
Social and behavioral science researchers who use survey data are vigilant about data quality, with an increasing emphasis on avoiding common method variance (CMV) and insufficient effort responding (IER). Each of these errors can inflate and deflate substantive relationships, and there are both a priori and post hoc means to address them. Yet, little research has investigated how both IER and CMV are affected with the use of these different procedural or statistical techniques used to address them. More specifically, if interventions to reduce IER are used, does this affect CMV in data? In an experiment conducted both in and out of the laboratory, we investigate the impact of attentiveness interventions, such as a Factual Manipulation Check (FMC) on both IER and CMV in same-source survey data. In addition to typical IER measures, we also track whether respondents play the instructional video and their mouse movement. The results show that while interventions have some impact on the level of participant attentiveness, these interventions do not appear to lead to differing levels of CMV.

β€œWhat If Applicants Fake Their Responses?”: Modeling Faking and Response Styles in High-Stakes Assessments Using the Multidimensional Nominal Response Model

Educational and Psychological Measurement, Ahead of Print.
Self-report personality tests used in high-stakes assessments hold the risk that test-takers engage in faking. In this article, we demonstrate an extension of the multidimensional nominal response model (MNRM) to account for the response bias of faking. The MNRM is a flexible item response theory (IRT) model that allows modeling response biases whose effect patterns vary between items. In a simulation, we found good parameter recovery of the model accounting for faking under different conditions as well as good performance of model selection criteria. Also, we modeled responses from N = 3,046 job applicants taking a personality test under real high-stakes conditions. We thereby specified item-specific effect patterns of faking by setting scoring weights to appropriate values that we collected in a pilot study. Results indicated that modeling faking significantly increased model fit over and above response styles and improved divergent validity, while the faking dimension exhibited relations to several covariates. Additionally, applying the model to a sample of job incumbents taking the test under low-stakes conditions, we found evidence that the model can effectively capture faking and adjust estimates of substantive trait scores for the assumed influence of faking. We end the article with a discussion of implications for psychological measurement in high-stakes assessment contexts.

A Comparison of the Next Eigenvalue Sufficiency Test to Other Stopping Rules for the Number of Factors in Factor Analysis

Educational and Psychological Measurement, Ahead of Print.
A plethora of techniques exist to determine the number of factors to retain in exploratory factor analysis. A recent and promising technique is the Next Eigenvalue Sufficiency Test (NEST), but has not been systematically compared with well-established stopping rules. The present study proposes a simulation with synthetic factor structures to compare NEST, parallel analysis, sequential [math] test, Hull method, and the empirical Kaiser criterion. The structures were based on 24 variables containing one to eight factors, loadings ranged from .40 to .80, inter-factor correlations ranged from .00 to .30, and three sample sizes were used. In total, 360 scenarios were replicated 1,000 times. Performance was evaluated in terms of accuracy (correct identification of dimensionality) and bias (tendency to over- or underestimate dimensionality). Overall, NEST showed the best overall performances, especially in hard conditions where it had to detect small but meaningful factors. It had a tendency to underextract, but to a lesser extent than other methods. The second best method was parallel analysis by being more liberal in harder cases. The three other stopping rules had pitfalls: sequential [math] test and Hull method even in some easy conditions; the empirical Kaiser criterion in hard conditions.

An Omega-Hierarchical Extension Index for Second-Order Constructs With Hierarchical Measuring Instruments

Educational and Psychological Measurement, Ahead of Print.
An index extending the widely used omega-hierarchical coefficient is discussed, which can be used for evaluating the influence of a second-order factor on the interrelationships among the components of a hierarchical measuring instrument. The index represents a useful and informative complement to the traditional omega-hierarchical measure of explained overall scale score variance by that underlying construct. A point and interval estimation procedure is outlined for the described index, which is based on model reparameterization and is developed within the latent variable modeling framework. The method is readily applicable with popular software and is illustrated with examples.

The Impact of Missing Data on Parameter Estimation: Three Examples in Computerized Adaptive Testing

Educational and Psychological Measurement, Ahead of Print.
In computerized adaptive testing (CAT), examinees see items targeted to their ability level. Postoperational data have a high degree of missing information relative to designs where everyone answers all questions. Item responses are observed over a restricted range of abilities, reducing item-total score correlations. However, if the adaptive item selection depends only on observed responses, the data are missing at random (MAR). We simulated data from three different testing designs (common items, randomly selected items, and CAT) and found that it was possible to re-estimate both person and item parameters from postoperational CAT data. In a multidimensional CAT, we show that it is necessary to include all responses from the testing phase to avoid violating missing data assumptions. We also observed that some CAT designs produced β€œreversals” where item discriminations became negative causing dramatic under and over-estimation of abilities. Our results apply to situations where researchers work with data drawn from adaptive testing or from instructional tools with adaptive delivery. To avoid bias, researchers must make sure they use all the data necessary to meet the MAR assumptions.

Factor Retention in Exploratory Multidimensional Item Response Theory

Educational and Psychological Measurement, Ahead of Print.
Multidimensional Item Response Theory (MIRT) is applied routinely in developing educational and psychological assessment tools, for instance, for exploring multidimensional structures of items using exploratory MIRT. A critical decision in exploratory MIRT analyses is the number of factors to retain. Unfortunately, the comparative properties of statistical methods and innovative Machine Learning (ML) methods for factor retention in exploratory MIRT analyses are still not clear. This study aims to fill this gap by comparing a selection of statistical and ML methods, including Kaiser Criterion (KC), Empirical Kaiser Criterion (EKC), Parallel Analysis (PA), scree plot (OC and AF), Very Simple Structure (VSS; C1 and C2), Minimum Average Partial (MAP), Exploratory Graph Analysis (EGA), Random Forest (RF), Histogram-based Gradient Boosted Decision Trees (HistGBDT), eXtreme Gradient Boosting (XGBoost), and Artificial Neural Network (ANN). The comparison was performed using 720,000 dichotomous response data sets simulated by the MIRT, for various between-item and within-item structures and considering characteristics of large-scale assessments. The results show that MAP, RF, HistGBDT, XGBoost, and ANN tremendously outperform other methods. Among them, HistGBDT generally performs better than other methods. Furthermore, including statistical methods’ results as training features improves ML methods’ performance. The methods’ correct-factoring proportions decrease with an increase in missingness or a decrease in sample size. KC, PA, EKC, and scree plot (OC) are over-factoring, while EGA, scree plot (AF), and VSS (C1) are under-factoring. We recommend that practitioners use both MAP and HistGBDT to determine the number of factors when applying exploratory MIRT.

Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks

Educational and Psychological Measurement, Ahead of Print.
Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved very easy to easy items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.

Examination of ChatGPT’s Performance as a Data Analysis Tool

Educational and Psychological Measurement, Ahead of Print.
This study examines the performance of ChatGPT, developed by OpenAI and widely used as an AI-based conversational tool, as a data analysis tool through exploratory factor analysis (EFA). To this end, simulated data were generated under various data conditions, including normal distribution, response category, sample size, test length, factor loading, and measurement models. The generated data were analyzed using ChatGPT-4o twice with a 1-week interval under the same prompt, and the results were compared with those obtained using R code. In data analysis, the Kaiser–Meyer–Olkin (KMO) value, total variance explained, and the number of factors estimated using the empirical Kaiser criterion, Hull method, and Kaiser–Guttman criterion, as well as factor loadings, were calculated. The findings obtained from ChatGPT at two different times were found to be consistent with those obtained using R. Overall, ChatGPT demonstrated good performance for steps that require only computational decisions without involving researcher judgment or theoretical evaluation (such as KMO, total variance explained, and factor loadings). However, for multidimensional structures, although the estimated number of factors was consistent across analyses, biases were observed, suggesting that researchers should exercise caution in such decisions.
❌