❌

Reading view

There are new articles available, click to refresh the page.

Comparative Judgement for evaluating young learners’ EFL writing performances: Reliability and teacher perceptions of holistic and dimension-based judgements

Language Testing, Ahead of Print.
Comparative Judgement (CJ) is an evaluation method, typically conducted online, whereby a rank order is constructed, and scores calculated, from judges’ pairwise comparisons of performances. CJ has been researched in various educational contexts, though only rarely in English as a Foreign Language (EFL) writing settings, and is generally agreed to be a reliable method of evaluating performances. This study extends the CJ research base to young learner EFL writing contexts and innovates CJ procedures with a novel dimension-based approach. Twenty-seven Austrian EFL educators evaluated 300 young learners’ EFL scripts (addressing two task types) from a national examination, using three scoring methods: standard CJ (holistic), CJ by dimensions (our new criteria-based method), and the exam’s conventional analytic rating. It was found that both holistic CJ and our dimension-based CJ were reliable methods of evaluating young learners’ EFL scripts. Experienced EFL teachers who also have experience with using marking schemes proved to be reliable CJ judges. Moreover, despite the preference of some for the more familiar analytic rating method, teachers displayed higher reliability and shorter decision-making times when using CJ. Benefits of dimension-based CJ for reliable and economical scoring of large-scale young learner EFL writing scripts, and the potential for positive washback, are discussed.

A systematic review of differential item functioning in second language assessment

Language Testing, Ahead of Print.
The growing diversity among test takers in second or foreign language (L2) assessments makes the importance of fairness front and center. This systematic review aimed to examine how fairness in L2 assessments was evaluated through differential item functioning (DIF) analysis. A total of 83 articles from 27 journals were included in a systematic review. The findings suggested that classical DIF techniques were dominant in use, particularly Rasch-based methods, the Mantel–Haenszel procedure, item response theory (IRT) approaches, logistic regression, and SIBTEST, but emerging methods such as DIF analysis based on cognitive diagnostic models were also identified. Most DIF studies examined manifest grouping variables such as gender and language background and were based on assessments of receptive language skills such as reading and listening comprehension. DIF analyses were mostly conducted in an exploratory fashion and causes of DIF were often justified on speculative rather than empirical grounds. In addition, the quality of DIF analyses was undermined by suboptimal reporting practices. Our results suggest the need to improve current DIF practices, to consider alternative DIF detection methods aligning with emerging views of measurement bias, and to adequately account for the heterogeneity of L2 test takers. The findings have implications for test design and use, fairness, and validity in L2 assessments.

An automatized semantic analysis of two large-scale listening tests: A corpus-based study

Language Testing, Ahead of Print.
This study examined the semantic features of the simulated mini-lectures in the listening sections of the International English Language Testing System (IELTS) and the Test of English as a Foreign Language (TOEFL) based on automatized semantic analysis to explore the content validity of the two tests. Two study corpora were utilized, the IELTS corpus with 56 mini-lectures (38,944 words) and the TOEFL corpus with 285 mini-lectures (207,296 words). The reference corpus comprised 59 lectures from the Michigan Corpus of Academic Spoken English (MICASE), totaling 571,354 words. The corpora were submitted to automatized semantic tagging using Wmatrix5. Three comparisons were conducted: IELTS versus TOEFL, IELTS versus MICASE lectures, and TOEFL versus MICASE lectures. The results suggest that IELTS and TOEFL mini-lectures shared 78% and 64% of the same semantic features as MICASE, respectively, supporting their relative content validity. Nevertheless, specific semantic categories, such as politics, war, and intimate and sexual relationships, were notably absent from the test corpora, even though they appeared in the academic lecture corpus. In addition, causal connectors are frequently used in both tests, while the mini-lectures of IELTS listening tests cover fewer academic discourse fields than TOEFL mini-lectures. Implications for content validity are discussed.

Review of the Canadian English Language Proficiency Index Program (CELPIP)

Language Testing, Volume 42, Issue 1, Page 100-113, January 2025.
The Canadian English Language Proficiency Index Program (CELPIP) is a computer-delivered test for English language proficiency, primarily used for Canadian immigration purposes. This review begins by contextualizing the test’s use as an immigration gatekeeping instrument, followed by an overview of its underlying construct and the four test components: listening, reading, writing, and speaking. We then appraise the test in terms of its accessibility, reliability, validity, authenticity, and impact. While we appreciate the β€œCanadian-ness” of the test, the user-friendly computer-based test delivery, and the accessible approach to sharing scoring criteria, we also identify several shortcomings regarding transparency in scoring, attention to interactional competence, and attention to research on test impact. We close with a brief commentary on the use of such tests for selecting and controlling immigrants.

Open Access in language testing and assessment: The case of two flagship journals

Language Testing, Volume 41, Issue 4, Page 703-728, October 2024.
This study is a systematic examination of the open access status of research in two flagship language testing and assessment journals: Language Testing and Language Assessment Quarterly. Coding and analysing 898 articles, we investigated (a) the prevalence of open access in four aspectsβ€”open manuscripts, open materials, open data, and open code, and (b) the relationship between open access and various characteristics of research, tests, and researchers. Our study revealed a positive trend in the adoption of open access over time, with open manuscripts and materials showing notable increases. Open code and data have remained scarce, though with a recent uptick from a low base. Notably, logistic regression results suggest inequitable participation in open access as authors from the Global South were less likely to have open manuscripts. Recognising the potential role of flagship journals as trend and standard setters, we call on the field to (a) shift towards more equitable open access models, (b) balance intellectual property concerns with validation needs, (c) recognise open code and open data with protected access via dedicated badges, and (d) adopt Research Transparency Statements, a new reporting structure inclusive of methodological and epistemological differences in open research practices.

Evaluating the impact of nonverbal behavior on language ability ratings

Language Testing, Volume 41, Issue 4, Page 729-758, October 2024.
Nonverbal behavior can impact language proficiency scores in speaking tests, but there is little empirical information of the size or consistency of its effects or whether language proficiency may be a moderating variable. In this study, 100 novice raters watched and scored 30 recordings of test takers taking an international, high stakes proficiency test. The speech samples were each 2 minutes long and ranged in proficiency levels. The raters scored each sample on fluency, vocabulary, grammar, and comprehensibility using 7-point semantic differential scales. Nonverbal behavior was extracted using an automated machine learning software called iMotions, and data was analyzed with ordinal mixed effects regression. Results showed that attentional variance predicted fluency, vocabulary, and grammar scores, but only when accounting for proficiency. Higher standard deviations of attention corresponded with lower scores for the lower-proficiency group, but not the mid/higher-proficiency group. Comprehensibility scores were only predicted by mean valence when proficiency was an interaction term. Higher mean valence, or positive emotional behavior, corresponded with higher scores in the lower-proficiency group, but not the mid/higher-proficiency group. Effect sizes for these predictors were quite small, with small amounts of variance explained. These results have implications for construct representation and test fairness.

A Context-Aligned Two Thousand Test: Toward estimating high-frequency French vocabulary knowledge for beginner-to-low intermediate proficiency adolescent learners in England

Language Testing, Volume 41, Issue 4, Page 759-791, October 2024.
Vocabulary knowledge strongly predicts second language reading, listening, writing, and speaking. Yet, few tests have been developed to assess vocabulary knowledge in French. The primary aim of this pilot study was to design and initially validate the Context-Aligned Two Thousand Test (CA-TTT), following open research practices. The CA-TTT is a test of written form–meaning recognition of high-frequency vocabulary aimed at beginner-to-low intermediate learners of French at the end of their fifth year of secondary education. Using an argument-based validation framework, we drew on classical test theory and Rasch modeling, together with correlations with another vocabulary size test and proficiency measures, to assess the CA-TTT’s internal and external validity. Overall, the CA-TTT showed high internal and external validity. Our study highlighted the decisive role of the curriculum in determining vocabulary knowledge in instructed, low-exposure contexts. We discuss how this might contribute to under- or over-estimations of vocabulary size, depending on the relations between the test and curriculum content. Further research using the tool is openly invited, particularly with lower proficiency learners in this context. Following further validation, the test could serve as a tool for assessing high-frequency vocabulary knowledge at beginner-to-low intermediate levels, with due attention paid to alignment with curriculum content.

Authenticity of academic lecture passages in high-stakes tests: A temporal fluency perspective

Language Testing, Volume 41, Issue 4, Page 792-816, October 2024.
Corpus-based studies have offered the domain definition inference for test developers. Yet, corpus-based studies on temporal fluency measures (e.g., speech rate) have been limited, especially in the context of academic lecture settings. This made it difficult for test developers to sample representative fluency features to create authentic listening passages. To address this issue, the Fluency Corpus of Academic English Lectures (FCAEL) was created to offer insight into the thresholds for temporal fluency features in academic lecture settings. The current study compared the corpus data to the academic lecture passages in the Test of English as a Foreign Language Internet-based test (TOEFL iBT) and International English Language Testing System (IELTS) to examine the domain definition inference of these tests. In total, 14 temporal fluency measures were examined. A bootstrapped one-way multivariate analysis of variance (MANOVA), followed by a series of bootstrapped analyses of variances (ANOVAs), independent t-test, and Tukey tests showed some support for the tests, although many limitations were also found. The study suggests the 25th–75th percentile of FCAEL as tentative thresholds for each temporal fluency feature. The proposal may be useful for test developers to create and revise test materials. Coding schemes, analysis codes, and raw corpus data are available on the project’s Open Science Framework page, exemplifying how Open Science can provide benefits beyond the academic community.

Developing internet-based Tests of Aptitude for Language Learning (TALL): An open research endeavour

Language Testing, Volume 41, Issue 4, Page 817-827, October 2024.
Tests of Aptitude for Language Learning (TALL) is an openly accessible internet-based battery to measure the multifaceted construct of foreign language aptitude, using language domain–specific instruments and L1-sensitive instructions and stimuli. This brief report introduces the components of this theory-informed battery and methodological considerations for developing it into an open research instrument. It also presents the preliminary results from the initial validation of TALL carried out on data collected from Chinese L1 participants (n = 165) from a university setting who took two rounds of tests (with counterbalanced test items) with a minimum 30-day interval. The results of data analyses at subtest, item, and battery levels suggest that, in general, TALL has satisfactory reliability and can be used to measure aptitude conceptualized in the theoretical frameworks on which it has been developed. This report also highlights the value of TALL as a convenient data collection tool openly accessible to any researcher for free, its potential for facilitating an open data pool for high-quality syntheses of aptitude-related research findings, and its implications for Open Research practices in testing language-related constructs.

What is the best predictor of word difficulty? A case of data mining using random forest

Language Testing, Volume 41, Issue 4, Page 828-844, October 2024.
Word frequency has a long history of being considered the most important predictor of word difficulty and has served as a guideline for several aspects of second language vocabulary teaching, learning, and assessment. However, recent empirical research has challenged the supremacy of frequency as a predictor of word difficulty. Accordingly, applied linguists have questioned the use of frequency as the principal criterion in the development of wordlists and vocabulary tests. Despite being informative, previous studies on the topic have been limited in the way the researchers measured word difficulty and the statistical techniques they employed for exploratory data analysis. In the current study, meaning recall was used as a measure of word difficulty, and random forest was employed to examine the importance of various lexical sophistication metrics in predicting word difficulty. The results showed that frequency was not the most important predictor of word difficulty. Due to the limited scope, research findings are only generalizable to Vietnamese learners of English.

Sharing, collaborating, and building trust: How Open Science advances language testing

Language Testing, Volume 41, Issue 4, Page 845-859, October 2024.
The Open Science movement is taking hold around the world, and language testers are taking part. In this Viewpoint, I discuss how sharing, collaborating, and building trust, guided by Open Science principles, benefit the language testing field. To help more language testers join in, I present a standard definition of Open Science and describe four ways language testing researchers can immediately partake. Overall, I share my views on how Open Science is an accelerating process that improves language testing as a scientific and humanistic field.

An industry perspective on Open Science: A response to Winke

Language Testing, Volume 41, Issue 4, Page 865-871, October 2024.
Open science practices are now at the forefront of discussions in the applied linguistics research community. Proponents of open science argue for its potential to enhance research quality and accessibility while promoting a collaborative and equitable environment. Winke advocates integrating open science into language assessment research to enhance research quality, accessibility, and collaboration. This response introduces two additional perspectives to support open science practices. The first is a framework, which identifies five schools of thought on open science that emphasize understanding the various goals of open science and the scientific methods and tools that are used to pursue them. Second, I highlight two additional characteristics of open science: the need for community and the costs of open science. These additional perspectives underscore the significance of making research processes transparent and inclusive, extending beyond traditional academic boundaries to engage the public and industry stakeholders. By integrating these considerations, this response aims to offer a nuanced view of the challenges and opportunities that open science presents in the field of language assessment, suggesting ideas for how researchers outside and inside the language assessment industry can work toward improving open science practices in language assessment research.

Open Science for language assessment research and practice in China: A response to Winke

Language Testing, Volume 41, Issue 4, Page 877-881, October 2024.
Winke delineates objectives and suggests a series of steps for the implementation of Open Science (OS) in language assessment. While we recognize the relevance and potential success of these concrete measures for OS in language assessment, the distinctive challenges confronting OS in China may prevent researchers and practitioners from fully capitalizing on them. In response to Winke, we first reflect on the significant challenges encountered by Chinese language assessment researchers and practitioners. These challenges include the absence of an ethos of openness in language assessment research and practice, difficulties in publishing in scholarly journals, including open-access (OA) journals, and the sensitive nature of the data involved in large-scale, high-stakes language testing. To address these issues and forge ahead, we propose a community-oriented, grassroots-driven approach to cultivating an OS culture and fortifying collaboration among stakeholders. We hope that, through concerted efforts, we can promote OS and, more importantly, enhance the quality of language assessment research and practice in China.
❌