Unsupervised [randomly responding] survey bot detection: In search of high classification accuracy.
Psychological Methods, Mar 10, 2025, No Pagination Specified; doi:10.1037/met0000746
While online survey data collection has become popular in the social sciences, there is a risk of data contamination by computer-generated random responses (i.e., bots). Bot prevalence poses a significant threat to data quality. If deterrence efforts fail or were not set up in advance, researchers can still attempt to detect bots already present in the data. In this research, we study a recently developed algorithm to detect survey bots. The algorithm requires neither a measurement model nor a sample of known humans and bots; thus, it is model agnostic and unsupervised. It involves a permutation test under the assumption that Likert-type items are exchangeable for bots, but not humans. While the algorithm maintains a desired sensitivity for detecting bots (e.g., 95%), its classification accuracy may depend on other inventory-specific or demographic factors. Generating hypothetical human responses from a well-known item response theory model, we use simulations to understand how classification accuracy is affected by item properties, the number of items, the number of latent factors, and factor correlations. In an additional study, we simulate bots to contaminate real human data from 35 publicly available data sets to understand the algorithm’s classification accuracy under a variety of real measurement instruments. Through this work, we identify conditions under which classification accuracy is around 95% or above, but also conditions under which accuracy is quite low. In brief, performance is better with more items, more categories per item, and a variety in the difficulty or means of the survey items. (PsycInfo Database Record (c) 2025 APA, all rights reserved)