Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

J He; Z Leng; D Mckay; D Spina; JR Trippas

Conference Proceedings

Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

J He, Z Leng, D Mckay, D Spina, JR Trippas

SIGIR AP 2025 Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region | ACM | Published : 2025

DOI: 10.1145/3767695.3769508

Open access

Abstract

Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ''ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions - by both humans and LLMs - can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's alpha, paired bootstr..

View full abstract

University of Melbourne Researchers

Dana McKay Author

Related Projects (1)

ARC Centre of Excellence in Automated Decision Making and Society (CE200100005)

Grants

Citation metrics

1Scopus

3Dimensions

Keywords

4608 Human-Centred Computing

46 Information and Computing Sciences

Clinical Research