Two New Papers on LLM Benchmarking for Humanities Tasks

27 March 2026 / News, Research

Two New Papers on LLM Benchmarking for Humanities Tasks

A table showing the performance of different LLM models for a variety of tasks

The Journal of Open Humanities Data (JOHD) published two papers by Maximilian Hindermann, Sorin Marti, Lea Katharina Kasper and Arno Bosse on the RISE Humanities Data Benchmark. Which large language models perform best on humanities research tasks, and how can we systematically compare their capabilities?

The data paper “The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks” presents a framework for assessing the performance of large language models on humanities-related tasks. The benchmark suite (available on GitHub) includes text and image datasets, prompts, ground truths, and evaluation scripts and addresses tasks essential to digital humanities work including document analysis, transcription, and metadata extraction from historical materials.

The discussion paper “From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark” traces how the suite emerged from RISE's consulting practice and reflects on the methodological challenges of applying benchmarking to humanities contexts. It argues that ground truth in humanities benchmarking is not a matter of objective correctness but of explicit, scholar-defined interpretive choices, and that benchmarking should therefore be understood as an epistemic practice rather than a neutral measurement.

Both papers contribute to the JOHD special collection “Benchmarking in Digital Humanities”, which aims to establish benchmarking as common practice in the humanities. The framework promotes evidence-based decision making on which models to use for specific tasks and provides quantifiable comparisons between different LLMs, via an interactive dashboard.

Researchers interested in using the benchmark with their own materials are welcome to get in touch. In their roles at RISE, Maximilian Hindermann, Sorin Marti and Arno Bosse advise researchers on the use of computational methods and large language models for humanities research projects and are happy to discuss how the framework can be applied to new data and research contexts.

Citations:

Hindermann, M., Marti, S., Kasper, L. K., & Bosse, A. (2026). The RISE Humanities Data Benchmark: A Framework for Evaluating Large Language Models for Humanities Tasks. Journal of Open Humanities Data, 12(1), 24. https://doi.org/10.5334/johd.481

Hindermann, M., Kasper, L. K., Marti, S., & Bosse, A. (2026). From Experiments to Epistemic Practice: The RISE Humanities Data Benchmark. Journal of Open Humanities Data, 12(1), 38. https://doi.org/10.5334/johd.470