Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. We also present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets.
@inproceedings{abramsetal24lrec,
title={Automating Dataset Production Using Generative Text and Image Models},
author={Christopher Thierauf and Mitchell Abrams and Matthias Scheutz},
year={2024},
booktitle={Proceedings of LREC},
url={https://hrilab.tufts.edu/publications/abramsetal24lrec.pdf}
}