Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. We also present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets.
@inproceedings{abramsetal24lrec, title={Automating Dataset Production Using Generative Text and Image Models}, author={Christopher Thierauf and Mitchell Abrams and Matthias Scheutz}, year={2024}, booktitle={Proceedings of LREC}, url={https://hrilab.tufts.edu/publications/abramsetal24lrec.pdf} }