Automating Dataset Production Using Generative Text and Image Models

2024

Conference: Proceedings of LREC

M. Abrams and C. Thierauf and M. Scheutz

Practical and ethical dataset collection remains a challenge blocking many empirical methods in natural language processing, resulting in a lack of benchmarks or data on which to test hypotheses. We propose a solution to some of these areas by presenting a pipeline to reduce the research burden of producing image and text datasets when datasets may not exist. Our approach, with accompanying software tools, involves (1) generating text with LLMs; (2) creating accompanying image vignettes with text–to–image transformers; and (3) low-cost human validation. We also present the creation of 3 relevant datasets, and conduct a user study that demonstrates this approach is able to aid researchers in obtaining previously-challenging datasets.

@inproceedings{abramsetal24lrec,
  title={Automating Dataset Production Using Generative Text and Image Models},
  author={M. Abrams and C. Thierauf and M. Scheutz},
  year={2024},
  booktitle={Proceedings of LREC},
  url={https://hrilab.tufts.edu/publications/abramsetal24lrec.pdf}
}