Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Improving Text Classification with Large Language Model-Based Data Augmentation

Journal Article · · Electronics

Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model’s classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model’s performance.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
2394731
Journal Information:
Electronics, Journal Name: Electronics Journal Issue: 13 Vol. 13; ISSN 2079-9292
Publisher:
MDPICopyright Statement
Country of Publication:
United States
Language:
English

References (15)

Influence of project characteristics, regulatory pathways, and environmental complexity on hydropower licensing timelines in the US journal March 2022
A synthesis of environmental and recreational mitigation requirements at hydropower projects in the United States journal July 2016
Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time journal November 2023
Enhancing Text Classification Models with Generative AI-aided Data Augmentation conference July 2023
Text Generation for Imbalanced Text Classification
  • Akkaradamrongrat, Suphamongkol; Kachamas, Pornpimon; Sinthupinyo, Sukree
  • 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE) https://doi.org/10.1109/JCSSE.2019.8864181
conference July 2019
Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches journal June 2022
Data Augmentation for Multiclass Utterance Classification – A Systematic Study conference January 2020
Transformers: State-of-the-Art Natural Language Processing conference January 2020
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation conference January 2021
AEDA: An Easier Data Augmentation Technique for Text Classification conference January 2021
ZeroGen: Efficient Zero-shot Learning via Dataset Generation conference January 2022
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
  • Wei, Jason; Zou, Kai
  • Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) https://doi.org/10.18653/v1/D19-1670
conference January 2019
Robust Training under Linguistic Adversity
  • Li, Yitong; Cohn, Trevor; Baldwin, Timothy
  • Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers https://doi.org/10.18653/v1/E17-2004
conference January 2017
Improving Neural Machine Translation Models with Monolingual Data
  • Sennrich, Rico; Haddow, Barry; Birch, Alexandra
  • Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) https://doi.org/10.18653/v1/P16-1009
conference January 2016
NLTK: the Natural Language Toolkit
  • Loper, Edward; Bird, Steven
  • Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - https://doi.org/10.3115/1118108.1118117
conference January 2002