Towards Optimal NLP Solutions: Analyzing GPT and LLaMA-2 Models Across Model Scale, Dataset Size, and Task Diversity

Ankit Kumar; Richa Sharma; Punam Bedi

doi:10.48084/etasr.7200

Authors

Ankit Kumar Department of Computer Science, University of Delhi, India https://orcid.org/0000-0003-1808-214X
Richa Sharma Department of Computer Science, University of Delhi, India https://orcid.org/0000-0002-4472-1681
Punam Bedi Department of Computer Science, University of Delhi, India https://orcid.org/0000-0002-6007-7961

Volume: 14 | Issue: 3 | Pages: 14219-14224 | June 2024 | https://doi.org/10.48084/etasr.7200

Received: 16 March 2024 | Revised: 30 March 2024 | Accepted: 2 April 2024 | Online: 25 April 2024

Corresponding author: Richa Sharma

Abstract

This study carries out a comprehensive comparison of fine-tuned GPT models (GPT-2, GPT-3, GPT-3.5) and LLaMA-2 models (LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B) in text classification, addressing dataset sizes, model scales, and task diversity. Since its inception in 2018, the GPT series has been pivotal in advancing NLP, with each iteration introducing substantial enhancements. Despite its progress, detailed analyses, especially against competitive open-source models like the LLaMA-2 series in text classification, remain scarce. The current study fills this gap by fine-tuning these models across varied datasets, focusing on enhancing task-specific performance in hate speech and offensive language detection, fake news classification, and sentiment analysis. The learning efficacy and efficiency of the GPT and LLaMA-2 models were evaluated, providing a nuanced guide to choosing optimal models for NLP tasks based on architectural benefits and adaptation efficiency with limited data and resources. In particular, even with datasets as small as 1,000 rows per class, the F1 scores for the GPT-3.5 and LLaMA-2 models exceeded 0.9, reaching 0.99 with complete datasets. Additionally, the LLaMA-2 13B and 70B models outperformed GPT-3, demonstrating their superior efficiency and effectiveness in text classification. Both the GPT and LLaMA-2 series showed commendable performance on all three tasks, underscoring their ability to handle a diversity of tasks. Based on the size, performance, and resources required for fine-tuning the model, this study identifies LLaMA-2 13B as the most optimal model for NLP tasks.

Keywords:

natural language processing, large language models, GPT series, LLaMA-2 series, fine tuning

References

E. Yilmaz and O. Can, "Unveiling Shadows: Harnessing Artificial Intelligence for Insider Threat Detection," Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13341–13346, Apr. 2024. DOI: https://doi.org/10.48084/etasr.6911

A. Kazm, A. Ali, and H. Hashim, "Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction," Engineering, Technology & Applied Science Research, vol. 14, no. 2, pp. 13124–13132, Apr. 2024. DOI: https://doi.org/10.48084/etasr.6855

R. Sharma, S. Deol, U. Kaushish, P. Pandey, and V. Maurya, "DWAEF: a deep weighted average ensemble framework harnessing novel indicators for sarcasm detection 1," Data Science, vol. 6, no. 1–2, pp. 17–44, Jan. 2023. DOI: https://doi.org/10.3233/DS-220058

K. A. Aldriwish, "Empowering Learning through Intelligent Data-Driven Systems," Engineering, Technology & Applied Science Research, vol. 14, no. 1, pp. 12844–12849, Feb. 2024. DOI: https://doi.org/10.48084/etasr.6675

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving Language Understanding by Generative Pre-Training." [Online]. Available: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language Models are Unsupervised Multitask Learners," OpenAI, San Francisco, CA, USA.

T. Brown et al., "Language Models are Few-Shot Learners," in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 1877–1901.

"OpenAI Platform: GPT-3.5 Turbo." https://platform.openai.com/docs/models/gpt-3-5-turbo.

"GPT-4 Technical Report," OpenAI, San Francisco, CA, USA, 2023. [Online]. Available: https://cdn.openai.com/papers/gpt-4.pdf.

H. Touvron et al., "LLaMA: Open and Efficient Foundation Language Models." arXiv, Feb. 27, 2023.

H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv, Jul. 19, 2023.

B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, and A. Mukherjee, "HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection," Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 17, pp. 14867–14875, May 2021. DOI: https://doi.org/10.1609/aaai.v35i17.17745

R. Socher et al., "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, pp. 1631–1642.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, "XLNet: generalized autoregressive pretraining for language understanding," in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, Sep. 2019, pp. 5753–5763.

C. Bekar, K. Carlaw, and R. Lipsey, "General purpose technologies in theory, application and controversy: a review," Journal of Evolutionary Economics, vol. 28, no. 5, pp. 1005–1033, Dec. 2018. DOI: https://doi.org/10.1007/s00191-017-0546-0

"Fake News Challenge." http://www.fakenewschallenge.org/.

"Google Colaboratory." https://colab.research.google.com/.

"Fine-tuning - OpenAI API." https://platform.openai.com/docs/guides/fine-tuning.

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv, May 23, 2023.