Improving classification performance of extreme gradient boosting on small-sized dataset to classify Turkish and Italian wines along with elemental profiling by inductively coupled plasma-mass spectrometry


Alp H., ALP O.

SPECTROSCOPY LETTERS, vol.55, no.1, pp.1-12, 2022 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 55 Issue: 1
  • Publication Date: 2022
  • Doi Number: 10.1080/00387010.2021.2008977
  • Journal Name: SPECTROSCOPY LETTERS
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Analytical Abstracts, Applied Science & Technology Source, Biotechnology Research Abstracts, Chemical Abstracts Core, Chimica, Communication Abstracts, INSPEC, Metadex, Civil Engineering Abstracts
  • Page Numbers: pp.1-12
  • Keywords: Classification, extreme gradient boosting, ICP-MS, wine, LEAD-ISOTOPE RATIOS, ICP-MS, GEOGRAPHICAL ORIGIN, MULTIELEMENT ANALYSIS, TRACE-ELEMENTS, WHITE WINES, DIFFERENTIATION, FINGERPRINTS, ACCURACY, SOIL
  • TED University Affiliated: Yes

Abstract

In this study, the classification performance of the extreme gradient boosting algorithm on a small-sized dataset was improved by using a synthetically generated dataset created with kernel density estimation to classify wine samples. The concentration of 29 elements in wine samples produced in Turkey (domestic) and Italy (imported) was determined by inductively coupled plasma-mass spectrometry and obtained results were used to generate the dataset. Classification of wine samples was firstly assessed with extreme gradient boosting, which is known for overfitting in small-sized datasets, resulting in poor classification performance. To improve the classification performance, a synthetic dataset was created and the algorithm was trained on the synthetic dataset instead of the original dataset. With the proposed method, the accuracy of the model was improved from 76.7% to 81.7%. The precision values for Turkish and Italian wines were increased from 78.4% to 84.1% and from 70.9% to 79.4%, respectively. The variable importance determined by the extreme gradient boosting algorithm showed that beryllium and cesium were significantly more important compared to other elements followed by tin, phosphorus, cobalt, lead, calcium, copper, zinc, and aluminum as the top 10 elements to classify Turkish and Italian wine samples.