Automatic Labeling of Twitter Data for Developing COVID-19 Sentiment Dataset


Azharul Hasan K., Shovon S. D., Joy N. H., Islam M. S.

5th International Conference on Electrical Information and Communication Technology, EICT 2021, Khulna, Bangladeş, 17 - 19 Aralık 2021 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/eict54103.2021.9733548
  • Basıldığı Şehir: Khulna
  • Basıldığı Ülke: Bangladeş
  • Anahtar Kelimeler: Automatic Labelling, COVID-19, Logistic Regression, LSTM, Multinomial Naive Bayes, Sentiment Analysis, Textblob, TF-IDF Vectorizer, Twitter
  • TED Üniversitesi Adresli: Hayır

Özet

The COVID-19 has started expanding through the world and has become a pandemic since january 2020. With the rise of new cases daily along with mass death, nation and society are becoming fearful of it. People from all over the world are expressing their thoughts and views about this pandemic in many social media platforms. Nowadays social media is one of the most common ways to express idea or verdict on something. With the improvement of modern computing technology, machines are constantly being conducted to interpret what people express in social media like Twitter, Facebook, Instagram etc. These thoughts or views can be categorized and analysed based on sentiment. In this paper, we have analysed the sentiment of people what they express in social media by using tweets gathering from Twitter. We have categorized the sentiment of the tweets into five classes namely 'Strongly Negative', 'Negative', 'Neutral', 'Positive' and 'Strongly Positive'. Initially we use the Textblob of python for classification. This classification does not show good results and needs massive change as there lies many new terms related to COVID-19 which effects the sentiment of tweets. We have labeled automatically by creating Regular Expression rules with our new corrected word library which is created by analysing the tweets manually. We have trained a model with the updated labeled dataset with Long short-Term memory(LSTM), Logistic Regression(LR), Multinomial Naive Bayes(MNB) and analysed the performance. We found that our data labelling shows better performance comparing to standard dataset.