Intelligent Systems Conference, IntelliSys 2023, Amsterdam, Netherlands, 7 - 08 September 2023, vol.825, pp.51-67
In information retrieval, words that contribute little to the semantic information within a text are called stopwords. Identification of stopwords is considered an asset for many retrieval tasks as removal of such words from the text collection increase the regularization within the dataset and reduce the volume and computational complexity. Traditional techniques for identification of stopwords either involve manual construction of stopword lists by analyzing terms individually or employing sorting-based techniques that uses global term characteristics as a proxy. Due to the context-dependant and non-precise definition of what really classifies as a stopword, transfer and application of these traditional methodologies to different languages or domains leaves room for generalized and interpretable results. To address this concern, we propose a feature based supervised machine learning technique for automatic detection of stopwords. We have tested the validity of the proposed technique with extensive experiments and compared the results with a general English stopword list. Furthermore, we have evaluated the proposed technique with both formal written and social media text. The results reveal that the proposed technique leads to promising results and is also capable of addressing the changes in the dialect.