Best Preprocessing Techniques for Sentiment Analysis

arXiv cs.CL Papers

Summary

This paper systematically investigates the optimal order of preprocessing techniques for sentiment analysis on Twitter data, finding that tokenisation is most impactful and spelling correction least, with the best order being tokenisation, cleaning, stemming, then stopword removal.

arXiv:2606.24055v1 Announce Type: new Abstract: Sentiment analysis in Twitter datasets is important because it enables monitoring public opinion on products and analysis of political and social movements. One critical step is preprocessing: the automated processing of text for machine learning algorithms. Preprocessing plays a critical role in reducing noise and improving efficiency. However, little research has systematically examined the order in which preprocessing techniques are implemented. We find that, when accounting for order, spelling correction is the least impactful preprocessing technique, whereas tokenisation is the most impactful. Stemming and stop-word removal are interchangeable, and it is better to remove stop words without removing negation. The best order for applying the preprocessing techniques was tokenisation, text cleaning, stemming, and then stopword removal. Our results provide a systematic approach for practitioners to deploy preprocessing to improve model output without the costly preprocessing exploratory phase.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:44 AM

# Best Preprocessing Techniques for Sentiment Analysis
Source: [https://arxiv.org/html/2606.24055](https://arxiv.org/html/2606.24055)
Melissa Humphries†Jonathan Tuke⋆Lewis Mitchell∗

⋄†⋆∗The School of Mathematical Sciences, Adelaide University, South Australia 5005, Australia ⋄saranzaya\.magsarjav@adelaide\.edu\.au,†melissa\.humphries@adelaide\.edu\.au ⋆simon\.tuke@adelaide\.edu\.au,∗lewis\.mitchell@adelaide\.edu\.au

###### Abstract

Sentiment analysis in Twitter datasets is important because it enables monitoring public opinion on products and analysis of political and social movements\. One critical step is preprocessing: the automated processing of text for machine learning algorithms\. Preprocessing plays a critical role in reducing noise and improving efficiency\. However, little research has systematically examined the order in which preprocessing techniques are implemented\. We find that, when accounting for order, spelling correction is the least impactful preprocessing technique, whereas tokenisation is the most impactful\. Stemming and stop\-word removal are interchangeable, and it is better to remove stop words without removing negation\. The best order for applying the preprocessing techniques was tokenisation, text cleaning, stemming, and then stopword removal\. Our results provide a systematic approach for practitioners to deploy preprocessing to improve model output without the costly preprocessing exploratory phase\.

## 1Introduction

Since the seminal work byPang and Lee \([2008](https://arxiv.org/html/2606.24055#bib.bib1)\), Opinion Mining and Sentiment Analysis have grown exponentially\. This growth is largely due to the increased availability of social media, which is specifically designed for sharing opinions, views, and experiences\. As a result, social media has become a natural fit for Sentiment Analysis\.

Language is context\-based, making it extremely difficult to analyse\. Sentiment in particular can be very subtle\. Things like irony, sarcasm and negation, where a single word can completely change the sentiment polarity, all complicate analysis\. These challenges are further compounded when using social media, given the informal nature of online content\. Social media posts frequently include URLs, hashtags, multimedia, emoticons, misspellings, and slang, which increase the dimensionality of the problem and make classification more complex\. Therefore, before sentiment analysis and classification, the data must undergo carefully considered preprocessing steps to reduce unnecessary noise\.

Preprocessing steps are used to improve the performance of classifiers, whether in Sentiment Analysis or otherwise, by converting text into a more manageable and analysable form\. Some basic techniques include stemming \(reducing words to their root forms\), converting to lowercase, and removing stop words \(e\.g\., pronouns and articles\)\.

In addition to these relatively well\-established techniques, newer preprocessing methods for online content include emoji conversionDandannavaret al\.\([2020](https://arxiv.org/html/2606.24055#bib.bib2)\), slang conversionSingh and Kumari \([2016](https://arxiv.org/html/2606.24055#bib.bib3)\), spelling correction, and many others\. However, few papers consider how preprocessing techniques are applied or in what order\. No systematic analysis has tested different orders of preprocessing techniques\. This paper aims to fill this gap by systematically testing the ordering of preprocessing techniques and identifying which should be implemented\. This, in turn, allows us to provide practitioners with recommendations on implementation orders\. We show that tokenisation is the most impactful preprocessing technique and spelling correction is the least impactful\. The order that yields the best output is tokenisation, cleaning, stemming, and then stopword removal\. Although there has been an increase in the use of Neural Network models, such as BERT, in sentiment analysis, this paper will primarily focus on word\-based sentiment analysis\. The simplicity of the models will help focus on changes to preprocessing techniques rather than on differences between models\.

The remainder of this work is organised as follows\. Section[2](https://arxiv.org/html/2606.24055#S2)presents a review of the current literature, and Sections[4](https://arxiv.org/html/2606.24055#S4)onwards detail the methods, results, and discussion\. Finally, the conclusion and future work are in Section[8](https://arxiv.org/html/2606.24055#S8)\.

![Refer to caption](https://arxiv.org/html/2606.24055v1/flowchart.png)Figure 1:A flowchart process for sentiment analysis classification\.
## 2Related Work

The current literature shows that preprocessing significantly affects sentiment classification\. The best methods, however, seem highly dependent on both the algorithm used and the context of the investigation\. To our knowledge, no systematic investigation of the ordering of the techniques has been conducted\.

Angianiet al\.\([2016](https://arxiv.org/html/2606.24055#bib.bib4)\)used Multinomial Naive Bayes on SemEval 2015\(on Semantic Evaluation,[2015](https://arxiv.org/html/2606.24055#bib.bib21)\)and 2016\(on Semantic Evaluation,[2016](https://arxiv.org/html/2606.24055#bib.bib22)\)data and showed the individual effects of several preprocessing techniques, including converting all negations to ‘not’, emoji to simple descriptions, spelling correction, slang conversion, stemming, and stopword removal\. All preprocessing techniques applied improved the output except for spelling correction and slang conversion\. When combined, basic cleaning methods significantly improved classification compared to no cleaning; therefore, they recommended applying these techniques before any other preprocessing\.

Expanding on this,Alam and Yao \([2019](https://arxiv.org/html/2606.24055#bib.bib23)\)also used Naive Bayes to assess the effects of different preprocessing steps on the output\. They compared this to the effects of preprocessing on Support Vector Machine and Maximum Entropy Modelling\. The base comparison used emoticon removal\. The authors also applied bi\-grams\. The most significant improvement was observed with Naive Bayes\. However,Alam and Yao \([2019](https://arxiv.org/html/2606.24055#bib.bib23)\)showed this was not true for all algorithms, with maximum entropy showing no improvement in accuracy\.

Jianqiang \([2015](https://arxiv.org/html/2606.24055#bib.bib24)\)showed the effects of URL, stopword, repeated letters, negation, acronym and number removal on sentiment classification performance using two feature models and four classifiers \(Logistic regression, Naive Bayes, Support Vector Machines, Random Forests\) on five Twitter datasets\. When focusing on testing individual preprocessing techniques, there was no consensus on which were best, as accuracy changes varied across datasets and classifiers\. The literature presents a wide variety of outcomes; however, only a few examine the simultaneous application of multiple preprocessing techniques, and no papers systematically analyse their ordering\. This paper aims to fill this gap in the literature\.

## 3Framework and Algorithms

### 3\.1Framework

A standard supervised sentiment classification process is shown in Figure[1](https://arxiv.org/html/2606.24055#S1.F1)\. After the dataset is collected, it will undergo preprocessing\. This step cleans the data and reduces it to the most informative set\. This step can also be part of the feature selection process\. The selected features and different feature combinations will affect the classifier’s performance\. From this set, the model is trained, typically using a machine\-learning classification algorithm\. An overview of the algorithms used is provided in Section[3\.2](https://arxiv.org/html/2606.24055#S3.SS2)\. The classifier then assigns labels, and the predictions are evaluated\.

### 3\.2Prediction Algorithms

The main methods used in this paper for text preprocessing are Naive Bayes, Support Vector Machine, Clustering, and Decision Trees\. We use these standard techniques for simplicity\. Many more methods can be used to classify and evaluate sentiment analysis problems; refer toYueet al\.\([2019](https://arxiv.org/html/2606.24055#bib.bib25)\)andGiachanou and Crestani \([2016](https://arxiv.org/html/2606.24055#bib.bib26)\)\. We do not provide an extensive review of prediction algorithms here, but direct the reader to the following resources\.Giachanou and Crestani \([2016](https://arxiv.org/html/2606.24055#bib.bib26)\)provides a good overview of sentiment analysis and opinion mining and their application to Twitter\. It presents many challenges in sentiment analysis and in using Twitter for it, as well as features, applications, and open problems in the field\.Yueet al\.\([2019](https://arxiv.org/html/2606.24055#bib.bib25)\)is a more in\-depth review of different types of sentiment analysis and their backgrounds\. It presents the finer details of sentiment analysis and the different types of sentiment analysis and opinion mining\.

### 3\.3Preprocessing Techniques

Data preprocessing has four main steps: cleaning, integration, transformation, and reduction\. When applied specifically to text preprocessing, the major components are cleaning, transformation and reduction\. These components help reduce dataset volume and noise by normalising, aggregating, or integrating data in various ways\. The specific preprocessing techniques that we focused on were:

- •Spelling correction: The process of correcting misspelled words\.
- •Stemming: Reducing words to their root form\.
- •Tokenisation: Segmenting text or strings of characters into paragraphs, sentences, words or characters for easier analysis\.
- •Stop word removal: Removing common words that do not add to sentiment analysis, such as pronouns and articles\.

The final preprocessing technique considered was cleaning\. The cleaning processes that are considered in this paper are:

- •emoji conversion: converting the emoji/ emoticons to worded descriptions\.
- •converting to lowercase,
- •de\-contraction: using a list of known contractions and converting them to their expanded form, and
- •symbol removal: removal of special characters and URLs
- •punctuation removal

To compare how each preprocessing technique affects outcomes, the option of not implementing any cleaning was also considered\. The order in which these were applied was kept the same to reduce computational cost, since the preprocessing techniques are the main focus\.

## 4Data Set

Table 1:Final number of positive and negative tweets in each dataset\.Three datasets were used: US Airlinefor Everyone library\. \([2015](https://arxiv.org/html/2606.24055#bib.bib20)\), GOP Debatefor Everyone library\. \([2016](https://arxiv.org/html/2606.24055#bib.bib19)\), and the SMILE projectWang,Boet al\.\([2016](https://arxiv.org/html/2606.24055#bib.bib18)\)\. All datasets consist of Twitter posts, focusing the analysis on short\-form text\. To standardise the datasets, neutral tweets were removed, happy emotions were converted to a single positive value, and negative emotions were converted to a single negative value\. This preprocessing ensured consistency across datasets before sampling and analysis\.

After this standardisation process, stratified sampling was applied\. Theoretically, there are approximately 1\.5 million possible combinations to run; therefore, stratified sampling was used to improve time efficiency\. To balance accuracy and efficiency, different sample sizes were tested across datasets for randomly selected combinations of preprocessing techniques\. The data were sampled at different proportions multiple times to obtain a confidence interval for the F1 score\. This process showed the appropriate sample size was about35%35\\%of the original dataset\. Larger sample sizes did not show a significant improvement in the classification accuracy\. The final number of tweets for each dataset is shown in Table[1](https://arxiv.org/html/2606.24055#S4.T1)\.

## 5Method

Table 2:The orders in which the preprocessing techniques were implemented\. The orders are referenced by the number and shorthand in the first column\. In total, 15 different orders were considered\.0:cl\-to\-sp\-st\-secleantokeniserspellstopwordstem1:cl\-to\-sp\-se\-stcleantokeniserspellstemstopword2:cl\-to\-st\-sp\-secleantokeniserstopwordspellstem3:to\-cl\-sp\-st\-setokenisercleanspellstopwordstem4:to\-cl\-sp\-se\-sttokenisercleanspellstemstopword5:to\-cl\-st\-sp\-setokenisercleanstopwordspellstem6:to\-sp\-cl\-st\-setokeniserspellcleanstopwordstem7:to\-sp\-cl\-se\-sttokeniserspellcleanstemstopword8:to\-sp\-st\-cl\-setokeniserspellstopwordcleanstem9:to\-sp\-st\-se\-cltokeniserspellstopwordstemclean10:to\-sp\-se\-cl\-sttokeniserspellstemcleanstopword11:to\-sp\-se\-st\-cltokeniserspellstemstopwordclean12:to\-st\-cl\-sp\-setokeniserstopwordcleanspellstem13:to\-st\-sp\-cl\-setokeniserstopwordspellcleanstem14:to\-st\-sp\-se\-cltokeniserstopwordspellstemcleanDifferent packages were also used to see which preprocessing package performs best\. The packages used for each preprocessing technique are listed below:

- •Spelling correction: spellcheckerBarrus \([2021](https://arxiv.org/html/2606.24055#bib.bib13)\), textblobLoria \([2020](https://arxiv.org/html/2606.24055#bib.bib8)\), autocorrectSondej \([2021](https://arxiv.org/html/2606.24055#bib.bib17)\),
- •Stemming: SnowballStemmerBirdet al\.\([2009](https://arxiv.org/html/2606.24055#bib.bib6)\), WordNetLemmatizerLoria \([2020](https://arxiv.org/html/2606.24055#bib.bib8)\), spaCyHonnibalet al\.\([2020](https://arxiv.org/html/2606.24055#bib.bib7)\), textblobLoria \([2020](https://arxiv.org/html/2606.24055#bib.bib8)\),
- •Tokenisation: TweetTokenizerBirdet al\.\([2009](https://arxiv.org/html/2606.24055#bib.bib6)\), spaCyHonnibalet al\.\([2020](https://arxiv.org/html/2606.24055#bib.bib7)\), transformers AutoTokenizerFace \([2021](https://arxiv.org/html/2606.24055#bib.bib16)\), whitespace, and
- •Stop word removal: nltk stopwordBirdet al\.\([2009](https://arxiv.org/html/2606.24055#bib.bib6)\)without removing no and not from the list\.
- •Cleaning: emoji conversion\(Kim and Wurster,[2021](https://arxiv.org/html/2606.24055#bib.bib10)\), converting lowercase, de\-contraction, symbol removal, punctuation removalBirdet al\.\([2009](https://arxiv.org/html/2606.24055#bib.bib6)\)

The possible order of implementation is5\!5\!for the different preprocessing techniques\. Trying to run all possible combinations and orders is computationally intensive\. Therefore, the computation was improved by accounting for order limitations and illogical combinations, thereby reducing the number of combinations to consider\. Stop word removal, spelling correction, and stemming take in tokenised text; thus, they must be implemented after tokenisation\. Text transformation and reduction can be applied before or after tokenisation\. As string identification is used for all cleaning processes, the text does not need to be tokenised\. Another order implementation considered was to run spelling correction before stemming, since the stemmers do not work on incorrect words\. For spelling correction, stemming algorithms, and stop word removal to work, the text has to be tokenised\. These factors reduced the number of possible orders from5\!5\!to 15, and the possible orders are in Table[2](https://arxiv.org/html/2606.24055#S5.T2)\.

There are252^\{5\}different combinations for the cleaning process and four different spelling correction algorithms, including no implementation\. Similarly, for stemming, five methods were applied, with no\-stemming also considered\. There were four tokenisation methods, including whitespace, and the stopwords could be removed or not removed\. All of these add up to25×4×5×4×2=51202^\{5\}\\times 4\\times 5\\times 4\\times 2=5120possible combinations of the different techniques\. After cleaning, the final text is run through four different models for sentiment analysis: Naive Bayes \(NB\), K\-means \(KM\), Decision Trees \(DT\), and Support Vector Machines \(SVM\)\. To mitigate overfitting, five\-fold cross\-validation was used, and the average F1 scores were recorded, where the F1 score accuracy is defined as:

F​1=2×T​P2×T​P\+F​N\+F​P,F1=\\frac\{2\\times TP\}\{2\\times TP\+FN\+FP\},where TP is the count of true positives, FN is the count of false negatives, and FP is the count of false positives\. Analysis of variance \(ANOVA\) was used to determine the impact of each process, as it provides information on which techniques had the greatest effect on the outputs\.

## 6Results

Table 3:Average F1\-scores for each order for different datasets and for each model\. The top 3 highest F1 accuracies are bolded for each model and dataset\.
Table[3](https://arxiv.org/html/2606.24055#S6.T3)shows the average F1 scores of each order for each dataset and classifier\. The Support Vector Machine consistently performed best, achieving an F1 score above90%90\\%\. Decision Tree and Naive Bayes performed similarly\. Naive Bayes performed better on the Airline dataset, and Decision Trees performed better on the SMILE dataset and yielded very similar results on the Debate dataset\. K\-means came last as it did not achieve accuracies greater than60%60\\%\.

The best ordering was determined by first considering each classifier individually, then collating similarities among the top\-performing orders\. Firstly, SVM performed best with the following order implementations: 3:to\-cl\-sp\-st\-se, 4:to\-cl\-sp\-se\-st, and 5:to\-cl\-st\-sp\-se\. The common factors across these orders are: tokenisation, cleaning, then spelling correction, and no order preference for stemming or stop\-word removal\. For Naive Bayes, the order that gave the highest accuracy in two of the datasets was the same as SVM: 3:to\-cl\-sp\-st\-se, 4:to\-cl\-sp\-se\-st, and 5:to\-cl\-st\-sp\-se\. The other order implementations are 0:cl\-to\-sp\-st\-se, 1:cl\-to\-sp\-se\-st, and 2:cl\-to\-st\-sp\-se\. For KM and DT, the cleaning process was much more dependent on the dataset used\. KM and DT are both highly sensitive to the starting initial conditions\. K\-means are highly sensitive to the initial centroid\. Because we are varying the starting point for each data set, the initial centroid will differ across cleaning processes\. Similarly, Decision Trees are not robust; thus, slight changes in the dataset can alter the branches and, consequently, the final prediction\.

The order that resulted in the higher F1 accuracy across the different models and data sets was: tokenising, cleaning, spelling correction, stop word removal, and then stemming\. However, when accounting for all variations in the top\-order implementation, tokenisation and cleaning processes could be interchangeable; cleaning came before spelling correction, and stemming and stop\-word removal were the last steps\.

Noting that the cleaning process is more consistent when using SVM, the analysis will focus on SVM across the three datasets\.

![Refer to caption](https://arxiv.org/html/2606.24055v1/SVM_prep-edit.png)Figure 2:The average F1\-scores with standard error bars for different preprocessing techniques\. The results are from SVM on the three datasets\. The rows are the different techniques, and the columns are the different datasets\.Table 4:ANOVA outputs for the different datasets using SVM\. A higher F\-statistic indicates greater variation in the results; therefore, it has a greater impact on the output\. It can be observed that tokenisation is the most impactful, and spelling correction is the least impactful\.![Refer to caption](https://arxiv.org/html/2606.24055v1/preprocessing_SVM_clean-edit.png)Figure 3:The average F1\-score of the cleaning process on different datasets using SVM\. The error bar shows the spread of the dataset\. If the line between two points is positive, the cleaning process should be applied; if it is negative, it should not be applied\.Table[4](https://arxiv.org/html/2606.24055#S6.T4)shows the ANOVA output and the F\-statistic for the different preprocessing techniques for each dataset using SVM\. The larger the F\-statistic, the more it explains the variation in the data, and the more impactful the preprocessing is on the final output\. From the ANOVA outputs, it can be seen that spelling correction has the lowest F\-statistic across all datasets and models, with Values ranging from 18 to 1000\. This implies that spelling correction is the least impactful preprocessing technique\. This was also true when accounting for the order of application\. The top performing spelling correction methods, from Figure[2](https://arxiv.org/html/2606.24055#S6.F2), were spellchecker or no applications of spelling correction, then autocorrect and lastly textblob\.

The most impactful preprocessing technique is tokenisation, which had F\-statistics at least an order of magnitude higher than those for spelling correction\. The most frequently top\-performing methods in Figure[2](https://arxiv.org/html/2606.24055#S6.F2)were BERT and spaCy tokenisers, followed by nltk\_tweet, then whitespace\.

The next couple of important features are stemming and stop word removal\. These two preprocessing techniques were interchangeable in most cases\. The best\-performing stemmer is the snowball stemmer or spaCy stemmer, seen in Figure[2](https://arxiv.org/html/2606.24055#S6.F2)\. The Snowball stemmer follows simple rules as a decision process, while the spaCy stemmer takes context into account to determine a word’s root\. From Figure[2](https://arxiv.org/html/2606.24055#S6.F2), it is better to remove stop words\.

Figure[3](https://arxiv.org/html/2606.24055#S6.F3)shows the results of each cleaning process\. The only cleaning process that consistently showed a major improvement was text lowering\. The other consistent cleaning process was not removing symbols; however, the results showed a smaller difference\. The final consistent cleaning process was the de\-contraction of text\.

## 7Discussion

Spelling correction was the least useful technique\. This could be because these methods do not account for context\. As a result, correcting spelling may introduce as much noise as it removes\. The recommendation is not to apply spelling correction due to its low overall impact and poor performance\.

The preprocessing technique that made the most difference in the output was tokenisation\. This is intuitive, as it creates a basis for how each following technique will be applied\. Spelling correction, stemming, and stop word removal methods will not work correctly if punctuation or emojis remain in the tokenised text\. Both best\-performing tokenisers, BERT and spaCy, carefully consider how to tokenise text\. This preserves important context necessary for sentiment analysis\. Therefore, it is recommended to use tokenisation methods that can handle the differences between punctuation and emojis\.

Stop words should be removed, as they do not carry much meaning\. However, in sentiment analysis, keep negating words such as not and no\. Negating words flip the sentiment if removed\.

There was no particular order for whether stop\-word removal or stemming was used first\. This could be because they do not depend on each other\. Stemming a stop word will return a stop word; therefore, no real impact is noted\. The recommendation here is to perform stop\-word removal first, followed by stemming, as stemming a stop word is redundant\. Removing stop words first will save time during stemming\.

During the cleaning process, the only major effective step was text lowering\. Lowering the text will reduce the number of variations of the same word, therefore, resulting in more consistent data for modelling\. Another, however, less effective cleaning process, was not removing symbols\. More often than not, emojis convey the sentiment of the text; therefore, removing symbols may be detrimental to sentiment analysis, as emojis can be made from symbols\. The final effective cleaning process was de\-contraction\. Similar to symbol removal, the change in accuracy was relatively small; however, applying de\-contraction helps reduce data variation\. This will only work if part of the stop word removal process is not to remove negating words\. The other two cleaning processes, emoji conversion and punctuation removal, were much more dataset\-dependent, and there was no clear consensus on whether to use them\.

We can now propose a recommendation for applying the different preprocessing techniques in the correct order\. The first preprocessing technique is to use a context\-aware tokeniser, such as BERT or spaCy\. When tokenisation is done well, the following preprocessing steps become more consistent and perform well\. The next technique to apply is cleaning the tokenised text\. The main cleaning steps to implement are text lowering and decontraction\. Both of these help reduce data variation\. However, whether to use other cleaning processes depends heavily on the dataset\. After cleaning the tokenised text, it should be run through a stemmer\. The suggested stemmers are the Snowball and spaCy stemmers\. Then, stop\-word removal should be applied; however, words such as not and no should be retained in the dataset, as they alter the text’s sentiment\.

## 8Conclusion

Sentiment analysis has been extensively applied to social media data, and preprocessing techniques are used to improve F1 accuracy\. However, there is no consensus on the order or which techniques should be implemented\. In this paper, the physical and algorithmic constraints of preprocessing techniques were considered, thus limiting the possible logical order\. The reduced number of preprocessing orders was tested to determine the optimal implementation order using three classifiers across three datasets\.

Average F1\-score accuracies for each order showed that the best\-performing order was tokenisation, cleaning, spelling correction, then stop word removal, and finally stemming\. ANOVA analysis also showed that the most impactful preprocessing technique was tokenisation, with BERT and spaCy being the better choices\. Spelling correction was the least effective, so it was not implemented\. For the different cleaning processes, the largest change was observed with text lowering, and to a lesser extent with word de\-contraction\. However, the rest of the cleaning process was highly dependent on the classifier and dataset being used\.

Finally, since the start of this analysis, models regarding language and text have improved significantly due to developments in Large Language Models\. These models have changed the scope of how natural language processing is approached\. Therefore, for future work, it would be interesting to examine how preprocessing affects outcomes in these models and whether preprocessing of text is necessary for language\-related tasks\.

## References

- S\. Alam and N\. Yao \(2019\)The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis\.Computational and Mathematical Organization Theory25\(3\),pp\. 319–335\.Cited by:[§2](https://arxiv.org/html/2606.24055#S2.p3.1)\.
- G\. Angiani, L\. Ferrari, T\. Fontanini, P\. Fornacciari, E\. Iotti, F\. Magliani, and S\. Manicardi \(2016\)A comparison between preprocessing techniques for sentiment analysis in Twitter\.InCEUR Workshop Proceedings,External Links:ISSN 16130073Cited by:[§2](https://arxiv.org/html/2606.24055#S2.p2.1)\.
- T\. Barrus \(2021\)Pyspellchecker package\.Note:Release 0\.6\.2Available at[https://pypi\.org/project/pyspellchecker/](https://pypi.org/project/pyspellchecker/)Cited by:[1st item](https://arxiv.org/html/2606.24055#S5.I1.i1.p1.1)\.
- S\. Bird, E\. Klein, and E\. Loper \(2009\)Natural language processing with python: analyzing text with the natural language toolkit\." O’Reilly Media, Inc\."\.Cited by:[2nd item](https://arxiv.org/html/2606.24055#S5.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.24055#S5.I1.i3.p1.1),[4th item](https://arxiv.org/html/2606.24055#S5.I1.i4.p1.1),[5th item](https://arxiv.org/html/2606.24055#S5.I1.i5.p1.1)\.
- P\. S\. Dandannavar, S\. R\. Mangalwede, and S\. B\. Deshpande \(2020\)Emoticons and Their Effects on Sentiment Analysis of Twitter Data\.InEAI International Conference on Big Data Innovation for Sustainable Cognitive Computing,A\. Haldorai, A\. Ramu, S\. Mohanram, and C\. C\. Onn \(Eds\.\),EAI/Springer Innovations in Communication and Computing,pp\. 191–201\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-19562-5%5F19),ISBN 978\-3\-030\-19562\-5Cited by:[§1](https://arxiv.org/html/2606.24055#S1.p4.1)\.
- H\. Face \(2021\)Tokenizers\.Note:Available at[https://huggingface\.co/docs/tokenizers/](https://huggingface.co/docs/tokenizers/)Cited by:[3rd item](https://arxiv.org/html/2606.24055#S5.I1.i3.p1.1)\.
- C\. D\. for Everyone library\. \(2015\)Twitter us airline sentiment\.Note:Available at[https://www\.kaggle\.com/datasets/crowdflower/twitter\-airline\-sentiment](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment)Cited by:[§4](https://arxiv.org/html/2606.24055#S4.p1.1)\.
- C\. D\. for Everyone library\. \(2016\)First gop debate twitter sentiment\.Note:Available at[https://www\.kaggle\.com/datasets/crowdflower/first\-gop\-debate\-twitter\-sentiment](https://www.kaggle.com/datasets/crowdflower/first-gop-debate-twitter-sentiment)Cited by:[§4](https://arxiv.org/html/2606.24055#S4.p1.1)\.
- A\. Giachanou and F\. Crestani \(2016\)Like it or not: A survey of Twitter sentiment analysis methods\.Vol\.49\.External Links:[Document](https://dx.doi.org/10.1145/2938640),ISSN 15577341Cited by:[§3\.2](https://arxiv.org/html/2606.24055#S3.SS2.p1.1)\.
- M\. Honnibal, I\. Montani, S\. Van Landeghem, and A\. Boyd \(2020\)spaCy: Industrial\-strength Natural Language Processing in Python\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1212303),[Link](https://doi.org/10.5281/zenodo.1212303)Cited by:[2nd item](https://arxiv.org/html/2606.24055#S5.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.24055#S5.I1.i3.p1.1)\.
- Z\. Jianqiang \(2015\)Pre\-processing boosting Twitter sentiment analysis?\.In2015 IEEE International Conference on Smart City/SocialCom/SustainCom \(SmartCity\),pp\. 748–753\.Cited by:[§2](https://arxiv.org/html/2606.24055#S2.p4.1)\.
- T\. Kim and K\. Wurster \(2021\)Emoji package\.Note:Available at[https://pypi\.org/project/emoji/](https://pypi.org/project/emoji/)Cited by:[5th item](https://arxiv.org/html/2606.24055#S5.I1.i5.p1.1)\.
- S\. Loria \(2020\)TextBlob documentation\.Release 0\.16\.02\.Cited by:[1st item](https://arxiv.org/html/2606.24055#S5.I1.i1.p1.1),[2nd item](https://arxiv.org/html/2606.24055#S5.I1.i2.p1.1)\.
- I\. W\. on Semantic Evaluation \(2015\)SemEval\-2015 : semantic evaluation exercises\.Note:Available at[https://alt\.qcri\.org/semeval2015/index\.php?id=tasks](https://alt.qcri.org/semeval2015/index.php?id=tasks)Cited by:[§2](https://arxiv.org/html/2606.24055#S2.p2.1)\.
- I\. W\. on Semantic Evaluation \(2016\)SemEval\-2016 : semantic evaluation exercises\.Note:Available at[https://alt\.qcri\.org/semeval2016/index\.php?id=tasks](https://alt.qcri.org/semeval2016/index.php?id=tasks)Cited by:[§2](https://arxiv.org/html/2606.24055#S2.p2.1)\.
- B\. Pang and L\. Lee \(2008\)Opinion mining and sentiment analysis\.Foundations and Trends® in information retrieval2\(1–2\),pp\. 1–135\.Cited by:[§1](https://arxiv.org/html/2606.24055#S1.p1.1)\.
- T\. Singh and M\. Kumari \(2016\)Role of Text Pre\-processing in Twitter Sentiment Analysis\.89,pp\. 549–554\.External Links:ISSN 1877\-0509,[Document](https://dx.doi.org/10.1016/j.procs.2016.06.095),[Link](https://www.sciencedirect.com/science/article/pii/S1877050916311607)Cited by:[§1](https://arxiv.org/html/2606.24055#S1.p4.1)\.
- F\. Sondej \(2021\)Autocorrect package\.Note:Available at[https://github\.com/fsondej/autocorrect](https://github.com/fsondej/autocorrect)Cited by:[1st item](https://arxiv.org/html/2606.24055#S5.I1.i1.p1.1)\.
- Wang,Bo, Tsakalidis,Adam, M\. Liakata, A\. Zubiaga, R\. Procter, and E\. Jensen \(2016\)SMILE twitter emotion dataset\.Note:Available at[https://figshare\.com/articles/dataset/smile\_annotations\_final\_csv/3187909](https://figshare.com/articles/dataset/smile_annotations_final_csv/3187909)External Links:[Document](https://dx.doi.org/10.6084/m9.figshare.3187909.v2)Cited by:[§4](https://arxiv.org/html/2606.24055#S4.p1.1)\.
- L\. Yue, W\. Chen, X\. Li, W\. Zuo, and M\. Yin \(2019\)A survey of sentiment analysis in social media\.Knowledge and Information Systems60\(2\),pp\. 617–663\.Cited by:[§3\.2](https://arxiv.org/html/2606.24055#S3.SS2.p1.1)\.

Similar Articles

Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)

arXiv cs.CL

This paper presents SSAS (Syntactic & Semantic Context Assessment Summarization), a framework designed to improve consistency in LLM-based sentiment prediction by reducing noise and variance through hierarchical classification and iterative summarization. Empirical evaluation on three industry-standard datasets shows up to 30% improvement in data quality and reliability for enterprise decision-making.

Finding Optimal Tokenizers

Hacker News Top

This blog post presents an algorithm using integer linear programming to compute optimal tokenizers for language models, drawing parallels to solving the Traveling Salesman Problem. It notes that while the result is theoretically interesting, practical tokenizers are already near-optimal and the method may not generalize well.