This project aims to judge feedback/reviews of Google's self driving cars through twitter reviews by performing sentiment analysis using natural language processing in Python.
Python
NumPy
Pandas
NLTK (Natural Language ToolKit)
Scikit-learn
Data Cleaning and Exploratory Data Analysis
Tokenization
Classification
Model Evaluation
Please find below the process used to analyze the dataset. All graphs have been created using Matplotlib and Seaborn
This state consists of cleaning and preprocessing data in order to prepare it for modeling
Use a barplot to depict sentiment scores to understand the distribution better. From the graph below,
we see maximum concentration of data in the sentiment 3 section. This might result in us being able to predict data with sentiment 3 more accurately.


Packages Used:- Pandas, NumPy, Matplotlib
Tokenization is the process of breaking down text into smaller chunks of data for easier processing. We also need to perform lemmatization in order to reduce every word to its base root word. Perform tokenization using the tokenize function of nltk and create a custom Parts of Speech tagger in order to tag values as adjective, verb, noun and adverb and use these tagged values for lemmatization.
In order to perform lemmatization, we use the lemmatizer function in nltk to lemmatize the tokenized dataset. Remove stop words and frequently occurring words that do
not add any value to sentiment. Create a list with all lemmatized words. Make 2 arrays one for lemmatized words and other for parts of speech tagging.
Create dataframe of lemma and parts of speech. Since we do not have a POS tagging for punctuation we add one called PU.


Packages Used:- nltk and other packages mentioned above
In this stage we perform supervised classification on the twitter dataset in order to be able to predict accurate sentiments. The first step would be to detokenize the twitter data. We then split our dataset into training and testing data. Our next step would be vectorizing text reviews to a numerical format since classification cannot be performed on textual data. Finally, we perform classification using Multinomial Naive Bayes Classifier which produces an accuracy of 61.26%
Now that we have our model ready, we want to find the top 10 words for each sentiment class in order to judge if our model gives meaningful results. For this we
will use the feature_log_prob_ function of sklearn to find feature importance

Packages Used:- sklearn and other packages mentioned above
As seen above, the accuracy of our Naive Baye's Classifier is 61.26%. We then test our predictions using random test statements as shown below.

From the above scores we see that negetive words like skeptical are being given a low sentiment score while positive words like excited have a high sentiment score.
We would however assume that a word like "worst" would have a very low sentiment which is not the case. This is probably due to the lower accuracy
of the model or the large numbet of records under score 3 as shown in the EDA above
We use a confusion matrix to understand the performance of the model against each score.
The confusion matrix below shows the correlation between the actual and predicted values.
As shown above max correlation can be observed for sentiment score 2. Which means the accuracy for tweets with a sentiment score of 2 is the highest. This is a little
unexpected since as per our analysis, the distribution of sentiment score 3 was the highest.

A major issue in our model is that the accuracy is pretty low around 61.49%. We can improve our model by changing the way our bag of words is created. Currently it is created using the CountVectorizer method which counts the occurence of the words in the text. A word that appears frequently becomes more important for classification. To overcome this problem we can use TF-IDF, which considers the product of term frequency and inverse document frequency. If a term appears n times in a document with w words term frequency would be n/w. We can also try different models such as a Linear Support Vector Classifier
On implementing TFIdfVectorizer followed by Linear Support Vector Classifier we noticed the accuracy increase to 62%