Russia|Ukraine war — tweets NLP analysis
It is sad how the events have uncovered in Ukraine Russia conflict — to think about an active war in this age and society is absolutely scary. The main conflict of the war comes down to the future of Ukraine as an independent nation and it joining NATO — which would cost Russia its prevalence in Europe.
In this analysis below we will be looking at text data in tweets variable — we will apply Natural language processing to get Compound sentiment, topics, word cloud, personnel loss and equipment loss visuals.
Datasets contain tweets monitoring the current ongoing Ukraine-Russia conflict, We have isolated the tweets from 28th February 2022.
Below are all the columns present in the dataset:
Removing redundant variables: Since we are primarily interested in the tweet text data, variables which contain unimportant data such as links, userid, tweetid etc reduce model performance. Below are all the columns which are removed:
Removing null values: Null values reduce model performance and skew the data. Since we as a part of the analysis want to look at tweets from United States we take out all rows which do not have location.
Data filters to reduce high variance: To prevent data with high variance we apply filters to columns and reduce the number of rows. To keep the text analysis to english we filter language to ‘en’ . Filtering the location column is tricky as the data is not clean — so we have to hard code using str.contains to grab highest number of tweets associated with United States.
Text Data Cleaning:
Text data is a complex and is in natural language (english, german, french etc) — the nature of text data is unstructured and we need to clean and transform it for python to understand and interpret it.
Below are the steps we take to clean text data:
Tokenization: Tokenization is a process of extracting singular words from a statement of words — we import nlkt’s word_tokenize function to create tokens for all descriptions of movie/series in dataset.
Removing Stop words: Stop words are words which do not add any value to the analysis and are merely fillers for natural language — example ‘the’, ‘a’, ‘an’, ‘is’ etc. To remove stop words we need to import stop word dictionary or create a manual dictionary. I have imported stop word dictionary from Spacy text modelling library. To remove stop words I have created a function to apply tokenization and remove tokens which match in stop words dictionary from Spacy.
Lemmatization: It’s the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.
Below is the code we run to apply tokenization, remove stop words and lemmatization to tweets text in dataset:
Below is the before and after of applying text data cleaning steps:
Word Cloud for Cleaned corpus and hashtags:
A word cloud is a visual representation of text data, which are used to visualize free form text. Word cloud usually have single words, and the importance of each word is shown with font size.
Below is the code to generate Word cloud:
Word Cloud for cleaned tweet text:
Word Cloud for hashtags in cleaned tweet text:
Code to extract hashtags from cleaned text:
Sentiment Analysis with VADER:
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity but also tells us about how positive or negative a sentiment is by giving positive score, negative score, neutral score and a compound score which is sum of all three scores.
Below is a code snippet to apply Vader Sentiment Analysis:
Below is the piechart() of sentiment type for tweets:
Topic Detection with LDA:
Latent Dirichlet Allocation is an unsupervised learning algorithm which represents documents as mixtures of topics that spit out words with certain probabilities.
Cons of LDA:
- User must define the amount of topics to be created.
- User must interpret what topics are based on probabilities of words assigned to the topics.
Below is the code you run in python to get Topics assigned to each tweet:
Top 15 words associated with topics:
Countplot() for topics assigned to tweets: