Topic detection and sentiment analysis for Netflix descriptions

5 min readFeb 28, 2022

Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally.

In this analysis below we will be looking at text data in Movie/series description variable — we will apply Natural language processing to get Compound sentiment and topics.

Dataset:

This tabular dataset consists of listings of all the movies and tv shows available on Netflix, Below are all the columns present in the dataset

Data cleaning:

Removing null values: Null values reduce model performance and skew the data. From the below screenshot we see that director, cast and country have the most null values — based on redundant variables we remove columns director and cast . In the next step we us dropna() function to remove all rows with null values.

Removing redundant variables: Variables which contain unimportant data such as links, description reduce model performance. Columns ‘show_id’, ‘type’, ‘country’, ‘rating’, ‘date_added’, ‘duration’, ‘listed_in’ were removed.

Data filters to reduce high variance: To prevent data with high variance we apply filters to columns and reduce the number of rows. From the below figure we can see that data is skewed towards the right, to normalize the data we filter release_year > 1980.

Text Data Cleaning:

Text data is a complex and is in natural language (english, german, french etc) — the nature of text data is unstructured and we need to clean and transform it for python to understand and interpret it.

Below are the steps we take to clean text data:

Tokenization: Tokenization is a process of extracting singular words from a statement of words — we import nlkt’s word_tokenize function to create tokens for all descriptions of movie/series in dataset.

Removing Stop words: Stop words are words which do not add any value to the analysis and are merely fillers for natural language — example ‘the’, ‘a’, ‘an’, ‘is’ etc. To remove stop words we need to import stop word dictionary or create a manual dictionary. I have imported stop word dictionary from Spacy text modelling library. To remove stop words I have created a function to apply tokenization and remove tokens which match in stop words dictionary from Spacy.

Lemmatization: It’s the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word’s lemma, or dictionary form.

Below is the code we run to apply tokenization, remove stop words and lemmatization to description text in dataset:

Below is the before and after of applying text data cleaning steps:

Sentiment Analysis with VADER:

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity but also tells us about how positive or negative a sentiment is by giving positive score, negative score, neutral score and a compound score which is sum of all three scores.

Below is a code snippet to apply Vader Sentiment Analysis:

Below is the snippet of 10 rows with Compound sentiment score and Sentiment_type:

Topic Detection with LDA:

Latent Dirichlet Allocation is an unsupervised learning algorithm which represents documents as mixtures of topics that spit out words with certain probabilities.

Cons of LDA:

User must define the amount of topics to be created.
User must interpret what topics are based on probabilities of words assigned to the topics.

For python to understand text data we need to vectorize it using Count vectorizer, after a fit transform we get a large matrix of count of words.

Importing LDA from sklearn library, creating an instance and applying it to matrix spit out of Count vectorizer:

Printing the top 15 words associated with each topic:

The next step is manual and totally subjective where the user interprets the highest probability words associated with a topic list above and assigns topic a category name.

We transform the dtm matrix using LDA model created to assign each description cell a mix of probability with each topics. We take the top topic assigned using argmax() function.