Post

Dataquest Guided Project - Analyzing Movie Reviews

In this project, we willl analyze various movie review websites using “fandango_score_comparison.csv” We will use descriptive statistics to draw comparisons between fandango and other review websites. In addition, we’ll also use linear regression to determine fandango review scores based on other review scores.

1
2
3
4
5
6
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

movies = pd.read_csv('fandango_score_comparison.csv')
movies.head()
FILMRottenTomatoesRottenTomatoes_UserMetacriticMetacritic_UserIMDBFandango_StarsFandango_RatingvalueRT_normRT_user_norm...IMDB_normRT_norm_roundRT_user_norm_roundMetacritic_norm_roundMetacritic_user_norm_roundIMDB_norm_roundMetacritic_user_vote_countIMDB_user_vote_countFandango_votesFandango_Difference
0Avengers: Age of Ultron (2015)7486667.17.85.04.53.704.3...3.903.54.53.53.54.01330271107148460.5
1Cinderella (2015)8580677.57.15.04.54.254.0...3.554.54.03.54.03.524965709126400.5
2Ant-Man (2015)8090648.17.85.04.54.004.5...3.904.04.53.04.04.0627103660120550.5
3Do You Believe? (2015)1884224.75.45.04.50.904.2...2.701.04.01.02.52.531313617930.5
4Hot Tub Time Machine 2 (2015)1428293.45.13.53.00.701.4...2.550.51.51.51.52.5881956010210.5

5 rows × 22 columns

First, we’ll use a histogram to see the distribution of ratings for “Fandango_Stars” and “Metacritic_norm_round”.

1
2
3
4
5
6
7
8
mc = movies['Metacritic_norm_round']
fd = movies['Fandango_Stars']

plt.hist(mc, 5)
plt.show()

plt.hist(fd, 5)
plt.show()

png

png

It looks like fandango seems to have higher overalll ratings than metacritic, but just looking at histograms isn’t enough to prove that. We can calclate the mean, median, and standard deviation of the two websites using numpy functions.

1
2
3
4
5
6
7
8
9
10
mean_fd = fd.mean()
mean_mc = mc.mean()
median_fd = fd.median()
median_mc = mc.median()
std_fd = fd.std()
std_mc = mc.std()

print("means", mean_fd, mean_mc)
print("medians",median_fd, median_mc)
print("std_devs",std_fd, std_mc)
1
2
3
means 4.08904109589 2.97260273973
medians 4.0 3.0
std_devs 0.540385977979 0.990960561374

Couple of things to note here:

  • Fandango rating methods are hidden, where as metacritic takes a weighted average of all the published critic scores.

  • The mean and the median for fandango is way higher, they also got a low std deviation. I’d imagine their scores are influenced by studios and have inflated scores to get people on the website to watch the movies.

  • The standard deviation for fandango is also lower because most of their ratings are clustered on the high side.

  • Metacritic on the other hand has a median of 3.0 and an average of 3 which is basically what you would expect from a normal distribution.

Let’s make a scatter plot between fandango and metacritic to see if we can draw any correlations.

1
2
plt.scatter(fd, mc)
plt.show()

png

1
2
3
4
5
movies['fm_diff'] = fd - mc
movies['fm_diff'] = np.absolute(movies['fm_diff'])
dif_sort = movies['fm_diff'].sort_values(ascending=False)

movies.sort_values(by='fm_diff', ascending = False).head(5)
FILMRottenTomatoesRottenTomatoes_UserMetacriticMetacritic_UserIMDBFandango_StarsFandango_RatingvalueRT_normRT_user_norm...RT_norm_roundRT_user_norm_roundMetacritic_norm_roundMetacritic_user_norm_roundIMDB_norm_roundMetacritic_user_vote_countIMDB_user_vote_countFandango_votesFandango_Differencefm_diff
3Do You Believe? (2015)1884224.75.45.04.50.904.20...1.04.01.02.52.531313617930.54.0
85Little Boy (2015)2081305.97.44.54.31.004.05...1.04.01.53.03.53859278110.23.0
47Annie (2014)2761334.85.24.54.21.353.05...1.53.01.52.52.51081922268350.33.0
19Pixels (2015)1754275.35.64.54.10.852.70...1.02.51.52.53.02461952138860.43.0
134The Longest Ride (2015)3173334.87.24.54.51.553.65...1.53.51.52.53.5492521426030.03.0

5 rows × 23 columns

It looks like the difference can get as high as 4.0 or 3.0. We should try to calculate the correlation between the two websites. We can do this by simply using the .pearsonr() function from scipy.

1
2
3
4
5
import scipy.stats as sci

r, pearsonr = sci.pearsonr(mc, fd)
print(r)
print(pearsonr)
1
2
0.178449190739
0.0311615162285

If both movie review sites uses the similar methods for rating their movies, we should see a strong correlation. A low correlation tells us that these two websites have very different review methods.

Doing a linear regression wouldn’t be very accurate with a low correlation. However, let’s do it for the sake of practice anyway.

1
2
3
4
5
m, b, r, p, stderr = sci.linregress(mc, fd)

#Fit into a line, y = mx+b where x is 3.
pred_3 = m*3 + b
pred_3
1
4.0917071528212041
1
2
3
4
pred_1 = m*1 + b
print(pred_1)
pred_5 = m*5 + b
print(pred_5)
1
2
3.89708499687
4.28632930877

We can make predictions of what the fandango score is based on the metacritic score by doing a linear regression. However it is important to keep in mind, if the correlation is low, the model might not be very accurate.

1
2
3
4
5
6
7
8
9
x_pred = [1.0, 5.0]
y_pred = [3.89708499687, 4.28632930877]

plt.scatter(fd, mc)
plt.plot(x_pred, y_pred)



plt.show()

png


Learning Summary

Concepts explored: pandas, descriptive statistics, numpy, matplotlib, scipy, correlations

Functions and methods used: .sort_values(), sci.linregress(), .hist(), .absolute(), .mean(), .median(), .absolute()

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.