Post

Dataquest Guided Project - Star Wars Survey

In this project we will be looking at a star wars survey, ‘star_wars.csv’. This project will focus on data cleaning so we have a data set ready for analysis. Let’s begin by reading the csv file and the first couple rows.

1
2
3
4
5
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(3)
RespondentIDHave you seen any of the 6 films in the Star Wars franchise?Do you consider yourself to be a fan of the Star Wars film franchise?Which of the following Star Wars films have you seen? Please select all that apply.Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film....Unnamed: 28Which character shot first?Are you familiar with the Expanded Universe?Do you consider yourself to be a fan of the Expanded Universe?ξDo you consider yourself to be a fan of the Star Trek franchise?GenderAgeHousehold IncomeEducationLocation (Census Region)
0NaNResponseResponseStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the JediStar Wars: Episode I The Phantom Menace...YodaResponseResponseResponseResponseResponseResponseResponseResponseResponse
13.292880e+09YesYesStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the Jedi3...Very favorablyI don't understand this questionYesNoNoMale18-29NaNHigh school degreeSouth Atlantic
23.292880e+09NoNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNYesMale18-29$0 - $24,999Bachelor degreeWest South Central

3 rows × 38 columns

1
2
print(star_wars.shape)
star_wars.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(1187, 38)





Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'Which of the following Star Wars films have you seen? Please select all that apply.',
       'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

It looks like most of these columns are unnamed, but first we’ll remove any row without a RespondentID.

1
star_wars = star_wars[star_wars['RespondentID'].notnull()]

Next we want to convert the “Yes” and “No” strings into booleans. We can use the .map() method along with a dictionary to replace the “Yes” string into True and the “No” string into False.

1
2
3
4
5
6
7
8
9
10
11
yes_no = {"Yes": True, "No": False, np.nan:False}

for col in [
    "Have you seen any of the 6 films in the Star Wars franchise?",
    "Do you consider yourself to be a fan of the Star Wars film franchise?"
    ]:
    star_wars[col] = star_wars[col].map(yes_no)

star_wars.head()

RespondentIDHave you seen any of the 6 films in the Star Wars franchise?Do you consider yourself to be a fan of the Star Wars film franchise?Which of the following Star Wars films have you seen? Please select all that apply.Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film....Unnamed: 28Which character shot first?Are you familiar with the Expanded Universe?Do you consider yourself to be a fan of the Expanded Universe?ξDo you consider yourself to be a fan of the Star Trek franchise?GenderAgeHousehold IncomeEducationLocation (Census Region)
13.292880e+09TrueTrueStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the Jedi3...Very favorablyI don't understand this questionYesNoNoMale18-29NaNHigh school degreeSouth Atlantic
23.292880e+09FalseFalseNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNYesMale18-29$0 - $24,999Bachelor degreeWest South Central
33.292765e+09TrueFalseStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithNaNNaNNaN1...Unfamiliar (N/A)I don't understand this questionNoNaNNoMale18-29$0 - $24,999High school degreeWest North Central
43.292763e+09TrueTrueStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the Jedi5...Very favorablyI don't understand this questionNoNaNYesMale18-29$100,000 - $149,999Some college or Associate degreeWest North Central
53.292731e+09TrueTrueStar Wars: Episode I The Phantom MenaceStar Wars: Episode II Attack of the ClonesStar Wars: Episode III Revenge of the SithStar Wars: Episode IV A New HopeStar Wars: Episode V The Empire Strikes BackStar Wars: Episode VI Return of the Jedi5...Somewhat favorablyGreedoYesNoNoMale18-29$100,000 - $149,999Some college or Associate degreeWest North Central

5 rows × 38 columns

Columns 4 to 9 have string values with the movie the respondant saw. Similar to how we cleaned columns 2-3, we want to convert these into booleans with the .map() method. In addition, we want to change the column names to reference the true or false question.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
true_false = {
    "Star Wars: Episode I  The Phantom Menace": True,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True,
    np.nan: False,
}
for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(true_false)
    
star_wars.head()

RespondentIDHave you seen any of the 6 films in the Star Wars franchise?Do you consider yourself to be a fan of the Star Wars film franchise?Which of the following Star Wars films have you seen? Please select all that apply.Unnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film....Unnamed: 28Which character shot first?Are you familiar with the Expanded Universe?Do you consider yourself to be a fan of the Expanded Universe?ξDo you consider yourself to be a fan of the Star Trek franchise?GenderAgeHousehold IncomeEducationLocation (Census Region)
13.292880e+09TrueTrueTrueTrueTrueTrueTrueTrue3...Very favorablyI don't understand this questionYesNoNoMale18-29NaNHigh school degreeSouth Atlantic
23.292880e+09FalseFalseFalseFalseFalseFalseFalseFalseNaN...NaNNaNNaNNaNYesMale18-29$0 - $24,999Bachelor degreeWest South Central
33.292765e+09TrueFalseTrueTrueTrueFalseFalseFalse1...Unfamiliar (N/A)I don't understand this questionNoNaNNoMale18-29$0 - $24,999High school degreeWest North Central
43.292763e+09TrueTrueTrueTrueTrueTrueTrueTrue5...Very favorablyI don't understand this questionNoNaNYesMale18-29$100,000 - $149,999Some college or Associate degreeWest North Central
53.292731e+09TrueTrueTrueTrueTrueTrueTrueTrue5...Somewhat favorablyGreedoYesNoNoMale18-29$100,000 - $149,999Some college or Associate degreeWest North Central

5 rows × 38 columns

1
2
3
4
5
6
7
8
9
10
11
#Change the column names with the .rename() method
star_wars = star_wars.rename(columns={
    'Which of the following Star Wars films have you seen? Please select all that apply.': "seen_1",
    "Unnamed: 4": "seen_2",
    "Unnamed: 5": "seen_3",
    "Unnamed: 6": "seen_4",
    "Unnamed: 7": "seen_5",
    "Unnamed: 8": "seen_6",
    })

star_wars.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
       'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

We’ve successfully cleaned up columns 1-9. Let’s check the data types of the rest of the dataframe.

1
star_wars.dtypes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
RespondentID                                                                                                                                     float64
Have you seen any of the 6 films in the Star Wars franchise?                                                                                        bool
Do you consider yourself to be a fan of the Star Wars film franchise?                                                                               bool
seen_1                                                                                                                                              bool
seen_2                                                                                                                                              bool
seen_3                                                                                                                                              bool
seen_4                                                                                                                                              bool
seen_5                                                                                                                                              bool
seen_6                                                                                                                                              bool
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.     object
Unnamed: 10                                                                                                                                       object
Unnamed: 11                                                                                                                                       object
Unnamed: 12                                                                                                                                       object
Unnamed: 13                                                                                                                                       object
Unnamed: 14                                                                                                                                       object
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.                                    object
Unnamed: 16                                                                                                                                       object
Unnamed: 17                                                                                                                                       object
Unnamed: 18                                                                                                                                       object
Unnamed: 19                                                                                                                                       object
Unnamed: 20                                                                                                                                       object
Unnamed: 21                                                                                                                                       object
Unnamed: 22                                                                                                                                       object
Unnamed: 23                                                                                                                                       object
Unnamed: 24                                                                                                                                       object
Unnamed: 25                                                                                                                                       object
Unnamed: 26                                                                                                                                       object
Unnamed: 27                                                                                                                                       object
Unnamed: 28                                                                                                                                       object
Which character shot first?                                                                                                                       object
Are you familiar with the Expanded Universe?                                                                                                      object
Do you consider yourself to be a fan of the Expanded Universe?ξ                                                                                object
Do you consider yourself to be a fan of the Star Trek franchise?                                                                                  object
Gender                                                                                                                                            object
Age                                                                                                                                               object
Household Income                                                                                                                                  object
Education                                                                                                                                         object
Location (Census Region)                                                                                                                          object
dtype: object

Columns 10 to 16 are movie ranking values. Again, we can change the column names using the .rename() method. In addition, columns 10-16 are current listed as ‘Object’ we want to convert the values in these columns into float.

1
2
3
4
5
6
7
8
9
10
11
12
star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)

star_wars = star_wars.rename(columns={
    'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': "ranking_1",
    "Unnamed: 10": "ranking_2",
    "Unnamed: 11": "ranking_3",
    "Unnamed: 12": "ranking_4",
    "Unnamed: 13": "ranking_5",
    "Unnamed: 14": "ranking_6",
    })

star_wars.columns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Index(['RespondentID',
       'Have you seen any of the 6 films in the Star Wars franchise?',
       'Do you consider yourself to be a fan of the Star Wars film franchise?',
       'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
       'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
       'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Which character shot first?',
       'Are you familiar with the Expanded Universe?',
       'Do you consider yourself to be a fan of the Expanded Universe?ξ',
       'Do you consider yourself to be a fan of the Star Trek franchise?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')
1
2
3
4
5
means = star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline
plt.bar(range(1,7), means)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
1
Text(0,0.5,'Average Ranking')

png

Column 10 contains the following string: ‘Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.’

So columns with lower ranking values are considered better by the survey respondants. From the chart, it looks like the older movies(#4-6) have higher rankngs than the newer star war movies(#1-3).

1
2
3
4
sums = star_wars[star_wars.columns[3:9]].sum()
plt.bar(range(1,7), sums)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
1
Text(0,0.5,'Total Respondants')

png

Same thing here, more respondants saw the original movies(#4-6), and they were ranked higher. Keep in a mind a lower value for average ranking means the respondant liked the movie more.

Let’s do the same analysis again, but seperate the plots by gender.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
star_wars_males = males = star_wars[star_wars["Gender"] == "Male"]
star_wars_females = females = star_wars[star_wars["Gender"] == "Female"]

means_males = star_wars_males[star_wars_males.columns[9:15]].mean()
plt.bar(range(1,7), means_males)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()


means_females = star_wars_females[star_wars_females.columns[9:15]].mean()
plt.bar(range(1,7), means_females)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()

png

png

1
2
3
4
5
6
7
8
9
10
11
sums_males = star_wars_males[star_wars_males.columns[3:9]].sum()
plt.bar(range(1, 7), sums_males)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()

sums_females = star_wars_females[star_wars_females.columns[3:9]].sum()
plt.bar(range(1, 7), sums_females)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()

png

png

More males saw the prequel movies (#1-3) and they gave higher ratings than females. Both groups gave high ratings for original movies (#4-6).


Learning Summary

Python concepts explored: pandas, matplotlib.pyplot, data cleaning, string manipulation, bar plots

Python functions and methods used: .read_csv(), .columns, notnull, map(), .dtypes, .rename, astype(), .mean(), .sum(), .xlabel(), .ylabel()

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.