In this project we will be looking at a star wars survey, ‘star_wars.csv’. This project will focus on data cleaning so we have a data set ready for analysis. Let’s begin by reading the csv file and the first couple rows.
1
2
3
4
5
| import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
star_wars = pd.read_csv('star_wars.csv', encoding='ISO-8859-1')
star_wars.head(3)
|
| RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) |
|---|
| 0 | NaN | Response | Response | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | Star Wars: Episode I The Phantom Menace | ... | Yoda | Response | Response | Response | Response | Response | Response | Response | Response | Response |
|---|
| 1 | 3.292880e+09 | Yes | Yes | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
|---|
| 2 | 3.292880e+09 | No | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
|---|
3 rows × 38 columns
1
2
| print(star_wars.shape)
star_wars.columns
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| (1187, 38)
Index(['RespondentID',
'Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?',
'Which of the following Star Wars films have you seen? Please select all that apply.',
'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
'Unnamed: 14',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
'Unnamed: 28', 'Which character shot first?',
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?Âæ',
'Do you consider yourself to be a fan of the Star Trek franchise?',
'Gender', 'Age', 'Household Income', 'Education',
'Location (Census Region)'],
dtype='object')
|
It looks like most of these columns are unnamed, but first we’ll remove any row without a RespondentID.
1
| star_wars = star_wars[star_wars['RespondentID'].notnull()]
|
Next we want to convert the “Yes” and “No” strings into booleans. We can use the .map() method along with a dictionary to replace the “Yes” string into True and the “No” string into False.
1
2
3
4
5
6
7
8
9
10
11
|
yes_no = {"Yes": True, "No": False, np.nan:False}
for col in [
"Have you seen any of the 6 films in the Star Wars franchise?",
"Do you consider yourself to be a fan of the Star Wars film franchise?"
]:
star_wars[col] = star_wars[col].map(yes_no)
star_wars.head()
|
| RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) |
|---|
| 1 | 3.292880e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
|---|
| 2 | 3.292880e+09 | False | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
|---|
| 3 | 3.292765e+09 | True | False | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | NaN | NaN | NaN | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
|---|
| 4 | 3.292763e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
|---|
| 5 | 3.292731e+09 | True | True | Star Wars: Episode I The Phantom Menace | Star Wars: Episode II Attack of the Clones | Star Wars: Episode III Revenge of the Sith | Star Wars: Episode IV A New Hope | Star Wars: Episode V The Empire Strikes Back | Star Wars: Episode VI Return of the Jedi | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
|---|
5 rows × 38 columns
Columns 4 to 9 have string values with the movie the respondant saw. Similar to how we cleaned columns 2-3, we want to convert these into booleans with the .map() method. In addition, we want to change the column names to reference the true or false question.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| true_false = {
"Star Wars: Episode I The Phantom Menace": True,
"Star Wars: Episode II Attack of the Clones": True,
"Star Wars: Episode III Revenge of the Sith": True,
"Star Wars: Episode IV A New Hope": True,
"Star Wars: Episode V The Empire Strikes Back": True,
"Star Wars: Episode VI Return of the Jedi": True,
np.nan: False,
}
for col in star_wars.columns[3:9]:
star_wars[col] = star_wars[col].map(true_false)
star_wars.head()
|
| RespondentID | Have you seen any of the 6 films in the Star Wars franchise? | Do you consider yourself to be a fan of the Star Wars film franchise? | Which of the following Star Wars films have you seen? Please select all that apply. | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. | ... | Unnamed: 28 | Which character shot first? | Are you familiar with the Expanded Universe? | Do you consider yourself to be a fan of the Expanded Universe?Âæ | Do you consider yourself to be a fan of the Star Trek franchise? | Gender | Age | Household Income | Education | Location (Census Region) |
|---|
| 1 | 3.292880e+09 | True | True | True | True | True | True | True | True | 3 | ... | Very favorably | I don't understand this question | Yes | No | No | Male | 18-29 | NaN | High school degree | South Atlantic |
|---|
| 2 | 3.292880e+09 | False | False | False | False | False | False | False | False | NaN | ... | NaN | NaN | NaN | NaN | Yes | Male | 18-29 | $0 - $24,999 | Bachelor degree | West South Central |
|---|
| 3 | 3.292765e+09 | True | False | True | True | True | False | False | False | 1 | ... | Unfamiliar (N/A) | I don't understand this question | No | NaN | No | Male | 18-29 | $0 - $24,999 | High school degree | West North Central |
|---|
| 4 | 3.292763e+09 | True | True | True | True | True | True | True | True | 5 | ... | Very favorably | I don't understand this question | No | NaN | Yes | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
|---|
| 5 | 3.292731e+09 | True | True | True | True | True | True | True | True | 5 | ... | Somewhat favorably | Greedo | Yes | No | No | Male | 18-29 | $100,000 - $149,999 | Some college or Associate degree | West North Central |
|---|
5 rows × 38 columns
1
2
3
4
5
6
7
8
9
10
11
| #Change the column names with the .rename() method
star_wars = star_wars.rename(columns={
'Which of the following Star Wars films have you seen? Please select all that apply.': "seen_1",
"Unnamed: 4": "seen_2",
"Unnamed: 5": "seen_3",
"Unnamed: 6": "seen_4",
"Unnamed: 7": "seen_5",
"Unnamed: 8": "seen_6",
})
star_wars.columns
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| Index(['RespondentID',
'Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?',
'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.',
'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
'Unnamed: 14',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
'Unnamed: 28', 'Which character shot first?',
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?Âæ',
'Do you consider yourself to be a fan of the Star Trek franchise?',
'Gender', 'Age', 'Household Income', 'Education',
'Location (Census Region)'],
dtype='object')
|
We’ve successfully cleaned up columns 1-9. Let’s check the data types of the rest of the dataframe.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| RespondentID float64
Have you seen any of the 6 films in the Star Wars franchise? bool
Do you consider yourself to be a fan of the Star Wars film franchise? bool
seen_1 bool
seen_2 bool
seen_3 bool
seen_4 bool
seen_5 bool
seen_6 bool
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. object
Unnamed: 10 object
Unnamed: 11 object
Unnamed: 12 object
Unnamed: 13 object
Unnamed: 14 object
Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. object
Unnamed: 16 object
Unnamed: 17 object
Unnamed: 18 object
Unnamed: 19 object
Unnamed: 20 object
Unnamed: 21 object
Unnamed: 22 object
Unnamed: 23 object
Unnamed: 24 object
Unnamed: 25 object
Unnamed: 26 object
Unnamed: 27 object
Unnamed: 28 object
Which character shot first? object
Are you familiar with the Expanded Universe? object
Do you consider yourself to be a fan of the Expanded Universe?Âæ object
Do you consider yourself to be a fan of the Star Trek franchise? object
Gender object
Age object
Household Income object
Education object
Location (Census Region) object
dtype: object
|
Columns 10 to 16 are movie ranking values. Again, we can change the column names using the .rename() method. In addition, columns 10-16 are current listed as ‘Object’ we want to convert the values in these columns into float.
1
2
3
4
5
6
7
8
9
10
11
12
| star_wars[star_wars.columns[9:15]] = star_wars[star_wars.columns[9:15]].astype(float)
star_wars = star_wars.rename(columns={
'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.': "ranking_1",
"Unnamed: 10": "ranking_2",
"Unnamed: 11": "ranking_3",
"Unnamed: 12": "ranking_4",
"Unnamed: 13": "ranking_5",
"Unnamed: 14": "ranking_6",
})
star_wars.columns
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| Index(['RespondentID',
'Have you seen any of the 6 films in the Star Wars franchise?',
'Do you consider yourself to be a fan of the Star Wars film franchise?',
'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6', 'ranking_1',
'ranking_2', 'ranking_3', 'ranking_4', 'ranking_5', 'ranking_6',
'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.',
'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
'Unnamed: 28', 'Which character shot first?',
'Are you familiar with the Expanded Universe?',
'Do you consider yourself to be a fan of the Expanded Universe?Âæ',
'Do you consider yourself to be a fan of the Star Trek franchise?',
'Gender', 'Age', 'Household Income', 'Education',
'Location (Census Region)'],
dtype='object')
|
1
2
3
4
5
| means = star_wars[star_wars.columns[9:15]].mean()
%matplotlib inline
plt.bar(range(1,7), means)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
|
1
| Text(0,0.5,'Average Ranking')
|
Column 10 contains the following string: ‘Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.’
So columns with lower ranking values are considered better by the survey respondants. From the chart, it looks like the older movies(#4-6) have higher rankngs than the newer star war movies(#1-3).
1
2
3
4
| sums = star_wars[star_wars.columns[3:9]].sum()
plt.bar(range(1,7), sums)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
|
1
| Text(0,0.5,'Total Respondants')
|
Same thing here, more respondants saw the original movies(#4-6), and they were ranked higher. Keep in a mind a lower value for average ranking means the respondant liked the movie more.
Let’s do the same analysis again, but seperate the plots by gender.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| star_wars_males = males = star_wars[star_wars["Gender"] == "Male"]
star_wars_females = females = star_wars[star_wars["Gender"] == "Female"]
means_males = star_wars_males[star_wars_males.columns[9:15]].mean()
plt.bar(range(1,7), means_males)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()
means_females = star_wars_females[star_wars_females.columns[9:15]].mean()
plt.bar(range(1,7), means_females)
plt.xlabel("Movie #")
plt.ylabel('Average Ranking')
plt.show()
|
1
2
3
4
5
6
7
8
9
10
11
| sums_males = star_wars_males[star_wars_males.columns[3:9]].sum()
plt.bar(range(1, 7), sums_males)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()
sums_females = star_wars_females[star_wars_females.columns[3:9]].sum()
plt.bar(range(1, 7), sums_females)
plt.xlabel("Movie #")
plt.ylabel('Total Respondants')
plt.show()
|
More males saw the prequel movies (#1-3) and they gave higher ratings than females. Both groups gave high ratings for original movies (#4-6).
Learning Summary
Python concepts explored: pandas, matplotlib.pyplot, data cleaning, string manipulation, bar plots
Python functions and methods used: .read_csv(), .columns, notnull, map(), .dtypes, .rename, astype(), .mean(), .sum(), .xlabel(), .ylabel()
The files used for this project can be found in my GitHub repository.