Dataquest Guided Project - Preparing Data For SQLite
In this project, we will prepare data in preparation for SQL. We’ll have to first clean the data.
1
2
3
4
5
6
import pandas as pd
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')
#Turns off warnings for potentially confusing assignments
pd.options.mode.chained_assignment = None # default='warn'
df.head()
| Year | Category | Nominee | Additional Info | Won? | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 (83rd) | Actor -- Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2010 (83rd) | Actor -- Leading Role | Jeff Bridges | True Grit {'Rooster Cogburn'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2010 (83rd) | Actor -- Leading Role | Jesse Eisenberg | The Social Network {'Mark Zuckerberg'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2010 (83rd) | Actor -- Leading Role | Colin Firth | The King's Speech {'King George VI'} | YES | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2010 (83rd) | Actor -- Leading Role | James Franco | 127 Hours {'Aron Ralston'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
This is not considered tidy data. Here are the issues:
We have string values in the year column, we should change these into integers.
Additional Info has the movie name and the role name, this can be split into two columns.
The “Won?” column change should be changed to 0’s and 1’s in.
There are a lot of NaN values under the unnammed columns, we should consider dropping them.
First let’s take a look at a unnammed columns, we can use .value_counts() to see if there are any significant values in these columns.
1
2
3
4
5
6
print(df['Unnamed: 10'].value_counts())
print(df['Unnamed: 9'].value_counts())
print(df['Unnamed: 8'].value_counts())
print(df['Unnamed: 7'].value_counts())
print(df['Unnamed: 6'].value_counts())
print(df['Unnamed: 5'].value_counts())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
* 1
Name: Unnamed: 10, dtype: int64
* 1
Name: Unnamed: 9, dtype: int64
* 1
understanding comedy genius - Mack Sennett."" 1
Name: Unnamed: 8, dtype: int64
* 1
while requiring no dangerous solvents. [Systems]" 1
kindly 1
Name: Unnamed: 7, dtype: int64
* 9
direct radiator bass style cinema loudspeaker systems. [Sound]" 1
flexibility and water resistance 1
sympathetic 1
Name: Unnamed: 6, dtype: int64
* 7
discoverer of stars 1
D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity 1
error-prone measurements on sets. [Digital Imaging Technology]" 1
resilience 1
Name: Unnamed: 5, dtype: int64
It doesn’t look like there any significant information under these columns. We can probably drop all the unnamed columns, but first let’s convert the year column to integers.
1
2
3
df['Year'] = df['Year'].str[0:4]
df['Year'] = df['Year'].astype(int)
df.head()
| Year | Category | Nominee | Additional Info | Won? | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | Actor -- Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2010 | Actor -- Leading Role | Jeff Bridges | True Grit {'Rooster Cogburn'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2010 | Actor -- Leading Role | Jesse Eisenberg | The Social Network {'Mark Zuckerberg'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2010 | Actor -- Leading Role | Colin Firth | The King's Speech {'King George VI'} | YES | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2010 | Actor -- Leading Role | James Franco | 127 Hours {'Aron Ralston'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
1
df.describe(include = 'all')
| Year | Category | Nominee | Additional Info | Won? | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10137.000000 | 10137 | 10137 | 9011 | 10137 | 11 | 12 | 3 | 2 | 1 | 1 |
| unique | NaN | 40 | 6001 | 6424 | 16 | 5 | 4 | 3 | 2 | 1 | 1 |
| top | NaN | Writing | Meryl Streep | Metro-Goldwyn-Mayer | NO | * | * | * | * | * | * |
| freq | NaN | 888 | 16 | 60 | 7168 | 7 | 9 | 1 | 1 | 1 | 1 |
| mean | 1970.330768 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| std | 23.332917 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | 1927.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | 1950.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | 1970.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | 1991.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | 2010.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We are only interested in data after year 2000, and actors in award winnng categories. We can use boolean filtering and the .isin() method to filter the data out and then create a new dataframe named “nominations”.
1
2
3
4
5
6
7
8
9
10
11
12
later_than_2000 = df[df['Year'] > 2000]
award_categories = [
"Actor -- Leading Role",
"Actor -- Supporting Role",
"Actress -- Leading Role",
"Actress -- Supporting Role"
]
nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]
nominations.head()
| Year | Category | Nominee | Additional Info | Won? | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | Actor -- Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2010 | Actor -- Leading Role | Jeff Bridges | True Grit {'Rooster Cogburn'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2010 | Actor -- Leading Role | Jesse Eisenberg | The Social Network {'Mark Zuckerberg'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2010 | Actor -- Leading Role | Colin Firth | The King's Speech {'King George VI'} | YES | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2010 | Actor -- Leading Role | James Franco | 127 Hours {'Aron Ralston'} | NO | NaN | NaN | NaN | NaN | NaN | NaN |
We can also use use the .map() method to convert all the “No” values to 0 and all the “Yes” values to 1.
1
2
3
4
5
6
7
8
yes_no = {
"NO":0,
"YES":1
}
nominations['Won'] = nominations['Won?'].map(yes_no)
nominations.head()
| Year | Category | Nominee | Additional Info | Won? | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | Won | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2010 | Actor -- Leading Role | Javier Bardem | Biutiful {'Uxbal'} | NO | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
| 1 | 2010 | Actor -- Leading Role | Jeff Bridges | True Grit {'Rooster Cogburn'} | NO | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
| 2 | 2010 | Actor -- Leading Role | Jesse Eisenberg | The Social Network {'Mark Zuckerberg'} | NO | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
| 3 | 2010 | Actor -- Leading Role | Colin Firth | The King's Speech {'King George VI'} | YES | NaN | NaN | NaN | NaN | NaN | NaN | 1 |
| 4 | 2010 | Actor -- Leading Role | James Franco | 127 Hours {'Aron Ralston'} | NO | NaN | NaN | NaN | NaN | NaN | NaN | 0 |
Next, we’ll drop the unnamed columns
1
2
3
4
5
6
7
8
9
10
columns_drop = [
"Won?",
"Unnamed: 5",
"Unnamed: 6",
"Unnamed: 7",
"Unnamed: 8",
"Unnamed: 9",
"Unnamed: 10"
]
final_nominations = nominations.drop(columns_drop, axis = 1)
1
final_nominations.head()
| Year | Category | Nominee | Additional Info | Won | |
|---|---|---|---|---|---|
| 0 | 2010 | Actor -- Leading Role | Javier Bardem | Biutiful {'Uxbal'} | 0 |
| 1 | 2010 | Actor -- Leading Role | Jeff Bridges | True Grit {'Rooster Cogburn'} | 0 |
| 2 | 2010 | Actor -- Leading Role | Jesse Eisenberg | The Social Network {'Mark Zuckerberg'} | 0 |
| 3 | 2010 | Actor -- Leading Role | Colin Firth | The King's Speech {'King George VI'} | 1 |
| 4 | 2010 | Actor -- Leading Role | James Franco | 127 Hours {'Aron Ralston'} | 0 |
The last thing we’ll have to do is separate the “Additional Info” column into two new columns “Movie” and “Character”. We’ll have to manipulate the strings and the split the two values in this one column in two new strings.
First we’ll use .str.rstrip() to remove the end quotation mark and the bracket from the column series, then we’ll split the the string again into a series of lists.
1
2
additional_info_one = final_nominations['Additional Info'].str.rstrip(to_strip = "'}")
additional_info_one.head()
1
2
3
4
5
6
0 Biutiful {'Uxbal
1 True Grit {'Rooster Cogburn
2 The Social Network {'Mark Zuckerberg
3 The King's Speech {'King George VI
4 127 Hours {'Aron Ralston
Name: Additional Info, dtype: object
1
2
3
#Split into a series of lists
additional_info_two = additional_info_one.str.split(' {\'')
additional_info_two.head()
1
2
3
4
5
6
0 [Biutiful, Uxbal]
1 [True Grit, Rooster Cogburn]
2 [The Social Network, Mark Zuckerberg]
3 [The King's Speech, King George VI]
4 [127 Hours, Aron Ralston]
Name: Additional Info, dtype: object
1
2
3
4
5
6
#Set a series with the first element to movie names
movie_names = additional_info_two.str[0]
#Set a series with the second element to characters
characters = additional_info_two.str[1]
1
2
3
4
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop(["Additional Info"], axis=1)
final_nominations.head()
| Year | Category | Nominee | Won | Movie | Character | |
|---|---|---|---|---|---|---|
| 0 | 2010 | Actor -- Leading Role | Javier Bardem | 0 | Biutiful | Uxbal |
| 1 | 2010 | Actor -- Leading Role | Jeff Bridges | 0 | True Grit | Rooster Cogburn |
| 2 | 2010 | Actor -- Leading Role | Jesse Eisenberg | 0 | The Social Network | Mark Zuckerberg |
| 3 | 2010 | Actor -- Leading Role | Colin Firth | 1 | The King's Speech | King George VI |
| 4 | 2010 | Actor -- Leading Role | James Franco | 0 | 127 Hours | Aron Ralston |
Now that we are done cleaning up the data, we can do some simply analysis using sqlite3
1
2
3
4
5
6
7
8
import sqlite3
conn = sqlite3.connect("nominations.db")
cursor = conn.cursor()
#Creates the table "nominations"
final_nominations.to_sql("nominations", conn, index=False)
1
2
3
4
5
q1 = '''
Pragma table_info(nominations)
'''
result = cursor.execute(q1).fetchall()
result
1
2
3
4
5
6
[(0, 'Year', 'INTEGER', 0, None, 0),
(1, 'Category', 'TEXT', 0, None, 0),
(2, 'Nominee', 'TEXT', 0, None, 0),
(3, 'Won', 'INTEGER', 0, None, 0),
(4, 'Movie', 'TEXT', 0, None, 0),
(5, 'Character', 'TEXT', 0, None, 0)]
1
2
3
4
5
6
q2 = '''
SELECT * FROM nominations LIMIT 10
'''
result = cursor.execute(q2).fetchall()
result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'),
(2010,
'Actor -- Leading Role',
'Jeff Bridges',
0,
'True Grit',
'Rooster Cogburn'),
(2010,
'Actor -- Leading Role',
'Jesse Eisenberg',
0,
'The Social Network',
'Mark Zuckerberg'),
(2010,
'Actor -- Leading Role',
'Colin Firth',
1,
"The King's Speech",
'King George VI'),
(2010,
'Actor -- Leading Role',
'James Franco',
0,
'127 Hours',
'Aron Ralston'),
(2010,
'Actor -- Supporting Role',
'Christian Bale',
1,
'The Fighter',
'Dicky Eklund'),
(2010,
'Actor -- Supporting Role',
'John Hawkes',
0,
"Winter's Bone",
'Teardrop'),
(2010,
'Actor -- Supporting Role',
'Jeremy Renner',
0,
'The Town',
'James Coughlin'),
(2010,
'Actor -- Supporting Role',
'Mark Ruffalo',
0,
'The Kids Are All Right',
'Paul'),
(2010,
'Actor -- Supporting Role',
'Geoffrey Rush',
0,
"The King's Speech",
'Lionel Logue')]
1
conn.close()
Learning Summary
Python/SQL concepts explored: python+sqlite3, pandas, data cleaning, columns manipulation
Python functions and methods used: .str.rstrip(), .str.split(), .connect(), .cursor(), .drop(), .str[], .map(), .value_counts()
SQL statements used: SELECT, FROM, PRAGMA
The files used for this project can be found in my GitHub repository.