Post

Dataquest Guided Project - Preparing Data For SQLite

In this project, we will prepare data in preparation for SQL. We’ll have to first clean the data.

1
2
3
4
5
6
import pandas as pd
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')

#Turns off warnings for potentially confusing assignments
pd.options.mode.chained_assignment = None  # default='warn'
df.head()
YearCategoryNomineeAdditional InfoWon?Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
02010 (83rd)Actor -- Leading RoleJavier BardemBiutiful {'Uxbal'}NONaNNaNNaNNaNNaNNaN
12010 (83rd)Actor -- Leading RoleJeff BridgesTrue Grit {'Rooster Cogburn'}NONaNNaNNaNNaNNaNNaN
22010 (83rd)Actor -- Leading RoleJesse EisenbergThe Social Network {'Mark Zuckerberg'}NONaNNaNNaNNaNNaNNaN
32010 (83rd)Actor -- Leading RoleColin FirthThe King's Speech {'King George VI'}YESNaNNaNNaNNaNNaNNaN
42010 (83rd)Actor -- Leading RoleJames Franco127 Hours {'Aron Ralston'}NONaNNaNNaNNaNNaNNaN

This is not considered tidy data. Here are the issues:

  1. We have string values in the year column, we should change these into integers.

  2. Additional Info has the movie name and the role name, this can be split into two columns.

  3. The “Won?” column change should be changed to 0’s and 1’s in.

  4. There are a lot of NaN values under the unnammed columns, we should consider dropping them.

First let’s take a look at a unnammed columns, we can use .value_counts() to see if there are any significant values in these columns.

1
2
3
4
5
6
print(df['Unnamed: 10'].value_counts())
print(df['Unnamed: 9'].value_counts())
print(df['Unnamed: 8'].value_counts())
print(df['Unnamed: 7'].value_counts())
print(df['Unnamed: 6'].value_counts())
print(df['Unnamed: 5'].value_counts())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
*    1
Name: Unnamed: 10, dtype: int64
*    1
Name: Unnamed: 9, dtype: int64
*                                                 1
 understanding comedy genius - Mack Sennett.""    1
Name: Unnamed: 8, dtype: int64
*                                                     1
 while requiring no dangerous solvents. [Systems]"    1
 kindly                                               1
Name: Unnamed: 7, dtype: int64
*                                                                   9
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
 flexibility and water resistance                                   1
 sympathetic                                                        1
Name: Unnamed: 6, dtype: int64
*                                                                                                               7
 discoverer of stars                                                                                            1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 resilience                                                                                                     1
Name: Unnamed: 5, dtype: int64

It doesn’t look like there any significant information under these columns. We can probably drop all the unnamed columns, but first let’s convert the year column to integers.

1
2
3
df['Year'] = df['Year'].str[0:4]
df['Year'] = df['Year'].astype(int)
df.head()
YearCategoryNomineeAdditional InfoWon?Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
02010Actor -- Leading RoleJavier BardemBiutiful {'Uxbal'}NONaNNaNNaNNaNNaNNaN
12010Actor -- Leading RoleJeff BridgesTrue Grit {'Rooster Cogburn'}NONaNNaNNaNNaNNaNNaN
22010Actor -- Leading RoleJesse EisenbergThe Social Network {'Mark Zuckerberg'}NONaNNaNNaNNaNNaNNaN
32010Actor -- Leading RoleColin FirthThe King's Speech {'King George VI'}YESNaNNaNNaNNaNNaNNaN
42010Actor -- Leading RoleJames Franco127 Hours {'Aron Ralston'}NONaNNaNNaNNaNNaNNaN
1
df.describe(include = 'all')
YearCategoryNomineeAdditional InfoWon?Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
count10137.000000101371013790111013711123211
uniqueNaN406001642416543211
topNaNWritingMeryl StreepMetro-Goldwyn-MayerNO******
freqNaN88816607168791111
mean1970.330768NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
std23.332917NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
min1927.000000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
25%1950.000000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
50%1970.000000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
75%1991.000000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
max2010.000000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

We are only interested in data after year 2000, and actors in award winnng categories. We can use boolean filtering and the .isin() method to filter the data out and then create a new dataframe named “nominations”.

1
2
3
4
5
6
7
8
9
10
11
12
later_than_2000 = df[df['Year'] > 2000]

award_categories = [
    "Actor -- Leading Role",
    "Actor -- Supporting Role",
    "Actress -- Leading Role",
    "Actress -- Supporting Role"
]


nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]
nominations.head()
YearCategoryNomineeAdditional InfoWon?Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10
02010Actor -- Leading RoleJavier BardemBiutiful {'Uxbal'}NONaNNaNNaNNaNNaNNaN
12010Actor -- Leading RoleJeff BridgesTrue Grit {'Rooster Cogburn'}NONaNNaNNaNNaNNaNNaN
22010Actor -- Leading RoleJesse EisenbergThe Social Network {'Mark Zuckerberg'}NONaNNaNNaNNaNNaNNaN
32010Actor -- Leading RoleColin FirthThe King's Speech {'King George VI'}YESNaNNaNNaNNaNNaNNaN
42010Actor -- Leading RoleJames Franco127 Hours {'Aron Ralston'}NONaNNaNNaNNaNNaNNaN

We can also use use the .map() method to convert all the “No” values to 0 and all the “Yes” values to 1.

1
2
3
4
5
6
7
8
yes_no = {
    "NO":0,
    "YES":1
}

nominations['Won'] = nominations['Won?'].map(yes_no)
nominations.head()

YearCategoryNomineeAdditional InfoWon?Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9Unnamed: 10Won
02010Actor -- Leading RoleJavier BardemBiutiful {'Uxbal'}NONaNNaNNaNNaNNaNNaN0
12010Actor -- Leading RoleJeff BridgesTrue Grit {'Rooster Cogburn'}NONaNNaNNaNNaNNaNNaN0
22010Actor -- Leading RoleJesse EisenbergThe Social Network {'Mark Zuckerberg'}NONaNNaNNaNNaNNaNNaN0
32010Actor -- Leading RoleColin FirthThe King's Speech {'King George VI'}YESNaNNaNNaNNaNNaNNaN1
42010Actor -- Leading RoleJames Franco127 Hours {'Aron Ralston'}NONaNNaNNaNNaNNaNNaN0

Next, we’ll drop the unnamed columns

1
2
3
4
5
6
7
8
9
10
columns_drop = [
    "Won?",
    "Unnamed: 5",
    "Unnamed: 6",
    "Unnamed: 7",
    "Unnamed: 8",
    "Unnamed: 9",
    "Unnamed: 10"
]
final_nominations = nominations.drop(columns_drop, axis = 1)
1
final_nominations.head()
YearCategoryNomineeAdditional InfoWon
02010Actor -- Leading RoleJavier BardemBiutiful {'Uxbal'}0
12010Actor -- Leading RoleJeff BridgesTrue Grit {'Rooster Cogburn'}0
22010Actor -- Leading RoleJesse EisenbergThe Social Network {'Mark Zuckerberg'}0
32010Actor -- Leading RoleColin FirthThe King's Speech {'King George VI'}1
42010Actor -- Leading RoleJames Franco127 Hours {'Aron Ralston'}0

The last thing we’ll have to do is separate the “Additional Info” column into two new columns “Movie” and “Character”. We’ll have to manipulate the strings and the split the two values in this one column in two new strings.

First we’ll use .str.rstrip() to remove the end quotation mark and the bracket from the column series, then we’ll split the the string again into a series of lists.

1
2
additional_info_one = final_nominations['Additional Info'].str.rstrip(to_strip = "'}")
additional_info_one.head()
1
2
3
4
5
6
0                        Biutiful {'Uxbal
1             True Grit {'Rooster Cogburn
2    The Social Network {'Mark Zuckerberg
3      The King's Speech {'King George VI
4                127 Hours {'Aron Ralston
Name: Additional Info, dtype: object
1
2
3
#Split into a series of lists
additional_info_two = additional_info_one.str.split(' {\'')
additional_info_two.head()
1
2
3
4
5
6
0                        [Biutiful, Uxbal]
1             [True Grit, Rooster Cogburn]
2    [The Social Network, Mark Zuckerberg]
3      [The King's Speech, King George VI]
4                [127 Hours, Aron Ralston]
Name: Additional Info, dtype: object
1
2
3
4
5
6
#Set a series with the first element to movie names
movie_names = additional_info_two.str[0]


#Set a series with the second element to characters
characters = additional_info_two.str[1]
1
2
3
4
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop(["Additional Info"], axis=1)
final_nominations.head()
YearCategoryNomineeWonMovieCharacter
02010Actor -- Leading RoleJavier Bardem0BiutifulUxbal
12010Actor -- Leading RoleJeff Bridges0True GritRooster Cogburn
22010Actor -- Leading RoleJesse Eisenberg0The Social NetworkMark Zuckerberg
32010Actor -- Leading RoleColin Firth1The King's SpeechKing George VI
42010Actor -- Leading RoleJames Franco0127 HoursAron Ralston

Now that we are done cleaning up the data, we can do some simply analysis using sqlite3

1
2
3
4
5
6
7
8
import sqlite3

conn = sqlite3.connect("nominations.db")
cursor = conn.cursor()

#Creates the table "nominations"
final_nominations.to_sql("nominations", conn, index=False)

1
2
3
4
5
q1 = '''
Pragma table_info(nominations)
'''
result = cursor.execute(q1).fetchall()
result
1
2
3
4
5
6
[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]
1
2
3
4
5
6
q2 = '''
SELECT * FROM nominations LIMIT 10
'''

result = cursor.execute(q2).fetchall()
result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0,
  'True Grit',
  'Rooster Cogburn'),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0,
  'The Social Network',
  'Mark Zuckerberg'),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1,
  "The King's Speech",
  'King George VI'),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0,
  '127 Hours',
  'Aron Ralston'),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1,
  'The Fighter',
  'Dicky Eklund'),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0,
  "Winter's Bone",
  'Teardrop'),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0,
  'The Town',
  'James Coughlin'),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0,
  'The Kids Are All Right',
  'Paul'),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0,
  "The King's Speech",
  'Lionel Logue')]
1
conn.close()

Learning Summary

Python/SQL concepts explored: python+sqlite3, pandas, data cleaning, columns manipulation

Python functions and methods used: .str.rstrip(), .str.split(), .connect(), .cursor(), .drop(), .str[], .map(), .value_counts()

SQL statements used: SELECT, FROM, PRAGMA

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.