Dataquest Guided Project - Preparing Data For SQLite

Posted Feb 7, 2018

By Seng Kin Chu

10 min read

In this project, we will prepare data in preparation for SQL. We’ll have to first clean the data.

  
import pandas as pd
df = pd.read_csv('academy_awards.csv', encoding='ISO-8859-1')

#Turns off warnings for potentially confusing assignments
pd.options.mode.chained_assignment = None  # default='warn'
df.head()

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	2010 (83rd)	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
1	2010 (83rd)	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
2	2010 (83rd)	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
3	2010 (83rd)	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	YES	NaN	NaN	NaN	NaN	NaN	NaN
4	2010 (83rd)	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	NO	NaN	NaN	NaN	NaN	NaN	NaN

This is not considered tidy data. Here are the issues:

We have string values in the year column, we should change these into integers.
Additional Info has the movie name and the role name, this can be split into two columns.
The “Won?” column change should be changed to 0’s and 1’s in.
There are a lot of NaN values under the unnammed columns, we should consider dropping them.

First let’s take a look at a unnammed columns, we can use .value_counts() to see if there are any significant values in these columns.

  
print(df['Unnamed: 10'].value_counts())
print(df['Unnamed: 9'].value_counts())
print(df['Unnamed: 8'].value_counts())
print(df['Unnamed: 7'].value_counts())
print(df['Unnamed: 6'].value_counts())
print(df['Unnamed: 5'].value_counts())

*    1
Name: Unnamed: 10, dtype: int64
*    1
Name: Unnamed: 9, dtype: int64
*                                                 1
 understanding comedy genius - Mack Sennett.""    1
Name: Unnamed: 8, dtype: int64
*                                                     1
 while requiring no dangerous solvents. [Systems]"    1
 kindly                                               1
Name: Unnamed: 7, dtype: int64
*                                                                   9
 direct radiator bass style cinema loudspeaker systems. [Sound]"    1
 flexibility and water resistance                                   1
 sympathetic                                                        1
Name: Unnamed: 6, dtype: int64
*                                                                                                               7
 discoverer of stars                                                                                            1
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    1
 error-prone measurements on sets. [Digital Imaging Technology]"                                                1
 resilience                                                                                                     1
Name: Unnamed: 5, dtype: int64

It doesn’t look like there any significant information under these columns. We can probably drop all the unnamed columns, but first let’s convert the year column to integers.

  
df['Year'] = df['Year'].str[0:4]
df['Year'] = df['Year'].astype(int)
df.head()

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	2010	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
1	2010	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
2	2010	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
3	2010	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	YES	NaN	NaN	NaN	NaN	NaN	NaN
4	2010	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	NO	NaN	NaN	NaN	NaN	NaN	NaN

  
df.describe(include = 'all')

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
count	10137.000000	10137	10137	9011	10137	11	12	3	2	1	1
unique	NaN	40	6001	6424	16	5	4	3	2	1	1
top	NaN	Writing	Meryl Streep	Metro-Goldwyn-Mayer	NO	*	*	*	*	*	*
freq	NaN	888	16	60	7168	7	9	1	1	1	1
mean	1970.330768	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	23.332917	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	1927.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	1950.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	1970.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	1991.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	2010.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

We are only interested in data after year 2000, and actors in award winnng categories. We can use boolean filtering and the .isin() method to filter the data out and then create a new dataframe named “nominations”.

  
later_than_2000 = df[df['Year'] > 2000]

award_categories = [
    "Actor -- Leading Role",
    "Actor -- Supporting Role",
    "Actress -- Leading Role",
    "Actress -- Supporting Role"
]


nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]
nominations.head()

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	2010	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
1	2010	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
2	2010	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN	NaN	NaN	NaN	NaN
3	2010	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	YES	NaN	NaN	NaN	NaN	NaN	NaN
4	2010	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	NO	NaN	NaN	NaN	NaN	NaN	NaN

We can also use use the .map() method to convert all the “No” values to 0 and all the “Yes” values to 1.

  
yes_no = {
    "NO":0,
    "YES":1
}

nominations['Won'] = nominations['Won?'].map(yes_no)
nominations.head()

	Year	Category	Nominee	Additional Info	Won?	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10	Won
0	2010	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	NO	NaN	NaN	NaN	NaN	NaN	NaN	0
1	2010	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	NO	NaN	NaN	NaN	NaN	NaN	NaN	0
2	2010	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	NO	NaN	NaN	NaN	NaN	NaN	NaN	0
3	2010	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	YES	NaN	NaN	NaN	NaN	NaN	NaN	1
4	2010	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	NO	NaN	NaN	NaN	NaN	NaN	NaN	0

Next, we’ll drop the unnamed columns

  
columns_drop = [
    "Won?",
    "Unnamed: 5",
    "Unnamed: 6",
    "Unnamed: 7",
    "Unnamed: 8",
    "Unnamed: 9",
    "Unnamed: 10"
]
final_nominations = nominations.drop(columns_drop, axis = 1)

  
final_nominations.head()

	Year	Category	Nominee	Additional Info	Won
0	2010	Actor -- Leading Role	Javier Bardem	Biutiful {'Uxbal'}	0
1	2010	Actor -- Leading Role	Jeff Bridges	True Grit {'Rooster Cogburn'}	0
2	2010	Actor -- Leading Role	Jesse Eisenberg	The Social Network {'Mark Zuckerberg'}	0
3	2010	Actor -- Leading Role	Colin Firth	The King's Speech {'King George VI'}	1
4	2010	Actor -- Leading Role	James Franco	127 Hours {'Aron Ralston'}	0

The last thing we’ll have to do is separate the “Additional Info” column into two new columns “Movie” and “Character”. We’ll have to manipulate the strings and the split the two values in this one column in two new strings.

First we’ll use .str.rstrip() to remove the end quotation mark and the bracket from the column series, then we’ll split the the string again into a series of lists.

  
additional_info_one = final_nominations['Additional Info'].str.rstrip(to_strip = "'}")
additional_info_one.head()

                      Biutiful {'Uxbal
           True Grit {'Rooster Cogburn
  The Social Network {'Mark Zuckerberg
    The King's Speech {'King George VI
              127 Hours {'Aron Ralston
Name: Additional Info, dtype: object

  
#Split into a series of lists
additional_info_two = additional_info_one.str.split(' {\'')
additional_info_two.head()

                      [Biutiful, Uxbal]
           [True Grit, Rooster Cogburn]
  [The Social Network, Mark Zuckerberg]
    [The King's Speech, King George VI]
              [127 Hours, Aron Ralston]
Name: Additional Info, dtype: object

  
#Set a series with the first element to movie names
movie_names = additional_info_two.str[0]

#Set a series with the second element to characters
characters = additional_info_two.str[1]

  
final_nominations["Movie"] = movie_names
final_nominations["Character"] = characters
final_nominations = final_nominations.drop(["Additional Info"], axis=1)
final_nominations.head()

	Year	Category	Nominee	Won	Movie	Character
0	2010	Actor -- Leading Role	Javier Bardem	0	Biutiful	Uxbal
1	2010	Actor -- Leading Role	Jeff Bridges	0	True Grit	Rooster Cogburn
2	2010	Actor -- Leading Role	Jesse Eisenberg	0	The Social Network	Mark Zuckerberg
3	2010	Actor -- Leading Role	Colin Firth	1	The King's Speech	King George VI
4	2010	Actor -- Leading Role	James Franco	0	127 Hours	Aron Ralston

Now that we are done cleaning up the data, we can do some simply analysis using sqlite3

  
import sqlite3

conn = sqlite3.connect("nominations.db")
cursor = conn.cursor()

#Creates the table "nominations"
final_nominations.to_sql("nominations", conn, index=False)

  
q1 = '''
Pragma table_info(nominations)
'''
result = cursor.execute(q1).fetchall()
result

[(0, 'Year', 'INTEGER', 0, None, 0),
 (1, 'Category', 'TEXT', 0, None, 0),
 (2, 'Nominee', 'TEXT', 0, None, 0),
 (3, 'Won', 'INTEGER', 0, None, 0),
 (4, 'Movie', 'TEXT', 0, None, 0),
 (5, 'Character', 'TEXT', 0, None, 0)]

  
q2 = '''
SELECT * FROM nominations LIMIT 10
'''

result = cursor.execute(q2).fetchall()
result

[(2010, 'Actor -- Leading Role', 'Javier Bardem', 0, 'Biutiful', 'Uxbal'),
 (2010,
  'Actor -- Leading Role',
  'Jeff Bridges',
  0,
  'True Grit',
  'Rooster Cogburn'),
 (2010,
  'Actor -- Leading Role',
  'Jesse Eisenberg',
  0,
  'The Social Network',
  'Mark Zuckerberg'),
 (2010,
  'Actor -- Leading Role',
  'Colin Firth',
  1,
  "The King's Speech",
  'King George VI'),
 (2010,
  'Actor -- Leading Role',
  'James Franco',
  0,
  '127 Hours',
  'Aron Ralston'),
 (2010,
  'Actor -- Supporting Role',
  'Christian Bale',
  1,
  'The Fighter',
  'Dicky Eklund'),
 (2010,
  'Actor -- Supporting Role',
  'John Hawkes',
  0,
  "Winter's Bone",
  'Teardrop'),
 (2010,
  'Actor -- Supporting Role',
  'Jeremy Renner',
  0,
  'The Town',
  'James Coughlin'),
 (2010,
  'Actor -- Supporting Role',
  'Mark Ruffalo',
  0,
  'The Kids Are All Right',
  'Paul'),
 (2010,
  'Actor -- Supporting Role',
  'Geoffrey Rush',
  0,
  "The King's Speech",
  'Lionel Logue')]

  
conn.close()

Learning Summary

Python/SQL concepts explored: python+sqlite3, pandas, data cleaning, columns manipulation

Python functions and methods used: .str.rstrip(), .str.split(), .connect(), .cursor(), .drop(), .str[], .map(), .value_counts()

SQL statements used: SELECT, FROM, PRAGMA

The files used for this project can be found in my GitHub repository.

DataQuest

dataquest

This post is licensed under CC BY 4.0 by the author.

Learning Summary

Trending Tags