Post

Dataquest Guided Project - Analyzing CIA Factbook Data Using SQLite And Python

In this project we’ll be working with SQL in combination with Python. Specifically we’ll use sqlite3. We will analyze the database file “factbook.db” which is the CIA World Factbook. We will write queries to look at the data and see if we can draw any interesting insights.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#import sql3, pandas and connect to the databse.
import sqlite3
import pandas as pd
conn = sqlite3.connect("factbook.db")

#activates the cursor
cursor = conn.cursor()

#the SQL query to look at the tables in the databse
q1 = "SELECT * FROM sqlite_master WHERE type='table';"

#execute the query and read it in pandas, this returns a table in pandas form
database_info = pd.read_sql_query(q1, conn)
database_info
typenametbl_namerootpagesql
0tablefactsfacts2CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...
1tablesqlite_sequencesqlite_sequence3CREATE TABLE sqlite_sequence(name,seq)

Let’s begin exploring the data, we can use pd.read_sql_query to see what the first table looks like

1
2
3
4
q2 = "SELECT * FROM facts"

data = pd.read_sql_query(q2, conn)
data.head()
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_ratecreated_atupdated_at
01afAfghanistan652230.0652230.00.032564342.02.3238.5713.891.512015-11-01 13:19:49.4617342015-11-01 13:19:49.461734
12alAlbania28748.027398.01350.03029278.00.3012.926.583.302015-11-01 13:19:54.4310822015-11-01 13:19:54.431082
23agAlgeria2381741.02381741.00.039542166.01.8423.674.310.922015-11-01 13:19:59.9612862015-11-01 13:19:59.961286
34anAndorra468.0468.00.085580.00.128.136.960.002015-11-01 13:20:03.6599452015-11-01 13:20:03.659945
45aoAngola1246700.01246700.00.019625353.02.7838.7811.490.462015-11-01 13:20:08.6250722015-11-01 13:20:08.625072

Let’s see what the maximum and the minimum population is and then we’ll identify the country name. If they are outliers, we should probably remove it from the table.

1
2
3
q3 = "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth) FROM facts"
data = pd.read_sql_query(q3, conn)
data.head()
MIN(population)MAX(population)MIN(population_growth)MAX(population_growth)
0072564900110.04.02
1
2
3
4
5
6
q4 = '''
SELECT * FROM facts 
WHERE population == (SELECT MIN(population) from facts);
'''
data = pd.read_sql_query(q4, conn)
data.head()
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_ratecreated_atupdated_at
0250ayAntarcticaNone280000None0NoneNoneNoneNone2015-11-01 13:38:44.8857462015-11-01 13:38:44.885746
1
2
3
4
5
6
q5 = '''
SELECT * FROM facts 
WHERE population == (SELECT MAX(population) from facts);
'''
data = pd.read_sql_query(q5, conn)
data.head()
idcodenameareaarea_landarea_waterpopulationpopulation_growthbirth_ratedeath_ratemigration_ratecreated_atupdated_at
0261xxWorldNoneNoneNone72564900111.0818.67.8None2015-11-01 13:39:09.9107212015-11-01 13:39:09.910721

It doesn’t make much sense to include Antarctica and the entire world as a part of our data analysis, we should definitely exlude this from our analysis.

We can write a SQL query along with subqueries to exlude the min and max population from the data.

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

q6 = '''
SELECT population, population_growth, birth_rate, death_rate
FROM facts
WHERE population != (SELECT MIN(population) from facts)
AND population != (SELECT MAX(population) from facts)
'''

data = pd.read_sql_query(q6, conn)
data.head()
populationpopulation_growthbirth_ratedeath_rate
0325643422.3238.5713.89
130292780.3012.926.58
2395421661.8423.674.31
3855800.128.136.96
4196253532.7838.7811.49

Suppose we are the CIA and we are interested in the future prospects of the countries arround the world. We can plot histograms of the birth rate, death rate, and population growth of the countries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

data["birth_rate"].hist(ax=ax1)
ax1.set_xlabel("birth_rate")
data["death_rate"].hist(ax=ax2)
ax2.set_xlabel("death_rate")
data["population_growth"].hist(ax=ax3)
ax3.set_xlabel("population_growth")
data["population"].hist(ax=ax4)
ax4.set_xlabel("population")

plt.show()

png

The birth_rate and population growth plot both show a right-skewed distribution, This makes sense as birth rate and population growth are directly related. The death_rate plot shows a normal distribution, almost a double peaked distribution. The population plot is a bit hard to read due to outliers.

Next we are interested to see what city has the highest population density

1
2
3
4
5
6
7
8
9
10
11
q7 = '''
SELECT name, CAST(population as float)/CAST(area as float) "density"
FROM facts
WHERE population != (SELECT MIN(population) from facts)
AND population != (SELECT MAX(population) from facts)
ORDER BY density DESC
'''

data = pd.read_sql_query(q7, conn)
data.head()
namedensity
0Macau21168.964286
1Monaco15267.500000
2Singapore8141.279770
3Hong Kong6445.041516
4Gaza Strip5191.819444

Looks like Macau has the highest population density in the world, not too surprising because Macau is a tourist heavy town with tons of casinos.

1
2
3
4
5
6
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(1,1,1)

data['density'].hist()

plt.show()

png

Again there are several outliers making the data hard to read, let’s limit the histogram and increase the number of bins.

1
2
3
4
5
6
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)

data['density'].hist(bins=500)
ax.set_xlim(0, 2000)
plt.show()

png

This table includes cities along with countries. The cities will obviously have way higher density than the countries. So plotting them both together in one histogram doesn’t make much sense

This explains why the population histogram we did earlier showed a similar trend.


Learning Summary

Python/SQL concepts explored: python+sqlite3, pandas, SQL queries, SQL subqueries, matplotlib.plyplot, seaborn, histograms

Python functions and methods used: .cursor(), .read_sql_query(), .set_xlabel(), .set_xlim(), .add_subplot(), .figure()

SQL statements used: SELECT, WHERE, FROM, MIN(), MAX(), ORDER BY, AND

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.