In this project, we’ll look at 20,000 rows of the jeopardy dataset in “jeopardy.csv”. We want to see if there are patterns in the questions asked so we can get a little bit of an edge to win.
First, we’ll have to tidy up the data.
1
2
3
4
5
| import pandas as pd
import matplotlib.pyplot as plt
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)
|
| Show Number | Air Date | Round | Category | Value | Question | Answer |
|---|
| 0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus |
|---|
| 1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
|---|
| 2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona |
|---|
| 3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
|---|
| 4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams |
|---|
1
| print(jeopardy.columns)
|
1
2
3
| Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
' Question', ' Answer'],
dtype='object')
|
Looks like there is a space after each column name, we can fix this pretty easily with the .columns() method.
1
2
3
| jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
'Question', 'Answer']
jeopardy.columns
|
1
2
3
| Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
'Answer'],
dtype='object')
|
Next, let’s make all the strings in the question and answer columns lower case. We can do write a function and then use the .apply() method.
We also want to remove all the punctuations, the goal is to have the “Question” and “Answer” columns down to just words.
1
2
3
4
5
| import re
def lowercase_no_punct(string):
lower = string.lower()
punremoved = re.sub('[^A-Za-z0-9\s]','', lower)
return punremoved
|
1
2
| jeopardy['clean_question'] = jeopardy['Question'].apply(lowercase_no_punct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(lowercase_no_punct)
|
The “Value” column is usually a dollar sign followed by a number. However, this is currently in a string format. We should conver tthis to an integer and remove the dollar sign.
1
2
3
4
5
6
7
| def punremovandtoint(string):
punremoved = re.sub('[^A-Za-z0-9\s]','', string)
try:
integer = int(punremoved)
except Exception:
integer = 0
return integer
|
1
| jeopardy['clean_values'] = jeopardy['Value'].apply(punremovandtoint)
|
We’ll have to convert the values in the “Air Date” column into a datetime object
1
| jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
|
Let’s see what our table currently looks like
| Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_values |
|---|
| 0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
|---|
| 1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe | no 2 1912 olympian football star at carlisle i... | jim thorpe | 200 |
|---|
| 2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
|---|
| 3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
|---|
| 4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
|---|
Now that the data is cleaned, we can start analyzing it.
Suppose we are interested in the number of words in the answer that occurs in the question. We’ll create a function and use the .apply() method to create a new column. This column will have ratio of matching question words to total answer words.
1
2
3
4
5
6
7
8
9
10
11
12
| def cleaner(series):
split_answer = series['clean_answer'].split(' ')
split_question = series['clean_question'].split(' ')
match_count = 0
if "the" in split_answer:
split_answer.remove('the')
if len(split_answer) == 0:
return 0
for item in split_answer:
if item in split_question:
match_count +=1
return match_count/len(split_answer)
|
1
2
| jeopardy['answer_in_question'] = jeopardy.apply(cleaner, axis=1)
jeopardy['answer_in_question'].mean()
|
It looks like the answer only appears in the question 6% of the time, so this is not a super reliable strategy.
Next, we’ll look at words used in the questions column. We can write a function to see how often they repeat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| question_overlap = []
#a python set is an unordered list of items
terms_used = set()
for idx, row in jeopardy.iterrows():
split_question = row['clean_question'].split(" ")
match_count = 0
newlist = []
for word in split_question:
if len(word) >= 6:
newlist.append(word)
for word in newlist:
if word in terms_used:
match_count += 1
for word in newlist:
terms_used.add(word)
if len(newlist) > 0:
match_count = match_count/len(newlist)
question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
|
1
| jeopardy['question_overlap'].mean()
|
There is a 69% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.
Let’s take a look at the number of questions that are > 800 dollars. Maybe it is a good idea to only study high value questions.
1
2
3
4
5
6
7
| def highvalue(row):
value = 0
if row['clean_values'] > 800:
value = 1
return value
jeopardy['high_value'] = jeopardy.apply(highvalue, axis =1)
|
1
2
| high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
|
1
2
| print(high_value_count)
low_value_count
|
It doesnt look like there are that many high value questions in the dataset.
We can create a function that takes in a word, then return the # of high/low values questions this word showed up in. Maybe this will help us study.
1
2
3
4
5
6
7
8
9
10
| def highlowcounts(word):
low_count = 0
high_count = 0
for idx, row in jeopardy.iterrows():
if word in row['clean_question'].split(' '):
if row["high_value"] == 1:
high_count += 1
else:
low_count += 1
return high_count, low_count
|
1
2
3
| observed_expected = []
comparison_terms = list(terms_used)[:5]
comparison_terms
|
1
| ['emigrated', 'ruffles', 'waterworld', 'mussorgsky', 'appendages']
|
1
2
3
4
| for term in comparison_terms:
observed_expected.append(highlowcounts(term))
observed_expected
|
1
| [(1, 0), (0, 2), (1, 0), (1, 1), (1, 2)]
|
We can use the chi squared test to see if the values of the terms in “comparsion_terms” are statiscally significant.
1
2
3
4
5
6
7
8
9
10
11
| chi_squared =[]
from scipy.stats import chisquare
import numpy as np
for lists in observed_expected:
total = sum(lists)
total_prop = total/jeopardy.shape[0]
expected_high = total_prop * high_value_count
expected_low = total_prop * low_value_count
observed = np.array([lists[0], lists[1]])
expected = np.array([expected_high, expected_low])
chi_squared.append(chisquare(observed, expected))
|
1
2
3
4
5
| [Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932)]
|
None of the p values are less than 0.05 so this is not statiscally significant.
Learning Summary
Python concepts explored: pandas, matplotlib, data cleaning, string manipulation, chi squared test, regex, try/except
Python functions and methods used: .columns, .lower(), .sub(), .apply(), sum(), .array(), .split(), .shape, .mean(), .iterrows(), .remove(), .add(), .append()
The files used for this project can be found in my GitHub repository.