Post

Dataquest Guided Project - Winning Jeopardy

In this project, we’ll look at 20,000 rows of the jeopardy dataset in “jeopardy.csv”. We want to see if there are patterns in the questions asked so we can get a little bit of an edge to win.

First, we’ll have to tidy up the data.

1
2
3
4
5
import pandas as pd
import matplotlib.pyplot as plt

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)
Show NumberAir DateRoundCategoryValueQuestionAnswer
046802004-12-31Jeopardy!HISTORY$200For the last 8 years of his life, Galileo was ...Copernicus
146802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$200No. 2: 1912 Olympian; football star at Carlisl...Jim Thorpe
246802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$200The city of Yuma in this state has a record av...Arizona
346802004-12-31Jeopardy!THE COMPANY LINE$200In 1963, live on "The Art Linkletter Show", th...McDonald's
446802004-12-31Jeopardy!EPITAPHS & TRIBUTES$200Signer of the Dec. of Indep., framer of the Co...John Adams
1
print(jeopardy.columns)
1
2
3
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Looks like there is a space after each column name, we can fix this pretty easily with the .columns() method.

1
2
3
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns
1
2
3
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Next, let’s make all the strings in the question and answer columns lower case. We can do write a function and then use the .apply() method.

We also want to remove all the punctuations, the goal is to have the “Question” and “Answer” columns down to just words.

1
2
3
4
5
import re
def lowercase_no_punct(string):
    lower = string.lower()
    punremoved = re.sub('[^A-Za-z0-9\s]','', lower)
    return punremoved
1
2
jeopardy['clean_question'] = jeopardy['Question'].apply(lowercase_no_punct)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(lowercase_no_punct)

The “Value” column is usually a dollar sign followed by a number. However, this is currently in a string format. We should conver tthis to an integer and remove the dollar sign.

1
2
3
4
5
6
7
def punremovandtoint(string):
    punremoved = re.sub('[^A-Za-z0-9\s]','', string)
    try:
        integer = int(punremoved)
    except Exception:
        integer = 0
    return integer
1
jeopardy['clean_values'] = jeopardy['Value'].apply(punremovandtoint)

We’ll have to convert the values in the “Air Date” column into a datetime object

1
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Let’s see what our table currently looks like

1
jeopardy.head()
Show NumberAir DateRoundCategoryValueQuestionAnswerclean_questionclean_answerclean_values
046802004-12-31Jeopardy!HISTORY$200For the last 8 years of his life, Galileo was ...Copernicusfor the last 8 years of his life galileo was u...copernicus200
146802004-12-31Jeopardy!ESPN's TOP 10 ALL-TIME ATHLETES$200No. 2: 1912 Olympian; football star at Carlisl...Jim Thorpeno 2 1912 olympian football star at carlisle i...jim thorpe200
246802004-12-31Jeopardy!EVERYBODY TALKS ABOUT IT...$200The city of Yuma in this state has a record av...Arizonathe city of yuma in this state has a record av...arizona200
346802004-12-31Jeopardy!THE COMPANY LINE$200In 1963, live on "The Art Linkletter Show", th...McDonald'sin 1963 live on the art linkletter show this c...mcdonalds200
446802004-12-31Jeopardy!EPITAPHS & TRIBUTES$200Signer of the Dec. of Indep., framer of the Co...John Adamssigner of the dec of indep framer of the const...john adams200

Now that the data is cleaned, we can start analyzing it.

Suppose we are interested in the number of words in the answer that occurs in the question. We’ll create a function and use the .apply() method to create a new column. This column will have ratio of matching question words to total answer words.

1
2
3
4
5
6
7
8
9
10
11
12
def cleaner(series):
    split_answer = series['clean_answer'].split(' ')
    split_question = series['clean_question'].split(' ')
    match_count = 0
    if "the" in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count +=1
    return match_count/len(split_answer)
1
2
jeopardy['answer_in_question'] = jeopardy.apply(cleaner, axis=1)
jeopardy['answer_in_question'].mean()
1
0.060493257069335872

It looks like the answer only appears in the question 6% of the time, so this is not a super reliable strategy.

Next, we’ll look at words used in the questions column. We can write a function to see how often they repeat

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
question_overlap = []
#a python set is an unordered list of items
terms_used = set()
for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")     
    match_count = 0
    newlist = []
    for word in split_question:
        if len(word) >= 6:
            newlist.append(word)
    for word in newlist:
        if word in terms_used:
            match_count += 1
    for word in newlist:
        terms_used.add(word)
    if len(newlist) > 0:
        match_count = match_count/len(newlist)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
1
jeopardy['question_overlap'].mean()
1
0.69087373156719623

There is a 69% overlap of words between new questions and old ones. However words can be put together as different phases with a big difference in meaning. So this huge overlap is not super significant.

Let’s take a look at the number of questions that are > 800 dollars. Maybe it is a good idea to only study high value questions.

1
2
3
4
5
6
7
def highvalue(row):
    value = 0
    if row['clean_values'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(highvalue, axis =1)
1
2
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
1
2
print(high_value_count)
low_value_count
1
2
3
4
5
6
7
5734





14265

It doesnt look like there are that many high value questions in the dataset.

We can create a function that takes in a word, then return the # of high/low values questions this word showed up in. Maybe this will help us study.

1
2
3
4
5
6
7
8
9
10
def highlowcounts(word):
    low_count = 0
    high_count = 0 
    for idx, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1   
    return high_count, low_count
1
2
3
observed_expected = []
comparison_terms = list(terms_used)[:5]
comparison_terms
1
['emigrated', 'ruffles', 'waterworld', 'mussorgsky', 'appendages']
1
2
3
4
for term in comparison_terms:
    observed_expected.append(highlowcounts(term))

observed_expected
1
[(1, 0), (0, 2), (1, 0), (1, 1), (1, 2)]

We can use the chi squared test to see if the values of the terms in “comparsion_terms” are statiscally significant.

1
2
3
4
5
6
7
8
9
10
11
chi_squared =[]
from scipy.stats import chisquare
import numpy as np
for lists in observed_expected:
    total = sum(lists)
    total_prop = total/jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    observed = np.array([lists[0], lists[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))
1
chi_squared
1
2
3
4
5
[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932)]

None of the p values are less than 0.05 so this is not statiscally significant.


Learning Summary

Python concepts explored: pandas, matplotlib, data cleaning, string manipulation, chi squared test, regex, try/except

Python functions and methods used: .columns, .lower(), .sub(), .apply(), sum(), .array(), .split(), .shape, .mean(), .iterrows(), .remove(), .add(), .append()

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.