Post

Dataquest Guided Project - Visualizing Earnings Based On College Majors

In this project we will look at earnings from recent college graduates based on each major in ‘recent-grads.csv’. We’ll visualize the data using histograms, bar charts, and scatter plots and see if we can draw any interesting insights from it. However, the main purpose of this project is to practice some of the data visualization tools.

1
2
3
4
5
import pandas as pd
import matplotlib as plt

#jupyter magic so the plots are displayed inline
%matplotlib inline
1
2
recent_grads = pd.read_csv('recent-grads.csv')
recent_grads.iloc[0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
1
recent_grads.head(1)
RankMajor_codeMajorTotalMenWomenMajor_categoryShareWomenSample_sizeEmployed...Part_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
012419PETROLEUM ENGINEERING2339.02057.0282.0Engineering0.120564361976...2701207370.018381110000950001250001534364193

1 rows × 21 columns

1
recent_grads.tail(1)
RankMajor_codeMajorTotalMenWomenMajor_categoryShareWomenSample_sizeEmployed...Part_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
1721733501LIBRARY SCIENCE1098.0134.0964.0Education0.877962742...237410870.104946220002000022000288338192

1 rows × 21 columns

1
recent_grads.describe()
RankMajor_codeTotalMenWomenShareWomenSample_sizeEmployedFull_timePart_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
count173.000000173.000000172.000000172.000000172.000000172.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000173.000000
mean87.0000003879.81502939370.08139516723.40697722646.6744190.522223356.08092531192.76300626029.3063588832.39884419694.4277462416.3294800.06819140151.44508729501.44508751494.21965312322.63583813284.4971103859.017341
std50.0849281687.75314063483.49100928122.43347441057.3307400.231205618.36102250675.00224142869.65509214648.17947333160.9415144112.8031480.03033111470.1818029166.00523514906.27974021299.86886323789.6553636944.998579
min1.0000001100.000000124.000000119.0000000.0000000.0000002.0000000.000000111.0000000.000000111.0000000.0000000.00000022000.00000018500.00000022000.0000000.0000000.0000000.000000
25%44.0000002403.0000004549.7500002177.5000001778.2500000.33602639.0000003608.0000003154.0000001030.0000002453.000000304.0000000.05030633000.00000024000.00000042000.0000001675.0000001591.000000340.000000
50%87.0000003608.00000015104.0000005434.0000008386.5000000.534024130.00000011797.00000010048.0000003299.0000007413.000000893.0000000.06796136000.00000027000.00000047000.0000004390.0000004595.0000001231.000000
75%130.0000005503.00000038909.75000014631.00000022553.7500000.703299338.00000031433.00000025147.0000009948.00000016891.0000002393.0000000.08755745000.00000033000.00000060000.00000014444.00000011783.0000003466.000000
max173.0000006403.000000393735.000000173809.000000307087.0000000.9689544212.000000307933.000000251540.000000115172.000000199897.00000028169.0000000.177226110000.00000095000.000000125000.000000151643.000000148395.00000048207.000000

First, let’s clean up the data a bit and drop the rows that have NaN as values.

1
2
recent_grads = recent_grads.dropna()
recent_grads
RankMajor_codeMajorTotalMenWomenMajor_categoryShareWomenSample_sizeEmployed...Part_timeFull_time_year_roundUnemployedUnemployment_rateMedianP25thP75thCollege_jobsNon_college_jobsLow_wage_jobs
012419PETROLEUM ENGINEERING2339.02057.0282.0Engineering0.120564361976...2701207370.018381110000950001250001534364193
122416MINING AND MINERAL ENGINEERING756.0679.077.0Engineering0.1018527640...170388850.11724175000550009000035025750
232415METALLURGICAL ENGINEERING856.0725.0131.0Engineering0.1530373648...133340160.02409673000500001050004561760
342417NAVAL ARCHITECTURE AND MARINE ENGINEERING1258.01123.0135.0Engineering0.10731316758...150692400.0501257000043000800005291020
452405CHEMICAL ENGINEERING32260.021239.011021.0Engineering0.34163128925694...51801669716720.061098650005000075000183144440972
562418NUCLEAR ENGINEERING2573.02200.0373.0Engineering0.144967171857...26414494000.17722665000500001020001142657244
676202ACTUARIAL SCIENCE3777.02110.01667.0Business0.441356512912...29624823080.0956526200053000720001768314259
785001ASTRONOMY AND ASTROPHYSICS1792.0832.0960.0Physical Sciences0.535714101526...553827330.0211676200031500109000972500220
892414MECHANICAL ENGINEERING91227.080320.010907.0Engineering0.119559102976442...131015463946500.05734260000480007000052844163843253
9102408ELECTRICAL ENGINEERING81527.065511.016016.0Engineering0.19645063161928...126954141338950.05917460000450007200045829108743170
10112407COMPUTER ENGINEERING41542.033258.08284.0Engineering0.19941339932506...51462362122750.065409600004500075000236945721980
11122401AEROSPACE ENGINEERING15058.012953.02105.0Engineering0.13979314711391...272487907940.06516260000420007000081842425372
12132404BIOMEDICAL ENGINEERING14955.08407.06548.0Engineering0.4378477910047...2694598610190.09208460000360007000064392471789
13145008MATERIALS SCIENCE4279.02949.01330.0Engineering0.310820223307...8781967780.023043600003900065000262639181
14152409ENGINEERING MECHANICS PHYSICS AND SCIENCE4321.03526.0795.0Engineering0.183985303608...8112004230.0063345800025000740002439947263
15162402BIOLOGICAL ENGINEERING8925.06062.02863.0Engineering0.320784556170...198334135890.08714357100400007600036031595524
16172412INDUSTRIAL AND MANUFACTURING ENGINEERING18968.012453.06515.0Engineering0.34347318315604...2243113266990.04287657000379006700083063235640
17182400GENERAL ENGINEERING61152.045683.015469.0Engineering0.25296042544931...71993354028590.05982456000360006900026898117343192
18192403ARCHITECTURAL ENGINEERING2825.01835.0990.0Engineering0.350442262575...34318481700.0619315400038000650001665649137
19203201COURT REPORTING1148.0877.0271.0Law & Public Policy0.23606314930...223808110.011690540005000054000402528144
20212102COMPUTER SCIENCE128319.099743.028576.0Computers & Mathematics0.2226951196102087...187267093268840.06317353000390007000068622256675144
22232502ELECTRICAL ENGINEERING TECHNOLOGY11565.08181.03384.0Engineering0.292607978587...187356818240.08755752000350006000051262686696
23242413MATERIALS ENGINEERING AND MATERIALS SCIENCE2993.02020.0973.0Engineering0.325092222449...10401151700.027789520003500062000191130570
24256212MANAGEMENT INFORMATION SYSTEMS AND STATISTICS18713.013496.05217.0Business0.27879027816413...24201301710150.05824051000380006000063425741708
25262406CIVIL ENGINEERING53153.041081.012072.0Engineering0.22711856543041...100802919632700.0706105000040000600002852693562899
26275601CONSTRUCTION SERVICES18498.016820.01678.0Industrial Arts & Consumer Services0.09071329516318...17511231310420.06002350000360006000032755351703
27286204OPERATIONS LOGISTICS AND E-COMMERCE11732.07921.03811.0Business0.32483815610027...118377245040.04785950000400006000014663629285
28292499MISCELLANEOUS ENGINEERING9133.07398.01735.0Engineering0.1899701187428...166254765970.07439350000390006500034452426365
29305402PUBLIC POLICY5978.02639.03339.0Law & Public Policy0.558548554547...130627766700.12842650000350007000015501871340
30312410ENVIRONMENTAL ENGINEERING4047.02662.01385.0Engineering0.342229262983...93019513080.0935895000042000560002028830260
..................................................................
1431441105PLANT SCIENCE AND AGRONOMY7416.04897.02519.0Agriculture & Natural Resources0.3396711106594...124645223140.045455320002290040000208935451231
1441452308SCIENCE AND COMPUTER TEACHER EDUCATION6483.02049.04434.0Education0.683943595362...122732472660.04726432000280003900042141106591
1451465200PSYCHOLOGY393735.086648.0307087.0Psychology & Social Work0.7799332584307933...115172174438281690.08381131500240004100012514814186048207
1461476002MUSIC60633.029909.030724.0Arts0.50672141947662...249432142539180.07596031000223004200013752287869286
1471482306PHYSICAL AND HEALTH EDUCATION TEACHING28213.015670.012543.0Education0.44458225923794...72301365119200.0746673100024000400001277793282042
1481496006ART HISTORY AND CRITICISM21030.03240.017790.0Humanities & Liberal Arts0.84593420417579...6140996511280.060298310002300040000513997383426
1491506000FINE ARTS74440.024786.049654.0Arts0.66703462359679...236563187754860.084186305002100041000207923272511880
1501512901FAMILY AND CONSUMER SCIENCES58001.05166.052835.0Industrial Arts & Consumer Services0.91093351846624...158722690633550.06712830000229004000020985201335248
1511525404SOCIAL WORK53552.05137.048415.0Psychology & Social Work0.90407537445038...134812758833290.06882830000250003500027449144164344
1521531103ANIMAL SCIENCES21573.05347.016226.0Agriculture & Natural Resources0.75214425517112...5353108249170.050862300002200040000544395712125
1531546003VISUAL AND PERFORMING ARTS16250.04133.012117.0Arts0.74566213212870...6253632214650.102197300002200040000384976352840
1541552312TEACHER EDUCATION: MULTIPLE LEVELS14443.02734.011709.0Education0.81070414213076...221484574960.036546300002400037000107661949722
1551565299MISCELLANEOUS PSYCHOLOGY9628.01936.07692.0Psychology & Social Work0.798920607653...322138384190.051908300002080040000296039481650
1561575403HUMAN SERVICES AND COMMUNITY ORGANIZATION9374.0885.08489.0Psychology & Social Work0.905590898294...240550613260.03781930000240003500028784595724
1571583402HUMANITIES6652.02013.04639.0Humanities & Liberal Arts0.697384495052...222526613720.068584300002000049000116833541141
1581594901THEOLOGY AND RELIGIOUS VOCATIONS30207.018616.011591.0Humanities & Liberal Arts0.38371931024202...87671394416170.0626282900022000380009927120373304
1591606007STUDIO ARTS16977.04754.012223.0Arts0.71997418213908...5673741313680.089552290001920038300394887073586
1601612201COSMETOLOGY SERVICES AND CULINARY ARTS10510.04364.06146.0Industrial Arts & Consumer Services0.5847761178650...206459495100.05567729000200003600056373843163
1611621199MISCELLANEOUS AGRICULTURE1488.0404.01084.0Agriculture & Natural Resources0.728495241290...335936820.05976729000230004210048362631
1621635502ANTHROPOLOGY AND ARCHEOLOGY38844.011376.027468.0Humanities & Liberal Arts0.70713624729633...145151323233950.1027922800020000380009805166936866
1631646102COMMUNICATION DISORDERS SCIENCES AND SERVICES38279.01225.037054.0Health0.9679989529763...138621446014870.0475842800020000400001995794045125
1641652307EARLY CHILDHOOD EDUCATION37589.01167.036422.0Education0.96895434232551...70012074813600.0401052800021000350002351577052868
1651662603OTHER FOREIGN LANGUAGES11204.03472.07732.0Humanities & Liberal Arts0.690111567052...368532148460.107116275002290038000232637031115
1661676001DRAMA AND THEATER ARTS43249.014440.028809.0Arts0.66611935736165...159941689130400.07754127000192003500069942531311068
1671683302COMPOSITION AND RHETORIC18953.07022.011931.0Humanities & Liberal Arts0.62950515115053...6612783213400.081742270002000035000485581003466
1681693609ZOOLOGY8409.03050.05359.0Biology & Life Science0.637293476259...219036023040.04632026000200003900027712947743
1691705201EDUCATIONAL PSYCHOLOGY2854.0522.02332.0Psychology & Social Work0.81709972125...57212111480.065112250002400034000148861582
1701715202CLINICAL PSYCHOLOGY2838.0568.02270.0Psychology & Social Work0.799859132101...64812933680.149048250002500040000986870622
1711725203COUNSELING PSYCHOLOGY4626.0931.03695.0Psychology & Social Work0.798746213777...96527382140.05362123400192002600024031245308
1721733501LIBRARY SCIENCE1098.0134.0964.0Education0.8779602742...237410870.104946220002000022000288338192

172 rows × 21 columns

Let’s begin exploring the data using scatter plots and see if we can draw any interesting correlations.

1
2
3
4
5
6
recent_grads.plot(x='Sample_size', y='Median', kind = 'scatter')
recent_grads.plot(x='Sample_size', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Full_time', y='Median', kind = 'scatter')
recent_grads.plot(x='ShareWomen', y='Unemployment_rate', kind = 'scatter')
recent_grads.plot(x='Men', y='Median', kind = 'scatter')
recent_grads.plot(x='Women', y='Median', kind = 'scatter')
1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae4fb710>

png

png

png

png

png

png

From the ‘Unemployment_rate’ vs. ‘ShareWomen’ plot, it looks like there is no correlation between unemployment rate and the amount of women in the major.

Doesn’t look like there is much other useful information from these scatter plots, let’s explore the data a bit further using histograms instead.

The y axis shows the frequency of the data and the x axis refers to the column name specified in code.

1
recent_grads['Median'].hist(bins=25)
1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae502a90>

png

1
2
recent_grads['Employed'].hist(bins=25)

1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae4c74e0>

png

1
2
recent_grads['Full_time'].hist(bins=25)

1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae7e7c50>

png

1
2
recent_grads['ShareWomen'].hist(bins=25)

1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae843cf8>

png

1
2
recent_grads['Unemployment_rate'].hist(bins=25)

1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae4abf28>

png

1
2
recent_grads['Men'].hist(bins=25)

1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae77a978>

png

1
recent_grads['Women'].hist(bins=25)
1
<matplotlib.axes._subplots.AxesSubplot at 0x14dae45f518>

png

Again, not much correlation from these histograms. We do see a distribution of unemployment rates for various majors. If unemployment rate is not related to major, then we should see a wide plateau on the histogram.

Next we’ll use scatter matrix from pandas to see if we can draw more insight. A scatter matrix can plot many different variables together and allow us to quickly see if there are correlations between those variables.

1
from pandas.plotting import scatter_matrix
1
scatter_matrix(recent_grads[['Sample_size', 'Median']], figsize=(10,10))
1
2
3
4
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAE8F52E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAE92AE80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAE94DE80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAE978400>]], dtype=object)

png

1
scatter_matrix(recent_grads[['Men', 'ShareWomen', 'Median']], figsize=(10,10))
1
2
3
4
5
6
7
8
9
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAE9E4E48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEA354E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEA59550>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEA6C860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEAA1550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEABAE80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEADFF60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEB03F60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000014DAEB25F60>]], dtype=object)

png

We are not really seeing much correlations betwen these plots, There is a weak negative correlation between ‘ShareWomen’ and Median. Majors with less women tend to have higher earnings. It could be due to the fact that high paying majors like engineering tend to have less women.

The first ten rows in the data are mostly engineering majors, and the last ten rows are non engineering majors. We can generate a bar chart and look at the ‘ShareWomen’ vs ‘Majors’ to see if our hypothesis is correct.

1
2
recent_grads[:10].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='ShareWomen', colormap='winter')
1
<matplotlib.axes._subplots.AxesSubplot at 0x14daedf7fd0>

png

png

Let’s plot the majors we selected above with ‘Median’ income to see if engineers earn more income.

1
2
recent_grads[:10].plot(kind='bar', x='Major', y='Median', colormap='winter')
recent_grads[163:].plot(kind='bar', x='Major', y='Median', colormap='winter')
1
<matplotlib.axes._subplots.AxesSubplot at 0x14daee985c0>

png

png

Our hypothesis appears to be correct, at least for the majors we selected. Majors with less women such as engineering tend to earn higher salaries.


Learning Summary

Python concepts explored: pandas, matplotlib, histograms, bar charts, scatterplots, scatter matrices

Python functions and methods used: .plot(), scatter_matrix(), hist(), iloc[], .head(), .tail(), .describe()

The files used for this project can be found in my GitHub repository.

This post is licensed under CC BY 4.0 by the author.