Comparison of Academia and Industry for Graduate Students

academia_versus_industry_part1

Comparison of Academia and Industry for Graduate Students

Roughly how well does pay for professors reflect market demand for PhDs? More vaguely: is it worth staying in academia? The second question is much harder to answer - and is a good topic for another blog post! The first we hope to partially answer below. We are going to look at the average pay of a full professor (courtesy of the College and University Professional Association for Human Resources) versus the median pay for (fully-employed) graduate students fresh out of grad school (courtesy of the American Community Survey 2010-2012 Public Use Microdata Series via fivethirtyeight). Note that we are comapring means with medians, so we have to be a bit wary of any conclusions drawn, although it is likely a fair assumption that the professor salaries are normally distributed (it only consists of fully tenured professors) and so their medians and means ought to be similar.

Now, the relationship between these two datasets may be nonlinear (wages in different sectors of the economy probably differ in terms of scaling, entry-level pay, etc.), so we will ask two specific questions:

  1. Is the order in which universities rank professor salaries similar to the order in which industry ranks recent PhD salaries the same?
  1. Are the distributions of average salaries similar in industry versus academia?

As usual, we will use the pandas package for python:

In [1]:
import pandas as pd

We will also need a package to read the messier pdf from CUPAHR. We will use tabula:

In [2]:
from tabula import read_pdf

Now, due to some white space errors, we will need to deal with page 3 of this pdf separately. First we import the other pages (1, 2 and 4) into a dataframe:

In [3]:
tables = read_pdf('https://www.cupahr.org/wp-content/uploads/2017/07/FHE-2016-2-Digit-Average-Salaries-Tenured-and-Tenure-Track.pdf', pages='1,2,4')
In [4]:
tables.head(10)
Out[4]:
Unnamed: 0 Unweighted Average Salary
0 Discipline and Rank All Public Private
1 [01.] AGRICULTURE, AGRICULTURE OPERATIONS, AND... NaN
2 Professor 102,328 102,691 96,185
3 Associate Professor 79,433 79,822 73,870
4 Assistant Professor 70,273 71,230 59,673
5 New Assistant Professor 72,135 72,581 *
6 Instructor * * *
7 [03.] NATURAL RESOURCES AND CONSERVATION NaN
8 Professor 100,200 100,512 98,627
9 Associate Professor 77,234 76,847 78,718
In [5]:
tables_bad = read_pdf('https://www.cupahr.org/wp-content/uploads/2017/07/FHE-2016-2-Digit-Average-Salaries-Tenured-and-Tenure-Track.pdf', pages='3')
In [6]:
tables_bad.head(10)
Out[6]:
Unnamed: 0 Unnamed: 1 Unweighted Average Salary
0 NaN Discipline and Rank All Public Private
1 [30.] MULTI/INTERDISCIPLINARY STUDIES NaN NaN
2 Professor NaN 105,855 111,614 96,907
3 Associate Professor NaN 79,387 81,362 77,121
4 Assistant Professor NaN 65,466 65,423 65,514
5 New Assistant Professor NaN 61,493 61,661 61,289
6 Instructor NaN 49,855 * *
7 [31.] PARKS, RECREATION, LEISURE AND FITNESS S... NaN NaN
8 Professor NaN 89,281 91,035 85,096
9 Associate Professor NaN 70,977 71,834 69,411

It has accidentally created an unwanted extra column. We will deal with this later. The second column of the first table has the salaries we want, as well as salaries for more junior faculty. We are going to throw this away as it is not quite what we are looking for (it varies too much based on teaching needs of institutions).

In [7]:
list(tables.iloc[5].values)[1]
Out[7]:
'72,135 72,581 *'

Since these entries are all strings with white spaces, we need a function to clean the data by:

  1. Extracting the relevant part of the string (in the above case, the first 5 digit number).
  1. Turning the string into an integer.

We will also drop the numbering in front of the discipline name (it is possibly arbitrary anyway, as the original pdf seems to be missing entries).

In [8]:
def main_sal(s):
    i=s.find(',')
    w=s[0:i]+s[i+1:i+4]
    return int(w)

First, import the regular expression package to search through the strings:

In [9]:
import re

Next, we will write two functions, one to clean the strings, the second to clean the columns by applying the first function to each entry.

In [10]:
def clean(s):
    if type(s)==str:
        discipline_name=re.search('\]',s)
        salary=re.match('\d',s)
        if discipline_name:
            i=discipline_name.start()
            return s[i+2:]
        elif salary:
            i=re.search('\s',s).start()
            j=re.search(',',s).start()
            return int(s[:j]+s[j+1:i])
        else:
            pass
In [11]:
def clean_list(L):
    L1=[]
    for l in L:
        if clean(l)!=None:
            L1.append(clean(l))
    return L1

Lets see if this works:

In [12]:
clean_list(['72,135 72,581 *','[30.] MULTI/INTERDISCIPLINARY STUDIES','*'])
Out[12]:
[72135, 'MULTI/INTERDISCIPLINARY STUDIES']

OK that seems to do it. First, we'll get a list of all the rows of the dataframe (skipping the first row because it just contains the headings):

In [13]:
t=tables_bad.values
M=[]
for x in t:
    M+=(list(x))
print(M[1:])
['Discipline and Rank', 'All Public Private', '[30.] MULTI/INTERDISCIPLINARY STUDIES', nan, nan, 'Professor', nan, '105,855 111,614 96,907', 'Associate Professor', nan, '79,387 81,362 77,121', 'Assistant Professor', nan, '65,466 65,423 65,514', 'New Assistant Professor', nan, '61,493 61,661 61,289', 'Instructor', nan, '49,855 * *', '[31.] PARKS, RECREATION, LEISURE AND FITNESS STUDIES', nan, nan, 'Professor', nan, '89,281 91,035 85,096', 'Associate Professor', nan, '70,977 71,834 69,411', 'Assistant Professor', nan, '60,302 60,984 59,006', 'New Assistant Professor', nan, '61,159 62,255 57,752', 'Instructor', nan, '51,841 53,285 49,273', '[38.] PHILOSOPHY AND RELIGIOUS STUDIES', nan, nan, 'Professor', nan, '92,741 93,625 92,205', 'Associate Professor', nan, '70,937 70,635 71,136', 'Assistant Professor', nan, '59,808 59,786 59,824', 'New Assistant Professor', nan, '60,738 59,813 61,537', 'Instructor', nan, '53,166 49,580 55,318', '[39.] THEOLOGY AND RELIGIOUS VOCATIONS', nan, nan, 'Professor', nan, '79,838 * 79,838', 'Associate Professor', nan, '65,783 * 65,757', 'Assistant Professor', nan, '56,590 * 56,590', 'New Assistant Professor', nan, '57,794 * 57,794', 'Instructor', nan, '* * *', '[40.] PHYSICAL SCIENCES', nan, nan, 'Professor', nan, '97,733 99,180 95,655', 'Associate Professor', nan, '74,574 75,632 73,135', 'Assistant Professor', nan, '64,685 66,099 62,630', 'New Assistant Professor', nan, '66,108 67,593 63,500', 'Instructor', nan, '54,080 53,524 55,807', '[42.] PSYCHOLOGY', nan, nan, 'Professor', nan, '94,218 96,201 92,174', 'Associate Professor', nan, '71,872 72,262 71,493', 'Assistant Professor', nan, '61,542 61,832 61,219', 'New Assistant Professor', nan, '61,965 62,470 61,044', 'Instructor', nan, '52,924 52,688 53,367', '[43.] HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE', nan, nan, 'Professor', nan, '91,192 93,325 86,426', 'Associate Professor', nan, '72,036 72,125 71,851', 'Assistant Professor', nan, '59,818 60,029 59,378', 'New Assistant Professor', nan, '59,742 59,363 60,841', 'Instructor', nan, '54,395 49,679 60,998', '[44.] PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', nan, nan, 'Professor', nan, '99,243 100,438 96,289', 'Associate Professor', nan, '75,713 76,216 74,644', 'Assistant Professor', nan, '64,266 65,199 62,166', 'New Assistant Professor', nan, '64,477 66,061 59,921', 'Instructor', nan, '51,079 48,772 *', '[45.] SOCIAL SCIENCES', nan, nan, 'Professor', nan, '99,219 98,367 100,395', 'Associate Professor', nan, '76,367 75,408 77,702', 'Assistant Professor', nan, '65,446 64,806 66,416', 'New Assistant Professor', nan, '67,376 66,686 68,825', 'Instructor', nan, '51,788 48,925 59,077']

Let us clean the list.

In [14]:
clean_list(M[1:])
Out[14]:
['MULTI/INTERDISCIPLINARY STUDIES',
 105855,
 79387,
 65466,
 61493,
 49855,
 'PARKS, RECREATION, LEISURE AND FITNESS STUDIES',
 89281,
 70977,
 60302,
 61159,
 51841,
 'PHILOSOPHY AND RELIGIOUS STUDIES',
 92741,
 70937,
 59808,
 60738,
 53166,
 'THEOLOGY AND RELIGIOUS VOCATIONS',
 79838,
 65783,
 56590,
 57794,
 'PHYSICAL SCIENCES',
 97733,
 74574,
 64685,
 66108,
 54080,
 'PSYCHOLOGY',
 94218,
 71872,
 61542,
 61965,
 52924,
 'HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
 91192,
 72036,
 59818,
 59742,
 54395,
 'PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS',
 99243,
 75713,
 64266,
 64477,
 51079,
 'SOCIAL SCIENCES',
 99219,
 76367,
 65446,
 67376,
 51788]

Now we only want the first number, so we will make a smaller list with just the discipline names and their corresponding salaries for full professors only:

In [15]:
L=clean_list(M[1:])
L_fixed=[[L[i],L[i+1]] for i in range(len(L)) if type(L[i])==str]
In [16]:
L_fixed
Out[16]:
[['MULTI/INTERDISCIPLINARY STUDIES', 105855],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 89281],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 92741],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 79838],
 ['PHYSICAL SCIENCES', 97733],
 ['PSYCHOLOGY', 94218],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  91192],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 99243],
 ['SOCIAL SCIENCES', 99219]]

This looks good. Lets put it in a neww dataframe:

In [17]:
tables_good=pd.DataFrame(L_fixed)
In [18]:
tables_good.columns=['Discipline','Average Professor Salary']
In [19]:
tables_good
Out[19]:
Discipline Average Professor Salary
0 MULTI/INTERDISCIPLINARY STUDIES 105855
1 PARKS, RECREATION, LEISURE AND FITNESS STUDIES 89281
2 PHILOSOPHY AND RELIGIOUS STUDIES 92741
3 THEOLOGY AND RELIGIOUS VOCATIONS 79838
4 PHYSICAL SCIENCES 97733
5 PSYCHOLOGY 94218
6 HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTI... 91192
7 PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFE... 99243
8 SOCIAL SCIENCES 99219

Now we'll clean the rest of the data. First, lets print of a list of the rows:

In [20]:
t2=tables.values
M2=[]
for x in t2:
    M2+=list(x)
print(M2[1:])
['All Public Private', '[01.] AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', nan, 'Professor', '102,328 102,691 96,185', 'Associate Professor', '79,433 79,822 73,870', 'Assistant Professor', '70,273 71,230 59,673', 'New Assistant Professor', '72,135 72,581 *', 'Instructor', '* * *', '[03.] NATURAL RESOURCES AND CONSERVATION', nan, 'Professor', '100,200 100,512 98,627', 'Associate Professor', '77,234 76,847 78,718', 'Assistant Professor', '66,397 66,806 64,390', 'New Assistant Professor', '70,753 70,924 *', 'Instructor', '* * *', '[04.] ARCHITECTURE AND RELATED SERVICES', nan, 'Professor', '108,653 110,056 102,031', 'Associate Professor', '83,470 82,532 87,671', 'Assistant Professor', '67,821 67,626 68,668', 'New Assistant Professor', '68,848 68,532 70,680', 'Instructor', '* * *', '[05.] AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', nan, 'Professor', '107,572 107,836 106,779', 'Associate Professor', '79,123 77,193 83,554', 'Assistant Professor', '65,560 64,404 68,755', 'New Assistant Professor', '66,895 65,799 69,574', 'Instructor', '* * *', '[09.] COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', nan, 'Professor', '92,241 93,285 90,946', 'Associate Professor', '70,967 70,994 70,935', 'Assistant Professor', '59,924 59,090 60,993', 'New Assistant Professor', '59,842 58,781 61,910', 'Instructor', '49,087 48,679 50,164', '[10.] COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES', nan, 'Professor', '94,089 95,487 91,572', 'Associate Professor', '74,291 74,081 74,652', 'Assistant Professor', '61,691 61,313 62,182', 'New Assistant Professor', '56,864 60,665 *', 'Instructor', '* * *', '[11.] COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', nan, 'Professor', '113,646 115,757 109,876', 'Associate Professor', '92,906 94,362 90,456', 'Assistant Professor', '81,810 82,719 79,753', 'New Assistant Professor', '84,821 85,094 83,749', 'Instructor', '62,225 64,150 58,616', '[13.] EDUCATION', nan, 'Professor', '92,764 92,870 92,517', 'Associate Professor', '71,722 71,465 72,263', 'Assistant Professor', '61,253 61,358 61,013', 'New Assistant Professor', '60,526 60,592 60,306', 'Instructor', '53,502 54,191 51,746', '[14.] ENGINEERING', nan, 'Professor', '129,012 129,700 127,138', 'Associate Professor', '97,023 96,720 97,873', 'Assistant Professor', '84,197 83,825 85,194', 'New Assistant Professor', '83,419 83,798 82,324', 'Instructor', '66,499 64,668', nan, 'Unweighted Average Salary', 'Discipline and Rank', 'All Public Private', '[15.] ENGINEERING TECHNOLOGIES AND ENGINEERING RELATED FIELDS', nan, 'Professor', '97,103 96,276 105,237', 'Associate Professor', '78,984 78,061 91,235', 'Assistant Professor', '69,892 68,873 79,279', 'New Assistant Professor', '69,665 68,735 *', 'Instructor', '52,747 52,747 *', '[16.] FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', nan, 'Professor', '94,698 92,710 96,910', 'Associate Professor', '71,466 69,276 73,867', 'Assistant Professor', '59,838 58,937 60,861', 'New Assistant Professor', '59,402 59,113 59,979', 'Instructor', '52,427 48,592 56,902', '[19.] FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', nan, 'Professor', '99,572 99,014 102,487', 'Associate Professor', '75,529 75,032 78,491', 'Assistant Professor', '64,453 64,435 64,562', 'New Assistant Professor', '64,777 64,769 *', 'Instructor', '48,556 * *', '[22.] LEGAL PROFESSIONS AND STUDIES', nan, 'Professor', '145,732 141,284 150,114', 'Associate Professor', '109,109 104,828 113,742', 'Assistant Professor', '95,606 89,861 104,044', 'New Assistant Professor', '90,429 86,424 97,106', 'Instructor', '83,255 * 85,645', '[23.] ENGLISH LANGUAGE AND LITERATURE/LETTERS', nan, 'Professor', '87,735 86,961 88,545', 'Associate Professor', '68,015 67,092 69,008', 'Assistant Professor', '58,242 57,576 58,976', 'New Assistant Professor', '57,592 57,113 58,359', 'Instructor', '46,893 45,712 49,648', '[24.] LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', nan, 'Professor', '91,954 92,577 90,991', 'Associate Professor', '71,290 72,267 69,826', 'Assistant Professor', '59,798 60,360 59,001', 'New Assistant Professor', '59,427 60,348 57,278', 'Instructor', '* *', '[25.] LIBRARY SCIENCE', nan, 'Professor', '88,531 87,805 91,074', 'Associate Professor', '72,513 71,422 76,375', 'Assistant Professor', '61,037 61,978 58,147', 'New Assistant Professor', '63,432 62,104 *', 'Instructor', '* * *', '[26.] BIOLOGICAL AND BIOMEDICAL SCIENCES', nan, 'Professor', '103,879 109,228 94,492', 'Associate Professor', '76,932 79,559 72,928', 'Assistant Professor', '66,524 69,120 62,571', 'New Assistant Professor', '64,922 66,375 61,797', 'Instructor', '53,923 52,750 55,169', '[27.] MATHEMATICS AND STATISTICS', nan, 'Professor', '94,710 95,593 93,558', 'Associate Professor', '72,777 73,392 72,021', 'Assistant Professor', '64,411 65,690 62,681', 'New Assistant Professor', '66,062 67,981 62,690', 'Instructor', '49,571 48,789 51,032', nan, 'Unweighted Average Salary', 'Discipline and Rank', 'All Public Private', '[49.] TRANSPORTATION AND MATERIAL SERVICES', nan, 'Professor', '92,888 91,288 *', 'Associate Professor', '79,678 80,977 *', 'Assistant Professor', '66,272 67,541 *', 'New Assistant Professor', '* * *', 'Instructor', '* * *', '[50.] VISUAL AND PERFORMING ARTS', nan, 'Professor', '87,065 85,962 88,422', 'Associate Professor', '68,320 67,057 69,889', 'Assistant Professor', '57,800 56,579 59,424', 'New Assistant Professor', '57,367 56,074 60,560', 'Instructor', '49,569 49,272 49,975', '[51.] HEALTH PROFESSIONS AND RELATED PROGRAMS', nan, 'Professor', '108,064 110,346 102,914', 'Associate Professor', '83,630 84,221 82,476', 'Assistant Professor', '70,512 70,688 70,181', 'New Assistant Professor', '71,605 71,910 70,981', 'Instructor', '62,751 64,080 60,775', '[52.] BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES', nan, 'Professor', '129,904 131,720 127,104', 'Associate Professor', '110,031 113,193 105,215', 'Assistant Professor', '105,958 109,128 101,076', 'New Assistant Professor', '113,924 115,807 109,612', 'Instructor', '72,307 76,446 62,787', '[54.] HISTORY GENERAL', nan, 'Professor', '89,536 89,663 89,416', 'Associate Professor', '68,593 67,677 69,542', 'Assistant Professor', '59,141 58,425 59,977', 'New Assistant Professor', '58,412 57,348 60,625', 'Instructor', '45,449 43,704 48,066']

The rest is the same as for the previous part of the data.

In [21]:
L2=clean_list(M2[1:])
L2_fixed=[[L2[i],L2[i+1]] for i in range(len(L2)) if type(L2[i])==str]
In [22]:
L2_fixed
Out[22]:
[['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 102328],
 ['NATURAL RESOURCES AND CONSERVATION', 100200],
 ['ARCHITECTURE AND RELATED SERVICES', 108653],
 ['AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', 107572],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 92241],
 ['COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES', 94089],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 113646],
 ['EDUCATION', 92764],
 ['ENGINEERING', 129012],
 ['ENGINEERING TECHNOLOGIES AND ENGINEERING RELATED FIELDS', 97103],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 94698],
 ['FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', 99572],
 ['LEGAL PROFESSIONS AND STUDIES', 145732],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 87735],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 91954],
 ['LIBRARY SCIENCE', 88531],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 103879],
 ['MATHEMATICS AND STATISTICS', 94710],
 ['TRANSPORTATION AND MATERIAL SERVICES', 92888],
 ['VISUAL AND PERFORMING ARTS', 87065],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 108064],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES', 129904],
 ['HISTORY GENERAL', 89536]]

Finally we join them together and get a dataframe:

In [23]:
Lf_fixed=L_fixed+L2_fixed
Lf_fixed.sort()
df_prof=pd.DataFrame(Lf_fixed)
df_prof.columns=[['discipline','average_professor_salary']]
In [24]:
df_prof.head()
Out[24]:
discipline average_professor_salary
0 AGRICULTURE, AGRICULTURE OPERATIONS, AND RELAT... 102328
1 ARCHITECTURE AND RELATED SERVICES 108653
2 AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES 107572
3 BIOLOGICAL AND BIOMEDICAL SCIENCES 103879
4 BUSINESS, MANAGEMENT, MARKETING, AND RELATED S... 129904
In [25]:
df_prof.describe()
Out[25]:
average_professor_salary
count 32.000000
mean 100037.375000
std 13729.242248
min 79838.000000
25% 92169.250000
50% 95906.500000
75% 104373.000000
max 145732.000000

We now have half of our dataframe. The other half comes from the student median wage data:

In [26]:
df_student=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv')
In [27]:
df_student.columns
Out[27]:
Index(['Major_code', 'Major', 'Major_category', 'Grad_total',
       'Grad_sample_size', 'Grad_employed', 'Grad_full_time_year_round',
       'Grad_unemployed', 'Grad_unemployment_rate', 'Grad_median', 'Grad_P25',
       'Grad_P75', 'Nongrad_total', 'Nongrad_employed',
       'Nongrad_full_time_year_round', 'Nongrad_unemployed',
       'Nongrad_unemployment_rate', 'Nongrad_median', 'Nongrad_P25',
       'Nongrad_P75', 'Grad_share', 'Grad_premium'],
      dtype='object')

We only need one of these columns, 'Grad_median'. The top of the table looks like this:

In [28]:
df_student.head()
Out[28]:
Major_code Major Major_category Grad_total Grad_sample_size Grad_employed Grad_full_time_year_round Grad_unemployed Grad_unemployment_rate Grad_median ... Nongrad_total Nongrad_employed Nongrad_full_time_year_round Nongrad_unemployed Nongrad_unemployment_rate Nongrad_median Nongrad_P25 Nongrad_P75 Grad_share Grad_premium
0 5601 CONSTRUCTION SERVICES Industrial Arts & Consumer Services 9173 200 7098 6511 681 0.087543 75000.0 ... 86062 73607 62435 3928 0.050661 65000.0 47000 98000.0 0.096320 0.153846
1 6004 COMMERCIAL ART AND GRAPHIC DESIGN Arts 53864 882 40492 29553 2482 0.057756 60000.0 ... 461977 347166 250596 25484 0.068386 48000.0 34000 71000.0 0.104420 0.250000
2 6211 HOSPITALITY MANAGEMENT Business 24417 437 18368 14784 1465 0.073867 65000.0 ... 179335 145597 113579 7409 0.048423 50000.0 35000 75000.0 0.119837 0.300000
3 2201 COSMETOLOGY SERVICES AND CULINARY ARTS Industrial Arts & Consumer Services 5411 72 3590 2701 316 0.080901 47000.0 ... 37575 29738 23249 1661 0.052900 41600.0 29000 60000.0 0.125878 0.129808
4 2001 COMMUNICATION TECHNOLOGIES Computers & Mathematics 9109 171 7512 5622 466 0.058411 57000.0 ... 53819 43163 34231 3389 0.072800 52000.0 36000 78000.0 0.144753 0.096154

5 rows × 22 columns

First, let's get a list of all the majors:

In [29]:
L=sorted(list(df_student['Major'].unique()))

Then, let's make a smaller dataframe with the only information we will be using. We are keeping track of sample size because we will have to combine some student subjects, since they were classified into a larger set of disciplines.

In [30]:
df_student_sub=df_student[['Major','Grad_median','Grad_sample_size']]
In [31]:
df_student_sub.shape
Out[31]:
(173, 3)

So we have 173 disciplines, versus the 32 for the professor salary dataframe. In order to combine them we will merge student disciplines under professor disciplines in an essentially ad hoc manner, throwing some out a couple but keeping most in. It may have been more accurate (and it certainly would have been quicker!) to throw away all but those that share essentially the same name, but that would be ignoring a lot of the data. We will then calculate a 'weighted median', which will be the mean of all the student medians falling under the same professor discipline, weighted by sample size. This statistic is prima facie a bit of a mutant, but, assuming the student salaries are normally distributed, this ought to give the mean graduate student starting salary for someone from that discipline.

In [32]:
L
Out[32]:
['ACCOUNTING',
 'ACTUARIAL SCIENCE',
 'ADVERTISING AND PUBLIC RELATIONS',
 'AEROSPACE ENGINEERING',
 'AGRICULTURAL ECONOMICS',
 'AGRICULTURE PRODUCTION AND MANAGEMENT',
 'ANIMAL SCIENCES',
 'ANTHROPOLOGY AND ARCHEOLOGY',
 'APPLIED MATHEMATICS',
 'ARCHITECTURAL ENGINEERING',
 'ARCHITECTURE',
 'AREA ETHNIC AND CIVILIZATION STUDIES',
 'ART AND MUSIC EDUCATION',
 'ART HISTORY AND CRITICISM',
 'ASTRONOMY AND ASTROPHYSICS',
 'ATMOSPHERIC SCIENCES AND METEOROLOGY',
 'BIOCHEMICAL SCIENCES',
 'BIOLOGICAL ENGINEERING',
 'BIOLOGY',
 'BIOMEDICAL ENGINEERING',
 'BOTANY',
 'BUSINESS ECONOMICS',
 'BUSINESS MANAGEMENT AND ADMINISTRATION',
 'CHEMICAL ENGINEERING',
 'CHEMISTRY',
 'CIVIL ENGINEERING',
 'CLINICAL PSYCHOLOGY',
 'COGNITIVE SCIENCE AND BIOPSYCHOLOGY',
 'COMMERCIAL ART AND GRAPHIC DESIGN',
 'COMMUNICATION DISORDERS SCIENCES AND SERVICES',
 'COMMUNICATION TECHNOLOGIES',
 'COMMUNICATIONS',
 'COMMUNITY AND PUBLIC HEALTH',
 'COMPOSITION AND RHETORIC',
 'COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY',
 'COMPUTER AND INFORMATION SYSTEMS',
 'COMPUTER ENGINEERING',
 'COMPUTER NETWORKING AND TELECOMMUNICATIONS',
 'COMPUTER PROGRAMMING AND DATA PROCESSING',
 'COMPUTER SCIENCE',
 'CONSTRUCTION SERVICES',
 'COSMETOLOGY SERVICES AND CULINARY ARTS',
 'COUNSELING PSYCHOLOGY',
 'COURT REPORTING',
 'CRIMINAL JUSTICE AND FIRE PROTECTION',
 'CRIMINOLOGY',
 'DRAMA AND THEATER ARTS',
 'EARLY CHILDHOOD EDUCATION',
 'ECOLOGY',
 'ECONOMICS',
 'EDUCATIONAL ADMINISTRATION AND SUPERVISION',
 'EDUCATIONAL PSYCHOLOGY',
 'ELECTRICAL ENGINEERING',
 'ELECTRICAL ENGINEERING TECHNOLOGY',
 'ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES AND PRODUCTION',
 'ELEMENTARY EDUCATION',
 'ENGINEERING AND INDUSTRIAL MANAGEMENT',
 'ENGINEERING MECHANICS PHYSICS AND SCIENCE',
 'ENGINEERING TECHNOLOGIES',
 'ENGLISH LANGUAGE AND LITERATURE',
 'ENVIRONMENTAL ENGINEERING',
 'ENVIRONMENTAL SCIENCE',
 'FAMILY AND CONSUMER SCIENCES',
 'FILM VIDEO AND PHOTOGRAPHIC ARTS',
 'FINANCE',
 'FINE ARTS',
 'FOOD SCIENCE',
 'FORESTRY',
 'FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
 'GENERAL AGRICULTURE',
 'GENERAL BUSINESS',
 'GENERAL EDUCATION',
 'GENERAL ENGINEERING',
 'GENERAL MEDICAL AND HEALTH SERVICES',
 'GENERAL SOCIAL SCIENCES',
 'GENETICS',
 'GEOGRAPHY',
 'GEOLOGICAL AND GEOPHYSICAL ENGINEERING',
 'GEOLOGY AND EARTH SCIENCE',
 'GEOSCIENCES',
 'HEALTH AND MEDICAL ADMINISTRATIVE SERVICES',
 'HEALTH AND MEDICAL PREPARATORY PROGRAMS',
 'HISTORY',
 'HOSPITALITY MANAGEMENT',
 'HUMAN RESOURCES AND PERSONNEL MANAGEMENT',
 'HUMAN SERVICES AND COMMUNITY ORGANIZATION',
 'HUMANITIES',
 'INDUSTRIAL AND MANUFACTURING ENGINEERING',
 'INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY',
 'INDUSTRIAL PRODUCTION TECHNOLOGIES',
 'INFORMATION SCIENCES',
 'INTERCULTURAL AND INTERNATIONAL STUDIES',
 'INTERDISCIPLINARY SOCIAL SCIENCES',
 'INTERNATIONAL BUSINESS',
 'INTERNATIONAL RELATIONS',
 'JOURNALISM',
 'LANGUAGE AND DRAMA EDUCATION',
 'LIBERAL ARTS',
 'LIBRARY SCIENCE',
 'LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE',
 'MANAGEMENT INFORMATION SYSTEMS AND STATISTICS',
 'MARKETING AND MARKETING RESEARCH',
 'MASS MEDIA',
 'MATERIALS ENGINEERING AND MATERIALS SCIENCE',
 'MATERIALS SCIENCE',
 'MATHEMATICS',
 'MATHEMATICS AND COMPUTER SCIENCE',
 'MATHEMATICS TEACHER EDUCATION',
 'MECHANICAL ENGINEERING',
 'MECHANICAL ENGINEERING RELATED TECHNOLOGIES',
 'MEDICAL ASSISTING SERVICES',
 'MEDICAL TECHNOLOGIES TECHNICIANS',
 'METALLURGICAL ENGINEERING',
 'MICROBIOLOGY',
 'MILITARY TECHNOLOGIES',
 'MINING AND MINERAL ENGINEERING',
 'MISCELLANEOUS AGRICULTURE',
 'MISCELLANEOUS BIOLOGY',
 'MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION',
 'MISCELLANEOUS EDUCATION',
 'MISCELLANEOUS ENGINEERING',
 'MISCELLANEOUS ENGINEERING TECHNOLOGIES',
 'MISCELLANEOUS FINE ARTS',
 'MISCELLANEOUS HEALTH MEDICAL PROFESSIONS',
 'MISCELLANEOUS PSYCHOLOGY',
 'MISCELLANEOUS SOCIAL SCIENCES',
 'MOLECULAR BIOLOGY',
 'MULTI-DISCIPLINARY OR GENERAL SCIENCE',
 'MULTI/INTERDISCIPLINARY STUDIES',
 'MUSIC',
 'NATURAL RESOURCES MANAGEMENT',
 'NAVAL ARCHITECTURE AND MARINE ENGINEERING',
 'NEUROSCIENCE',
 'NUCLEAR ENGINEERING',
 'NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES',
 'NURSING',
 'NUTRITION SCIENCES',
 'OCEANOGRAPHY',
 'OPERATIONS LOGISTICS AND E-COMMERCE',
 'OTHER FOREIGN LANGUAGES',
 'PETROLEUM ENGINEERING',
 'PHARMACOLOGY',
 'PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION',
 'PHILOSOPHY AND RELIGIOUS STUDIES',
 'PHYSICAL AND HEALTH EDUCATION TEACHING',
 'PHYSICAL FITNESS PARKS RECREATION AND LEISURE',
 'PHYSICAL SCIENCES',
 'PHYSICS',
 'PHYSIOLOGY',
 'PLANT SCIENCE AND AGRONOMY',
 'POLITICAL SCIENCE AND GOVERNMENT',
 'PRE-LAW AND LEGAL STUDIES',
 'PSYCHOLOGY',
 'PUBLIC ADMINISTRATION',
 'PUBLIC POLICY',
 'SCHOOL STUDENT COUNSELING',
 'SCIENCE AND COMPUTER TEACHER EDUCATION',
 'SECONDARY TEACHER EDUCATION',
 'SOCIAL PSYCHOLOGY',
 'SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION',
 'SOCIAL WORK',
 'SOCIOLOGY',
 'SOIL SCIENCE',
 'SPECIAL NEEDS EDUCATION',
 'STATISTICS AND DECISION SCIENCE',
 'STUDIO ARTS',
 'TEACHER EDUCATION: MULTIPLE LEVELS',
 'THEOLOGY AND RELIGIOUS VOCATIONS',
 'TRANSPORTATION SCIENCES AND TECHNOLOGIES',
 'TREATMENT THERAPY PROFESSIONS',
 'UNITED STATES HISTORY',
 'VISUAL AND PERFORMING ARTS',
 'ZOOLOGY']

The corresponding list of professor disciplines is

In [33]:
df_prof['discipline']
Out[33]:
discipline
0 AGRICULTURE, AGRICULTURE OPERATIONS, AND RELAT...
1 ARCHITECTURE AND RELATED SERVICES
2 AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES
3 BIOLOGICAL AND BIOMEDICAL SCIENCES
4 BUSINESS, MANAGEMENT, MARKETING, AND RELATED S...
5 COMMUNICATION, JOURNALISM AND RELATED PROGRAMS
6 COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SU...
7 COMPUTER AND INFORMATION SCIENCES AND SUPPORT ...
8 EDUCATION
9 ENGINEERING
10 ENGINEERING TECHNOLOGIES AND ENGINEERING RELAT...
11 ENGLISH LANGUAGE AND LITERATURE/LETTERS
12 FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES
13 FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS
14 HEALTH PROFESSIONS AND RELATED PROGRAMS
15 HISTORY GENERAL
16 HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTI...
17 LEGAL PROFESSIONS AND STUDIES
18 LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND...
19 LIBRARY SCIENCE
20 MATHEMATICS AND STATISTICS
21 MULTI/INTERDISCIPLINARY STUDIES
22 NATURAL RESOURCES AND CONSERVATION
23 PARKS, RECREATION, LEISURE AND FITNESS STUDIES
24 PHILOSOPHY AND RELIGIOUS STUDIES
25 PHYSICAL SCIENCES
26 PSYCHOLOGY
27 PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFE...
28 SOCIAL SCIENCES
29 THEOLOGY AND RELIGIOUS VOCATIONS
30 TRANSPORTATION AND MATERIAL SERVICES
31 VISUAL AND PERFORMING ARTS

We now set up a dictionary that sends the student discipline to our chosen over-arching professor discipline. This was saved in a separate file due to its unsightly length, and can be found on github here. It will allow us to merge the two datasets.

In [34]:
from academia_dict import d
In [35]:
Disciplines=set(d.values())

Now we convert the student dataframe into a list

In [36]:
student_list=list(map(list,list(df_student_sub.values)))
In [37]:
student_list
Out[37]:
[['CONSTRUCTION SERVICES', 75000.0, 200],
 ['COMMERCIAL ART AND GRAPHIC DESIGN', 60000.0, 882],
 ['HOSPITALITY MANAGEMENT', 65000.0, 437],
 ['COSMETOLOGY SERVICES AND CULINARY ARTS', 47000.0, 72],
 ['COMMUNICATION TECHNOLOGIES', 57000.0, 171],
 ['COURT REPORTING', 75000.0, 22],
 ['MARKETING AND MARKETING RESEARCH', 80000.0, 3738],
 ['AGRICULTURE PRODUCTION AND MANAGEMENT', 67000.0, 386],
 ['COMPUTER PROGRAMMING AND DATA PROCESSING', 85000.0, 98],
 ['ADVERTISING AND PUBLIC RELATIONS', 60000.0, 688],
 ['FILM VIDEO AND PHOTOGRAPHIC ARTS', 57000.0, 370],
 ['ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES AND PRODUCTION',
  62000.0,
  45],
 ['MECHANICAL ENGINEERING RELATED TECHNOLOGIES', 78000.0, 111],
 ['MASS MEDIA', 57000.0, 828],
 ['TRANSPORTATION SCIENCES AND TECHNOLOGIES', 90000.0, 538],
 ['COMPUTER NETWORKING AND TELECOMMUNICATIONS', 80000.0, 218],
 ['MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION', 75000.0, 408],
 ['MISCELLANEOUS ENGINEERING TECHNOLOGIES', 80000.0, 315],
 ['INDUSTRIAL PRODUCTION TECHNOLOGIES', 84500.0, 408],
 ['MISCELLANEOUS FINE ARTS', 55000.0, 27],
 ['CRIMINAL JUSTICE AND FIRE PROTECTION', 68000.0, 3794],
 ['BUSINESS MANAGEMENT AND ADMINISTRATION', 77000.0, 16129],
 ['CRIMINOLOGY', 65000.0, 381],
 ['MANAGEMENT INFORMATION SYSTEMS AND STATISTICS', 89000.0, 963],
 ['COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY', 81000.0, 194],
 ['OPERATIONS LOGISTICS AND E-COMMERCE', 94000.0, 335],
 ['GENERAL BUSINESS', 85000.0, 10399],
 ['MEDICAL TECHNOLOGIES TECHNICIANS', 76000.0, 942],
 ['COMPUTER AND INFORMATION SYSTEMS', 80000.0, 1425],
 ['COMMUNICATIONS', 65000.0, 4879],
 ['ACTUARIAL SCIENCE', 110000.0, 56],
 ['ELECTRICAL ENGINEERING TECHNOLOGY', 85000.0, 521],
 ['JOURNALISM', 70000.0, 2244],
 ['MEDICAL ASSISTING SERVICES', 80000.0, 326],
 ['ENGINEERING TECHNOLOGIES', 74000.0, 219],
 ['ACCOUNTING', 88000.0, 11774],
 ['FINE ARTS', 58000.0, 2528],
 ['NURSING', 84000.0, 10432],
 ['INFORMATION SCIENCES', 84000.0, 551],
 ['ARCHITECTURAL ENGINEERING', 78000.0, 143],
 ['MULTI/INTERDISCIPLINARY STUDIES', 55000.0, 318],
 ['NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES', 80000.0, 65],
 ['GENERAL AGRICULTURE', 68000.0, 764],
 ['FORESTRY', 78000.0, 487],
 ['LIBERAL ARTS', 70000.0, 3420],
 ['HUMAN SERVICES AND COMMUNITY ORGANIZATION', 50100.0, 555],
 ['VISUAL AND PERFORMING ARTS', 53000.0, 275],
 ['NATURAL RESOURCES MANAGEMENT', 70000.0, 659],
 ['STUDIO ARTS', 50750.0, 500],
 ['FAMILY AND CONSUMER SCIENCES', 58000.0, 2294],
 ['PHYSICAL FITNESS PARKS RECREATION AND LEISURE', 60000.0, 2423],
 ['FINANCE', 95000.0, 6319],
 ['PETROLEUM ENGINEERING', 124000.0, 164],
 ['PLANT SCIENCE AND AGRONOMY', 67000.0, 624],
 ['HUMAN RESOURCES AND PERSONNEL MANAGEMENT', 70000.0, 1316],
 ['INTERNATIONAL BUSINESS', 72000.0, 604],
 ['COMPOSITION AND RHETORIC', 58000.0, 332],
 ['DRAMA AND THEATER ARTS', 58600.0, 1069],
 ['BUSINESS ECONOMICS', 94000.0, 642],
 ['ENGINEERING AND INDUSTRIAL MANAGEMENT', 107000.0, 340],
 ['COMPUTER SCIENCE', 95000.0, 6674],
 ['HEALTH AND MEDICAL ADMINISTRATIVE SERVICES', 79000.0, 898],
 ['AGRICULTURAL ECONOMICS', 80000.0, 305],
 ['ENVIRONMENTAL SCIENCE', 68000.0, 925],
 ['GEOGRAPHY', 73000.0, 1008],
 ['MISCELLANEOUS ENGINEERING', 90000.0, 497],
 ['ECOLOGY', 62000.0, 465],
 ['INTERDISCIPLINARY SOCIAL SCIENCES', 66000.0, 541],
 ['ARCHITECTURE', 72000.0, 2760],
 ['SOIL SCIENCE', 65000.0, 61],
 ['PRE-LAW AND LEGAL STUDIES', 76000.0, 544],
 ['GENERAL ENGINEERING', 100000.0, 4345],
 ['MULTI-DISCIPLINARY OR GENERAL SCIENCE', 86000.0, 3494],
 ['CIVIL ENGINEERING', 98000.0, 4057],
 ['COMPUTER ENGINEERING', 97000.0, 1806],
 ['MINING AND MINERAL ENGINEERING', 100000.0, 126],
 ['EARLY CHILDHOOD EDUCATION', 50000.0, 1396],
 ['SOCIOLOGY', 64000.0, 6155],
 ['GENERAL SOCIAL SCIENCES', 69000.0, 1069],
 ['ANIMAL SCIENCES', 70300.0, 1335],
 ['TREATMENT THERAPY PROFESSIONS', 70000.0, 2607],
 ['MISCELLANEOUS AGRICULTURE', 54000.0, 98],
 ['MECHANICAL ENGINEERING', 100000.0, 7285],
 ['HUMANITIES', 65000.0, 450],
 ['FOOD SCIENCE', 72000.0, 266],
 ['INDUSTRIAL AND MANUFACTURING ENGINEERING', 98000.0, 1758],
 ['GEOLOGICAL AND GEOPHYSICAL ENGINEERING', 105000.0, 66],
 ['SOCIAL PSYCHOLOGY', 71000.0, 119],
 ['NAVAL ARCHITECTURE AND MARINE ENGINEERING', 102000.0, 197],
 ['MATHEMATICS AND COMPUTER SCIENCE', 98000.0, 103],
 ['ART HISTORY AND CRITICISM', 65000.0, 892],
 ['MISCELLANEOUS HEALTH MEDICAL PROFESSIONS', 60000.0, 848],
 ['GENERAL MEDICAL AND HEALTH SERVICES', 70000.0, 1172],
 ['INTERCULTURAL AND INTERNATIONAL STUDIES', 70000.0, 600],
 ['NUTRITION SCIENCES', 65000.0, 641],
 ['ECONOMICS', 100000.0, 9822],
 ['PHYSICAL AND HEALTH EDUCATION TEACHING', 65000.0, 3061],
 ['COMMUNITY AND PUBLIC HEALTH', 68500.0, 638],
 ['ELECTRICAL ENGINEERING', 106000.0, 10070],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 48000.0, 3112],
 ['OCEANOGRAPHY', 90000.0, 197],
 ['MISCELLANEOUS EDUCATION', 61000.0, 2091],
 ['BIOLOGICAL ENGINEERING', 80000.0, 433],
 ['PUBLIC ADMINISTRATION', 75000.0, 750],
 ['ELEMENTARY EDUCATION', 55000.0, 15410],
 ['INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY', 75000.0, 221],
 ['MILITARY TECHNOLOGIES', 74000.0, 29],
 ['GENERAL EDUCATION', 58000.0, 13846],
 ['MUSIC', 60000.0, 2759],
 ['ART AND MUSIC EDUCATION', 59000.0, 2679],
 ['LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE', 65000.0, 853],
 ['MATERIALS ENGINEERING AND MATERIALS SCIENCE', 92000.0, 335],
 ['ANTHROPOLOGY AND ARCHEOLOGY', 65000.0, 1971],
 ['SOCIAL WORK', 53000.0, 4537],
 ['ENGLISH LANGUAGE AND LITERATURE', 67000.0, 13688],
 ['TEACHER EDUCATION: MULTIPLE LEVELS', 55000.0, 1122],
 ['GEOLOGY AND EARTH SCIENCE', 84000.0, 1814],
 ['PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION', 111000.0, 2872],
 ['OTHER FOREIGN LANGUAGES', 68000.0, 784],
 ['PSYCHOLOGY', 64000.0, 20547],
 ['AREA ETHNIC AND CIVILIZATION STUDIES', 75000.0, 1573],
 ['PHYSICAL SCIENCES', 80000.0, 141],
 ['ATMOSPHERIC SCIENCES AND METEOROLOGY', 82000.0, 254],
 ['CHEMICAL ENGINEERING', 102000.0, 3306],
 ['AEROSPACE ENGINEERING', 107000.0, 1240],
 ['HISTORY', 80000.0, 11486],
 ['MISCELLANEOUS SOCIAL SCIENCES', 73000.0, 306],
 ['APPLIED MATHEMATICS', 100000.0, 386],
 ['STATISTICS AND DECISION SCIENCE', 92000.0, 429],
 ['FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES',
  67000.0,
  3187],
 ['SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION', 60000.0, 1646],
 ['MATHEMATICS', 89000.0, 6906],
 ['POLITICAL SCIENCE AND GOVERNMENT', 92000.0, 14467],
 ['INTERNATIONAL RELATIONS', 86000.0, 1348],
 ['ENVIRONMENTAL ENGINEERING', 81000.0, 252],
 ['MISCELLANEOUS BIOLOGY', 65000.0, 553],
 ['MISCELLANEOUS PSYCHOLOGY', 68000.0, 590],
 ['METALLURGICAL ENGINEERING', 100000.0, 251],
 ['SECONDARY TEACHER EDUCATION', 61000.0, 3094],
 ['GEOSCIENCES', 90000.0, 204],
 ['UNITED STATES HISTORY', 82000.0, 311],
 ['ENGINEERING MECHANICS PHYSICS AND SCIENCE', 100000.0, 447],
 ['COGNITIVE SCIENCE AND BIOPSYCHOLOGY', 95000.0, 118],
 ['LANGUAGE AND DRAMA EDUCATION', 58000.0, 2757],
 ['NUCLEAR ENGINEERING', 110000.0, 243],
 ['PUBLIC POLICY', 89000.0, 338],
 ['MATHEMATICS TEACHER EDUCATION', 60000.0, 1194],
 ['SCIENCE AND COMPUTER TEACHER EDUCATION', 62000.0, 993],
 ['MICROBIOLOGY', 85000.0, 1631],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 65000.0, 4566],
 ['SPECIAL NEEDS EDUCATION', 58000.0, 3240],
 ['BOTANY', 70000.0, 323],
 ['BIOLOGY', 95000.0, 21994],
 ['ASTRONOMY AND ASTROPHYSICS', 96000.0, 136],
 ['CHEMISTRY', 100000.0, 8694],
 ['PHYSIOLOGY', 90000.0, 1155],
 ['BIOMEDICAL ENGINEERING', 90000.0, 475],
 ['LIBRARY SCIENCE', 52000.0, 314],
 ['MOLECULAR BIOLOGY', 85000.0, 875],
 ['PHARMACOLOGY', 105000.0, 168],
 ['ZOOLOGY', 110000.0, 1978],
 ['PHYSICS', 100000.0, 4361],
 ['NEUROSCIENCE', 58000.0, 286],
 ['EDUCATIONAL PSYCHOLOGY', 61000.0, 396],
 ['BIOCHEMICAL SCIENCES', 96000.0, 2765],
 ['GENETICS', 78000.0, 261],
 ['MATERIALS SCIENCE', 95000.0, 299],
 ['COMMUNICATION DISORDERS SCIENCES AND SERVICES', 65000.0, 2947],
 ['COUNSELING PSYCHOLOGY', 50000.0, 724],
 ['CLINICAL PSYCHOLOGY', 70000.0, 355],
 ['HEALTH AND MEDICAL PREPARATORY PROGRAMS', 135000.0, 1766],
 ['SCHOOL STUDENT COUNSELING', 56000.0, 260],
 ['EDUCATIONAL ADMINISTRATION AND SUPERVISION', 65000.0, 841]]

The function we use to clean it up uses the dictionary from earlier to changing the student discipline name to the professor discipline name

In [38]:
def student_cleaner(L):
    result=[]
    for l in L:
        try:
            r=[d[l[0]],l[1],l[2]]
            result.append(r)
        except:
            pass
    return result
In [39]:
cleaned_student_list=student_cleaner(student_list)

Next, we sort the list alphabetically

In [40]:
cleaned_student_list.sort(key= lambda x:x[0])

Finally,we work out the overall averages using the sample sizes and medians

In [41]:
final_student_list=[]
n=len(cleaned_student_list)
x=cleaned_student_list[0]
x[1]=x[2]*x[1]
for i in range(1,n):
    y=cleaned_student_list[i]
    if x[0]==y[0]:
        x[2]+=y[2]
        x[1]+=y[1]*y[2]
    else:
        z=[x[0],x[1]/x[2]]
        final_student_list.append(z)
        x=y
        x[1]=x[1]*x[2]
z=[x[0],x[1]/x[2]]
final_student_list.append(z)

The cleaned list now looks like this

In [42]:
cleaned_student_list
Out[42]:
[['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES',
  247129500.0,
  3573],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 68000.0, 764],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 67000.0, 624],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 80000.0, 305],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 65000.0, 61],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 70300.0, 1335],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 54000.0, 98],
 ['ARCHITECTURE AND RELATED SERVICES', 198720000.0, 2760],
 ['AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', 117975000.0, 1573],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 3421360000.0, 37189],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 80000.0, 326],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 79000.0, 898],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 60000.0, 848],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 70000.0, 1172],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 65000.0, 641],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 80000.0, 433],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 65000.0, 553],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 85000.0, 1631],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 95000.0, 21994],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 90000.0, 1155],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 90000.0, 475],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 85000.0, 875],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 105000.0, 168],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 58000.0, 286],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 96000.0, 2765],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 78000.0, 261],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 135000.0, 1766],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  5209431000.0,
  60170],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  75000.0,
  408],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  77000.0,
  16129],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  94000.0,
  335],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  85000.0,
  10399],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  88000.0,
  11774],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  95000.0,
  6319],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  72000.0,
  604],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  94000.0,
  642],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  100000.0,
  9822],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 562691000.0, 8639],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 57000.0, 828],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 65000.0, 4879],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 70000.0, 2244],
 ['COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES',
  9747000.0,
  171],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES',
  1096687000.0,
  11929],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 80000.0, 218],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 89000.0, 963],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 81000.0, 194],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 80000.0, 1425],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 84000.0, 551],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 95000.0, 6674],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 97000.0, 1806],
 ['EDUCATION', 2970551000.0, 51347],
 ['EDUCATION', 65000.0, 3061],
 ['EDUCATION', 61000.0, 2091],
 ['EDUCATION', 55000.0, 15410],
 ['EDUCATION', 58000.0, 13846],
 ['EDUCATION', 55000.0, 1122],
 ['EDUCATION', 60000.0, 1646],
 ['EDUCATION', 61000.0, 3094],
 ['EDUCATION', 58000.0, 2757],
 ['EDUCATION', 60000.0, 1194],
 ['EDUCATION', 62000.0, 993],
 ['EDUCATION', 58000.0, 3240],
 ['EDUCATION', 61000.0, 396],
 ['EDUCATION', 56000.0, 260],
 ['EDUCATION', 65000.0, 841],
 ['ENGINEERING', 3398176000.0, 33799],
 ['ENGINEERING', 78000.0, 111],
 ['ENGINEERING', 80000.0, 315],
 ['ENGINEERING', 84500.0, 408],
 ['ENGINEERING', 85000.0, 521],
 ['ENGINEERING', 74000.0, 219],
 ['ENGINEERING', 78000.0, 143],
 ['ENGINEERING', 80000.0, 65],
 ['ENGINEERING', 124000.0, 164],
 ['ENGINEERING', 107000.0, 340],
 ['ENGINEERING', 90000.0, 497],
 ['ENGINEERING', 100000.0, 4345],
 ['ENGINEERING', 98000.0, 4057],
 ['ENGINEERING', 100000.0, 126],
 ['ENGINEERING', 100000.0, 7285],
 ['ENGINEERING', 98000.0, 1758],
 ['ENGINEERING', 105000.0, 66],
 ['ENGINEERING', 102000.0, 197],
 ['ENGINEERING', 106000.0, 10070],
 ['ENGINEERING', 92000.0, 335],
 ['ENGINEERING', 107000.0, 1240],
 ['ENGINEERING', 81000.0, 252],
 ['ENGINEERING', 100000.0, 251],
 ['ENGINEERING', 100000.0, 447],
 ['ENGINEERING', 110000.0, 243],
 ['ENGINEERING', 95000.0, 299],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 936352000.0, 14020],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 67000.0, 13688],
 ['FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', 133052000.0, 2294],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 480214000.0, 6772],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 65000.0, 853],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 68000.0, 784],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 67000.0, 3187],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 86000.0, 1348],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 1386635000.0, 16251],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 111000.0, 2872],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 65000.0, 2947],
 ['HISTORY GENERAL', 944382000.0, 11797],
 ['HISTORY GENERAL', 82000.0, 311],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  284903000.0,
  4204],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  65000.0,
  381],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  74000.0,
  29],
 ['LEGAL PROFESSIONS AND STUDIES', 42994000.0, 566],
 ['LEGAL PROFESSIONS AND STUDIES', 76000.0, 544],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES',
  1933818000.0,
  23755],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 58000.0, 2528],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 70000.0, 3420],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 65000.0, 450],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 65000.0, 892],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 65000.0, 1971],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 92000.0, 14467],
 ['LIBRARY SCIENCE', 16328000.0, 314],
 ['MATHEMATICS AND STATISTICS', 708956000.0, 7880],
 ['MATHEMATICS AND STATISTICS', 98000.0, 103],
 ['MATHEMATICS AND STATISTICS', 100000.0, 386],
 ['MATHEMATICS AND STATISTICS', 92000.0, 429],
 ['MATHEMATICS AND STATISTICS', 89000.0, 6906],
 ['MULTI/INTERDISCIPLINARY STUDIES', 300484000.0, 3494],
 ['NATURAL RESOURCES AND CONSERVATION', 416036000.0, 4837],
 ['NATURAL RESOURCES AND CONSERVATION', 70000.0, 659],
 ['NATURAL RESOURCES AND CONSERVATION', 68000.0, 925],
 ['NATURAL RESOURCES AND CONSERVATION', 62000.0, 465],
 ['NATURAL RESOURCES AND CONSERVATION', 70000.0, 323],
 ['NATURAL RESOURCES AND CONSERVATION', 110000.0, 1978],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 293710500.0, 4731],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 50100.0, 555],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 60000.0, 2423],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 70000.0, 1316],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 296790000.0, 4566],
 ['PHYSICAL SCIENCES', 1949926000.0, 20115],
 ['PHYSICAL SCIENCES', 90000.0, 197],
 ['PHYSICAL SCIENCES', 84000.0, 1814],
 ['PHYSICAL SCIENCES', 80000.0, 141],
 ['PHYSICAL SCIENCES', 82000.0, 254],
 ['PHYSICAL SCIENCES', 102000.0, 3306],
 ['PHYSICAL SCIENCES', 90000.0, 204],
 ['PHYSICAL SCIENCES', 96000.0, 136],
 ['PHYSICAL SCIENCES', 100000.0, 8694],
 ['PHYSICAL SCIENCES', 100000.0, 4361],
 ['PSYCHOLOGY', 1634902000.0, 25281],
 ['PSYCHOLOGY', 71000.0, 119],
 ['PSYCHOLOGY', 75000.0, 221],
 ['PSYCHOLOGY', 64000.0, 20547],
 ['PSYCHOLOGY', 68000.0, 590],
 ['PSYCHOLOGY', 95000.0, 118],
 ['PSYCHOLOGY', 50000.0, 724],
 ['PSYCHOLOGY', 70000.0, 355],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 370496000.0, 6263],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 75000.0, 750],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 53000.0, 4537],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 89000.0, 338],
 ['SOCIAL SCIENCES', 525725000.0, 8071],
 ['SOCIAL SCIENCES', 64000.0, 6155],
 ['SOCIAL SCIENCES', 69000.0, 1069],
 ['SOCIAL SCIENCES', 73000.0, 306],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 149376000.0, 3112],
 ['TRANSPORTATION AND MATERIAL SERVICES', 63420000.0, 738],
 ['TRANSPORTATION AND MATERIAL SERVICES', 90000.0, 538],
 ['VISUAL AND PERFORMING ARTS', 500204400.0, 8534],
 ['VISUAL AND PERFORMING ARTS', 57000.0, 370],
 ['VISUAL AND PERFORMING ARTS', 53000.0, 275],
 ['VISUAL AND PERFORMING ARTS', 50750.0, 500],
 ['VISUAL AND PERFORMING ARTS', 58600.0, 1069],
 ['VISUAL AND PERFORMING ARTS', 60000.0, 2759],
 ['VISUAL AND PERFORMING ARTS', 59000.0, 2679]]
In [43]:
final_student_list.sort(key=lambda x:x[0])

And the final list looks like this (no duplicates)

In [44]:
final_student_list
Out[44]:
[['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES',
  69165.82703610412],
 ['ARCHITECTURE AND RELATED SERVICES', 72000.0],
 ['AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', 75000.0],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 91999.24708919304],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  86578.54412497922],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 65133.81178377127],
 ['COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES', 57000.0],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 91934.52929834856],
 ['EDUCATION', 57852.47434124681],
 ['ENGINEERING', 100540.72605698393],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 66786.87589158345],
 ['FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', 58000.0],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 70911.69521559362],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 85326.133776383],
 ['HISTORY GENERAL', 80052.72526913622],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  67769.50523311133],
 ['LEGAL PROFESSIONS AND STUDIES', 75961.13074204947],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES',
  81406.777520522],
 ['LIBRARY SCIENCE', 52000.0],
 ['MATHEMATICS AND STATISTICS', 89969.03553299492],
 ['MULTI/INTERDISCIPLINARY STUDIES', 86000.0],
 ['NATURAL RESOURCES AND CONSERVATION', 86011.16394459376],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 62082.117945466074],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 65000.0],
 ['PHYSICAL SCIENCES', 96938.90131742481],
 ['PSYCHOLOGY', 64669.19821209604],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 59156.31486508063],
 ['SOCIAL SCIENCES', 65137.52942634122],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 48000.0],
 ['TRANSPORTATION AND MATERIAL SERVICES', 85934.9593495935],
 ['VISUAL AND PERFORMING ARTS', 58613.12397468948]]

Now, we return to the list for professor salary. We are going to add on the student salary. First, we'll set up a dictionary that gets rid of any professor disciplines that have no corresponding student disciplines:

In [45]:
rev_d={x:True for x in Disciplines}
prof_list=[]
for l in Lf_fixed:
    try:
        rev_d[l[0]]
        prof_list.append(l)
    except:
        pass
prof_list.sort(key = lambda x:x[0])

Then we add the student salary onto the professor salary list:

In [46]:
n=len(prof_list)
comparison_list=[]
for i in range(n):
    if final_student_list[i][0]==prof_list[i][0]:
        comparison_list.append([prof_list[i][0],round(final_student_list[i][1]),prof_list[i][1]])
    else:
        print('Error here:',i)
        break

And here is the full list:

In [47]:
comparison_list
Out[47]:
[['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', 69166, 102328],
 ['ARCHITECTURE AND RELATED SERVICES', 72000, 108653],
 ['AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', 75000, 107572],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 91999, 103879],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES',
  86579,
  129904],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 65134, 92241],
 ['COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES',
  57000,
  94089],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 91935, 113646],
 ['EDUCATION', 57852, 92764],
 ['ENGINEERING', 100541, 129012],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 66787, 87735],
 ['FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', 58000, 99572],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 70912, 94698],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', 85326, 108064],
 ['HISTORY GENERAL', 80053, 89536],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  67770,
  91192],
 ['LEGAL PROFESSIONS AND STUDIES', 75961, 145732],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 81407, 91954],
 ['LIBRARY SCIENCE', 52000, 88531],
 ['MATHEMATICS AND STATISTICS', 89969, 94710],
 ['MULTI/INTERDISCIPLINARY STUDIES', 86000, 105855],
 ['NATURAL RESOURCES AND CONSERVATION', 86011, 100200],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 62082, 89281],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 65000, 92741],
 ['PHYSICAL SCIENCES', 96939, 97733],
 ['PSYCHOLOGY', 64669, 94218],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', 59156, 99243],
 ['SOCIAL SCIENCES', 65138, 99219],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 48000, 79838],
 ['TRANSPORTATION AND MATERIAL SERVICES', 85935, 92888],
 ['VISUAL AND PERFORMING ARTS', 58613, 87065]]

Finally, we turn it inot a dataframe:

In [48]:
df_comparison=pd.DataFrame(comparison_list)
In [49]:
df_comparison.columns=[['discipline','weighted_median_grad_salary','mean_prof_salary']]
In [50]:
df_comparison.describe()
Out[50]:
weighted_median_grad_salary mean_prof_salary
count 31.000000 31.000000
mean 73320.451613 100132.032258
std 14081.149252 13945.568659
min 48000.000000 79838.000000
25% 63375.500000 92097.500000
50% 70912.000000 94710.000000
75% 85967.500000 104867.000000
max 100541.000000 145732.000000
In [51]:
import statsmodels.api as sm

First, let's look at a linear regression:

In [52]:
X=df_comparison['weighted_median_grad_salary']
X1=sm.add_constant(X)
Y=df_comparison['mean_prof_salary']
model = sm.OLS(Y,X1).fit()
model.summary()
Out[52]:
OLS Regression Results
Dep. Variable: ('mean_prof_salary',) R-squared: 0.267
Model: OLS Adj. R-squared: 0.242
Method: Least Squares F-statistic: 10.58
Date: Sat, 22 Jun 2019 Prob (F-statistic): 0.00290
Time: 18:03:10 Log-Likelihood: -334.49
No. Observations: 31 AIC: 673.0
Df Residuals: 29 BIC: 675.8
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 6.259e+04 1.17e+04 5.329 0.000 3.86e+04 8.66e+04
('weighted_median_grad_salary',) 0.5120 0.157 3.252 0.003 0.190 0.834
Omnibus: 23.726 Durbin-Watson: 2.375
Prob(Omnibus): 0.000 Jarque-Bera (JB): 41.321
Skew: 1.768 Prob(JB): 1.06e-09
Kurtosis: 7.415 Cond. No. 4.02e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.02e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
In [53]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

Here is a plot of that regression:

In [54]:
X=np.array(df_comparison['weighted_median_grad_salary'])
Y=np.array(df_comparison['mean_prof_salary'])
pay_model=linear_model.LinearRegression()
pay_model.fit(X,Y)
prediction=pay_model.predict(np.sort(X, axis=0))

plt.scatter(X, Y)
plt.plot(np.sort(X, axis=0),prediction)
plt.show()

The slope is less than 1, so increases in graduate salaries do not drive commensurate increases in professor pay. However, we wish to know whether they rank them in the same order. So we calculate Spearman's rank coefficient. The advantage here is that it only looks for monotonic relationships between the salaries, so it ignores a lot of the nonlinearity that may come from considering people at different points in different careers.

In [55]:
df_comparison.corr(method='spearman')
Out[55]:
weighted_median_grad_salary mean_prof_salary
weighted_median_grad_salary 1.000000 0.603226
mean_prof_salary 0.603226 1.000000

This gives an $R^2$ correlation coefficient of

In [56]:
0.603226**2
Out[56]:
0.36388160707600004

which is rather low, if we were expecting the academy to respond to the same demands as the broader job market. We conclude that academic pay ranking is not explained well by graduate job market pay ranking.

Now, how robust is this number? Can we conclude anything from this? Well, it is worth comparing pay in academia versus in industry, as chances are one is potentially more lucrative than the other.

We will now perform a Wilcoxon signed-rank test on the predicted Mean Professor Salary versus the observered Mean Professor Salary. This will test if the populations have the same distribution, i.e. if their paired differences follow a (roughly normal) symmetric distribution. Specifically, it will output a $p$-value giveing the probability of the previous statement.

In [57]:
from scipy import stats

A bit more data cleaning (we are adding the predicted Professor Salary into our dataframe):

In [58]:
L=sorted(comparison_list, key = lambda x:x[1])
predict_prof_list=[[L[i][0],round(float(prediction[i]))] for i in range(31)]
predict_prof_list.sort(key= lambda x:x[0])
In [59]:
df_comparison['predicted_prof_salary']=list(map(lambda x:x[1],predict_prof_list))
In [60]:
df_comparison.describe()
Out[60]:
weighted_median_grad_salary mean_prof_salary predicted_prof_salary
count 31.000000 31.000000 31.000000
mean 73320.451613 100132.032258 100132.000000
std 14081.149252 13945.568659 7208.978106
min 48000.000000 79838.000000 87169.000000
25% 63375.500000 92097.500000 95040.500000
50% 70912.000000 94710.000000 98899.000000
75% 85967.500000 104867.000000 106606.500000
max 100541.000000 145732.000000 114068.000000

It looks like it worked. Now we apply the Wilcoxon test to the rightmost two columns. First, we turn them into lists so scipy can do its thing:

In [61]:
L_obs=list(map(float,df_comparison['mean_prof_salary'].values))
L_pred=list(map(lambda x:x[1],predict_prof_list))
In [62]:
stats.wilcoxon(L_pred,L_obs)
Out[62]:
WilcoxonResult(statistic=211.0, pvalue=0.46840775868536677)

This is not very helpful. It says that, based on our data, the probability that overall industry salaries are distributed similarly to salaries in academia is about $47\%$. The test would probably be more conclusive if we examined each subject individually, since some of the differences in ranking are large, while others are small. The ones with large difference are possibly where there is significant difference in distribution.

Now, which ones are those? Well, we can look at the list of differences, sort them by size and find out.

In [63]:
diff_list=[[c[0],sorted(comparison_list,key=lambda x:x[1]).index(c)-sorted(comparison_list,key=lambda x:x[2]).index(c)] for c in comparison_list]
In [64]:
diff_list.sort(key=lambda x:x[1],reverse=True)
In [65]:
diff_list
Out[65]:
[['HISTORY GENERAL', 14],
 ['LIBERAL ARTS AND SCIENCES, GENERAL STUDIES AND HUMANITIES', 13],
 ['PHYSICAL SCIENCES', 13],
 ['MATHEMATICS AND STATISTICS', 11],
 ['TRANSPORTATION AND MATERIAL SERVICES', 11],
 ['ENGLISH LANGUAGE AND LITERATURE/LETTERS', 10],
 ['HOMELAND SECURITY, LAW ENFORCEMENT, FIREFIGHTING AND RELATED PROTECTIVE SERVICE',
  7],
 ['BIOLOGICAL AND BIOMEDICAL SCIENCES', 6],
 ['NATURAL RESOURCES AND CONSERVATION', 4],
 ['VISUAL AND PERFORMING ARTS', 4],
 ['PARKS, RECREATION, LEISURE AND FITNESS STUDIES', 3],
 ['COMMUNICATION, JOURNALISM AND RELATED PROGRAMS', 2],
 ['ENGINEERING', 2],
 ['FOREIGN LANGUAGES, LITERATURES, AND LINGUISTICS', 1],
 ['COMPUTER AND INFORMATION SCIENCES AND SUPPORT SERVICES', 0],
 ['MULTI/INTERDISCIPLINARY STUDIES', 0],
 ['PHILOSOPHY AND RELIGIOUS STUDIES', 0],
 ['THEOLOGY AND RELIGIOUS VOCATIONS', 0],
 ['LIBRARY SCIENCE', -2],
 ['BUSINESS, MANAGEMENT, MARKETING, AND RELATED SUPPORT SERVICES', -4],
 ['HEALTH PROFESSIONS AND RELATED PROGRAMS', -4],
 ['PSYCHOLOGY', -5],
 ['SOCIAL SCIENCES', -6],
 ['AGRICULTURE, AGRICULTURE OPERATIONS, AND RELATED SCIENCES', -7],
 ['AREA, ETHNIC, CULTURAL, GENDER AND GROUP STUDIES', -7],
 ['EDUCATION', -7],
 ['ARCHITECTURE AND RELATED SERVICES', -10],
 ['COMMUNICATIONS TECHNOLOGIES/TECHNICIANS AND SUPPORT SERVICES', -10],
 ['LEGAL PROFESSIONS AND STUDIES', -12],
 ['PUBLIC ADMINISTRATION AND SOCIAL SERVICE PROFESSIONS', -12],
 ['FAMILY AND CONSUMER SCIENCES/HUMAN SCIENCES', -15]]

So our data suggest that historians, humanities graduates, physicists and mathematicians are underpaid as professors relative to the rest of the job market, whereas the (seemingly more vocational) degree holders near the bottom of the list are not.

Leave a comment

Your email address will not be published. Required fields are marked *