Comparison of Academia and Industry for Graduate Students¶
Roughly how well does pay for professors reflect market demand for PhDs? More vaguely: is it worth staying in academia? The second question is much harder to answer - and is a good topic for another blog post! The first we hope to partially answer below. We are going to look at the average pay of a full professor (courtesy of the College and University Professional Association for Human Resources) versus the median pay for (fully-employed) graduate students fresh out of grad school (courtesy of the American Community Survey 2010-2012 Public Use Microdata Series via fivethirtyeight). Note that we are comapring means with medians, so we have to be a bit wary of any conclusions drawn, although it is likely a fair assumption that the professor salaries are normally distributed (it only consists of fully tenured professors) and so their medians and means ought to be similar.
Now, the relationship between these two datasets may be nonlinear (wages in different sectors of the economy probably differ in terms of scaling, entry-level pay, etc.), so we will ask two specific questions:
- Is the order in which universities rank professor salaries similar to the order in which industry ranks recent PhD salaries the same?
- Are the distributions of average salaries similar in industry versus academia?
As usual, we will use the pandas package for python:
import pandas as pd
We will also need a package to read the messier pdf from CUPAHR. We will use tabula:
from tabula import read_pdf
Now, due to some white space errors, we will need to deal with page 3 of this pdf separately. First we import the other pages (1, 2 and 4) into a dataframe:
tables = read_pdf('https://www.cupahr.org/wp-content/uploads/2017/07/FHE-2016-2-Digit-Average-Salaries-Tenured-and-Tenure-Track.pdf', pages='1,2,4')
tables.head(10)
tables_bad = read_pdf('https://www.cupahr.org/wp-content/uploads/2017/07/FHE-2016-2-Digit-Average-Salaries-Tenured-and-Tenure-Track.pdf', pages='3')
tables_bad.head(10)
It has accidentally created an unwanted extra column. We will deal with this later. The second column of the first table has the salaries we want, as well as salaries for more junior faculty. We are going to throw this away as it is not quite what we are looking for (it varies too much based on teaching needs of institutions).
list(tables.iloc[5].values)[1]
Since these entries are all strings with white spaces, we need a function to clean the data by:
- Extracting the relevant part of the string (in the above case, the first 5 digit number).
- Turning the string into an integer.
We will also drop the numbering in front of the discipline name (it is possibly arbitrary anyway, as the original pdf seems to be missing entries).
def main_sal(s):
i=s.find(',')
w=s[0:i]+s[i+1:i+4]
return int(w)
First, import the regular expression package to search through the strings:
import re
Next, we will write two functions, one to clean the strings, the second to clean the columns by applying the first function to each entry.
def clean(s):
if type(s)==str:
discipline_name=re.search('\]',s)
salary=re.match('\d',s)
if discipline_name:
i=discipline_name.start()
return s[i+2:]
elif salary:
i=re.search('\s',s).start()
j=re.search(',',s).start()
return int(s[:j]+s[j+1:i])
else:
pass
def clean_list(L):
L1=[]
for l in L:
if clean(l)!=None:
L1.append(clean(l))
return L1
Lets see if this works:
clean_list(['72,135 72,581 *','[30.] MULTI/INTERDISCIPLINARY STUDIES','*'])
OK that seems to do it. First, we'll get a list of all the rows of the dataframe (skipping the first row because it just contains the headings):
t=tables_bad.values
M=[]
for x in t:
M+=(list(x))
print(M[1:])
Let us clean the list.
clean_list(M[1:])
Now we only want the first number, so we will make a smaller list with just the discipline names and their corresponding salaries for full professors only:
L=clean_list(M[1:])
L_fixed=[[L[i],L[i+1]] for i in range(len(L)) if type(L[i])==str]
L_fixed
This looks good. Lets put it in a neww dataframe:
tables_good=pd.DataFrame(L_fixed)
tables_good.columns=['Discipline','Average Professor Salary']
tables_good
Now we'll clean the rest of the data. First, lets print of a list of the rows:
t2=tables.values
M2=[]
for x in t2:
M2+=list(x)
print(M2[1:])
The rest is the same as for the previous part of the data.
L2=clean_list(M2[1:])
L2_fixed=[[L2[i],L2[i+1]] for i in range(len(L2)) if type(L2[i])==str]
L2_fixed
Finally we join them together and get a dataframe:
Lf_fixed=L_fixed+L2_fixed
Lf_fixed.sort()
df_prof=pd.DataFrame(Lf_fixed)
df_prof.columns=[['discipline','average_professor_salary']]
df_prof.head()
df_prof.describe()
We now have half of our dataframe. The other half comes from the student median wage data:
df_student=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/grad-students.csv')
df_student.columns
We only need one of these columns, 'Grad_median'. The top of the table looks like this:
df_student.head()
First, let's get a list of all the majors:
L=sorted(list(df_student['Major'].unique()))
Then, let's make a smaller dataframe with the only information we will be using. We are keeping track of sample size because we will have to combine some student subjects, since they were classified into a larger set of disciplines.
df_student_sub=df_student[['Major','Grad_median','Grad_sample_size']]
df_student_sub.shape
So we have 173 disciplines, versus the 32 for the professor salary dataframe. In order to combine them we will merge student disciplines under professor disciplines in an essentially ad hoc manner, throwing some out a couple but keeping most in. It may have been more accurate (and it certainly would have been quicker!) to throw away all but those that share essentially the same name, but that would be ignoring a lot of the data. We will then calculate a 'weighted median', which will be the mean of all the student medians falling under the same professor discipline, weighted by sample size. This statistic is prima facie a bit of a mutant, but, assuming the student salaries are normally distributed, this ought to give the mean graduate student starting salary for someone from that discipline.
L
The corresponding list of professor disciplines is
df_prof['discipline']
We now set up a dictionary that sends the student discipline to our chosen over-arching professor discipline. This was saved in a separate file due to its unsightly length, and can be found on github here. It will allow us to merge the two datasets.
from academia_dict import d
Disciplines=set(d.values())
Now we convert the student dataframe into a list
student_list=list(map(list,list(df_student_sub.values)))
student_list
The function we use to clean it up uses the dictionary from earlier to changing the student discipline name to the professor discipline name
def student_cleaner(L):
result=[]
for l in L:
try:
r=[d[l[0]],l[1],l[2]]
result.append(r)
except:
pass
return result
cleaned_student_list=student_cleaner(student_list)
Next, we sort the list alphabetically
cleaned_student_list.sort(key= lambda x:x[0])
Finally,we work out the overall averages using the sample sizes and medians
final_student_list=[]
n=len(cleaned_student_list)
x=cleaned_student_list[0]
x[1]=x[2]*x[1]
for i in range(1,n):
y=cleaned_student_list[i]
if x[0]==y[0]:
x[2]+=y[2]
x[1]+=y[1]*y[2]
else:
z=[x[0],x[1]/x[2]]
final_student_list.append(z)
x=y
x[1]=x[1]*x[2]
z=[x[0],x[1]/x[2]]
final_student_list.append(z)
The cleaned list now looks like this
cleaned_student_list
final_student_list.sort(key=lambda x:x[0])
And the final list looks like this (no duplicates)
final_student_list
Now, we return to the list for professor salary. We are going to add on the student salary. First, we'll set up a dictionary that gets rid of any professor disciplines that have no corresponding student disciplines:
rev_d={x:True for x in Disciplines}
prof_list=[]
for l in Lf_fixed:
try:
rev_d[l[0]]
prof_list.append(l)
except:
pass
prof_list.sort(key = lambda x:x[0])
Then we add the student salary onto the professor salary list:
n=len(prof_list)
comparison_list=[]
for i in range(n):
if final_student_list[i][0]==prof_list[i][0]:
comparison_list.append([prof_list[i][0],round(final_student_list[i][1]),prof_list[i][1]])
else:
print('Error here:',i)
break
And here is the full list:
comparison_list
Finally, we turn it inot a dataframe:
df_comparison=pd.DataFrame(comparison_list)
df_comparison.columns=[['discipline','weighted_median_grad_salary','mean_prof_salary']]
df_comparison.describe()
import statsmodels.api as sm
First, let's look at a linear regression:
X=df_comparison['weighted_median_grad_salary']
X1=sm.add_constant(X)
Y=df_comparison['mean_prof_salary']
model = sm.OLS(Y,X1).fit()
model.summary()
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
Here is a plot of that regression:
X=np.array(df_comparison['weighted_median_grad_salary'])
Y=np.array(df_comparison['mean_prof_salary'])
pay_model=linear_model.LinearRegression()
pay_model.fit(X,Y)
prediction=pay_model.predict(np.sort(X, axis=0))
plt.scatter(X, Y)
plt.plot(np.sort(X, axis=0),prediction)
plt.show()
The slope is less than 1, so increases in graduate salaries do not drive commensurate increases in professor pay. However, we wish to know whether they rank them in the same order. So we calculate Spearman's rank coefficient. The advantage here is that it only looks for monotonic relationships between the salaries, so it ignores a lot of the nonlinearity that may come from considering people at different points in different careers.
df_comparison.corr(method='spearman')
This gives an $R^2$ correlation coefficient of
0.603226**2
which is rather low, if we were expecting the academy to respond to the same demands as the broader job market. We conclude that academic pay ranking is not explained well by graduate job market pay ranking.
Now, how robust is this number? Can we conclude anything from this? Well, it is worth comparing pay in academia versus in industry, as chances are one is potentially more lucrative than the other.
We will now perform a Wilcoxon signed-rank test on the predicted Mean Professor Salary versus the observered Mean Professor Salary. This will test if the populations have the same distribution, i.e. if their paired differences follow a (roughly normal) symmetric distribution. Specifically, it will output a $p$-value giveing the probability of the previous statement.
from scipy import stats
A bit more data cleaning (we are adding the predicted Professor Salary into our dataframe):
L=sorted(comparison_list, key = lambda x:x[1])
predict_prof_list=[[L[i][0],round(float(prediction[i]))] for i in range(31)]
predict_prof_list.sort(key= lambda x:x[0])
df_comparison['predicted_prof_salary']=list(map(lambda x:x[1],predict_prof_list))
df_comparison.describe()
It looks like it worked. Now we apply the Wilcoxon test to the rightmost two columns. First, we turn them into lists so scipy can do its thing:
L_obs=list(map(float,df_comparison['mean_prof_salary'].values))
L_pred=list(map(lambda x:x[1],predict_prof_list))
stats.wilcoxon(L_pred,L_obs)
This is not very helpful. It says that, based on our data, the probability that overall industry salaries are distributed similarly to salaries in academia is about $47\%$. The test would probably be more conclusive if we examined each subject individually, since some of the differences in ranking are large, while others are small. The ones with large difference are possibly where there is significant difference in distribution.
Now, which ones are those? Well, we can look at the list of differences, sort them by size and find out.
diff_list=[[c[0],sorted(comparison_list,key=lambda x:x[1]).index(c)-sorted(comparison_list,key=lambda x:x[2]).index(c)] for c in comparison_list]
diff_list.sort(key=lambda x:x[1],reverse=True)
diff_list
So our data suggest that historians, humanities graduates, physicists and mathematicians are underpaid as professors relative to the rest of the job market, whereas the (seemingly more vocational) degree holders near the bottom of the list are not.