Cousins and Alcohol¶
There is a certain stereotype that says that both alcohol consumption and cousin marriage are bad things, and that they go hand in hand. Regardless of the first part of the stereotype, we can use data to address the second part. Is this just anecdotal or is there a trend in global data that backs this up? To investigate this, we run a simple linear regression between two data sets: the amount and type of alcohol consumed annually per capita by country; and the rate of cousin marriage by country.
The usual caveats apply: this is just using country-wide data and the sources may not have collected it with the same precision in each country. Nevertheless, it is a first step.
First, we'll import the data analysis package pandas for Python. We could just write up the functions themselves (in fact this is done at the end) as they are not going to be very complex, but why re-invent the wheel?
import pandas as pd
Next, we import the two data sets. The first, on cousin marriage, comes from Prof Alan Bittles of Edith Cowan University via fivethirtyeight.com, and can be imported like so:
cousin_df=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/cousin-marriage/cousin-marriage-data.csv')
Note that this is not a list, but a data frame:
type(cousin_df)
We can peek at the top lines of the data frame with the head() command:
cousin_df.head()
Wow, Burkina Faso is way out ahead. Next, we import the other data set on cousin marriage. This comes courtesy of the World Health Organization via fivethirtyeight.com.
alco_df=pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv')
alco_df.head()
Now, a quick glance at both sets shows that there are some naming inconsistencies, e.g. one having the USA under 'United States' and the other under 'USA'. This would be a problem with larger data sets, but we can just do some ad hoc editing here. We will replace 'The Netherlands' by 'Netherlands', 'USA' by 'United States' and 'Great Britian' by 'United Kingdom'. To do this, use the 'replace' function on the dataframe, with a dictionary telling the 'to_replace' command what to do. The second command, inplace='True', tells it to replace the old data frame with the new, correct one.
alco_df['country'].replace(to_replace={'USA':'United States'},inplace=True)
For the other dataframe we must replace two entries:
cousin_df['Country'].replace(to_replace={'The Netherlands':'Netherlands','Great Britain':'United Kingdom'},inplace=True)
Next thing to do is merge both dataframes along the country column. First thing we do is again make a simple edit so that the 'country' in alco_df matches 'Country' (note the capital C!) in cousin_df.
alco_df.rename(columns={'country':'Country'},inplace=True)
Now we are ready to merge the data frames. We use the .join() command.
df=pd.merge(alco_df,cousin_df,on='Country')
Note that this has also discarded any entries with incomplete data, i.e. countries appearing in one data frame but not the other. Next, we import the package statsmodels.api to do the computation of what we are interested in, namely the regression coefficient $\beta_1$ and $R^2.$
import statsmodels.api as sm
We'll also import a graphing package matplotlib.pyplot:
import matplotlib.pyplot as plt
X=df['total_litres_of_pure_alcohol']
X1=sm.add_constant(X)
Y=df['Percent']
Note that we had to use the function add_constant on X so that the regression line would not be forced to pass through the origin. Next, we specify the kind of model we want (Ordinary Least Squares) and specify the predcitor variable (alcohol consumed) and dependent variable (Percent of Cousin Marriage).
model = sm.OLS(Y,X1).fit()
Finally, we output the stats we are interested in:
model.summary()
With $\beta_0$ and $\beta_1$ we can plot the regression (note that we do not include the constant $\beta_0$ in the data, so we use X rather than X1):
plt.plot(X,Y,'.')
f = lambda x : 29.6483 - 2.8982*x
plt.plot(X,f(X), c="red", )
plt.legend()
plt.xlabel('Litres of Alcohol Consumed', fontsize=12)
plt.ylabel('Cousin Marriage Percentage', fontsize=12)
plt.show()
Somewhat surprisingly, there is a negative correlation between the amount of alcohol consumed and the prevalence of cousin marriage. In fact, the r-squared statistic implies that almost 40% of the cousin marriage trend can be explained (i.e. correlates with) by lack of alcohol. There could be many intrepretations for this, as well as the possibility that the data itself is not that robust. As we shall see below, it is beer that is best at explaining this effect, although all alcoholic drinks mentioned in the data set explain some of the trend.
We can also ask whether there is significant prediction in the opposite direction: does cousin marriage predict alcohol consumption?
P=df['Percent']
P1=sm.add_constant(P)
percent_alcohol=sm.OLS(X,P1).fit()
percent_alcohol.summary()
plt.plot(P,X,'.')
f = lambda x : 6.8279 -0.1364*x
plt.plot(P,f(P), c="red")
plt.legend()
plt.xlabel('Cousin Marriage Percentage', fontsize=12)
plt.ylabel('Litres of Alcohol Consumed', fontsize=12)
plt.show()
Again there is a negative correlation (of course), with the same $R^2$ coefficient.
We now look at the different types of alcohol and see if one is a better predictor of cousin marriage than the others. We start with a linear regression on beer servings as a predictor of cousin marriage:
B=df['beer_servings']
B1=sm.add_constant(B)
beer_model=sm.OLS(Y,B1).fit()
beer_model.summary()
plt.plot(B,Y,'.')
f = lambda x : 29.4190 -0.1215*x
plt.plot(B,f(B), c="red", )
plt.legend()
plt.xlabel('Beer Servings Consumed', fontsize=12)
plt.ylabel('Cousin Marriage Percentage', fontsize=12)
plt.show()
This has a higher $R^2$, meaning that annual beer servings per capita explain about 52% of the drop off in cousin marriage as annual beer servings per capita increase. So beer annual servings per capita better explain the lack of cousin marriage than annual litres of alcohol consumed per capita, according to this data set.
While the $\beta_1$ slope coefficient of the regression line is smaller than the slope coefficient for the total amount of alcohol consumed, this is mainly due to the change in units, since one serving of beer only contains about 0.018 litres of pure alcohol (if the beer has 5% ABV). Larger numbers for the predictor would therefore mean a smaller slope if the dependent variable's units are held fixed. Comapring the graphs, the slopes look very similar because they have been resized to fit all data points, compensating for the change of units.
Conversely, we get a similar negative correlation for cousin marriage as a predictor of beer consumption:
percent_beer=sm.OLS(B,P1).fit()
percent_beer.summary()
plt.plot(P,B,'.')
f = lambda x : 178.2149 -4.3109*x
plt.plot(P,f(P), c="red")
plt.legend()
plt.xlabel('Cousin Marriage Percentage', fontsize=12)
plt.ylabel('Beer Servings Consumed', fontsize=12)
plt.show()
Next up, to what extent do annual spirit servings per capita predict the prevalence of cousin marraige?
S=df['spirit_servings']
S1=sm.add_constant(S)
spirit_model=sm.OLS(Y,S1).fit()
spirit_model.summary()
plt.plot(S,Y,'.')
f = lambda x : 27.6759 -0.1660*x
plt.plot(S,f(S), c="red", )
plt.legend()
plt.xlabel('Spirit Servings Consumed', fontsize=12)
plt.ylabel('Cousin Marriage Percentage', fontsize=12)
plt.show()
Here, the $R^2$ is smaller than that of the overall alcohol consumed, meaning that it is not as good a predictor as overall alcohol, and certainly much less than beer. Nevertheless, it does explain about 31% of the decline in the prevalence of cousin marriage as annual servings of spirits per capita increases.
percent_spirits=sm.OLS(S,P1).fit()
percent_spirits.summary()
plt.plot(P,S,'.')
f = lambda x : 98.9043 -1.8639*x
plt.plot(P,f(P), c="red")
plt.legend()
plt.xlabel('Cousin Marriage Percentage', fontsize=12)
plt.ylabel('Spirit Servings Consumed', fontsize=12)
plt.show()
Finally, annual wine servings per capita as a predictor of cousin marriage:
W=df['wine_servings']
W1=sm.add_constant(W)
wine_model=sm.OLS(Y,W1).fit()
wine_model.summary()
plt.plot(W,Y,'.')
f = lambda x : 22.4169 -0.0981*x
plt.plot(W,f(W), c="red", )
plt.legend()
plt.xlabel('Wine Servings Consumed', fontsize=12)
plt.ylabel('Cousin Marriage Percentage', fontsize=12)
plt.show()
This has the worst $R^2$ of all, but at nearly 28% is still not terrible. Wine is the worst at predicting this trend.
percent_wine=sm.OLS(W,P1).fit()
percent_wine.summary()
plt.plot(P,W,'.')
f = lambda x : 108.5659 -2.8368*x
plt.plot(P,f(P), c="red")
plt.legend()
plt.xlabel('Cousin Marriage Percentage', fontsize=12)
plt.ylabel('Wine Servings Consumed', fontsize=12)
plt.show()
For those interested, the regression in the table above can be calculated 'by hand' below.¶
First, this function gives the regression coefficients, i.e. the $y$-intercept $\beta_0$ and the line's slope $\beta_1$:
def lin_reg(L):
x_sq=[L[i][0]**2 for i in range(0,len(L))]
xy=[L[i][0]*L[i][1] for i in range(0,len(L))]
x=[L[i][0] for i in range(0,len(L))]
y=[L[i][1] for i in range(0,len(L))]
n=len(L)
X=sum(x)/n
Y=sum(y)/n
XY=sum(xy)/n
X2=sum(x_sq)/n
b0=(X*XY-X2*Y)/(X**2-X2)
b1=(X*Y-XY)/(X**2-X2)
return b0,b1
Next we have the Mean Squared Error, sample variance, $R^2$ coefficient of determination and Pearson correlation coefficient:
def MSE(L):
L=[(L[i][1]-lin_reg(L)[0]-lin_reg(L)[1]*L[i][0])**2 for i in range(0,len(L))]
return sum(L)/(len(L)-2)
def sample_variance(L):
n=len(L)
A=[L[i][1] for i in range(0,n)]
av=sum(A)/n
L=[(L[i][1]-av)**2 for i in range(0,n)]
return (sum(L)/(n-1))**0.5
def r_squared(L):
n=len(L)
b0=lin_reg(L)[0]
b1=lin_reg(L)[1]
Y_hat=[b0+b1*L[i][1] for i in range(0,n)]
A=[L[i][1] for i in range(0,n)]
avg=sum(A)/n
Sq=[(a-avg)**2 for a in A]
SSTO=sum(Sq)
return 1-MSE(L)*(n-2)/SSTO
def pearson(L):
if lin_reg(L)[1]>0:
return r_squared(L)**0.5
else:
return -r_squared(L)**0.5
To run all these, we need to convert our dataframe into a list (of lists), which only contains the predictors and dependent variables:
D=df.values.tolist()
L=[[d[-2],d[-1]] for d in D]
BL=[[d[1],d[-1]] for d in D]
SL=[[d[2],d[-1]] for d in D]
WL=[[d[3],d[-1]] for d in D]
First, we get the linear regression coefficients:
print(lin_reg(L))
print(lin_reg(BL))
print(lin_reg(SL))
print(lin_reg(WL))
Next, the mean squared error and sample variance (these are not in the table above):
print(MSE(L))
print(MSE(BL))
print(MSE(SL))
print(MSE(WL))
print(sample_variance(L))
Note that this calculates the variance in the cousin marriage data only, and be the same for all the alcohol data sets.
Finally, the $R^2$ and Pearson coefficient:
print(r_squared(L),pearson(L))
print(r_squared(BL),pearson(BL))
print(r_squared(SL),pearson(SL))
print(r_squared(WL),pearson(WL))