Global Wage Prediction Model
This page is still in progress!
Introduction
As someone who has considered pursuing a higher level of education, I often wonder how much this will enhance my work earnings, not only in the United States, but across the world. For this project I am exploring the strength of the linear relationship between wages and age, sex, country, and level of education. I created a Multiple Regression Model to predict how my income will change if I move to another continent, and how my income in those different continents shifts when my education level changes from a bachelor’s to a masters or doctorate. The goal is to be able to recognize how different countries value education by calculating those differences. The predictors are age, sex, country, and level of education, while wage is the response.
The null and alternate hypothesis for the F-statisic are as follows:
Null Hypothesis: There is no statistical association between the predictors and the response
Alternative Hypothesis: There is some statistical association between at least one of the predictors and the response
Data Preparation
The Organisation for Economic Co-operation and Development(OECD) is the data source I retrieved my information from. Countries that are part of the organization have governments committed to democracy and the international market economy. The original data came from Education at a Glance, a source that tracks data on education around the world.
One important variable to understand about my data set is how the predictor, income, is presented. Since this data encompasses all OECD countries, wage levels are calculated by dividing low and high pay. Low pay is the share of workers that earn less than two thirds of median wage while high pay is the share of workers that earn more than one and a half of median wage.
All of my predictor variables are categorical. The sex is a qualitative, nominal variable assigned as male and female. Age is ordinal and is in groups set at 25-34, and 55-64. Education level is also an ordinal and categorical with variables, upper secondary education(equivalent to high school degree in the U.S.), bachelor’s or equivalent education, and master’s, doctoral or equivalent education. One important change I made to the data set was adding a column of continents. Since understanding the data strictly by country there are too many variables, and adding continents made it easier to understand big picture differences. Categorizing by continent also gave me the opportunity to visualize the data neatly.
Exploratory Analysis
Figure 1
* NA is Oceania and islands
As expected, the higher your level of education is, the higher your earnings are on average. None of the predictors taken from this source are dependent on one another, so there is no double counting within the model. No predictors are correlated with one another, but some are individually correlated with the response.
Model Development
My null hypothesis under the F-statistic is that there is no statistical association between age, level of education, continent and wage. The alternate hypothesis under the F-statistic is there is at least one variable out of age, level of education, and continent with statistical association to wage. My null hypothesis and alternate hypothesis is the same idea as the F-statistic, but instead compares each individual variable rather than at least one.
I split my data where 70% is in the training set, while the other 30% is in the test set. For my model I have the goal of prediction, so the hierarchical rule does not apply, and since we are using a multiple regression model, we do not need to worry about any violation of normality. Since variables are independent, lack co linearity, and do not have any influential outliers, all assumptions have been validated and the final regression model is suitable for prediction.
Using backward step-wise selection I removed variables that did not have as high of p-values than others. This resulted in me taking out levels of education that are equivalent to Middle and Elementary grades in the U.S. One result I was surprised to see was that sex negatively affected the accuracy of the model, so I chose to remove it as well.
The total number of variables that resulted in the lowest Adjusted R-Squared was the model with 7 predictor variables. You can see in the Adjusted R-Squared plot and the Residual sum of squares plot that 7 variables is when the model’s accuracy stops growing.
Figure 2, 3
In Figure 2 and 3 the model is taking the 7 best predictors and showing their influence on the accuracy of the model using Adjusted R-Squared and the Residual Sum of Squares.
Figure 4
Ultimately the predictors that gave me the most accurate model were age, level of education, and location. You can see the predictor’s significance visually in Figure 4 and quantitatively in the model summary in Figure 5.
Call:
lm(formula = Value ~ Age + ISC11A.1 + Continent, data = educvarstrain)
Residuals:
Min 1Q Median 3Q Max
-109.60 -19.44 -2.74 15.54 325.88
Coefficients:
Estimate Std. Error t value
(Intercept) 149.729 3.382 44.266
Age25-64 years 17.761 1.855 9.574
Age55-64 years 27.392 1.891 14.486
ISC11A.1Bachelor’s or equivalent education -19.023 1.963 -9.692
ISC11A.1Master’s, Doctoral or equivalent education 26.366 1.986 13.276
ISC11A.1Upper Secondary Edu -81.724 2.763 -29.577
ContinentEurope -6.211 3.120 -1.991
ContinentNorth America 29.220 3.665 7.973
ContinentSouth America 114.201 4.470 25.546
Pr(>|t|)
(Intercept) < 2e-16 ***
Age25-64 years < 2e-16 ***
Age55-64 years < 2e-16 ***
ISC11A.1Bachelor’s or equivalent education < 2e-16 ***
ISC11A.1Master’s, Doctoral or equivalent education < 2e-16 ***
ISC11A.1Upper Secondary Edu < 2e-16 ***
ContinentEurope 0.0466 *
ContinentNorth America 2.48e-15 ***
ContinentSouth America < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35.61 on 2156 degrees of freedom
(1085 observations deleted due to missingness)
Multiple R-squared: 0.5664, Adjusted R-squared: 0.5648
F-statistic: 352 on 8 and 2156 DF, p-value: < 2.2e-16
Figure 5
Model Assessment
The result from the summary of my model in Figure 5 includes a F-statistic p-value of \< 2.2e-16 which immediately indicates that there is some statistical association between at least one of the predictors and the response. I reject the null hypothesis in favor of the alternate. The same applies for each individual predictor variable. After performing backward-stepwise selection, each variable has some statistical association with wage. The adjusted R-squared is lower than ideal at 0.5855 meaning that 59% of variation of age, education, and continent, can be explained by the linear association with wage. This result is to be expected since we are predicting for 5 continents. When we eventually focus on one specific area, the result is more accurate since there is less variability in the data.
The residual standard error is 38.73 on 1500 degrees of freedom, meaning we can expect our prediction to be off by an average 38.73 percentage points. This error is not bad, in the context of the problem, since we are evaluating values ranging from 88% to 650%.
To identify outliers in the model, we refer to the Scale-Location plot in Figure 6.
Figure 6
The outliers in this model are both from Chile. These two points have master’s and doctoral as their level of education with age 55-64 years, where the only difference is sex. The first is data point 11861, which is the farthest outlier, is female. Slightly below is data point 13809 which is the same in all variables except the sex is male. The earnings were 644% and 586%. When viewing the highest wages of the data set as a whole, these points are some of the highest. In fact, all the highest wages are in Chile. Upon further research I can now see that Chile has an extremely high poverty rate, and their income distribution is large, so it would make sense that a master’s or doctorate is valued the most in terms of earnings in Chile. Since these outliers do not lie within Cook’s Distance they are not highly influential points, and I see no reason to remove them.
The calculated root mean squared error(RMSE) is 47%. This was found by using the test set to calculate the difference between the predicted income vs actual income, then squaring the difference, taking the mean, and then finally taking the square root.
Since I want to predict what my own potential salary could be if I went from a bachelor’s to a master’s, or doctorate, I created a separate model for North America, South America, Europe, and Asia. This will allow me to see which continents value that extra level of education the most.
North America
Call:
lm(formula = Value ~ ISC11A.1 + Age + SEX, data = na_datatrain)
Residuals:
Min 1Q Median 3Q Max
-113.052 -20.040 -4.606 16.807 176.363
Coefficients:
Estimate Std. Error t value
(Intercept) 163.629 5.734 28.536
ISC11A.1Bachelor’s or equivalent education -12.246 5.774 -2.121
ISC11A.1Master’s, Doctoral or equivalent education 56.115 6.120 9.169
ISC11A.1Upper Secondary Edu -77.700 9.231 -8.417
Age25-64 years 29.204 5.594 5.221
Age55-64 years 22.856 5.966 3.831
SEXM 4.760 4.673 1.019
Pr(>|t|)
(Intercept) < 2e-16 ***
ISC11A.1Bachelor’s or equivalent education 0.034772 *
ISC11A.1Master’s, Doctoral or equivalent education < 2e-16 ***
ISC11A.1Upper Secondary Edu 1.77e-15 ***
Age25-64 years 3.40e-07 ***
Age55-64 years 0.000156 ***
SEXM 0.309213
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 40.16 on 291 degrees of freedom
(29 observations deleted due to missingness)
Multiple R-squared: 0.4918, Adjusted R-squared: 0.4813
F-statistic: 46.93 on 6 and 291 DF, p-value: < 2.2e-16
Figure 7
* see Data Preparation to interpret income percent
Figure 7 has an Adjusted R-squared 48%, meaning 48% of variation in sex, age, and education can be explained by its linear association with earnings in North America. This score is the lowest of the continents because of the level of variation in the data, which you can see when predicting how large the earnings interval is.
I am 95% confident that if I, a female, worked in North America and am between 25-34 years old with a bachelor’s or equivalent, my income would be between 57% and 249%. This is a huge range of values that your income could be! The 57% is the lowest wage percentage you will see, meaning if you get a Bachelors in North America you have a fairly high chance of getting an income on the lower side. This makes me believe that the countries do not value the Bachelor’s degree as much as other countries in the world. However if everything were the same, except my level of education was instead a Master’s or Doctoral equivalent, my income would be between 123% and 315%.
Achieving a high level of education such as a masters or doctoral definitely gives someone better odds of earning a higher wage. The difference of expected wage between the bachelors and masters or doctorate, is a 66% increase in earnings.
South America
Call:
lm(formula = Value ~ ISC11A.1 + SEX + Age, data = sa_datatrain)
Residuals:
Min 1Q Median 3Q Max
-72.122 -13.874 -2.105 11.361 128.005
Coefficients:
Estimate Std. Error t value
(Intercept) 244.633 8.995 27.198
ISC11A.1Bachelor’s or equivalent education -21.064 10.162 -2.073
ISC11A.1Master’s, Doctoral or equivalent education 180.444 10.128 17.816
ISC11A.1Upper Secondary Edu -186.322 9.020 -20.657
SEXM -6.545 6.667 -0.982
Age25-64 years 49.217 8.144 6.043
Age55-64 years 90.486 8.166 11.081
Pr(>|t|)
(Intercept) < 2e-16 ***
ISC11A.1Bachelor’s or equivalent education 0.0404 *
ISC11A.1Master’s, Doctoral or equivalent education < 2e-16 ***
ISC11A.1Upper Secondary Edu < 2e-16 ***
SEXM 0.3283
Age25-64 years 1.88e-08 ***
Age55-64 years < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 36.71 on 116 degrees of freedom
(154 observations deleted due to missingness)
Multiple R-squared: 0.9372, Adjusted R-squared: 0.934
F-statistic: 288.7 on 6 and 116 DF, p-value: < 2.2e-16
Figure 8
The adjusted R-squared of Figure 8 is 94%, the highest linear association of all the models!
I am 95% confident that if I worked in South America and am between 25-34 years old with a bachelor’s or equivalent, my income would be between 156% and 319%. This is a major increase in earnings compared to North America. If the predictors remained constant, except my level of education was instead a Master’s or Doctoral equivalent, my income would be between 358% and 522%.
A higher level of education such as a masters or doctoral practically guarantees a higher wage. This makes intuitive sense because South America, a less developed country than the rest in comparison, has a high demand for people with higher levels of education. Having a bachelors versus masters or doctorate results in an average of 202% earnings increase.
Europe
Call:
lm(formula = Value ~ ISC11A.1 + SEX + Age, data = eur_datatrain)
Residuals:
Min 1Q Median 3Q Max
-51.844 -14.697 -2.457 11.027 117.898
Coefficients:
Estimate Std. Error t value
(Intercept) 146.4406 1.3052 112.201
ISC11A.1Bachelor’s or equivalent education -18.5142 1.3769 -13.447
ISC11A.1Master’s, Doctoral or equivalent education 12.8424 1.3442 9.554
ISC11A.1Upper Secondary Edu -55.5191 1.9745 -28.118
SEXM 0.5341 1.0677 0.500
Age25-64 years 13.8024 1.3007 10.612
Age55-64 years 25.0174 1.2987 19.263
Pr(>|t|)
(Intercept) <2e-16 ***
ISC11A.1Bachelor’s or equivalent education <2e-16 ***
ISC11A.1Master’s, Doctoral or equivalent education <2e-16 ***
ISC11A.1Upper Secondary Edu <2e-16 ***
SEXM 0.617
Age25-64 years <2e-16 ***
Age55-64 years <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.43 on 1606 degrees of freedom
(276 observations deleted due to missingness)
Multiple R-squared: 0.5221, Adjusted R-squared: 0.5203
F-statistic: 292.4 on 6 and 1606 DF, p-value: < 2.2e-16
Figure 9
Figure 9 has an adjusted R-squared of 57%, a lower association than we would prefer but still relevant.
I am 95% confident that if I worked in Europe and am between 25-34 years old with a bachelor’s or equivalent, my income would be between 86% and 172%. This interval does not guarantee as good of a wage with a bachelors than South America does. However if everything were the same except my level of education was instead a master’s or doctoral, my income would be between 118% and 204%.
Obtaining a high level of education such as a master’s or doctoral definitely gives someone better odds of earning a higher wage, but does not guarantee it. The difference of expected wage between the bachelors and masters or doctorate, is a 32% increase in earnings.
Asia
Call:
lm(formula = Value ~ ISC11A.1 + SEX + Age, data = asia_datatrain)
Residuals:
Min 1Q Median 3Q Max
-30.529 -13.385 -4.582 10.728 41.269
Coefficients:
Estimate Std. Error t value
(Intercept) 142.634 3.640 39.188
ISC11A.1Bachelor’s or equivalent education -11.406 3.537 -3.225
ISC11A.1Master’s, Doctoral or equivalent education 38.537 3.540 10.885
SEXM -8.515 2.866 -2.971
Age25-64 years 24.776 3.572 6.937
Age55-64 years 34.128 3.437 9.931
Pr(>|t|)
(Intercept) < 2e-16 ***
ISC11A.1Bachelor’s or equivalent education 0.00155 **
ISC11A.1Master’s, Doctoral or equivalent education < 2e-16 ***
SEXM 0.00346 **
Age25-64 years 1.19e-10 ***
Age55-64 years < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.65 on 147 degrees of freedom
(73 observations deleted due to missingness)
Multiple R-squared: 0.7129, Adjusted R-squared: 0.7031
F-statistic: 73 on 5 and 147 DF, p-value: < 2.2e-16
Figure 10
The adjusted R-Squared in Figure 10 is 76%, meaning there is a strong linear association between the predictors and the response.
I am 95% confident that if I worked in Asia and was between 25-34 years old with a bachelor’s or equivalent, my income would be between 93% and 163%. If everything were the same, except my level of education was a master’s or doctoral equivalent, my income would be between 146% and 216%.
Achieving a high level of education such as a masters or doctoral gives someone much better odds of earning a higher wage. The difference of expected wage between the bachelor’s and master’s or doctorate, is a 53% increase in earnings.
Conclusion
Predicting income on a national level is something my model offers over perhaps a more standard model focused in one area. I wanted to see the big picture on a global scale, and this model encompasses countries whose population totals to about 1.38 billion people. One additional bonus that I did not expect to come out of my model was the fluctuation of the p-value of sex in Figures 7-11. Europe, North America, and South America had higher p-values meaning sex plays less of a role in wages. However Asia had a p-value below .05 meaning, being female vs male plays a major role in a person’s earnings. These models expose a lot about country’s development as well as factors they value in the workforce. These would be ultimately benefit any person who wants to travel and/ or is thinking about pursuing a higher level of education.
Citations
“Chile Overview.” World Bank, <https://www.worldbank.org/en/country/chile/overview>.
“Education and Earnings .” OECD, <https://stats.oecd.org/Index.aspx?DataSetCode=EAG_EARNINGS>.
Appendix
Here is all the code used to obtain the results above.
library(tidyverse)
library(mosaic)
library(knitr)
library(tinytex)
library(leaps)
library(knitr)
::opts_chunk$set(echo = TRUE)
knitr
::opts_chunk$set(message = FALSE)
knitr
::opts_chunk$set(warning = FALSE)
knitr
::opts_chunk$set(out.width = "50%")
knitr
<- read.csv("education_earnings.csv")%>%
educvars
subset(select = -ISC11A)%>%
subset(select = -AGE)%>%
subset(select = -COUNTRY)%>%
subset(select = -INDICATOR)%>%
subset(select = -MEASURE)%>%
subset(select = -Measure)%>%
subset(select = -YEAR)%>%
subset(select = -Year)%>%
subset(select = -Unit.Code)%>%
subset(select = -Unit)%>%
subset(select = -PowerCode.Code)%>%
subset(select = -PowerCode)%>%
subset(select = -Flag.Codes)%>%
subset(select = -Flags)%>%
subset(select = -Reference.Period.Code)%>%
subset(select = -Gender)%>%
subset(select = -Indicator)%>%
subset(select = -EARN_CATEGORY)%>%
subset(select = -EARN_CATEGORY.1)%>%
subset(SEX!="T")%>%
subset(ISC11A.1 != "Below upper secondary education")%>%
subset(ISC11A.1 != "Tertiary education")%>%
subset(ISC11A.1 != "Short-cycle tertiary education")%>%
subset(ISC11A.1 != "Bachelor's, Master's, Doctoral or equivalent education")%>%
subset(ISC11A.1 != "Post-secondary non-tertiary education")
$ISC11A.1[educvars$ISC11A.1 == "Upper secondary, post-secondary non-tertiary education and short-cycle tertiary education"]<- "Upper Secondary Edu"
educvars
$ISC11A.1[educvars$ISC11A.1 == "Master's, Doctoral or equivalent education"]<- "Master's,Doctoral/equivalent"
educvars
<- c("United States", "Canada", "Costa Rica", "Mexico")
north_america
<-c("Chile", "Colombia", "Dominican Republic", "Ecuador", "El Salvador","Panama","Paraguay", "Brazil")
south_america
<- c("Austria", "Belgium", "Czech Republic", "Denmark","Estonia","Finland", "France", "Germany", "Greece","Hungary", "Iceland","Ireland","Italy","Latvia","Lithuania","Luxembourg","Netherlands","Norway", "Poland","Portugal", "Slovakia","Slovenia","Spain","Sweden","Switzerland", "Turkey", "United Kingdom", "Luxembourg")
europe
<- c("Israel", "Japan","South Korea", "Singapore","Thailand", "Malaysia", "Cambodia", "Korea", "Türkiye")
asia
attach(educvars)
$Continent[Country %in% north_america]<-"North America"
educvars
$Continent[Country %in% europe]<-"Europe"
educvars
$Continent[Country %in% south_america]<-"South America"
educvars
$Continent[Country %in% asia]<-"Asia"
educvars
detach(educvars)
#Seperating Each Data Set by Continent
<-educvars%>%
na_data
subset(Continent == "North America")
<-educvars%>%
sa_data
subset(Continent == "South America")
<-educvars%>%
eur_data
subset(Continent == "Europe")
<-educvars%>%
asia_data
subset(Continent == "Asia")
set.seed(101)
<- sample.int(n=nrow(educvars),size=floor(0.7*nrow(educvars)),replace=FALSE)
sample
<- educvars[sample,]
educvarstrain
<- educvars[-sample,]
educvarstest
summary(educvars)
ggplot(data=educvars,aes(x=Continent,y=Value)) +
ylab("Earnings")+
geom_jitter(width=0,height=0.1,alpha=0.3) +
geom_boxplot(alpha=0)+
geom_point(alpha=.3, aes(color=ISC11A.1))+
theme_bw()
<-regsubsets(Value~SEX+Age+ISC11A.1+Continent, data=educvars, nvmax=10, method="backward")
choosing
<-summary(choosing)
choosing.summary
plot(choosing.summary$rsq, xlab = "Number of Variables", ylab="Adjusted R-Squared", type="b")
plot(choosing.summary$rss, xlab = "Number of Variables", ylab = "Residual Sum of Squares(RSS)", type="b")
#which.max(choosing.summary$adjr2)
plot(choosing, scale = "r2")
<-lm(Value~Age+ISC11A.1+Continent, data=educvarstrain)
model6
summary(model6)
plot(model6, which=3)
<- educvarstest %>%
test_predict1
mutate(prob=predict(model6,newdata=educvarstest,type='response'))
%>%
test_predict1
summarize(accuracy=sqrt(mean((prob-Value)^2, na.rm=TRUE)))
#using na_data
set.seed(101)
<- sample.int(n=nrow(na_data),size=floor(0.7*nrow(na_data)),replace=FALSE)
sample
<- na_data[sample,]
na_datatrain
<- na_data[-sample,]
na_datatest
<-lm(Value~ISC11A.1+Age+SEX, data=na_datatrain)
model7
summary(model7)
#plot(model7)
#predict(model7, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1= "Bachelor’s or equivalent education"), interval="prediction", level=.95)
#predict(model7, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1= "Master's,Doctoral/equivalent"), interval="prediction", level=.95)
#using sa_data
set.seed(101)
<- sample.int(n=nrow(sa_data),size=floor(0.7*nrow(sa_data)),replace=FALSE)
sample
<- sa_data[sample,]
sa_datatrain
<- sa_data[-sample,]
sa_datatest
<-lm(Value~ISC11A.1+SEX+Age, data=sa_datatrain)
model8
summary(model8)
#predict(model8, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Bachelor's or equivalent education"), interval="prediction", level=.95)
#predict(model8, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Master's,Doctoral/equivalent"), interval="prediction", level=.95)
set.seed(101)
<- sample.int(n=nrow(eur_data),size=floor(0.7*nrow(eur_data)),replace=FALSE)
sample
<- eur_data[sample,]
eur_datatrain
<- eur_data[-sample,]
eur_datatest
<-lm(Value~ISC11A.1+SEX+Age, data=eur_datatrain)
model9
summary(model9)
#predict(model9, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Bachelor's or equivalent education"), interval="prediction", level=.95)
#predict(model9, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Master's,Doctoral/equivalent"), interval="prediction", level=.95)
set.seed(101)
<- sample.int(n=nrow(asia_data),size=floor(0.7*nrow(asia_data)),replace=FALSE)
sample
<- asia_data[sample,]
asia_datatrain
<- asia_data[-sample,]
asia_datatest
<-lm(Value~ISC11A.1+SEX+Age, data=asia_datatrain)
model10
summary(model10)
#predict(model10, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Bachelor's or equivalent education"), interval="prediction", level=.95)
#predict(model10, newdata=data.frame(SEX="F", Age = "25-34 years", ISC11A.1="Master's,Doctoral/equivalent"), interval="prediction", level=.95)