Course: DS 4002 - Data Science Project Course
Authors: Navya Annapareddy, Tony Albini, Kunaal Sarnaik, & Jaya Vellayan
Professor: Brian Wright
Date: January 8th, 2021
The following section serves as a general introduction to the topic that will be discussed throughout this research analysis — mental health in the workplace. See the sub-sections embedded below for futher information: Motivation, Background, General Research Question, Relevant Research, & Initial Hypotheses.
According to the National Institute of Mental Health, approximately one in five adults live with a mental illness in the United States; this drastic proportion corresponded to a quantity of nearly 51.1 million citizens in 2019.$^{1}$ Although the proportion of adults who experience a mental illness on the global scale is smaller (1 in 17, according to the World Health Organization in 2017), it still roughly equals an astounding magnitude of approximately half a million people.$^{2}$ Despite these monstrous statistics related to mental illness incidence and prevalence, previous studies have shown that only a small percentage of adults who suffer from a mental illness actually receive treatment for it.$^{3}$ For example, the National Institute of Mental Health found that a mere 43.8% of United States adults who were diagnosed with a mental illness actually treatment for it.$^{3}$
Given that incidences of mental illness not only have skyrocketed since the turn of the century, but also are expected to continue increasing at an alarming rate, strategies for addressing, coping with, and treating mental illnesses on the individual level must be developed urgently.$^{4}$ However, when it comes to addressing mental illnesses, one devastating obstacle looms large both individually and societally — the stigma against them. According to the American Psychiatric Association (APA), three types of stigmas govern the mental illness conversation$^{5}$:
1. Public Stigma: negative or discriminatory attitudes that others have about mental illness.
2. Self-stigma: negative attitudes, including internalized shame, that people with mental illness have about their own condition.
3. Institutional Stigma: more systemic, involving policies of government and private organizations that intentionally or unintentionally limit opportunities for people with mental illness.
A unique location where all three of these stigmas are practiced extensively is the workplace, where the APA has found that more than one in three workers are concerned about being fired if they sought mental health care.$^{6}$ The combination of these challenges makes an already difficult act of talking about mental illness much more arduous for workers, who comprise a large proportion of mental illness cases on the global scale.$^{7,8}$ Moreover, such expression regarding one's experience with a mental illness is largely considered to be among first steps towards treatment, and the lack of doing so may only exacerbate the symptoms of a mental illness on the individual.$^{7}$ Furthermore, in the workplace, poor mental health and stress can be detrimental for an employee's job performance, work engagement, communication with coworkers, physical capability, and ability to carry out day-to-day functions.$^{9}$
In a study conducted by the Center for Disease Control, only 57% of employees who reported suffering from moderate depression, in addition to a mere 40% of those who reported suffering from severe depression, received treatment to control their symptoms.$^{9}$ These appalling statistics, combined with the trend that mental health is only becoming harder to manage due to the current COVID-19 pandemic, suggest that finding strategies to increase the likelihood of employees' expressing their experience with mental illnesses to others in order to receive treatment (a mere 55% in 2019, according to a 2021 study) is an urgent priority for not only the nation, but also the world.$^{10}$
The following research project deploys data science and machine learning principles to investigate which factors present in the workplace are the strongest determinants of whether or not an employee confides to their direct supervisor(s).
The Mayo Clinic defines mental illnesses, which are also known as mental health disorders, as those that affect one's mood, thinking, and behavior.$^{11}$ Examples of these types of psychological afflictions include depression, anxiety, schizophrenia, bipolar, eating disorders, and addictive behaviors, yet the term mental illness is often used to refer to any among a wide range of mental health conditions. Common symptoms of a mental illness can vary and depend on the disorder; examples, however, include dysphoric mood, avolition, withdrawal from activities, suicidal thinking, and substance abuse.
Although the outcomes of one suffering from a mental illness can be just as debilitating as those of one suffering from any other physical illness such as cancer, stigmatization of those who suffer from them forms a pervasive barrier that prevents many of those suffering from such illnesses to get better.$^{12}$ Since contact with persons with mental illnesses is currently the most well-supported way to reduce stigma, identifying methods to increase such contact in the workplace — an environment that is almost universal to all who share this planet — is an urgent priority. However, for such contact to take place, modern workplace culture must be analyzed.
According to a 2019 survey conducted by Mind Share Partners' Mental Health at Work, 37% of workers reported that their work environment contributed to their mental health symptoms.$^{13}$ Due to a combination of cultural stigma, a lack of understanding of symptoms, increased pressure, and a lack of resources, the current state of the workplace can not only cause mental illnesses, but also further exacerbate existing ones.$^{14}$ Therefore, to gain an understanding of why an insufficient amount of contact with those suffering from mental health disorders occurs in today's workplace, an in-depth analysis of the factors contributing to an employee's willingness to discuss mental health problems with their direct supervisor may be enlightening. After all, the workplace, whether it be online or in-person, is comprising an increasing amount of any person's time, and finding ways to decrease mental health stigma can lead to invaluable returns with respect to not only addressing the illness, but also helping the sufferer of the illness access resources that will alleviate its symptoms.
An in-depth analysis utilizing data science and machine learning principles will be conducted in an attempt to address the following research question:
What are the strongest predictors dictating whether or not an employee will discuss mental health concerns with their direct supervisor(s)?
Using this question as a guide, hypotheses will be developed and an extensive and comprehensive analysis that leverages state-of-the-art machine learning techniques will be conducted. A dataset containing responses to a survey administered to employees in the technological industry — one that is largely predicted to sustain the most growth in the near future — will be analyzed (See Dataset Overview & Cleaning section for more information).$^{15}$ The data will be cleaned, exploratory data analysis will be conducted, and then machine learning techniques will be applied in order to determine the strongest predictors of the discrete response variable.
Several studies related to predicting outcomes and behavior related to mental health in the workplace have been previously conducted. For example, a team from the Keio University School of Medicine in Japan, Sado et al., analyzed predictors of repeated sick leave in the workplace as a result of mental disorders.$^{16}$ The Japanese researchers utilized Return To Work (RTW) and repeated sick leave rtes among 194 subjects employed at a manufacturing company. The independent variables of interest included sex, age at the time of employment, job tenure, diagnosis, number of previous sick leave days, duration of most recent sick leave, and employee rank. After conducting a univariate analysis using log-rank test and multivariate analysis using the Cox proportional hazard model, Sado et al. found that the employees' ages and number of previous sick-leave episodes were a significant predictor of repeated sick leave due to a mental ilness.
Moreover, the Center for Disease Control (CDC) annually conducts a "Mental Health in the Workplace" study, which is included as a sub-section in their Workplace Health Guide.$^{9}$ This study tracks mental illness solutions, frameworks to raise mental health awareness, and useful strategies to tackle the problem. Historically, the CDC has pointed to three key indicators and risk factors of mental illness: stigma, lack of health care, and lack of social connections. Furthermore, they have repeatedly suggested that employers should offer several mental health benefits (e.g., health insurance with no or low out-of-pocket costs, free clinical screenings, and informational brochures), be flexible with sick leave specifically with respect to mental illnesses, and provide managers with training to help recognize signs and symptoms of common mental health disorders.
According to studies consolidated by Harvard Health, the most debilitating characteristic of mental health disorders is that most go unrecognized and untreated due to the stigma attached to them.$^{17}$ Due to such stigma, employees in the workplace feel extremely reluctant to seek treatment out of feat that they might jeapardize their jobs, further exacerbating their already limited ability to communicate their feelings to co-workers. Similarly, the transition towards remote working, whether it be due to the rise of the technological industry or the coercion of the COVID-19 pandemic, may exacerbate feelings of isolation and loneliness, according to the Center for Workplace Mental Health.$^{18}$ Moreover, the remote transitions can have a debilitating affect on the amount of people in a person's social circle, which can also further exacerbate their ability to reach out for help.
In summary, the combination of previous research related to the General Research Question points to several common themes that may affect the likelihood of an employee to reach out to their direct supervisor to discuss a mental health issue: the community, stigmatization, and resources present at the workplace itself.
Null Hypothesis: Using Random Forest and permutation importance from Python's scikit-learn library, the 4 target predictors are equal to or less than 50% of total feature importance (which is normalized to 1) when predicting whether employees are willing to discuss mental health issues with supervisors.
Alternative Hypothesis: Using Random Forest and permutation importance from Python's scikit-learn library, the 4 target predictors are greater than 50% of total feature importance (which is normalized to 1) when predicting whether employees are willing to discuss mental health issues with supervisors.
Target Predictors:
NOTE: The target predictors are categorical variables found in the dataset as explained below (See Dataset Overview & Cleaning section for more information).
- benefits: Does your employer provide mental health benefits?
Justification: In the Center for Disease Control (CDC) Workplace Guide, there are several suggestions related to mental health-specific employment benefits.$^{9}$ If benefits are available and provided free-of-charge as a part of an employee's occupational agreement, the employee should theoretically be more likely to reach out to their supervisor in an attempt to possibly capitalize on them if they are suffering from a mental health disorder. Furthermore, the study conducted by Japanese researchers Sado et al. found that repeated sick leave episodes related to a mental disorder were a statistically significant predictor of future sick leave episodes related to the mental illness.$^{16}$ On the other end, if employees are not offered as many occupational benefits specifically related to maintaining their mental health, it is reasonable to assume that they would be less likely to discuss a mental health issue with their direct supervisor(s).
- no_employees: How many employees does your company/organization have?
Justification: Several sources outlined in the previous section pointed to social circles as a key indicator not only of one's mental health wellness, but also of the likelihood that one's mental health disorder improves. For example, the analysis conducted by the Center for Workplace Mental Health reported that staying connected with co-workers and friends is paramount to staying mentally well during remote work.$^{18}$ Furthermore, the CDC Workplace Guide reports that the workplace is an optimal setting to create a culture of health because communication structures are already in place and social support networks are available.$^{9}$ Thus, the more co-workers an employee has working with them, the more likely they are to have others to confide in, be encouraged by, and make them feel supported. The result of such an atmosphere should theoretically lead to them being more likely to discuss any mental health issues with a direct supervisor. Furthermore, the lack of employees at a person's place of work should logically have the opposite effect.
- obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
Justification: This predictor is likely a measure of the amount of mental health stigmatization present in a person's place of work. According to Harvard Health, the stigma attached to having a psychiatric disorder is such that employees may be reluctant to seek treatment out of fear that they might jeopardize their jobs.$^{17}$ Thus, if an employee has heard of or observed negative consequences for a coworker with mental health issues in their workplace, there likely exists a strong amount of mental health stigmatization present in the workplace environment. By extension, the employee should theoretically be less likely to reach out to their direct supervisor regarding a mental health issue. On the other hand, if no such consequence has been observed or known, the employee likely works at a place that does not stigmatize health as much, which may result in a stronger likelihood that the person reaches out to their direct supervisor(s).
- remote_work: Do you work remotely (outside of an office) at least 50% of the time?
Justification: Working remotely may substantially limit the amount of flexibility an employee possess when it comes to their social circle. A significant caveat to the workplace being considered an emotional and social support system, as defined by the Center for Disease Control, is remote work.$^{9}$ Due to the lack of interpersonal interaction, as well as the lack of emotional readability over a laptop screen, an employee may feel much more alone if they work remotely as compared to working in-person. As such, they should theoretically be less likely to discuss a mental health issue with their supervisor, and vice-versa if they are working in-person.
The following section describes the dataset being used to address the general research question listed above, as well as details how the dataset was pre-processed for analysis.
The dataset being utilized to answer the general question and test the hypotheses listed above is from a 2014 survey conducted by the Open Sourcing Mental Illness organization.$^{19}$ This organization aims to change the way mental health is discussed in the tech community, addressing stigmatization and providing necessary resources to those suffering from mental health disorders. In the survey, which was conducted in 2014, several responses are gathered in addition to the demographics of participants in a result to measure attitudes towards mental health and frequency of mental health disorders in the tech workplace. The categories of the dataset, with the question that corresponds to each, are listed below:
Dataset Link$^{20}$: https://www.kaggle.com/osmi/mental-health-in-tech-survey
As a refresher, supervisor is the response variable in the study, and benefits, obs_consequence, no_employees, and remote_work are the target predictors that are being tested in the initial hypotheses listed above.
The following sub-section includes a step-by-step overview of how the dataset was loaded into the Google Colab environment, cleaned, and prepared for exploratory data analysis.
For the analysis to be conducted in Python, several packages, libraries, and modules must be utilized. They are listed below:
import pandas as pd
import numpy as np
from os import listdir, walk
from os.path import isfile, join
import itertools
import sys,requests
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.offline import iplot, init_notebook_mode
import matplotlib.pyplot as plt
import seaborn as sns
#USE IF CONVERTING TO HTML
import plotly.offline as py
py.init_notebook_mode(connected=False)
from sklearn.feature_selection import chi2
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import sklearn.metrics as skmet
pd.set_option('mode.chained_assignment', None)
The following sub-section showcases how the Python system modules were utilized to mount Google Drive and point Colab's directory to the one which contains the dataset being used to address the general research question.
The code chunks below open the directory file the dataset (.csv) is stored in.
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')
!ls "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data"
folder_path = "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data"
file_path = "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data/survey.csv"
def getAllFilesInDirectory(directoryPath: str):
return [(directoryPath + "/" + f) for f in listdir(directoryPath) if isfile(join(directoryPath, f))]
getAllFilesInDirectory(folder_path)
The following code chunk uses pandas to read in the dataset (.csv) and preview it. Here, the Timestamp, state, and comments categories are also dropped, the reasoning of which is explained below.
df = pd.read_csv(file_path)
df = df.drop('Timestamp', 1)
df = df.drop('state', 1)
df = df.drop('comments', 1)
df.head(5)
Several pre-processing steps were conducted to the dataset previewed above.
First, as mentioned previously, the timestamp, state, and comments categories were dropped. This was done because we assumed that the time of survey response did not constitute a major effect on the response variable, state only had an instance if the country was the United States (otherwise empty instance), and the comments category was filled with nonsensical instances which would not have added information to the subsequent machine learning model.
The following two code chunks extensive describe how the gender, wellness_program, anonymity, seek_help, leave, mental_vs_physical, benefits, care_options, work_interfere, self_employed, country, and age columns were cleaned int the dataset.
df.Gender.value_counts()
The output above shows the various instances of Gender present in the original dataset. As shown in the below code chunk, they were further categorized into 'male', 'female', and 'other' instances.
df["Gender"] = df["Gender"].str.lower()
df.Gender[df.Gender == 'm'] = "male"
df.Gender[df.Gender == 'f'] = "female"
df.Gender[(df.Gender != 'female') & (df.Gender != 'male')] = "other"
df.wellness_program[df.wellness_program == 'Don\'t know'] = "No"
df.anonymity[df.anonymity == 'Don\'t know'] = "No"
df.seek_help[df.seek_help == 'Don\'t know'] = "No"
df.leave[df.leave == 'Don\'t know'] = "Somewhat difficult"
df.mental_vs_physical[df.mental_vs_physical == 'Don\'t know'] = "No"
df.benefits[df.benefits == 'Don\'t know'] = "No"
df.care_options[df.care_options == 'Not sure'] = "No"
df.work_interfere = df.work_interfere.fillna("No Response")
df.self_employed = df.self_employed.fillna("No")
The above code chunk also showcases how mispelled answers were reverted back to standardized answers of "Yes" or "No". Furthermore, some of the questions in the survey contained "Don't know" or "Not Sure" as possible responses. These were releveled to "No", with the reason being that if employees do not know about a certain characteristic dealing with mental health in their workplace, their workplace likely either does not have it or does not promote it enough to their employees. As a result, it was assumed that both instances should map to the "No" answer.
The following code chunk categorizes the countries present in the dataset into 'North America', 'Europe', and 'Other'. This was done due to the need to encode the categories later on. Since over 48 countries are present in this dataset, one hot encoding would add 47 categories to the dataset, and label encoding would label them from 0-47. Ultimately, this would not be good for the performance of our multiclass classifier.
df["country_bucket"] = df["Country"]
df.country_bucket[df.country_bucket == 'United States'] = "North America"
df.country_bucket[df.country_bucket == 'Canada'] = "North America"
df.country_bucket[df.country_bucket == 'United Kingdom'] = "Europe"
df.country_bucket[df.country_bucket == 'Germany'] = "Europe"
df.country_bucket[df.country_bucket == 'Ireland'] = "Europe"
df.country_bucket[df.country_bucket == 'France'] = "Europe"
df.country_bucket[df.country_bucket == 'Netherlands'] = "Europe"
df.country_bucket[df.country_bucket == 'Bulgaria'] = "Europe"
df.country_bucket[df.country_bucket == 'Switzerland'] = "Europe"
df.country_bucket[df.country_bucket == 'Italy'] = "Europe"
df.country_bucket[df.country_bucket == 'Poland'] = "Europe"
df.country_bucket[df.country_bucket == 'Sweden'] = "Europe"
df.country_bucket[df.country_bucket == 'Belgium'] = "Europe"
df.country_bucket[df.country_bucket == 'Austria'] = "Europe"
df.country_bucket[df.country_bucket == 'Finland'] = "Europe"
df.country_bucket[df.country_bucket == 'Russia'] = "Europe"
df.country_bucket[df.country_bucket == 'Greece'] = "Europe"
df.country_bucket[df.country_bucket == 'Croatia'] = "Europe"
df.country_bucket[df.country_bucket == 'Spain'] = "Europe"
df.country_bucket[df.country_bucket == 'Czech Republic'] = "Europe"
df.country_bucket[df.country_bucket == 'Romania'] = "Europe"
df.country_bucket[df.country_bucket == 'Moldova'] = "Europe"
df.country_bucket[df.country_bucket == 'Denmark'] = "Europe"
df.country_bucket[df.country_bucket == 'Norway'] = "Europe"
df.country_bucket[df.country_bucket == 'Georgia'] = "Europe"
df.country_bucket[df.country_bucket == 'Slovenia'] = "Europe"
df.country_bucket[df.country_bucket == 'Hungary'] = "Europe"
df.country_bucket[df.country_bucket == 'Portugal'] = "Europe"
df.country_bucket[df.country_bucket == 'Bosnia and Herzegovina'] = "Europe"
df.country_bucket[df.country_bucket == 'Latvia'] = "Europe"
df.country_bucket[df.country_bucket == 'India'] = "Other"
df.country_bucket[df.country_bucket == 'South Africa'] = "Other"
df.country_bucket[df.country_bucket == 'Australia'] = "Other"
df.country_bucket[df.country_bucket == 'New Zealand'] = "Other"
df.country_bucket[df.country_bucket == 'Philippines'] = "Other"
df.country_bucket[df.country_bucket == 'Costa Rica '] = "Other"
df.country_bucket[df.country_bucket == 'Australia'] = "Other"
df.country_bucket[df.country_bucket == 'Zimbabwe'] = "Other"
df.country_bucket[df.country_bucket == 'Colombia'] = "Other"
df.country_bucket[df.country_bucket == 'Singapore'] = "Other"
df.country_bucket[df.country_bucket == 'Israel'] = "Other"
df.country_bucket[df.country_bucket == 'Brazil'] = "Other"
df.country_bucket[df.country_bucket == 'China'] = "Other"
df.country_bucket[df.country_bucket == 'Japan'] = "Other"
df.country_bucket[df.country_bucket == 'Nigeria'] = "Other"
df.country_bucket[df.country_bucket == 'Thailand'] = "Other"
df.country_bucket[df.country_bucket == 'Bahamas, The'] = "Other"
df.country_bucket[df.country_bucket == 'Costa Rica'] = "Other"
df.country_bucket[df.country_bucket == 'Uruguay'] = "Other"
df.country_bucket[df.country_bucket == 'Mexico'] = "Other"
df["country_bucket"].value_counts()
df.isna().sum()
The above output serves as a check to ensure that there are no NaN instances present in our modified dataset.
d = {range(0, 18): "<18", range(18, 30): "20s", range(30, 40): "30s", range(40, 50): "40s", range(50, 60): "50s", range(60, 100): "60s+"}
df['age_bucket'] = df['Age'].apply(lambda x: next((v for k, v in d.items() if x in k), 0))
df[df['age_bucket'] == 0].Age.value_counts()
The code chunk and output below depict how the age category was changed to become a discrete variable. The category was essentially bucketed into age intervals (e.g., less than 18, 20s, 30s, etc.). Furthermore, as depicted by the output above, five outliers were present in the Age category. These were imputed with the mean age present in the dataset, which was found to be 32 as shown below.
df.Age[df.age_bucket == 0] = float('NaN')
df.Age[df.age_bucket == 0] = df.Age.mean()
df.age_bucket[df.Age == df.Age.mean()] = "30s"
df[df['age_bucket'] == 0]
df["age_bucket"] = pd.Categorical(df["age_bucket"], ["<18", "20s", "30s", "40s", "50s", "60s+"])
df['no_employees'] = pd.Categorical(df['no_employees'], ["1-5", "6-25", "26-100", "100-500", "500-1000", "More than 1000"])
df['leave'] = pd.Categorical(df['leave'], ["Very difficult", "Somewhat difficult", "Somewhat easy", "Very easy"])
The above code chunk sorts the discrete variables age_bucket, leave, and no_employees in a logical order such that figures that are generated in subsequent analysis are more intuitive.
enc = LabelEncoder()
mc_df = df.copy()
for category in ["Gender", "Country", "self_employed", "family_history", "treatment",
"work_interfere", "no_employees", "remote_work", "tech_company",
"benefits", "care_options", "wellness_program", "seek_help", "anonymity",
"leave", "mental_health_consequence", "phys_health_consequence", "coworkers",
"supervisor", "mental_health_interview", "phys_health_interview", "mental_vs_physical",
"obs_consequence", "age_bucket", "country_bucket"]:
mc_df[category] = enc.fit_transform(df[category])
mc_df.head()
The code chunk above uses the label encoding method to convert the categorical instances into numeric ones. Although encoding is necessary in order to make the categories readable by the mathematical equations necessary for machine learning, there is a stark difference between label encoding, which was performed above, and one-hot encoding, which is performed later in this notebook.
In label encoding, each instance of a category is assigned a unique integer based on alphabetical ordering.$^{21}$ Although this is relatively easy to perform and computationally inexpensive later in the machine learning nutshell, laabel encoding inherently adds a rank to the categorical instances based on alphabetical order. For example, as mentioned previously in the notebook, the 48 countries in this dataset would ultimately be numbered from 0-47, and this rank would then be accounted for in the machine learning model. However, the rank might have nothing to do with how much the instance contributes to the response variable, so label encoding, as we will see later, would not be a good fit for this dataset.
On the other hand, One-Hot Encoding simply creates additional features based on the number of unique values in the categorical features; each unique instance in the category will be added as a feature. Although this avoids the ranking problem associated with label encoding, it does add a lot of complexity to our dataset, especially with large categorical instance maps such as country. As such, we bucketed such categories (e.g., country) to avoid unecessary complexity and decrease the computational expense found downstream.
The following section outlines how exploratory data analysis (EDA) was initially conducted with respect to the response variable and independent variables present within the dataset. The exploratory data analysis was organized by several key questions that were generated as more specific versions of our general research question. The questions were also created in an attempt to test and possibly alter our original hypotheses. The Key EDA Questions, in addition to the General Research Question, can be found listed below:
General Research Question:
Key EDA Questions:
The first thing we want to look at with our dataset's features is how well-distributed the response variable is. Determining the base rate not only serves as a good measure of model robustness, but also if the method we ultimately use to approach the general research question is the most appropriate. The following code chunks determine the base rate of our response variable.
As a refresher, the response variable is coined supervisor, and it is a categorical measure of how willing employees are to discuss mental health issues with their direct supervisor(s). The possible survey responses, or response variable classes, are as follows: 'yes', 'some of them', and 'no'.
comp_employees = df.groupby(["supervisor"]).size().reset_index(name='Count')
fig = px.bar(comp_employees,
x="supervisor",
y="Count",
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Willingness to Reach Out to Direct Supervisor(s)'
)
fig.show()
As observed qualitatively in the histogram above, the response variable is relatively well-distributed between the possible classes. There are 393 instances of 'No', 350 instances of 'Some of them', and 516 instances of 'Yes'.
The following code chunk will determine the base rates characterizing the response variable.
num_no = (df.supervisor == "No").sum()
num_some = (df.supervisor == "Some of them").sum()
num_yes = (df.supervisor == "Yes").sum()
fig = go.Figure(data=[go.Pie(labels=["No", "Some of them", "Yes"],
values=[num_no, num_some, num_yes],
pull=[0, 0, 0.1],
title="Base Rates of Response Variable (supervisor)",
)])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
As seen quantitatively above, the base rates of the response variable are as follows: 41.00 for the 'Yes' class, 27.80% for the 'Some of them' class, and 31.2% for the 'No' class. These can serve as indicators of what random guessing the supervisor class of any given employee would look like. As such, they will be utilized later as the primary test of model robustness.
Next, let use see both how well the demographic features present in our dataset (age, gender, and country) are distributed and the effect they might have on the response variable.
num_less18 = (df.age_bucket == "<18").sum()
num_20s = (df.age_bucket == "20s").sum()
num_30s = (df.age_bucket == "30s").sum()
num_40s = (df.age_bucket == "40s").sum()
num_50s = (df.age_bucket == "50s").sum()
num_60plus = (df.age_bucket == "60s+").sum()
fig = go.Figure(data=[go.Pie(labels=["<18", "20s", "30s", "40s", "50s", "60s+"],
values=[num_less18, num_20s, num_30s, num_40s, num_50s, num_60plus],
pull=[0, 0, 0.1, 0, 0, 0],
title="Age Distribution",
)])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
At first glance, the pie chart above seems heavily distributed in favor of employees in their 20s and 30s. As such, it might be a better idea to leave this variable continuous as originally present in the dataset. We will see the effect of leaving age continuous as opposed to bucketing it into categories on the model later.
comp_age = df.groupby(["supervisor","age_bucket"]).size().reset_index(name='Count')
fig = px.bar(comp_age,
x="supervisor",
y="Count",
color="age_bucket",
barmode='group',
color_discrete_sequence=px.colors.sequential.Plasma_r,
title='Distribution of Likelihood to Reach Out to Supervisor by Age'
)
fig.show()
As qualitatively observed by the histogram above, it can be seen that age does not generally look like it has an effect on the response variable. However, upon second look, we can see that instances of 60s+ and <18 are relatively minute compared to the other age categories. This backs up what was previously stated, suggesting that age might be better left off as a continous variable rather than transforming it into a discrete one. Nevertheless, the qualitative observation suggests that age does not play a major role in determining whether or not an employee discusses a mental health concern with their direct supervisor(s). As such, we will leave it out of our hypothesized target predictor set.
num_female = (df.Gender == "female").sum()
num_male = (df.Gender == "male").sum()
num_other = (df.Gender == "other").sum()
fig = go.Figure(data=[go.Pie(labels=["Female", "Male", "Other"],
values=[num_female, num_male, num_other],
pull=[0, 0.1, 0],
title="Gender Distribution",
)])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
As observed in the qualitative pie chart above, the proportion of employees identifying as male in our dataset largely exceeds both the proportion of employees identifying as female and the proportion of employees identifying as other. Since we are concerning ourselves with the tech workplace, this is as expected. However, it may skew our data in terms of its effect on the response variable. Thus, let us execute the code chunk below to gain a preliminary view of the effect of gender categories on our response variable.
comp_gb = df.groupby(["supervisor","Gender"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="Gender",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Supervisor by Gender'
)
fig.show()
As seen in the qualitative, stacked bar plot above, gender does not seem to have much of an impact on whether an employee reaches out to their direct supervisor(s) with regard to a potential mental health concern. As such, we will leave this category in our dataset for now, but not add it to our 4 initial target predictors.
comp_country = df.groupby(["country_bucket"]).size().reset_index(name='Count')
fig = go.Figure(data=[go.Pie(labels=comp_country["country_bucket"],
values=comp_country["Count"],
pull=[0.1, 0],
title="Country Distribution",
)])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
marker=dict(line=dict(color='#000000', width=2)))
fig.show()
As shown qualitatively in the pie chart above, North American Countries dominate the dataset (65.4% of responses), closely followed by European countries (28.8% of responses). However, the 'Other' bucket pales in comparison, at 5.88%. Nevertheless, not unlike the gender category, it may be too early to exclude this variable, so let us see qualitatively see what its effect is on the response variable.
comp_gb = df.groupby(["supervisor","country_bucket"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="country_bucket",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Supervisor by Country'
)
fig.show()
Surprisingly, country does not play much of a role in determining whether or not an employee will reach out to their direct supervisor(s) to discuss a potential mental health concern. Given the various laws, healthcare systems, and benefits provided by different countries with respect to mental health, this is surprising. Nevertheless, we will keep the country_bucket category in our dataset for now, but we will not add it to our hypothesized 4 initial target predictors.
With demographic indicators out of the way, let us move on to analyzing our 4 initial target predictors. As a refresher, they are no_employees, remote_work, benefits, and obs_consequence.
As previously mentioned, the no_employees feature tracks how many employees are present at the subject's workplace.
comp_employees = df.groupby(["no_employees"]).size().reset_index(name='Count')
fig = px.bar(comp_employees,
x="no_employees",
y="Count",
color_discrete_sequence=px.colors.qualitative.D3,
title='Number Employees Distribution'
)
fig.show()
As seen qualitatively in the individual bar plot above, the distribution of the number of employees follows that of a bimodal one. It will be interesting to observe the effect of such a distribution on the response variable, as there may be merit to having both a small and large company when it comes to mental health discussions with a supervisor.
comp_gb = df.groupby(["supervisor","no_employees"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="no_employees",
barmode='group',
color_discrete_sequence=px.colors.sequential.Plasma_r,
title='Distribution of Likelihood to Reach Out to Supervisor by no_employees'
)
fig.show()
It seems as if the biomodal distribution does have a distinct effect on the response variable, at least qualitatively. This can be most distinctly observed with the 6-25 group from the 'No' class to the 'Yes' class of our response variable. Due to this seemingly positive relationship, we will keep no_employees in our target predictor set.
As a refresher, the obs_consequence feature tracks whether or not an employee has heard of or observed a negative consequence occurring to a coworker and related to their mental health.
comp_gb = df.groupby(["supervisor","obs_consequence"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="obs_consequence",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Likelihood to Reach Out to Supervisor based on Observed Negative Consequences for Coworkers with Mental Health Conditions'
)
fig.show()
Based on the distribution above, it can be qualitatively observed that obs_consequence seems to have a role in determining the outcome of the response variable. When the instance of this category is No, it seems as if employees are more likely to discuss their mental health issues with their direct supervisor(s). Thus, we will keep the category in our hypothesized target predictor set for now.
As a reminder, the remote_work feature tracks whether or not more than 50% of an employee's work takes place remotely as compared to in-person.
comp_gb = df.groupby(["supervisor","remote_work"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="remote_work",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Supervisor Based on Remote Work Availability'
)
fig.show()
Although the distribution shown above is not as distinct with respect to a relationship between remote_work and classes in the response variable, we will keep remote_work in our target predictor set. We decided on this for two reasons: 1) working remotely is becoming an increasingly available option, especially in the tech industry, and 2) this feature is extremely relevant right now due to COVID-19 forcing many people to work remotely. Thus, we want to see how this will contribute to discussing a mental health issue with a direct supervisor.
As previously mentioned in this notebook, the benefits feature tracks whether or not workplace benefits specifically related to mental health are present in an employee's workplace.
comp_gb = df.groupby(["supervisor","benefits"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
x="supervisor",
y="Count",
color="benefits",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Supervisor Based on if Employers Provide Mental Health Benefits'
)
fig.show()
Similar to the remote_work feature, there does not seem to be much of a qualitative association between the benefits feature and our response variable. Despite this, our literature review repeatedly cited benefits as an indicator of beneficial mental health outcomes. As such, we will leave this in our target predictor set.
It was suggested by peers to also include leave and seek_help in our target predictor set. In this sub-section, we will analyze how these features contribute to the outcome of our response variable.
As previously mentioned, the leave feature tracks how easy it is for an employee to take medical leave specifically due to a mental health condition.
comp_leave = df.groupby(["supervisor","leave"]).size().reset_index(name='Count')
fig = px.bar(comp_leave,
x="supervisor",
y="Count",
color="leave",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Direct Supervisor by how easy it is to take Medical Leave'
)
fig.show()
Unlike the previous two categories, the leave feature seems to have a distinct association with the response variable. This is most easily observed with the 'very easy' category, which essentially more than doubled in frequency from the 'Some of them' class to the 'Yes' class. As such, we will definitely add the leave feature to our hypothesized set of predictors.
As mentioned previously, the seek_help feature tracks whether an employee's employer provides resources to learn more about mental health issues and how to seek help.
comp_seekhelp = df.groupby(["supervisor","seek_help"]).size().reset_index(name='Count')
fig = px.bar(comp_seekhelp,
x="supervisor",
y="Count",
color="seek_help",
barmode='group',
color_discrete_sequence=px.colors.qualitative.D3,
title='Distribution of Likelihood to Reach Out to Direct Supervisor whether Employer Provides Resources to raise Mental Health Awareness'
)
fig.show()
As qualitatively observed in the plot above, there does not seem to be much of an association between the seek_help feature and our response variable. As such, this does not warrant an addition into our hypothesized set of features.
Now that we have analyzed qualitative effects of our hypothesized set of target predictors on our response variables, even adding one in while doing so (leave), let us look at more quantitative metrics to determine relationships between predictive features and the response variable as well as between predictive features themselves.
The following code chunk and output will display a correlation matrix between the features of our dataset.
corr = mc_df.iloc[:].corr()
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
plt.show()
As observed qualitatively above, although some features are highly correlated, there does not seem to be too many features highly correlated with our response variable (supervisor). At first glance, only coworkers and mental_vs_physical seem to be highly correlated with the response variable, but let us double check this qualitative observation with quantitative features.
corr["supervisor"]
After excluding all other columns in the matrix, and generating quantitative statistics, our suspicious are more or less confirmed. The coworkers and mental_vs_physical features seem to be highly correlated with our response variable (supervisor), with metrics of 0.5743 and 0.3117, respectively. Upon quantitatively finding this, we may consider taking them out so that our model does not overfit to the training set.
res = chi2(mc_df.iloc[:], mc_df['supervisor'])
features = pd.DataFrame({
'features': mc_df.columns[:],
'chi2': res[0],
'p-value': res[1]
})
features
Finally, the results of running a chi-square analysis on each feature with respect to the response variable are depicted in the above output. The chi-square test investigates whether or not a statistical significance of the observed relationship with respect to the expected relationship exists.$^{22}$ In other words, the statistic is used to determine whether or not a statistically significant relationship between any given feature and the response variable exists due to the categorical nature of the variables.
According to the results, the following features are statistically significant at the 0.001 level: seek_help, anonymity, leave, mental_health_consequence, coworkers, and mental_vs_physical. This is simply too many to take out, but we can sufficiently decide to take coworkers out by now due to not only the more than significant p-value (2.1754*10$^{36}$) and high correlation found with the response variable (0.5743).
After answering the key questions associated with our general research question, we can now reformat our original hypotheses. The biggest changes here are the addition of the leave feature to the hypothesized set of target predictors, which now possess 5 features, and the rewording of the hypothesis to capture the 5 strongest predictors instead of adding the predictors by feature importance and determining whether or not they are greater than an arbitrary value of 50%.
Null Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors.
Alternative Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular no order, will be the most important when predicting whether employees are willing to discuss mental health issues with supervisors
Target Predictors:
The following section outlines the initial model plan for the RandomForestClassifier, and showcases the results of the model's performance on the cleaned dataset. The model's performance is then analyzed depending on whether label encoding or one-hot encoding was utilized to encode the dataset's categorical features. Next, feature engineering is performed in an attempt to improve overall model performance. Finally, the model is tuned through performance of a multi-stage GridSearch containing multiple cross validation procedures within each overall iteration. At the end of this section, the best possible model with respect to accuracy in classifying the supervisor response variable as 'Yes' will be determined.
This finalized model will be subsequently utilized in Section V: Finalized Model & Results to determine its performance with respect to the other classes of the supervisor response variable (e.g., 'Some of them' and 'no').
In order to approach our general question and test our hypothesis, we will be utilizing the scikit-learn library, specifically its Random Forest Classifier, to model our dataset and determine which features in our dataset hold the most weight when predicting whether or not an employee is willing to discuss a mental health concern with their direct supervisor(s).
As mentioned previously, the scikit-learn library in Python is a state-of-the-art machine learning library consisting of many different classifiers and machines. One of these classifiers is the Random Forest Classifier, which relies on the concept of classification trees. Essentially, a random forest algorithm grows many classification trees fitted to any specific dataset.$^{23}$ The node in each tree is split based on a measure of impurity, most commonly either Gini or Information Entropy, as we will see later. This gives a fast variable importance measure at every node of the tree, allowing the algorithm to be efficient and computationally inexpensive. When creating a Random Forest Classifier, there are several arguments necessary to be passed in. As obtained from the scikit-learn Python documentation, listed below are several important ones that will be utilized later in this notebook for tuning and feature engineering$^{24}$:
n_estimators --> Refers to the number of trees in the random forest.
criterion --> As stated previously, refers to the function calculated to measure the quality/impurity of a split. The most common functions are the Gini Index and the information gain.
max_depth --> Refers to the maximum depth of the tree.
min_samples_split --> Refers to the minimum number of samples required to split an internal node.
min_samples_leaf --> Refers to the minimum number of samples required to be at a leaf node.
max_features --> Refers to the number of features to consider when looking for the best node split.
boostrap --> Determines whether or not the whole dataset will be utilized to build each tree in the random forest.
max_samples --> If bootstrap is True, refers to the proportion of the dataset that will be utilized to build each tree in the random forest.
Although there are tradeoffs to extremely high and extremely low values of each hyperparameter, as we will see below, Random Forest Classifiers generally perform well and tend not to overfit. As such, the algorithm does not require very thorough hyper-parameter tuning to produce an accurate result, as the performance is most largely dependent on the min_samples_leaf and n_estimators hyperparameters. Furthermore, using the random forest algorithm makes it very easy to measure the relative importance of each feature on the prediction, which is what our hypothesis is about. Morevoer, the scikit-learn Python library even allows us to directly access normalized feature importance as an attribute of the fitted model. Ultimately, this is calculated by the Mean Decrease in the Gini Index, which essentially renders it an impurity-based measure of feature importance. In the end, this will be our primary endpoint for testing the hypothesis and answering the general research question listed above.
Furthermore, Bootstrapping will be leveraged to improve predictive accuracy and prevent overfitting even further, which will allow for a reduced variance of the overall classifier. All in all, Random Forest Classifiers will serve as an efficient measure of feature importance of the 5 hypothesized predictors in our target set: obs_consequence, benefits, remote_work, leave, and no_employees.
However, in order to tune the hyperparameters listed above, thereby optimizing our model, a parameter search algorithm will likely need to be conducted. Our plan is to use GridSearchCV(), also provided by the scikit-learn Python library, on a distinct set of parameters for the purposes of tuning parameters related to tree size, tree number, node size, and number estimators in our random forest model. Ultimately, we will utilize the GridSearchCV algorithm in steps, tuning the most important hyperparameters first before becoming more and more detailed in our approach. As such, we will start by tuning min_samples_leaf and n_estimators, then move on to max_features, and finally move on to max_depth and max_sample.
The dataset will be divided into training and testing portions via an 80:20 split in order to train and evaluate the performance of our model. To test our hypothesis, we will call feature importance and determine whether or not the 5 predictors present in our hypothesis are actually the 5 most important features in our Random Forest Classifier. As stated previously, feature importance is calculated by mean decrease in Gini Index.
As previously mentioned, we must encode our categorical variables and convert them to numeric identifiers so that the random forest can actually learn from the training dataset. We have already label encoded the dataset, but now will also one-hot encode it. The distinction between the two was explained previously in Section II: Data Overview & Cleaning. The following code chunk performs the one-hot encoding aspect through utilization of the get_dummies function in the pandas library.
encoded_df = pd.get_dummies(df)
unfe_df = pd.read_csv(file_path)
unfe_df = unfe_df.drop('Timestamp', 1)
unfe_df = unfe_df.drop('comments', 1)
unfe_df = pd.get_dummies(unfe_df)
cdrop = df.drop('Country', 1)
cdrop = pd.get_dummies(cdrop)
cdrop = df.drop('Country', 1)
cdrop = pd.get_dummies(cdrop)
agecdrop = df.drop('Country', 1)
agecdrop = df.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)
The following code chunks will train a Random Forest Classifier on the one-hot encoded dataset as an initial indicator of model performance. Depending on the result, we will perform feature engineering as needed.
However, we first need to pre-process the dataset and drop some categories due to results of our exploratory data analysis and intuition. The following list summarizes the features that were dropped and the justification as to why they were dropped.
All of the following models were trained on this initially dropped dataset. However, feature engineering was also performed on some of the models below, as explained later. Nevertheless, with all preprocessing finished, let us see the results of our initial model run.
X_full = unfe_df.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y_full = unfe_df.loc[: ,'supervisor_Yes']
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
X_full,
X_full,
test_size = 0.2,
random_state = 508)
#Instantiate and fit Entropy on Full Dataset
RF_entropy = RandomForestClassifier(n_estimators = 100,
criterion = 'entropy',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train_full, y_train_full)
rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_entropy.predict(X_test_full)
print(classification_report(y_test_full, rfc_predict))
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train_full, y_train_full).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test_full, y_test_full).round(7))
As observed quantitatively from the code chunks above, the random forest classifier trained on the non-feature engineered dataset exhibits strong overfitting. The training score was found to be 1.0 and the testing score was found to be 0.0, demonstrating high variance due to the overfit. This is likely due to the hyperparameters utilized above, as Bootstrapping was not performed, and all hyperparameters were left as default. Thus, we will start with altering these hyperparameters from the initial run.
In this run, we will utilize the one-hot encoded dataset, yet make the hyperparameters of the Random Forest Classifier to different values to prevent the overfitting of the previous section. At the start, we will use 100 random forest trees and set the bootstrap argument to true. As a reminder, this second alteration allows the random forest to be grown out of subsets of the dataset instead of its entirety.
However, we will also perform the analysis without the age_bucket and country_bucket features, which might reduce the feature space complexity for age since it will stay continous. However, for country, the feature space will likely still be complex. Nevertheless, let us see the initial results based on this initil feature engineering change and hyperparameter additions.
X = encoded_df.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'age_bucket_<18', 'age_bucket_20s', 'age_bucket_30s', 'age_bucket_40s', 'age_bucket_50s', 'age_bucket_60s+', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes', 'country_bucket_Europe', 'country_bucket_North America', 'country_bucket_Europe'], axis=1)
y = encoded_df.loc[: ,'supervisor_Yes']
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
criterion = 'entropy',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))
rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_entropy.predict(X_test)
print(classification_report(y_test, rfc_predict))
Right off the bat, we can see that the alterations to our hyperparameters of the Random Forest Classifier and the changes in our dataset resulted in less of an overfit. This was as expected, as the age_bucket likely contributed substantial variance to the model since it was changed from continous to categorical. However, let us see the difference between using the Gini Index and using the Information Gain Index in the Random Forest Classifier Model.
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
criterion = 'gini',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))
rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_gini.predict(X_test)
print(classification_report(y_test, rfc_predict))
All metrics seem to improve when changing from Information Gain Index to the Gini Index as the criterion for the node split. However, it still seems to be overfitting based on the 1.0 training score.
Now, let us plot feature importance, which is the metric of interest.
def plot_feature_importances(model, train = X_train):
n_features = train.shape[1]
fi = np.sort(model.feature_importances_) #[::-1]
fig = px.bar(X_train, fi, y=X_train.columns, orientation='h', title="Feature Importances of Model Features")
fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
fig.show()
plot_feature_importances(RF_gini_fit, train = X_train)
Based on the initial plot of feature importance, it seems as if the model is heavily favoring mental_health_consequence, work_interfere, leave, remote_work, and gender as important categories.
The following is a repetition of the last section but with the label encoded dataframe instead of the OHE encoded one.
X = mc_df.drop(['supervisor', 'coworkers', 'mental_health_interview', 'phys_health_interview', 'age_bucket'], axis=1)
y = mc_df.loc[: ,'supervisor']
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
criterion = 'entropy',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))
rfc_predict = RF_entropy.predict(X_test)
print(classification_report(y_test, rfc_predict))
Based on the information gained from the model run above, with information gain entropy as the criterion, the model achieved admirable results, although they were still overal lower than the one-hot encoded model without feature engineering. Now, let us look at the feature importance.
plot_feature_importances(RF_entropy_fit, train = X_train)
Interestingly, Age was a strong predictor of the response variable. no_employees* was also included in the top 5, in addition to mental_health_consequence feature, which was also included in the last section's run. Similarly, work_interfere was also included in the top 5 most important features. Unexpectedly, Country** seemed to have a big role in this model's predictions of the response variable as well.
The following secton trains the model on a feature engineered dataset with age and country buckets.
X = cdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'Age', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = cdrop.loc[: ,'supervisor_Yes']
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
def plot_feature_importances(model, train = X_train):
n_features = train.shape[1]
fi = np.sort(model.feature_importances_) #[::-1]
fig = px.bar(X_train, fi, y=X_train.columns, orientation='h', title="Feature Importances of Model Features")
fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
fig.show()
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
criterion = 'gini',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))
rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_gini.predict(X_test)
print(classification_report(y_test, rfc_predict))
Based on the Gini Index Criterion, we can see a decrease in the overfitting related to our model, although slightly. However, the metrics decreased across the board from our previous One-Hot Encoded model which was run without any feature engineering. This was likely due to the age_bucket introducing a categorical variable with complex feature space into the dataset. Since age is an extremely variable category continously, parking ages into buckets reduces the ability of the model to be resistant to inherent variance.
Now, let us see how the metrics change through using information gain entropy as the criterion of node splitting.
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
criterion = 'entropy',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))
rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_entropy.predict(X_test)
print(classification_report(y_test, rfc_predict))
The results are relatively similar to those obtained using the Gini Information Index, although a slight increase can be observed across generally all metrics.
plot_feature_importances(RF_entropy_fit, train = X_train)
Based on the feature importance plot above, it seems as if mental_health_consequence is still in our top 5, as are no_employees, leave, and age_bucket.
The following sub-section repeats the analysis on the label encoded dataframe instead of the one-hot encoded dataframe.
X = mc_df.drop(['supervisor', 'coworkers', 'Age', 'mental_health_interview', 'phys_health_interview', 'Country'], axis=1)
y = mc_df.loc[: ,'supervisor']
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
criterion = 'entropy',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))
rfc_predict = RF_entropy.predict(X_test)
print(classification_report(y_test, rfc_predict))
Upon initial glance, this model's performance is not too distinct relative to the non-feature engineered results. However, all metrics were observed to drop across the board.
Now, let us see if the Gini Index Criterion results in any differences in performance.
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
criterion = 'gini',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))
rfc_predict = RF_entropy.predict(X_test)
print(classification_report(y_test, rfc_predict))
The results of using Gini Index Criterion were identical to those of the Information Gain Entropy Criterion.
plot_feature_importances(RF_entropy_fit, train = X_train)
According to the plot, mental_health_consequence, no_employees, work_interfere, age_bucket, and leave were the 5 most important predictors. This is promising, as two of our target predictors are present in the top five. However, the concerning part is the low metrics obtained in the model performance.
Since One-Hot Encoding generally performed the best thus far, let us now try running a Random Forest Classifier with the same hyperparameters mentioned above on the OHE dataset without the age bucket but with the country bucket.
X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = agecdrop.loc[: ,'supervisor_Yes']
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
criterion = 'gini',
max_depth = None,
min_samples_leaf = 1,
bootstrap = True,
warm_start = False,
random_state = 508)
#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)
#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))
rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())
rfc_predict = RF_gini.predict(X_test)
print(classification_report(y_test, rfc_predict))
As depicted in the output above, this model was determined to be the most generalizable with respect to testing score, and will thus be utilized for hyperparameter tuning and subsequent multiclass classification.
ax = plt.gca()
roc_disp = skmet.plot_roc_curve(RF_gini_fit, X_test, y_test, ax = ax, alpha=.8)
ax.plot([0,1], [0,1], linestyle = '--', lw=2, color= 'black', label = 'Chance', alpha = .8)
ax.legend(loc = 'lower right')
plt.show()
The ROC Curve above showcases an Area-Under-Curve (AUC) value of 0.79. As such, the model exhibits a 79% chance that a randomly chosen positive instance (e.g., 'Yes' in the supervisor category) will be ranked higher than a randomly chosen negative instance. As such, it is a good sign that the model possess high quality and robustness.
Now that we have picked the model of interest and conducted feature engineering, let's hypertune the parameters of the model using GridSearchCV(). As a refresher, the model of interest utilizes the 'gini' criterion with a True bootstrap argument. The engineered dataset does not contain the age_bucket feature but still contains the country bucket.
First, let's tune the n_estimators and min_samples_leaf hyperparameters. As previously stated, these are generally considered to be the most important hyperparameters of the Random Forest Classifier. The n_estimators hyperparameter reflects the amount of trees utilized in the random forest, while the min_samples_leaf reflects the minimum number of samples required to be at a leaf node in the model.
agecdrop = df.drop('Country', 1)
agecdrop = agecdrop.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)
X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = agecdrop.loc[: ,'supervisor_Yes']
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size = 0.2,
random_state = 508)
param_grid = {
'n_estimators': [1, 20, 50, 75, 100, 200, 300],
'min_samples_leaf': [1, 5, 10, 50, 100],
}
rfc=RandomForestClassifier(random_state=508, bootstrap=True)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)
print("Best Parameter Set: ", CV_rfc.best_params_)
print()
best_acc = CV_rfc.best_score_
print("Best Accuracy: ", best_acc)
print()
best_model = CV_rfc.best_estimator_
Based on the results of the first grid search, the optimal n_estimators hyperparameter was found to be 200, and the optimal min_samples_leaf hyperparameter was found to be 10. The base cross-validation accuracy after this grid-search was found to be 76.37%.
As a reminder, the max_features hyperparameter measures the maximum amount of features considered when looking for the best node split.
param_grid = {
'min_samples_leaf': [10],
'n_estimators': [200],
'max_features':[5, 10, 15, 20],
}
rfc=RandomForestClassifier(random_state=508, bootstrap=True)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)
print("Best Parameter Set: ", CV_rfc.best_params_)
print()
print("Best Accuracy: ", CV_rfc.best_score_)
print()
if best_acc < CV_rfc.best_score_:
print("Better Model?: ", True)
best_acc = CV_rfc.best_score_
best_model = CV_rfc.best_estimator_
else:
print("Better Model?: ", False)
As can be seen in the output above, tuning the max_features hyperparameter did not result in a better Cross-Validation Accuracy score. Thus, we will leave this as its default value (None) in order to reduce model unecessary model constraints.
As a refresher, max_depth refers to the maximum depth of the random forest's decision trees, and max_samples refers to the proportion of the original dataset sampled to make the decision tree when bootstrapping.
param_grid = {
'min_samples_leaf': [10],
'n_estimators': [200],
'max_depth': [5, 10, 15, 20, 50, 100],
'max_samples': [0.2, 0.4, 0.6, 0.8]
}
rfc=RandomForestClassifier(random_state=508, bootstrap=True)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)
print("Best Parameter Set: ", CV_rfc.best_params_)
print()
print("Best Score: ", CV_rfc.best_score_)
print()
if best_acc < CV_rfc.best_score_:
print("Better Model?: ", True)
best_acc = CV_rfc.best_score_
best_model = CV_rfc.best_estimator_
else:
print("Better Model?: ", False)
Tuning the max_depth and max_samples hyperparameters did not improve the Cross-Validation Score. Thus, we will not include them in the final analysis.
This section will generate summary statistics of the tuned, feature engineered model on the 'Yes' class.
best_parameters = best_model.get_params()
print("{:<10}\t\t\t{:<10}".format("HYPERPARAMETER", "VALUE"))
for key, value in best_parameters.items():
if value == None:
val = 'None'
else:
val = value
if key == "min_impurity_decrease" or key == "min_impurity_split" or key=="min_samples_leaf" or key == "min_samples_split":
print("{:<10}\t\t{:<10}".format(key, val))
elif key == "min_weight_fraction_leaf":
print("{:<10}\t{:<10}".format(key, val))
else:
print("{:<10}\t\t\t{:<10}".format(key, val))
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
The following section showcases the performance of the finalized model obtained from Section IV: Initial Model & Feature Engineering on each class of the supervisor response variable from the dataset. As previously mentioned, the classes of the supervisor response variable are as follows: 'Yes', 'Some of them', and 'No'. When looking at this section, keep in mind that feature engineering on the Random Forest Classifier, in addition to hyperparameter tuning, was only performed by training with respect to the 'Yes' label. Furthermore, the base rates of the 'Yes', 'Some of them', and 'No' classes, as found previously in Section III: Exploratory Data Analysis are 41%, 27.8%, and 31.2%, respectively.
From there, this section will go into analysis of the results to address the finalized hypotheses, which are listed again below for clarity:
Null Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors.
Alternative Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular no order, will be the most important when predicting whether employees are willing to discuss mental health issues with supervisors
Target Predictors:
agecdrop = df.drop('Country', 1)
agecdrop = agecdrop.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)
X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y_yes = agecdrop.loc[: ,'supervisor_Yes']
y_some = agecdrop.loc[:, 'supervisor_Some of them']
y_no = agecdrop.loc[:, 'supervisor_No']
RF = RandomForestClassifier(n_estimators = 200,
criterion = 'gini',
#max_depth = 10,
min_samples_leaf = 10,
#max_samples = 0.6,
bootstrap = True,
warm_start = False,
random_state = 508)
print("Summary of the Best Model")
print()
model_parameters = RF.get_params()
print("{:<10}\t\t\t{:<10}".format("HYPERPARAMETER", "VALUE"))
for key, value in model_parameters.items():
if value == None:
val = 'None'
else:
val = value
if key == "min_impurity_decrease" or key == "min_impurity_split" or key=="min_samples_leaf" or key == "min_samples_split":
print("{:<10}\t\t{:<10}".format(key, val))
elif key == "min_weight_fraction_leaf":
print("{:<10}\t{:<10}".format(key, val))
else:
print("{:<10}\t\t\t{:<10}".format(key, val))
X_train_yes, X_test_yes, y_train_yes, y_test_yes = train_test_split(
X,
y_yes,
test_size = 0.2,
random_state = 508)
RF_fit_yes = RF.fit(X_train_yes, y_train_yes)
print('Training Score', RF_fit_yes.score(X_train_yes, y_train_yes).round(7))
print('Testing Score:', RF_fit_yes.score(X_test_yes, y_test_yes).round(7))
print()
yes_predict = RF_fit_yes.predict(X_test_yes)
print(classification_report(y_test_yes, yes_predict))
ftr_imp = RF_fit_yes.feature_importances_
ftrs = X_train_yes.columns
ftrs_gen = ftrs.str.rsplit('_', 1).str.get(0)
ftrs_df = pd.DataFrame(
{'Original_Column': ftrs_gen,
#'Bucketed Column': ftrs,
'Feature_Importance': ftr_imp
})
grouped = ftrs_df.groupby(['Original_Column']).sum().reset_index().head(10)
#ftrs_df['Original Column'].value_counts()
grouped.sort_values(by='Feature_Importance', ascending=False).head(5)
grouped.sort_values(by='Feature_Importance', ascending=False).tail(5)
ftrs_df
X_train_some, X_test_some, y_train_some, y_test_some = train_test_split(
X,
y_some,
test_size = 0.2,
random_state = 508)
RF_fit_some = RF.fit(X_train_some, y_train_some)
print('Training Score', RF_fit_some.score(X_train_some, y_train_some).round(7))
print('Testing Score:', RF_fit_some.score(X_test_some, y_test_some).round(7))
print()
some_predict = RF_fit_some.predict(X_test_some)
print(classification_report(y_test_some, some_predict))
X_train_no, X_test_no, y_train_no, y_test_no = train_test_split(
X,
y_no,
test_size = 0.2,
random_state = 508)
RF_fit_no = RF.fit(X_train_no, y_train_no)
print('Training Score', RF_fit_no.score(X_train_no, y_train_no).round(7))
print('Testing Score:', RF_fit_no.score(X_test_no, y_test_no).round(7))
print()
no_predict = RF_fit_no.predict(X_test_no)
print(classification_report(y_test_no, no_predict))
ax = plt.gca()
ax.plot([0,1], [0,1], linestyle = '--', lw=2, color= 'black', label = 'Chance', alpha = .8)
ax.legend(loc = 'lower right')
roc_disp_yes = skmet.plot_roc_curve(RF_fit_yes, X_test_yes, y_test_yes, ax = ax, alpha=.8, name = 'RFC for "Yes" Response')
roc_disp_some = skmet.plot_roc_curve(RF_fit_some, X_test_some, y_test_some, ax = ax, alpha=.8, name = 'RFC for "Some of them" Response')
roc_disp_no = skmet.plot_roc_curve(RF_fit_no, X_test_no, y_test_no, ax = ax, alpha=.8, name = 'RFC for "No" Response')
plt.title("ROC curves of Model on the Different Supervisor Labels")
plt.show()
In order to ensure that all model's were correctly trained on the same dataset, we will plot the feature importance, as determined by the mean decrease in Gini Index of the Random Forest Classifier fit, for each class's model train.
def plot_feature_importances(model, train = X_train_yes):
n_features = train.shape[1]
fi = model.feature_importances_ #[::-1]
fig = px.bar(X_train_yes, fi, y=X_train_yes.columns, orientation='h', title="Feature Importances of Model Features").update_yaxes(categoryorder="total ascending")
fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()
plot_feature_importances(RF_fit_yes, train = X_train_yes)
plot_feature_importances(RF_fit_some, train = X_train_some)
plot_feature_importances(RF_fit_no, train = X_train_no)
yes_matrix = confusion_matrix(y_test_yes, yes_predict)
yes_labels = ["Yes", "Some/No"]
sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(yes_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20},
linecolor="black", linewidth=0.1, xticklabels=yes_labels, yticklabels=yes_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})
plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')
plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)
ax.set_title("Confusion Matrix for 'Yes' Class", fontsize=24)
fig.tight_layout()
plt.show()
some_matrix = confusion_matrix(y_test_some, some_predict)
some_labels = ["Some", "Yes/No"]
sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(some_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20},
linecolor="black", linewidth=0.1, xticklabels=some_labels, yticklabels=some_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})
plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')
plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)
ax.set_title("Confusion Matrix for 'Some of them' Class", fontsize=24)
fig.tight_layout()
plt.show()
no_matrix = confusion_matrix(y_test_no, no_predict)
no_labels = ["No", "Yes/Some"]
sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(no_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20},
linecolor="black", linewidth=0.1, xticklabels=no_labels, yticklabels=no_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})
plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')
plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)
ax.set_title("Confusion Matrix for 'No' Class", fontsize=24)
fig.tight_layout()
plt.show()
def getSummaryStatistics(confusion_matrix):
TP = float(confusion_matrix[0][0])
TN = float(confusion_matrix[1][1])
FP = float(confusion_matrix[0][1])
FN = float(confusion_matrix[1][0])
acc = round(((TP+TN) / (TP+TN+FP+FN)), 2)
pre = round(((TP) / (TP+FP)), 2)
rec = round(((TP) / (TP+FN)), 2)
if TN+FP == 0:
spe="---"
else:
spe = round(1 - ((FP) / (TN+FP)), 2)
f1 = round((((2)*(rec)*(pre)) / (rec+pre)), 2)
return [acc, pre, rec, spe, f1]
yes_summary = getSummaryStatistics(yes_matrix)
some_summary = getSummaryStatistics(some_matrix)
no_summary = getSummaryStatistics(no_matrix)
cellColor='lightskyblue'
headerColor='dodgerblue'
theTable = plt.table(
cellText=[
yes_summary,
some_summary,
no_summary
],
cellColours=[
[cellColor, cellColor, cellColor, cellColor, cellColor],
[cellColor, cellColor, cellColor, cellColor, cellColor],
[cellColor, cellColor, cellColor, cellColor, cellColor]
],
cellLoc='center',
rowLabels=['Yes', 'Some', 'No'],
rowColours=[headerColor, headerColor, headerColor],
rowLoc='center',
colLabels=['ACCURACY', 'PRECISION', 'RECALL', 'SPECIFICITY', 'F1-Score'],
colColours=[headerColor, headerColor, headerColor, headerColor, headerColor],
colLoc='center',
loc='center'
)
theTable.auto_set_font_size(False)
theTable.set_fontsize(16)
theTable.scale(2, 2)
ax=plt.gca()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.box(on=None)
plt.show()
Based on mean decrease in Gini Index Criterion, we cannot reject the null hypothesis. Of the 5 categories we predicted to be the most important in terms of feature importance to our model, only leave was present. The actual 5 most important features were mental_health_consequence, mental_vs_physical, leave, age, and anonymity. benefits was actually even found to be the least important feature in predicting the response variable. no_employees, obs_consequencce, and remote_work were not found to be anywhere near the top five most important features of our model. As a result, we can sufficiently reject the null hypothesis.
However, the top five features still make intuitive sense. First, mental_vs_physical measures how much an employee thinks that their employer takes mental health as seriously as physical. Intuitively it is clear to see why this would be the one of the top predictors of feeling comfortable with talking to your supervisor about mental health issues. For instance, if one felt that their supervisor takes mental health as serious as physical health, there is less of a stigma associated in the workplace, and one is likely to reach out to them if they had a potential mental health concern. Furthermore, mental_health_consequence measures the perception that an employee has regarding a negative consequence resulting from discussing a mental health issue with their employer. This also makes intuitive sense, as it inherently tracks the substantial stigma associated with having a mental illness in the world today; it is not surprising that this would be a top predictor of reaching out to a supervisor as well. Ultimately, this reasoning was why we chose obs_consequence as a feature of interest, but mental_health_consequence likely performed better because it is measuring what an employee thinks will happen to themselves if they reach out, not what they observed to happen to their coworkers (i.e., what obs_consequence reflects).
The next strongest predictor was anonymity, which tracked how employees perceived anonymity to be protected if they or their coworker takes advantage of mental health or substance abuse treatment. Similar to the prevoius two explanations, if their identity is protected, an employee will more likely to be comfortable reaching out to others, including a supervisor, with mental health concerns; it makes sense. age was an unexpected member of the top 5, but there are two possible explanations for this: 1) There was a large unimodal distribution of age highly in favor of employees in their 20s and 30s, and 2) stigma may be more debilitating to older members of the workforce who are accustomed to mental health not being discussed at all. Finally, we have the leave feature, which is the only one we predicted to make the top 5 most important features list.
In terms of model performance, the best model performed admirably compared to base rate percentages. The model obtained a 'Yes' accuracy of 78.00%, a 'Some of them' accuracy of 68.00%, and a 'No' accuracy of 79.00%. In contrast, the base rates of the three classes of the response variable were 41.00%, 27.80%, and 31.20%, respectively. As such, the model we utilized served as a solid predictor of whether an employee was willing to reach out to their direct supervisor(s) regarding a mental health issue.
Primarily, the model suffered from a lack of generalizability due to many possible sources. First, there were very low sample sizes from countries in the "Other" bucket, or in other words, countries that were not the United States of America, the United Kingdom, or Canada. Similarly, healthcare structures of the employees' residence were not able to be identified; this limitation is further exacerbated by the healthcare of the United States being highly variable across the population. Furthermore, the dataset was heavily imbalanced with respect to gender; this was likely due to the dataset being recorded primarily from technological industry companies, which are male-dominated. However, this is another inherent limitation, as other industries were not intended to be surveyed, and employee responses were only included as a byproduct of unintended survey distribution. Finally, the lack of generalizability primarily stems from the small size of the dataset (1,259 survey responses across several different countries); this is far from representative of the entire tech workforce population globally. Thus, the strongest predictors are unlikely to be constant if conducting analysis with a more representative dataset.
Additionally, the inherent categorical bias of the survey itself limited the robustness of the model. Since continous variables have greater statistical power, less variance without regard to sample size, and higher sensitivity than categorical variables, the fact that most features in the database were categorical in nature limits the model accuracy found in the analysis. Binning the country feature made the model more interpretable but it was not at all inclusive (e.g., only 'North American', 'Europe', and 'Other' instances). Finally, the model suffered from great class imbalance in the dataset, as the model tended to perofrm better for the majority with respect to respondent country as well as the 'Yes' class for the response variable. This was likely due to the feature engineering and hypertuning of the model only being performed through analysis of model performance on the 'Yes' class of the response variable.
One future avenue of research includes exploring options to improve the class imbalance of the response variable. One way to achieve this is to downsample 'Yes' instances (41% base rate) to match more closely to 'Some' or 'No' instances (27% and 31%, respectively). Additionally, one can utilize the class_weight feature in the Random Forest Classifier model to account for the imbalance. Yet another way to achieve this is to relevel the response variable to make 'some' instances grouped under either 'yes' or 'no', or try both variations to observe differences between the strategies.
Another avenue of future research includes exploring specific health policies in the workplace and their effect on the likelihood of whether an employee discusses a mental health concern with their direct supervisor(s). Some variables of interest here may include maternity/paternity leave, subsidized child care, and even counseling. Similarly, specific types of mental health and healthcare coverage should be analyzed, in addition to a whole multitude of specific features if the dataset was expanded. Moreover, healthcare systems should specifically be explored to see their effect on the response variable; these include privatized insurance, universal healthcare, and the availability of a public option.
Next, a time series analysis can be conducted, as the Open Sourcing Mental Illness organization has conducted this survey multiple times since 2014. In this time series analysis, one can observe how predictors of the response variable change over time with respect to feature importance. More specifically, an analysis with respect to COVID-19 can be conducted, and the transition to forced remote work due to the pandemic can be analyzed with respect to coping with a mental illness. Finally, feature engineering and hypertuning should be performed with respect to 'Some' and 'No' classes in addition to the 'Yes' class, and different models utilizing multi-class classification should be probed to test the same hypothesis.
A 2014 dataset consisting of employee responses to a survey conducted by the Open Sourcing Mental Illness organization was cleaned and pre-processed; the survey intended to measure attitudes towards mental health and frequency of mental health disorders in the tech workplace. After learning about the features in the dataset and conducting an extensive literature review, a null hypothesis was created: Using Random Forest and feature importance from scikit-learn (i.e., mean decrease in Gini Index), 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors. The 5 target predictors of interest were determined to be obs_consequence, benefits, no_employees, leave, and remote_work. After conducting exploratory data analysis, in which feature distributions with respect to the response variable (supervisor) were analyzed and highly correlated features were removed, feature engineering was performed. Feature engineering not only allowed for the elucidation of the tradeoff between categorical and continuous features in the dataset, but also led the Random Forest Classifier model to be hypertuned. The best model was then tested on each class of the response variable ('Yes', 'Some of them', and 'No'). In the end, only one of the five features predicted to be most important with respect to Mean Decrease in Gini Index was in the actual top five most important features: leave. Thus, although the model resulted in substantially higher class prediction accuracies during testing relative to the base rates, the null hypothesis was not able to be rejected.