Mental Health in the Workplace - Week One Project Code

Course: DS 4002 - Data Science Project Course

Authors: Navya Annapareddy, Tony Albini, Kunaal Sarnaik, & Jaya Vellayan

Professor: Brian Wright

Date: January 8th, 2021

TABLE OF CONTENTS


I. Introduction

The following section serves as a general introduction to the topic that will be discussed throughout this research analysis — mental health in the workplace. See the sub-sections embedded below for futher information: Motivation, Background, General Research Question, Relevant Research, & Initial Hypotheses.

a) Motivation

According to the National Institute of Mental Health, approximately one in five adults live with a mental illness in the United States; this drastic proportion corresponded to a quantity of nearly 51.1 million citizens in 2019.$^{1}$ Although the proportion of adults who experience a mental illness on the global scale is smaller (1 in 17, according to the World Health Organization in 2017), it still roughly equals an astounding magnitude of approximately half a million people.$^{2}$ Despite these monstrous statistics related to mental illness incidence and prevalence, previous studies have shown that only a small percentage of adults who suffer from a mental illness actually receive treatment for it.$^{3}$ For example, the National Institute of Mental Health found that a mere 43.8% of United States adults who were diagnosed with a mental illness actually treatment for it.$^{3}$

Given that incidences of mental illness not only have skyrocketed since the turn of the century, but also are expected to continue increasing at an alarming rate, strategies for addressing, coping with, and treating mental illnesses on the individual level must be developed urgently.$^{4}$ However, when it comes to addressing mental illnesses, one devastating obstacle looms large both individually and societally — the stigma against them. According to the American Psychiatric Association (APA), three types of stigmas govern the mental illness conversation$^{5}$:

1. Public Stigma: negative or discriminatory attitudes that others have about mental illness.

2. Self-stigma: negative attitudes, including internalized shame, that people with mental illness have about their own condition.

3. Institutional Stigma: more systemic, involving policies of government and private organizations that intentionally or unintentionally limit opportunities for people with mental illness.

A unique location where all three of these stigmas are practiced extensively is the workplace, where the APA has found that more than one in three workers are concerned about being fired if they sought mental health care.$^{6}$ The combination of these challenges makes an already difficult act of talking about mental illness much more arduous for workers, who comprise a large proportion of mental illness cases on the global scale.$^{7,8}$ Moreover, such expression regarding one's experience with a mental illness is largely considered to be among first steps towards treatment, and the lack of doing so may only exacerbate the symptoms of a mental illness on the individual.$^{7}$ Furthermore, in the workplace, poor mental health and stress can be detrimental for an employee's job performance, work engagement, communication with coworkers, physical capability, and ability to carry out day-to-day functions.$^{9}$

In a study conducted by the Center for Disease Control, only 57% of employees who reported suffering from moderate depression, in addition to a mere 40% of those who reported suffering from severe depression, received treatment to control their symptoms.$^{9}$ These appalling statistics, combined with the trend that mental health is only becoming harder to manage due to the current COVID-19 pandemic, suggest that finding strategies to increase the likelihood of employees' expressing their experience with mental illnesses to others in order to receive treatment (a mere 55% in 2019, according to a 2021 study) is an urgent priority for not only the nation, but also the world.$^{10}$

The following research project deploys data science and machine learning principles to investigate which factors present in the workplace are the strongest determinants of whether or not an employee confides to their direct supervisor(s).

b) Background

The Mayo Clinic defines mental illnesses, which are also known as mental health disorders, as those that affect one's mood, thinking, and behavior.$^{11}$ Examples of these types of psychological afflictions include depression, anxiety, schizophrenia, bipolar, eating disorders, and addictive behaviors, yet the term mental illness is often used to refer to any among a wide range of mental health conditions. Common symptoms of a mental illness can vary and depend on the disorder; examples, however, include dysphoric mood, avolition, withdrawal from activities, suicidal thinking, and substance abuse.

Although the outcomes of one suffering from a mental illness can be just as debilitating as those of one suffering from any other physical illness such as cancer, stigmatization of those who suffer from them forms a pervasive barrier that prevents many of those suffering from such illnesses to get better.$^{12}$ Since contact with persons with mental illnesses is currently the most well-supported way to reduce stigma, identifying methods to increase such contact in the workplace — an environment that is almost universal to all who share this planet — is an urgent priority. However, for such contact to take place, modern workplace culture must be analyzed.

According to a 2019 survey conducted by Mind Share Partners' Mental Health at Work, 37% of workers reported that their work environment contributed to their mental health symptoms.$^{13}$ Due to a combination of cultural stigma, a lack of understanding of symptoms, increased pressure, and a lack of resources, the current state of the workplace can not only cause mental illnesses, but also further exacerbate existing ones.$^{14}$ Therefore, to gain an understanding of why an insufficient amount of contact with those suffering from mental health disorders occurs in today's workplace, an in-depth analysis of the factors contributing to an employee's willingness to discuss mental health problems with their direct supervisor may be enlightening. After all, the workplace, whether it be online or in-person, is comprising an increasing amount of any person's time, and finding ways to decrease mental health stigma can lead to invaluable returns with respect to not only addressing the illness, but also helping the sufferer of the illness access resources that will alleviate its symptoms.

c) General Research Question

An in-depth analysis utilizing data science and machine learning principles will be conducted in an attempt to address the following research question:

What are the strongest predictors dictating whether or not an employee will discuss mental health concerns with their direct supervisor(s)?

Using this question as a guide, hypotheses will be developed and an extensive and comprehensive analysis that leverages state-of-the-art machine learning techniques will be conducted. A dataset containing responses to a survey administered to employees in the technological industry — one that is largely predicted to sustain the most growth in the near future — will be analyzed (See Dataset Overview & Cleaning section for more information).$^{15}$ The data will be cleaned, exploratory data analysis will be conducted, and then machine learning techniques will be applied in order to determine the strongest predictors of the discrete response variable.

d) Relevant Research

Several studies related to predicting outcomes and behavior related to mental health in the workplace have been previously conducted. For example, a team from the Keio University School of Medicine in Japan, Sado et al., analyzed predictors of repeated sick leave in the workplace as a result of mental disorders.$^{16}$ The Japanese researchers utilized Return To Work (RTW) and repeated sick leave rtes among 194 subjects employed at a manufacturing company. The independent variables of interest included sex, age at the time of employment, job tenure, diagnosis, number of previous sick leave days, duration of most recent sick leave, and employee rank. After conducting a univariate analysis using log-rank test and multivariate analysis using the Cox proportional hazard model, Sado et al. found that the employees' ages and number of previous sick-leave episodes were a significant predictor of repeated sick leave due to a mental ilness.

Moreover, the Center for Disease Control (CDC) annually conducts a "Mental Health in the Workplace" study, which is included as a sub-section in their Workplace Health Guide.$^{9}$ This study tracks mental illness solutions, frameworks to raise mental health awareness, and useful strategies to tackle the problem. Historically, the CDC has pointed to three key indicators and risk factors of mental illness: stigma, lack of health care, and lack of social connections. Furthermore, they have repeatedly suggested that employers should offer several mental health benefits (e.g., health insurance with no or low out-of-pocket costs, free clinical screenings, and informational brochures), be flexible with sick leave specifically with respect to mental illnesses, and provide managers with training to help recognize signs and symptoms of common mental health disorders.

According to studies consolidated by Harvard Health, the most debilitating characteristic of mental health disorders is that most go unrecognized and untreated due to the stigma attached to them.$^{17}$ Due to such stigma, employees in the workplace feel extremely reluctant to seek treatment out of feat that they might jeapardize their jobs, further exacerbating their already limited ability to communicate their feelings to co-workers. Similarly, the transition towards remote working, whether it be due to the rise of the technological industry or the coercion of the COVID-19 pandemic, may exacerbate feelings of isolation and loneliness, according to the Center for Workplace Mental Health.$^{18}$ Moreover, the remote transitions can have a debilitating affect on the amount of people in a person's social circle, which can also further exacerbate their ability to reach out for help.

In summary, the combination of previous research related to the General Research Question points to several common themes that may affect the likelihood of an employee to reach out to their direct supervisor to discuss a mental health issue: the community, stigmatization, and resources present at the workplace itself.

e) Initial Hypotheses

Null Hypothesis: Using Random Forest and permutation importance from Python's scikit-learn library, the 4 target predictors are equal to or less than 50% of total feature importance (which is normalized to 1) when predicting whether employees are willing to discuss mental health issues with supervisors.

Alternative Hypothesis: Using Random Forest and permutation importance from Python's scikit-learn library, the 4 target predictors are greater than 50% of total feature importance (which is normalized to 1) when predicting whether employees are willing to discuss mental health issues with supervisors.

Target Predictors:

NOTE: The target predictors are categorical variables found in the dataset as explained below (See Dataset Overview & Cleaning section for more information).

  1. benefits: Does your employer provide mental health benefits?

Justification: In the Center for Disease Control (CDC) Workplace Guide, there are several suggestions related to mental health-specific employment benefits.$^{9}$ If benefits are available and provided free-of-charge as a part of an employee's occupational agreement, the employee should theoretically be more likely to reach out to their supervisor in an attempt to possibly capitalize on them if they are suffering from a mental health disorder. Furthermore, the study conducted by Japanese researchers Sado et al. found that repeated sick leave episodes related to a mental disorder were a statistically significant predictor of future sick leave episodes related to the mental illness.$^{16}$ On the other end, if employees are not offered as many occupational benefits specifically related to maintaining their mental health, it is reasonable to assume that they would be less likely to discuss a mental health issue with their direct supervisor(s).

  1. no_employees: How many employees does your company/organization have?

Justification: Several sources outlined in the previous section pointed to social circles as a key indicator not only of one's mental health wellness, but also of the likelihood that one's mental health disorder improves. For example, the analysis conducted by the Center for Workplace Mental Health reported that staying connected with co-workers and friends is paramount to staying mentally well during remote work.$^{18}$ Furthermore, the CDC Workplace Guide reports that the workplace is an optimal setting to create a culture of health because communication structures are already in place and social support networks are available.$^{9}$ Thus, the more co-workers an employee has working with them, the more likely they are to have others to confide in, be encouraged by, and make them feel supported. The result of such an atmosphere should theoretically lead to them being more likely to discuss any mental health issues with a direct supervisor. Furthermore, the lack of employees at a person's place of work should logically have the opposite effect.

  1. obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

Justification: This predictor is likely a measure of the amount of mental health stigmatization present in a person's place of work. According to Harvard Health, the stigma attached to having a psychiatric disorder is such that employees may be reluctant to seek treatment out of fear that they might jeopardize their jobs.$^{17}$ Thus, if an employee has heard of or observed negative consequences for a coworker with mental health issues in their workplace, there likely exists a strong amount of mental health stigmatization present in the workplace environment. By extension, the employee should theoretically be less likely to reach out to their direct supervisor regarding a mental health issue. On the other hand, if no such consequence has been observed or known, the employee likely works at a place that does not stigmatize health as much, which may result in a stronger likelihood that the person reaches out to their direct supervisor(s).

  1. remote_work: Do you work remotely (outside of an office) at least 50% of the time?

Justification: Working remotely may substantially limit the amount of flexibility an employee possess when it comes to their social circle. A significant caveat to the workplace being considered an emotional and social support system, as defined by the Center for Disease Control, is remote work.$^{9}$ Due to the lack of interpersonal interaction, as well as the lack of emotional readability over a laptop screen, an employee may feel much more alone if they work remotely as compared to working in-person. As such, they should theoretically be less likely to discuss a mental health issue with their supervisor, and vice-versa if they are working in-person.


II. Dataset Overview & Cleaning

The following section describes the dataset being used to address the general research question listed above, as well as details how the dataset was pre-processed for analysis.

a) Dataset Overview

The dataset being utilized to answer the general question and test the hypotheses listed above is from a 2014 survey conducted by the Open Sourcing Mental Illness organization.$^{19}$ This organization aims to change the way mental health is discussed in the tech community, addressing stigmatization and providing necessary resources to those suffering from mental health disorders. In the survey, which was conducted in 2014, several responses are gathered in addition to the demographics of participants in a result to measure attitudes towards mental health and frequency of mental health disorders in the tech workplace. The categories of the dataset, with the question that corresponds to each, are listed below:

Dataset Link$^{20}$: https://www.kaggle.com/osmi/mental-health-in-tech-survey

  • Timestamp
  • Age
  • Gender
  • Country
  • State (if United States)
  • self_employed: Are you self-employed?
  • family_history: Do you have a family history of mental illness?
  • treatment: Have you sought treatment for a mental health condition?
  • work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
  • no_employees: How many employees does your company or organization have?
  • remote_work: Do you work remotely (outside of an office) at least 50% of the time?
  • tech_company: Is your employer primarily a tech company/organization?
  • benefits: Does your employer provide mental health benefits?
  • care_options: Do you know the options for mental health care your employer provides?
  • wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
  • seek_help: Does employer provide resources to learn more about mental health issues and how to seek help?
  • anonymity: Is anonymity protected if employee takes advantage of mental health or substance abuse treatment?
  • leave: How easy is it for you to take medical leave for a mental health condition?
  • mentalhealthconsequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
  • physhealthconsequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
  • coworkers: Would you be willing to discuss a mental health issue with your coworkers?
  • supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
  • mentalhealthinterview: Would you bring up a mental health issue with a potential employer in an interview?
  • physhealthinterview: Would you bring up a physical health issue with a potential employer in an interview?
  • mentalvsphysical: Do you feel that your employer takes mental health as seriously as physical health?
  • obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
  • comments: Any additional notes or comments

As a refresher, supervisor is the response variable in the study, and benefits, obs_consequence, no_employees, and remote_work are the target predictors that are being tested in the initial hypotheses listed above.

b) Dataset Cleaning & Pre-Processing

The following sub-section includes a step-by-step overview of how the dataset was loaded into the Google Colab environment, cleaned, and prepared for exploratory data analysis.

i) Importing Python Libraries

For the analysis to be conducted in Python, several packages, libraries, and modules must be utilized. They are listed below:

  • pandas - Standard data analysis library
  • NumPy - Computational Library for Fast Computation and Array Manipulation
  • os - Modele that provides functions for interacting with the operating system
  • itertools - Functions creating iterators for efficient looping through files, data, and programs
  • sys - Module for system-specific parameters and functions
  • requests - Allows HTTP requests to be made in Python
  • plotly - Data Visualization Library
  • matplotlib - Standard Python Plotting
  • seaborn - Statistical Data Visualization Library
  • scikit-learn - Machine Learning Library (RandomForestClassifier, train_test_split, Encoding, etc.)
In [2]:
import pandas as pd
import numpy as np
from os import listdir, walk
from os.path import isfile, join
import itertools
import sys,requests
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.offline import iplot, init_notebook_mode
import matplotlib.pyplot as plt
import seaborn as sns

#USE IF CONVERTING TO HTML
import plotly.offline as py
py.init_notebook_mode(connected=False)

from sklearn.feature_selection import chi2
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
import sklearn.metrics as skmet
pd.set_option('mode.chained_assignment', None) 

ii) Mounting Drive & Importing Dataset

The following sub-section showcases how the Python system modules were utilized to mount Google Drive and point Colab's directory to the one which contains the dataset being used to address the general research question.

The code chunks below open the directory file the dataset (.csv) is stored in.

In [3]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [4]:
!ls "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data"
survey.csv
In [5]:
folder_path = "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data"
file_path = "/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data/survey.csv"
In [6]:
def getAllFilesInDirectory(directoryPath: str):
    return [(directoryPath + "/" + f) for f in listdir(directoryPath) if isfile(join(directoryPath, f))]
getAllFilesInDirectory(folder_path)
Out[6]:
['/content/drive/MyDrive/DS_4002_JTerm2021/Week One/Code/data/survey.csv']

iii) Looking at the Raw Dataset

The following code chunk uses pandas to read in the dataset (.csv) and preview it. Here, the Timestamp, state, and comments categories are also dropped, the reasoning of which is explained below.

In [7]:
df = pd.read_csv(file_path)
df = df.drop('Timestamp', 1)
df = df.drop('state', 1)
df = df.drop('comments', 1)
df.head(5)
Out[7]:
Age Gender Country self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence
0 37 Female United States NaN No Yes Often 6-25 No Yes Yes Not sure No Yes Yes Somewhat easy No No Some of them Yes No Maybe Yes No
1 44 M United States NaN No No Rarely More than 1000 No No Don't know No Don't know Don't know Don't know Don't know Maybe No No No No No Don't know No
2 32 Male Canada NaN No No Rarely 6-25 No Yes No No No No Don't know Somewhat difficult No No Yes Yes Yes Yes No No
3 31 Male United Kingdom NaN Yes Yes Often 26-100 No Yes No Yes No No No Somewhat difficult Yes Yes Some of them No Maybe Maybe No Yes
4 31 Male United States NaN No No Never 100-500 Yes Yes Yes No Don't know Don't know Don't know Don't know No No Some of them Yes Yes Yes Don't know No

iv) Preprocessing

Several pre-processing steps were conducted to the dataset previewed above.

First, as mentioned previously, the timestamp, state, and comments categories were dropped. This was done because we assumed that the time of survey response did not constitute a major effect on the response variable, state only had an instance if the country was the United States (otherwise empty instance), and the comments category was filled with nonsensical instances which would not have added information to the subsequent machine learning model.

The following two code chunks extensive describe how the gender, wellness_program, anonymity, seek_help, leave, mental_vs_physical, benefits, care_options, work_interfere, self_employed, country, and age columns were cleaned int the dataset.

In [8]:
df.Gender.value_counts()
Out[8]:
Male                                              615
male                                              206
Female                                            121
M                                                 116
female                                             62
F                                                  38
m                                                  34
f                                                  15
Make                                                4
Woman                                               3
Male                                                3
Female (trans)                                      2
Cis Male                                            2
Man                                                 2
Female                                              2
non-binary                                          1
ostensibly male, unsure what that really means      1
maile                                               1
msle                                                1
Trans woman                                         1
Agender                                             1
woman                                               1
Nah                                                 1
A little about you                                  1
Genderqueer                                         1
fluid                                               1
Neuter                                              1
Guy (-ish) ^_^                                      1
Trans-female                                        1
Cis Female                                          1
something kinda male?                               1
male leaning androgynous                            1
Malr                                                1
p                                                   1
Female (cis)                                        1
Femake                                              1
Mail                                                1
Enby                                                1
cis-female/femme                                    1
Androgyne                                           1
femail                                              1
Mal                                                 1
queer                                               1
Cis Man                                             1
Male-ish                                            1
All                                                 1
queer/she/they                                      1
cis male                                            1
Male (CIS)                                          1
Name: Gender, dtype: int64

The output above shows the various instances of Gender present in the original dataset. As shown in the below code chunk, they were further categorized into 'male', 'female', and 'other' instances.

In [9]:
df["Gender"] = df["Gender"].str.lower()
df.Gender[df.Gender == 'm'] = "male"
df.Gender[df.Gender == 'f'] = "female"
df.Gender[(df.Gender != 'female') & (df.Gender != 'male')] = "other"
df.wellness_program[df.wellness_program == 'Don\'t know'] = "No"
df.anonymity[df.anonymity == 'Don\'t know'] = "No"
df.seek_help[df.seek_help == 'Don\'t know'] = "No"
df.leave[df.leave == 'Don\'t know'] = "Somewhat difficult"
df.mental_vs_physical[df.mental_vs_physical == 'Don\'t know'] = "No"
df.benefits[df.benefits == 'Don\'t know'] = "No"
df.care_options[df.care_options == 'Not sure'] = "No"
df.work_interfere = df.work_interfere.fillna("No Response")
df.self_employed = df.self_employed.fillna("No")

The above code chunk also showcases how mispelled answers were reverted back to standardized answers of "Yes" or "No". Furthermore, some of the questions in the survey contained "Don't know" or "Not Sure" as possible responses. These were releveled to "No", with the reason being that if employees do not know about a certain characteristic dealing with mental health in their workplace, their workplace likely either does not have it or does not promote it enough to their employees. As a result, it was assumed that both instances should map to the "No" answer.

The following code chunk categorizes the countries present in the dataset into 'North America', 'Europe', and 'Other'. This was done due to the need to encode the categories later on. Since over 48 countries are present in this dataset, one hot encoding would add 47 categories to the dataset, and label encoding would label them from 0-47. Ultimately, this would not be good for the performance of our multiclass classifier.

In [10]:
df["country_bucket"] = df["Country"]
df.country_bucket[df.country_bucket == 'United States'] = "North America"
df.country_bucket[df.country_bucket == 'Canada'] = "North America"
df.country_bucket[df.country_bucket == 'United Kingdom'] = "Europe"
df.country_bucket[df.country_bucket == 'Germany'] = "Europe"
df.country_bucket[df.country_bucket == 'Ireland'] = "Europe"
df.country_bucket[df.country_bucket == 'France'] = "Europe"
df.country_bucket[df.country_bucket == 'Netherlands'] = "Europe"
df.country_bucket[df.country_bucket == 'Bulgaria'] = "Europe"
df.country_bucket[df.country_bucket == 'Switzerland'] = "Europe"
df.country_bucket[df.country_bucket == 'Italy'] = "Europe"
df.country_bucket[df.country_bucket == 'Poland'] = "Europe"
df.country_bucket[df.country_bucket == 'Sweden'] = "Europe"
df.country_bucket[df.country_bucket == 'Belgium'] = "Europe"
df.country_bucket[df.country_bucket == 'Austria'] = "Europe"
df.country_bucket[df.country_bucket == 'Finland'] = "Europe"
df.country_bucket[df.country_bucket == 'Russia'] = "Europe"
df.country_bucket[df.country_bucket == 'Greece'] = "Europe"
df.country_bucket[df.country_bucket == 'Croatia'] = "Europe"
df.country_bucket[df.country_bucket == 'Spain'] = "Europe"
df.country_bucket[df.country_bucket == 'Czech Republic'] = "Europe"
df.country_bucket[df.country_bucket == 'Romania'] = "Europe"
df.country_bucket[df.country_bucket == 'Moldova'] = "Europe"
df.country_bucket[df.country_bucket == 'Denmark'] = "Europe"
df.country_bucket[df.country_bucket == 'Norway'] = "Europe"
df.country_bucket[df.country_bucket == 'Georgia'] = "Europe"
df.country_bucket[df.country_bucket == 'Slovenia'] = "Europe"
df.country_bucket[df.country_bucket == 'Hungary'] = "Europe"
df.country_bucket[df.country_bucket == 'Portugal'] = "Europe"
df.country_bucket[df.country_bucket == 'Bosnia and Herzegovina'] = "Europe"
df.country_bucket[df.country_bucket == 'Latvia'] = "Europe"
df.country_bucket[df.country_bucket == 'India'] = "Other"
df.country_bucket[df.country_bucket == 'South Africa'] = "Other"
df.country_bucket[df.country_bucket == 'Australia'] = "Other"
df.country_bucket[df.country_bucket == 'New Zealand'] = "Other"
df.country_bucket[df.country_bucket == 'Philippines'] = "Other"
df.country_bucket[df.country_bucket == 'Costa Rica '] = "Other"
df.country_bucket[df.country_bucket == 'Australia'] = "Other"
df.country_bucket[df.country_bucket == 'Zimbabwe'] = "Other"
df.country_bucket[df.country_bucket == 'Colombia'] = "Other"
df.country_bucket[df.country_bucket == 'Singapore'] = "Other"
df.country_bucket[df.country_bucket == 'Israel'] = "Other"
df.country_bucket[df.country_bucket == 'Brazil'] = "Other"
df.country_bucket[df.country_bucket == 'China'] = "Other"
df.country_bucket[df.country_bucket == 'Japan'] = "Other"
df.country_bucket[df.country_bucket == 'Nigeria'] = "Other"
df.country_bucket[df.country_bucket == 'Thailand'] = "Other"
df.country_bucket[df.country_bucket == 'Bahamas, The'] = "Other"
df.country_bucket[df.country_bucket == 'Costa Rica'] = "Other"
df.country_bucket[df.country_bucket == 'Uruguay'] = "Other"
df.country_bucket[df.country_bucket == 'Mexico'] = "Other"
df["country_bucket"].value_counts()
Out[10]:
North America    823
Europe           362
Other             74
Name: country_bucket, dtype: int64
In [11]:
df.isna().sum()
Out[11]:
Age                          0
Gender                       0
Country                      0
self_employed                0
family_history               0
treatment                    0
work_interfere               0
no_employees                 0
remote_work                  0
tech_company                 0
benefits                     0
care_options                 0
wellness_program             0
seek_help                    0
anonymity                    0
leave                        0
mental_health_consequence    0
phys_health_consequence      0
coworkers                    0
supervisor                   0
mental_health_interview      0
phys_health_interview        0
mental_vs_physical           0
obs_consequence              0
country_bucket               0
dtype: int64

The above output serves as a check to ensure that there are no NaN instances present in our modified dataset.

In [12]:
d = {range(0, 18): "<18", range(18, 30): "20s", range(30, 40): "30s", range(40, 50): "40s", range(50, 60): "50s", range(60, 100): "60s+"}
df['age_bucket'] = df['Age'].apply(lambda x: next((v for k, v in d.items() if x in k), 0))
df[df['age_bucket'] == 0].Age.value_counts()
Out[12]:
-1726           1
-29             1
 99999999999    1
 329            1
-1              1
Name: Age, dtype: int64

The code chunk and output below depict how the age category was changed to become a discrete variable. The category was essentially bucketed into age intervals (e.g., less than 18, 20s, 30s, etc.). Furthermore, as depicted by the output above, five outliers were present in the Age category. These were imputed with the mean age present in the dataset, which was found to be 32 as shown below.

In [13]:
df.Age[df.age_bucket == 0] = float('NaN')
df.Age[df.age_bucket == 0] = df.Age.mean()
df.age_bucket[df.Age == df.Age.mean()] = "30s"
df[df['age_bucket'] == 0]
Out[13]:
Age Gender Country self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence country_bucket age_bucket
In [14]:
df["age_bucket"] = pd.Categorical(df["age_bucket"], ["<18", "20s", "30s", "40s", "50s", "60s+"])  
df['no_employees'] = pd.Categorical(df['no_employees'], ["1-5", "6-25", "26-100", "100-500", "500-1000", "More than 1000"])
df['leave'] = pd.Categorical(df['leave'], ["Very difficult", "Somewhat difficult", "Somewhat easy", "Very easy"])

The above code chunk sorts the discrete variables age_bucket, leave, and no_employees in a logical order such that figures that are generated in subsequent analysis are more intuitive.

In [15]:
enc = LabelEncoder()

mc_df = df.copy()

for category in ["Gender", "Country", "self_employed", "family_history", "treatment", 
                 "work_interfere", "no_employees", "remote_work", "tech_company", 
                 "benefits", "care_options", "wellness_program", "seek_help", "anonymity", 
                 "leave", "mental_health_consequence", "phys_health_consequence", "coworkers", 
                 "supervisor", "mental_health_interview", "phys_health_interview", "mental_vs_physical",
                 "obs_consequence", "age_bucket", "country_bucket"]:
                 mc_df[category] = enc.fit_transform(df[category])
                 
mc_df.head()
Out[15]:
Age Gender Country self_employed family_history treatment work_interfere no_employees remote_work tech_company benefits care_options wellness_program seek_help anonymity leave mental_health_consequence phys_health_consequence coworkers supervisor mental_health_interview phys_health_interview mental_vs_physical obs_consequence country_bucket age_bucket
0 37.0 0 45 0 0 1 2 4 0 1 1 0 0 1 1 1 1 1 1 2 1 0 1 0 1 1
1 44.0 1 45 0 0 0 3 5 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 2
2 32.0 1 7 0 0 0 3 4 0 1 0 0 0 0 0 0 1 1 2 2 2 2 0 0 1 1
3 31.0 1 44 0 1 1 2 2 0 1 0 1 0 0 0 0 2 2 1 0 0 0 0 1 0 1
4 31.0 1 45 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 2 2 2 0 0 1 1

The code chunk above uses the label encoding method to convert the categorical instances into numeric ones. Although encoding is necessary in order to make the categories readable by the mathematical equations necessary for machine learning, there is a stark difference between label encoding, which was performed above, and one-hot encoding, which is performed later in this notebook.

In label encoding, each instance of a category is assigned a unique integer based on alphabetical ordering.$^{21}$ Although this is relatively easy to perform and computationally inexpensive later in the machine learning nutshell, laabel encoding inherently adds a rank to the categorical instances based on alphabetical order. For example, as mentioned previously in the notebook, the 48 countries in this dataset would ultimately be numbered from 0-47, and this rank would then be accounted for in the machine learning model. However, the rank might have nothing to do with how much the instance contributes to the response variable, so label encoding, as we will see later, would not be a good fit for this dataset.

On the other hand, One-Hot Encoding simply creates additional features based on the number of unique values in the categorical features; each unique instance in the category will be added as a feature. Although this avoids the ranking problem associated with label encoding, it does add a lot of complexity to our dataset, especially with large categorical instance maps such as country. As such, we bucketed such categories (e.g., country) to avoid unecessary complexity and decrease the computational expense found downstream.


III. Exploratory Data Analysis

The following section outlines how exploratory data analysis (EDA) was initially conducted with respect to the response variable and independent variables present within the dataset. The exploratory data analysis was organized by several key questions that were generated as more specific versions of our general research question. The questions were also created in an attempt to test and possibly alter our original hypotheses. The Key EDA Questions, in addition to the General Research Question, can be found listed below:

General Research Question:

  • What are the strongest predictors dictating whether or not an employee will discuss mental health concerns with their direct supervisor(s)?

Key EDA Questions:

  1. Key Question #1: Generally, how willing are employees to reach out to their direct supervisor(s) about a mental health issue?
  2. Key Question #2: Do demographic indicators such as age, gender, and country play a role in how willing employees are to reach out to their direct supervisor(s)?
  3. Key Question #3: How much of a role do our 4 initial target predictors play a role in the willingness of employees to reach out to their direct supervisor(s)?
    • no_employees, obs_consequence, remote_work, benefits
  4. Key Question #4: How much of a role do two target predictors suggested by our peers play a role in the willingness of employees to reach out to their direct supervisor(s)?
    • leave, seek_help
  5. Key Question #5: What is the multicollinearity between factors in our dataset? Which factors are highly correlated with our target label (supervisor)?

a) Key Question #1: Generally, how willing are employees to reach out to their direct supervisor(s) about a mental health issue?

The first thing we want to look at with our dataset's features is how well-distributed the response variable is. Determining the base rate not only serves as a good measure of model robustness, but also if the method we ultimately use to approach the general research question is the most appropriate. The following code chunks determine the base rate of our response variable.

As a refresher, the response variable is coined supervisor, and it is a categorical measure of how willing employees are to discuss mental health issues with their direct supervisor(s). The possible survey responses, or response variable classes, are as follows: 'yes', 'some of them', and 'no'.

In [16]:
comp_employees = df.groupby(["supervisor"]).size().reset_index(name='Count')
fig = px.bar(comp_employees,
             x="supervisor",
             y="Count",
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Willingness to Reach Out to Direct Supervisor(s)'
             )
fig.show()

As observed qualitatively in the histogram above, the response variable is relatively well-distributed between the possible classes. There are 393 instances of 'No', 350 instances of 'Some of them', and 516 instances of 'Yes'.

The following code chunk will determine the base rates characterizing the response variable.

In [17]:
num_no = (df.supervisor == "No").sum()
num_some = (df.supervisor == "Some of them").sum()
num_yes = (df.supervisor == "Yes").sum()

fig = go.Figure(data=[go.Pie(labels=["No", "Some of them", "Yes"],
                             values=[num_no, num_some, num_yes],
                             pull=[0, 0, 0.1],
                             title="Base Rates of Response Variable (supervisor)",
                             )])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(line=dict(color='#000000', width=2)))
fig.show()

As seen quantitatively above, the base rates of the response variable are as follows: 41.00 for the 'Yes' class, 27.80% for the 'Some of them' class, and 31.2% for the 'No' class. These can serve as indicators of what random guessing the supervisor class of any given employee would look like. As such, they will be utilized later as the primary test of model robustness.

b) Key Question #2: Do demographic indicators such as age, gender, and country play a role in how willing employees are to reach out to their direct supervisor(s)?

Next, let use see both how well the demographic features present in our dataset (age, gender, and country) are distributed and the effect they might have on the response variable.

i) AGE

In [18]:
num_less18 = (df.age_bucket == "<18").sum()
num_20s = (df.age_bucket == "20s").sum()
num_30s = (df.age_bucket == "30s").sum()
num_40s = (df.age_bucket == "40s").sum()
num_50s = (df.age_bucket == "50s").sum()
num_60plus = (df.age_bucket == "60s+").sum()


fig = go.Figure(data=[go.Pie(labels=["<18", "20s", "30s", "40s", "50s", "60s+"],
                             values=[num_less18, num_20s, num_30s, num_40s, num_50s, num_60plus],
                             pull=[0, 0, 0.1, 0, 0, 0],
                             title="Age Distribution",
                             )])

fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(line=dict(color='#000000', width=2)))

fig.show()

At first glance, the pie chart above seems heavily distributed in favor of employees in their 20s and 30s. As such, it might be a better idea to leave this variable continuous as originally present in the dataset. We will see the effect of leaving age continuous as opposed to bucketing it into categories on the model later.

In [19]:
comp_age = df.groupby(["supervisor","age_bucket"]).size().reset_index(name='Count')
fig = px.bar(comp_age,
             x="supervisor",
             y="Count",
             color="age_bucket",
             barmode='group',
             color_discrete_sequence=px.colors.sequential.Plasma_r, 
             title='Distribution of Likelihood to Reach Out to Supervisor by Age'
             )
fig.show()

As qualitatively observed by the histogram above, it can be seen that age does not generally look like it has an effect on the response variable. However, upon second look, we can see that instances of 60s+ and <18 are relatively minute compared to the other age categories. This backs up what was previously stated, suggesting that age might be better left off as a continous variable rather than transforming it into a discrete one. Nevertheless, the qualitative observation suggests that age does not play a major role in determining whether or not an employee discusses a mental health concern with their direct supervisor(s). As such, we will leave it out of our hypothesized target predictor set.

ii) GENDER

In [20]:
num_female = (df.Gender == "female").sum()
num_male = (df.Gender == "male").sum()
num_other = (df.Gender == "other").sum()

fig = go.Figure(data=[go.Pie(labels=["Female", "Male", "Other"],
                             values=[num_female, num_male, num_other],
                             pull=[0, 0.1, 0],
                             title="Gender Distribution",
                             )])

fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(line=dict(color='#000000', width=2)))

fig.show()

As observed in the qualitative pie chart above, the proportion of employees identifying as male in our dataset largely exceeds both the proportion of employees identifying as female and the proportion of employees identifying as other. Since we are concerning ourselves with the tech workplace, this is as expected. However, it may skew our data in terms of its effect on the response variable. Thus, let us execute the code chunk below to gain a preliminary view of the effect of gender categories on our response variable.

In [21]:
comp_gb = df.groupby(["supervisor","Gender"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="Gender",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Supervisor by Gender'
             )
fig.show()

As seen in the qualitative, stacked bar plot above, gender does not seem to have much of an impact on whether an employee reaches out to their direct supervisor(s) with regard to a potential mental health concern. As such, we will leave this category in our dataset for now, but not add it to our 4 initial target predictors.

iii) Country

In [22]:
comp_country = df.groupby(["country_bucket"]).size().reset_index(name='Count')

fig = go.Figure(data=[go.Pie(labels=comp_country["country_bucket"],
                             values=comp_country["Count"],
                             pull=[0.1, 0],
                             title="Country Distribution",
                             )])

fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(line=dict(color='#000000', width=2)))
fig.show()

As shown qualitatively in the pie chart above, North American Countries dominate the dataset (65.4% of responses), closely followed by European countries (28.8% of responses). However, the 'Other' bucket pales in comparison, at 5.88%. Nevertheless, not unlike the gender category, it may be too early to exclude this variable, so let us see qualitatively see what its effect is on the response variable.

In [23]:
comp_gb = df.groupby(["supervisor","country_bucket"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="country_bucket",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Supervisor by Country'
             )
fig.show()

Surprisingly, country does not play much of a role in determining whether or not an employee will reach out to their direct supervisor(s) to discuss a potential mental health concern. Given the various laws, healthcare systems, and benefits provided by different countries with respect to mental health, this is surprising. Nevertheless, we will keep the country_bucket category in our dataset for now, but we will not add it to our hypothesized 4 initial target predictors.

c) Key Question #3: How much of a role do our 4 initial target predictors play a role in the willingness of employees to reach out to their direct supervisor(s)?

With demographic indicators out of the way, let us move on to analyzing our 4 initial target predictors. As a refresher, they are no_employees, remote_work, benefits, and obs_consequence.

i) no_employees

As previously mentioned, the no_employees feature tracks how many employees are present at the subject's workplace.

In [24]:
comp_employees = df.groupby(["no_employees"]).size().reset_index(name='Count')
fig = px.bar(comp_employees,
             x="no_employees",
             y="Count",
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Number Employees Distribution'
             )
fig.show()

As seen qualitatively in the individual bar plot above, the distribution of the number of employees follows that of a bimodal one. It will be interesting to observe the effect of such a distribution on the response variable, as there may be merit to having both a small and large company when it comes to mental health discussions with a supervisor.

In [25]:
comp_gb = df.groupby(["supervisor","no_employees"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="no_employees",
             barmode='group',
             color_discrete_sequence=px.colors.sequential.Plasma_r, 
             title='Distribution of Likelihood to Reach Out to Supervisor by no_employees'
             )
fig.show()

It seems as if the biomodal distribution does have a distinct effect on the response variable, at least qualitatively. This can be most distinctly observed with the 6-25 group from the 'No' class to the 'Yes' class of our response variable. Due to this seemingly positive relationship, we will keep no_employees in our target predictor set.

ii) obs_consequence

As a refresher, the obs_consequence feature tracks whether or not an employee has heard of or observed a negative consequence occurring to a coworker and related to their mental health.

In [26]:
comp_gb = df.groupby(["supervisor","obs_consequence"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="obs_consequence",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Likelihood to Reach Out to Supervisor based on Observed Negative Consequences for Coworkers with Mental Health Conditions'
             )
fig.show()

Based on the distribution above, it can be qualitatively observed that obs_consequence seems to have a role in determining the outcome of the response variable. When the instance of this category is No, it seems as if employees are more likely to discuss their mental health issues with their direct supervisor(s). Thus, we will keep the category in our hypothesized target predictor set for now.

iii) remote_work

As a reminder, the remote_work feature tracks whether or not more than 50% of an employee's work takes place remotely as compared to in-person.

In [27]:
comp_gb = df.groupby(["supervisor","remote_work"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="remote_work",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Supervisor Based on Remote Work Availability'
             )
fig.show()

Although the distribution shown above is not as distinct with respect to a relationship between remote_work and classes in the response variable, we will keep remote_work in our target predictor set. We decided on this for two reasons: 1) working remotely is becoming an increasingly available option, especially in the tech industry, and 2) this feature is extremely relevant right now due to COVID-19 forcing many people to work remotely. Thus, we want to see how this will contribute to discussing a mental health issue with a direct supervisor.

iv) benefits

As previously mentioned in this notebook, the benefits feature tracks whether or not workplace benefits specifically related to mental health are present in an employee's workplace.

In [28]:
comp_gb = df.groupby(["supervisor","benefits"]).size().reset_index(name='Count')
fig = px.bar(comp_gb,
             x="supervisor",
             y="Count",
             color="benefits",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Supervisor Based on if Employers Provide Mental Health Benefits'
             )
fig.show()

Similar to the remote_work feature, there does not seem to be much of a qualitative association between the benefits feature and our response variable. Despite this, our literature review repeatedly cited benefits as an indicator of beneficial mental health outcomes. As such, we will leave this in our target predictor set.

d) Key Question #4: How much of a role do two target predictors suggested by our peers play a role in the willingness of employees to reach out to their direct supervisor(s)?

It was suggested by peers to also include leave and seek_help in our target predictor set. In this sub-section, we will analyze how these features contribute to the outcome of our response variable.

i) leave

As previously mentioned, the leave feature tracks how easy it is for an employee to take medical leave specifically due to a mental health condition.

In [29]:
comp_leave = df.groupby(["supervisor","leave"]).size().reset_index(name='Count')
fig = px.bar(comp_leave,
             x="supervisor",
             y="Count",
             color="leave",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Direct Supervisor by how easy it is to take Medical Leave'
             )
fig.show()

Unlike the previous two categories, the leave feature seems to have a distinct association with the response variable. This is most easily observed with the 'very easy' category, which essentially more than doubled in frequency from the 'Some of them' class to the 'Yes' class. As such, we will definitely add the leave feature to our hypothesized set of predictors.

ii) seek_help

As mentioned previously, the seek_help feature tracks whether an employee's employer provides resources to learn more about mental health issues and how to seek help.

In [30]:
comp_seekhelp = df.groupby(["supervisor","seek_help"]).size().reset_index(name='Count')
fig = px.bar(comp_seekhelp,
             x="supervisor",
             y="Count",
             color="seek_help",
             barmode='group',
             color_discrete_sequence=px.colors.qualitative.D3, 
             title='Distribution of Likelihood to Reach Out to Direct Supervisor whether Employer Provides Resources to raise Mental Health Awareness'
             )
fig.show()

As qualitatively observed in the plot above, there does not seem to be much of an association between the seek_help feature and our response variable. As such, this does not warrant an addition into our hypothesized set of features.

e) Key Question #5: What is the multicollinearity between factors in our dataset? Which factors are highly correlated with our target label (supervisor)

Now that we have analyzed qualitative effects of our hypothesized set of target predictors on our response variables, even adding one in while doing so (leave), let us look at more quantitative metrics to determine relationships between predictive features and the response variable as well as between predictive features themselves.

The following code chunk and output will display a correlation matrix between the features of our dataset.

In [31]:
corr = mc_df.iloc[:].corr()
sns.heatmap(corr,
            xticklabels=corr.columns,
            yticklabels=corr.columns)
plt.show()

As observed qualitatively above, although some features are highly correlated, there does not seem to be too many features highly correlated with our response variable (supervisor). At first glance, only coworkers and mental_vs_physical seem to be highly correlated with the response variable, but let us double check this qualitative observation with quantitative features.

In [32]:
corr["supervisor"]
Out[32]:
Age                          0.003533
Gender                       0.068139
Country                     -0.001308
self_employed                0.037432
family_history               0.003729
treatment                   -0.036199
work_interfere              -0.092664
no_employees                -0.052650
remote_work                  0.025220
tech_company                 0.049543
benefits                     0.039570
care_options                 0.070158
wellness_program             0.103986
seek_help                    0.119302
anonymity                    0.179767
leave                        0.208364
mental_health_consequence   -0.153149
phys_health_consequence      0.103836
coworkers                    0.574310
supervisor                   1.000000
mental_health_interview     -0.189549
phys_health_interview        0.082835
mental_vs_physical           0.311737
obs_consequence             -0.090507
country_bucket               0.005444
age_bucket                   0.016276
Name: supervisor, dtype: float64

After excluding all other columns in the matrix, and generating quantitative statistics, our suspicious are more or less confirmed. The coworkers and mental_vs_physical features seem to be highly correlated with our response variable (supervisor), with metrics of 0.5743 and 0.3117, respectively. Upon quantitatively finding this, we may consider taking them out so that our model does not overfit to the training set.

In [33]:
res = chi2(mc_df.iloc[:], mc_df['supervisor'])
features = pd.DataFrame({
    'features': mc_df.columns[:],
    'chi2': res[0],
    'p-value': res[1]
})

features
Out[33]:
features chi2 p-value
0 Age 0.595044 7.426564e-01
1 Gender 1.763668 4.140229e-01
2 Country 2.950676 2.287014e-01
3 self_employed 1.601137 4.490735e-01
4 family_history 0.015672 9.921946e-01
5 treatment 0.852475 6.529614e-01
6 work_interfere 11.234793 3.634090e-03
7 no_employees 3.796551 1.498267e-01
8 remote_work 2.190844 3.343985e-01
9 tech_company 0.576132 7.497121e-01
10 benefits 1.925620 3.818185e-01
11 care_options 4.552720 1.026572e-01
12 wellness_program 11.221737 3.657891e-03
13 seek_help 14.367648 7.587607e-04
14 anonymity 29.881474 3.245789e-07
15 leave 81.879679 1.659792e-18
16 mental_health_consequence 42.810801 5.055367e-10
17 phys_health_consequence 7.622049 2.212550e-02
18 coworkers 164.231691 2.175412e-36
19 supervisor 817.150507 3.614282e-178
20 mental_health_interview 9.530343 8.521428e-03
21 phys_health_interview 6.428747 4.018050e-02
22 mental_vs_physical 91.026257 1.713560e-20
23 obs_consequence 12.438695 1.990544e-03
24 country_bucket 0.415159 8.125486e-01
25 age_bucket 0.330236 8.477937e-01

Finally, the results of running a chi-square analysis on each feature with respect to the response variable are depicted in the above output. The chi-square test investigates whether or not a statistical significance of the observed relationship with respect to the expected relationship exists.$^{22}$ In other words, the statistic is used to determine whether or not a statistically significant relationship between any given feature and the response variable exists due to the categorical nature of the variables.

According to the results, the following features are statistically significant at the 0.001 level: seek_help, anonymity, leave, mental_health_consequence, coworkers, and mental_vs_physical. This is simply too many to take out, but we can sufficiently decide to take coworkers out by now due to not only the more than significant p-value (2.1754*10$^{36}$) and high correlation found with the response variable (0.5743).

f) Finalized Hypothesis

After answering the key questions associated with our general research question, we can now reformat our original hypotheses. The biggest changes here are the addition of the leave feature to the hypothesized set of target predictors, which now possess 5 features, and the rewording of the hypothesis to capture the 5 strongest predictors instead of adding the predictors by feature importance and determining whether or not they are greater than an arbitrary value of 50%.

Null Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors.

Alternative Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular no order, will be the most important when predicting whether employees are willing to discuss mental health issues with supervisors

Target Predictors:

  • leave: How easy is it for you to take medical leave for a mental health condition?
  • benefits: Does your employer provide mental health benefits?
  • no_employees: How many employees does your company/organization have?
  • obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace
  • remote_work: Do you work remotely (outside of an office) at least 50% of the time?

IV. Initial Model & Feature Engineering

The following section outlines the initial model plan for the RandomForestClassifier, and showcases the results of the model's performance on the cleaned dataset. The model's performance is then analyzed depending on whether label encoding or one-hot encoding was utilized to encode the dataset's categorical features. Next, feature engineering is performed in an attempt to improve overall model performance. Finally, the model is tuned through performance of a multi-stage GridSearch containing multiple cross validation procedures within each overall iteration. At the end of this section, the best possible model with respect to accuracy in classifying the supervisor response variable as 'Yes' will be determined.

This finalized model will be subsequently utilized in Section V: Finalized Model & Results to determine its performance with respect to the other classes of the supervisor response variable (e.g., 'Some of them' and 'no').

a) Initial Model Plan

In order to approach our general question and test our hypothesis, we will be utilizing the scikit-learn library, specifically its Random Forest Classifier, to model our dataset and determine which features in our dataset hold the most weight when predicting whether or not an employee is willing to discuss a mental health concern with their direct supervisor(s).

As mentioned previously, the scikit-learn library in Python is a state-of-the-art machine learning library consisting of many different classifiers and machines. One of these classifiers is the Random Forest Classifier, which relies on the concept of classification trees. Essentially, a random forest algorithm grows many classification trees fitted to any specific dataset.$^{23}$ The node in each tree is split based on a measure of impurity, most commonly either Gini or Information Entropy, as we will see later. This gives a fast variable importance measure at every node of the tree, allowing the algorithm to be efficient and computationally inexpensive. When creating a Random Forest Classifier, there are several arguments necessary to be passed in. As obtained from the scikit-learn Python documentation, listed below are several important ones that will be utilized later in this notebook for tuning and feature engineering$^{24}$:

n_estimators --> Refers to the number of trees in the random forest.

criterion --> As stated previously, refers to the function calculated to measure the quality/impurity of a split. The most common functions are the Gini Index and the information gain.

max_depth --> Refers to the maximum depth of the tree.

min_samples_split --> Refers to the minimum number of samples required to split an internal node.

min_samples_leaf --> Refers to the minimum number of samples required to be at a leaf node.

max_features --> Refers to the number of features to consider when looking for the best node split.

boostrap --> Determines whether or not the whole dataset will be utilized to build each tree in the random forest.

max_samples --> If bootstrap is True, refers to the proportion of the dataset that will be utilized to build each tree in the random forest.

Although there are tradeoffs to extremely high and extremely low values of each hyperparameter, as we will see below, Random Forest Classifiers generally perform well and tend not to overfit. As such, the algorithm does not require very thorough hyper-parameter tuning to produce an accurate result, as the performance is most largely dependent on the min_samples_leaf and n_estimators hyperparameters. Furthermore, using the random forest algorithm makes it very easy to measure the relative importance of each feature on the prediction, which is what our hypothesis is about. Morevoer, the scikit-learn Python library even allows us to directly access normalized feature importance as an attribute of the fitted model. Ultimately, this is calculated by the Mean Decrease in the Gini Index, which essentially renders it an impurity-based measure of feature importance. In the end, this will be our primary endpoint for testing the hypothesis and answering the general research question listed above.

Furthermore, Bootstrapping will be leveraged to improve predictive accuracy and prevent overfitting even further, which will allow for a reduced variance of the overall classifier. All in all, Random Forest Classifiers will serve as an efficient measure of feature importance of the 5 hypothesized predictors in our target set: obs_consequence, benefits, remote_work, leave, and no_employees.

However, in order to tune the hyperparameters listed above, thereby optimizing our model, a parameter search algorithm will likely need to be conducted. Our plan is to use GridSearchCV(), also provided by the scikit-learn Python library, on a distinct set of parameters for the purposes of tuning parameters related to tree size, tree number, node size, and number estimators in our random forest model. Ultimately, we will utilize the GridSearchCV algorithm in steps, tuning the most important hyperparameters first before becoming more and more detailed in our approach. As such, we will start by tuning min_samples_leaf and n_estimators, then move on to max_features, and finally move on to max_depth and max_sample.

The dataset will be divided into training and testing portions via an 80:20 split in order to train and evaluate the performance of our model. To test our hypothesis, we will call feature importance and determine whether or not the 5 predictors present in our hypothesis are actually the 5 most important features in our Random Forest Classifier. As stated previously, feature importance is calculated by mean decrease in Gini Index.

b) Encoding Variables

As previously mentioned, we must encode our categorical variables and convert them to numeric identifiers so that the random forest can actually learn from the training dataset. We have already label encoded the dataset, but now will also one-hot encode it. The distinction between the two was explained previously in Section II: Data Overview & Cleaning. The following code chunk performs the one-hot encoding aspect through utilization of the get_dummies function in the pandas library.

In [34]:
encoded_df = pd.get_dummies(df)

unfe_df = pd.read_csv(file_path)
unfe_df = unfe_df.drop('Timestamp', 1)
unfe_df = unfe_df.drop('comments', 1)
unfe_df = pd.get_dummies(unfe_df)

cdrop = df.drop('Country', 1)
cdrop = pd.get_dummies(cdrop)

cdrop = df.drop('Country', 1)
cdrop = pd.get_dummies(cdrop)

agecdrop = df.drop('Country', 1)
agecdrop = df.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)

c) Initial OHE Model

The following code chunks will train a Random Forest Classifier on the one-hot encoded dataset as an initial indicator of model performance. Depending on the result, we will perform feature engineering as needed.

However, we first need to pre-process the dataset and drop some categories due to results of our exploratory data analysis and intuition. The following list summarizes the features that were dropped and the justification as to why they were dropped.

  • timestamp - We did not think the timestamp that the survey was submitted was relevant to determining our question.
  • state - We dropped state because we are looking at global data, and have country already.
  • coworkers - We dropped coworkers because it has the exact same question as supervisor, our label, but for coworkers.
    • We thought that including this category would introduce overfitting to our model (0.57).
  • mental_health_interview - We dropped mental_health_interview because we thought it was too similar to our label.
    • We thought including this category would introduce overfitting.
  • physical_health_interview - We dropped physical_health_interview because we thought it was too similar to our label
    • We thought including this category would introduce overfitting
  • comments - We dropped comments because the it contains many null values, and the string responses are all very different, so it would be very hard to format this data to actually benefit our model.

All of the following models were trained on this initially dropped dataset. However, feature engineering was also performed on some of the models below, as explained later. Nevertheless, with all preprocessing finished, let us see the results of our initial model run.

In [35]:
X_full = unfe_df.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y_full = unfe_df.loc[: ,'supervisor_Yes']
In [36]:
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
            X_full,
            X_full,
            test_size = 0.2,
            random_state = 508)
In [ ]:
#Instantiate and fit Entropy on Full Dataset
RF_entropy = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'entropy',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train_full, y_train_full)
In [ ]:
rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_entropy.predict(X_test_full) 
print(classification_report(y_test_full, rfc_predict))
In [ ]:
#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train_full, y_train_full).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test_full, y_test_full).round(7))

As observed quantitatively from the code chunks above, the random forest classifier trained on the non-feature engineered dataset exhibits strong overfitting. The training score was found to be 1.0 and the testing score was found to be 0.0, demonstrating high variance due to the overfit. This is likely due to the hyperparameters utilized above, as Bootstrapping was not performed, and all hyperparameters were left as default. Thus, we will start with altering these hyperparameters from the initial run.

d) OHE Non feature engineered model w/o age and country buckets

In this run, we will utilize the one-hot encoded dataset, yet make the hyperparameters of the Random Forest Classifier to different values to prevent the overfitting of the previous section. At the start, we will use 100 random forest trees and set the bootstrap argument to true. As a reminder, this second alteration allows the random forest to be grown out of subsets of the dataset instead of its entirety.

However, we will also perform the analysis without the age_bucket and country_bucket features, which might reduce the feature space complexity for age since it will stay continous. However, for country, the feature space will likely still be complex. Nevertheless, let us see the initial results based on this initil feature engineering change and hyperparameter additions.

In [40]:
X = encoded_df.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'age_bucket_<18', 'age_bucket_20s', 'age_bucket_30s', 'age_bucket_40s', 'age_bucket_50s', 'age_bucket_60s+', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes', 'country_bucket_Europe', 'country_bucket_North America', 'country_bucket_Europe'], axis=1)
y = encoded_df.loc[: ,'supervisor_Yes']
In [41]:
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)
In [42]:
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'entropy',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))

rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_entropy.predict(X_test) 
print(classification_report(y_test, rfc_predict))
Training Score 1.0
Testing Score: 0.7579365
0.7902831247537131
              precision    recall  f1-score   support

           0       0.78      0.81      0.79       144
           1       0.73      0.69      0.71       108

    accuracy                           0.76       252
   macro avg       0.75      0.75      0.75       252
weighted avg       0.76      0.76      0.76       252

Right off the bat, we can see that the alterations to our hyperparameters of the Random Forest Classifier and the changes in our dataset resulted in less of an overfit. This was as expected, as the age_bucket likely contributed substantial variance to the model since it was changed from continous to categorical. However, let us see the difference between using the Gini Index and using the Information Gain Index in the Random Forest Classifier Model.

In [43]:
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'gini',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)

#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))

rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_gini.predict(X_test) 
print(classification_report(y_test, rfc_predict))
Training Score 1.0
Testing Score: 0.7698413
0.7904685160273395
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       144
           1       0.74      0.71      0.73       108

    accuracy                           0.77       252
   macro avg       0.77      0.76      0.76       252
weighted avg       0.77      0.77      0.77       252

All metrics seem to improve when changing from Information Gain Index to the Gini Index as the criterion for the node split. However, it still seems to be overfitting based on the 1.0 training score.

Now, let us plot feature importance, which is the metric of interest.

In [44]:
def plot_feature_importances(model, train = X_train):
    n_features = train.shape[1]
    fi = np.sort(model.feature_importances_) #[::-1]
    fig = px.bar(X_train, fi, y=X_train.columns, orientation='h', title="Feature Importances of Model Features")
    fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
    fig.show()
In [45]:
plot_feature_importances(RF_gini_fit, train = X_train)

Based on the initial plot of feature importance, it seems as if the model is heavily favoring mental_health_consequence, work_interfere, leave, remote_work, and gender as important categories.

e) LE Non feature engineered model w/o age and country buckets

The following is a repetition of the last section but with the label encoded dataframe instead of the OHE encoded one.

In [46]:
X = mc_df.drop(['supervisor', 'coworkers', 'mental_health_interview', 'phys_health_interview', 'age_bucket'], axis=1)
y = mc_df.loc[: ,'supervisor']

#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)
In [47]:
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'entropy',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))

rfc_predict = RF_entropy.predict(X_test) 
print(classification_report(y_test, rfc_predict)) 
Training Score 1.0
Testing Score: 0.6031746
              precision    recall  f1-score   support

           0       0.59      0.66      0.62        64
           1       0.53      0.33      0.40        80
           2       0.64      0.78      0.70       108

    accuracy                           0.60       252
   macro avg       0.59      0.59      0.58       252
weighted avg       0.59      0.60      0.59       252

Based on the information gained from the model run above, with information gain entropy as the criterion, the model achieved admirable results, although they were still overal lower than the one-hot encoded model without feature engineering. Now, let us look at the feature importance.

In [48]:
plot_feature_importances(RF_entropy_fit, train = X_train)

Interestingly, Age was a strong predictor of the response variable. no_employees* was also included in the top 5, in addition to mental_health_consequence feature, which was also included in the last section's run. Similarly, work_interfere was also included in the top 5 most important features. Unexpectedly, Country** seemed to have a big role in this model's predictions of the response variable as well.

f) OHE Feature engineered model /w age and country buckets

The following secton trains the model on a feature engineered dataset with age and country buckets.

In [49]:
X = cdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'Age', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = cdrop.loc[: ,'supervisor_Yes']
In [50]:
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)
In [51]:
def plot_feature_importances(model, train = X_train):
    n_features = train.shape[1]
    fi = np.sort(model.feature_importances_) #[::-1]
    fig = px.bar(X_train, fi, y=X_train.columns, orientation='h', title="Feature Importances of Model Features")
    fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
    fig.show()
In [52]:
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'gini',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)

#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))

rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_gini.predict(X_test) 
print(classification_report(y_test, rfc_predict))
Training Score 0.9940417
Testing Score: 0.7420635
0.7855883440001087
              precision    recall  f1-score   support

           0       0.76      0.80      0.78       144
           1       0.71      0.67      0.69       108

    accuracy                           0.74       252
   macro avg       0.74      0.73      0.73       252
weighted avg       0.74      0.74      0.74       252

Based on the Gini Index Criterion, we can see a decrease in the overfitting related to our model, although slightly. However, the metrics decreased across the board from our previous One-Hot Encoded model which was run without any feature engineering. This was likely due to the age_bucket introducing a categorical variable with complex feature space into the dataset. Since age is an extremely variable category continously, parking ages into buckets reduces the ability of the model to be resistant to inherent variance.

Now, let us see how the metrics change through using information gain entropy as the criterion of node splitting.

In [53]:
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'entropy',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))

rfc_cv_score = cross_val_score(RF_entropy, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_entropy.predict(X_test) 
print(classification_report(y_test, rfc_predict))
Training Score 0.9940417
Testing Score: 0.7619048
0.7869484733602381
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       144
           1       0.74      0.69      0.71       108

    accuracy                           0.76       252
   macro avg       0.76      0.75      0.75       252
weighted avg       0.76      0.76      0.76       252

The results are relatively similar to those obtained using the Gini Information Index, although a slight increase can be observed across generally all metrics.

In [54]:
plot_feature_importances(RF_entropy_fit, train = X_train)

Based on the feature importance plot above, it seems as if mental_health_consequence is still in our top 5, as are no_employees, leave, and age_bucket.

g) LE Feature engineered model /w age and country buckets

The following sub-section repeats the analysis on the label encoded dataframe instead of the one-hot encoded dataframe.

In [55]:
X = mc_df.drop(['supervisor', 'coworkers', 'Age', 'mental_health_interview', 'phys_health_interview', 'Country'], axis=1)
y = mc_df.loc[: ,'supervisor']

#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)
In [56]:
#Instantiate and fit Entropy
RF_entropy = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'entropy',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)
RF_entropy_fit = RF_entropy.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))

rfc_predict = RF_entropy.predict(X_test) 
print(classification_report(y_test, rfc_predict)) 
Training Score 0.9920556
Testing Score: 0.5714286
              precision    recall  f1-score   support

           0       0.53      0.62      0.58        64
           1       0.43      0.23      0.30        80
           2       0.64      0.80      0.71       108

    accuracy                           0.57       252
   macro avg       0.53      0.55      0.53       252
weighted avg       0.54      0.57      0.54       252

Upon initial glance, this model's performance is not too distinct relative to the non-feature engineered results. However, all metrics were observed to drop across the board.

Now, let us see if the Gini Index Criterion results in any differences in performance.

In [57]:
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'gini',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)

#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_entropy_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_entropy_fit.score(X_test, y_test).round(7))

rfc_predict = RF_entropy.predict(X_test) 
print(classification_report(y_test, rfc_predict)) 
Training Score 0.9920556
Testing Score: 0.5714286
              precision    recall  f1-score   support

           0       0.53      0.62      0.58        64
           1       0.43      0.23      0.30        80
           2       0.64      0.80      0.71       108

    accuracy                           0.57       252
   macro avg       0.53      0.55      0.53       252
weighted avg       0.54      0.57      0.54       252

The results of using Gini Index Criterion were identical to those of the Information Gain Entropy Criterion.

In [58]:
plot_feature_importances(RF_entropy_fit, train = X_train)

According to the plot, mental_health_consequence, no_employees, work_interfere, age_bucket, and leave were the 5 most important predictors. This is promising, as two of our target predictors are present in the top five. However, the concerning part is the low metrics obtained in the model performance.

h) OHE Feature engineered model w/o age bucket /w country bucket

Since One-Hot Encoding generally performed the best thus far, let us now try running a Random Forest Classifier with the same hyperparameters mentioned above on the OHE dataset without the age bucket but with the country bucket.

In [59]:
X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = agecdrop.loc[: ,'supervisor_Yes']
In [60]:
#Train, test, split
X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)
In [61]:
#Instantiate and fit Gini
RF_gini = RandomForestClassifier(n_estimators = 100,
                                     criterion = 'gini',
                                     max_depth = None,
                                     min_samples_leaf = 1,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)

#Fitting model
RF_gini_fit = RF_gini.fit(X_train, y_train)

#Printing model scores
print('Training Score', RF_gini_fit.score(X_train, y_train).round(7))
print('Testing Score:', RF_gini_fit.score(X_test, y_test).round(7))

rfc_cv_score = cross_val_score(RF_gini, X, y, cv=10, scoring='roc_auc')
print(rfc_cv_score.mean())

rfc_predict = RF_gini.predict(X_test) 
print(classification_report(y_test, rfc_predict))
Training Score 1.0
Testing Score: 0.7539683
0.7853761668908729
              precision    recall  f1-score   support

           0       0.78      0.80      0.79       144
           1       0.72      0.69      0.71       108

    accuracy                           0.75       252
   macro avg       0.75      0.75      0.75       252
weighted avg       0.75      0.75      0.75       252

As depicted in the output above, this model was determined to be the most generalizable with respect to testing score, and will thus be utilized for hyperparameter tuning and subsequent multiclass classification.

In [62]:
ax = plt.gca()
roc_disp = skmet.plot_roc_curve(RF_gini_fit, X_test, y_test, ax = ax, alpha=.8)
ax.plot([0,1], [0,1], linestyle = '--', lw=2, color= 'black', label = 'Chance', alpha = .8)
ax.legend(loc = 'lower right')
plt.show()

The ROC Curve above showcases an Area-Under-Curve (AUC) value of 0.79. As such, the model exhibits a 79% chance that a randomly chosen positive instance (e.g., 'Yes' in the supervisor category) will be ranked higher than a randomly chosen negative instance. As such, it is a good sign that the model possess high quality and robustness.

i) Optimized DE Feature engineered model w/o age buckets /w country bucket

Now that we have picked the model of interest and conducted feature engineering, let's hypertune the parameters of the model using GridSearchCV(). As a refresher, the model of interest utilizes the 'gini' criterion with a True bootstrap argument. The engineered dataset does not contain the age_bucket feature but still contains the country bucket.

i) Hypertuning n_estimators & min_samples_leaf

First, let's tune the n_estimators and min_samples_leaf hyperparameters. As previously stated, these are generally considered to be the most important hyperparameters of the Random Forest Classifier. The n_estimators hyperparameter reflects the amount of trees utilized in the random forest, while the min_samples_leaf reflects the minimum number of samples required to be at a leaf node in the model.

In [66]:
agecdrop = df.drop('Country', 1)
agecdrop = agecdrop.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)

X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y = agecdrop.loc[: ,'supervisor_Yes']


X_train, X_test, y_train, y_test = train_test_split(
            X,
            y,
            test_size = 0.2,
            random_state = 508)

param_grid = {
    'n_estimators': [1, 20, 50, 75, 100, 200, 300],
    'min_samples_leaf': [1, 5, 10, 50, 100], 
}

rfc=RandomForestClassifier(random_state=508, bootstrap=True)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)

print("Best Parameter Set: ", CV_rfc.best_params_)
print()

best_acc = CV_rfc.best_score_
print("Best Accuracy: ", best_acc)
print()

best_model = CV_rfc.best_estimator_
Best Parameter Set:  {'min_samples_leaf': 10, 'n_estimators': 200}

Best Accuracy:  0.7636668144426382

Based on the results of the first grid search, the optimal n_estimators hyperparameter was found to be 200, and the optimal min_samples_leaf hyperparameter was found to be 10. The base cross-validation accuracy after this grid-search was found to be 76.37%.

ii) Hypertuning max_features

As a reminder, the max_features hyperparameter measures the maximum amount of features considered when looking for the best node split.

In [67]:
param_grid = {
    'min_samples_leaf': [10], 
    'n_estimators': [200], 
    'max_features':[5, 10, 15, 20],
}

rfc=RandomForestClassifier(random_state=508, bootstrap=True)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)

print("Best Parameter Set: ", CV_rfc.best_params_)
print()

print("Best Accuracy: ", CV_rfc.best_score_)
print()

if best_acc < CV_rfc.best_score_:
  print("Better Model?: ", True)
  best_acc = CV_rfc.best_score_
  best_model = CV_rfc.best_estimator_
else:
  print("Better Model?: ", False)
Best Parameter Set:  {'max_features': 5, 'min_samples_leaf': 10, 'n_estimators': 200}

Best Accuracy:  0.7577163686517905

Better Model?:  False

As can be seen in the output above, tuning the max_features hyperparameter did not result in a better Cross-Validation Accuracy score. Thus, we will leave this as its default value (None) in order to reduce model unecessary model constraints.

iii) Hypertuning max_depth & max_samples

As a refresher, max_depth refers to the maximum depth of the random forest's decision trees, and max_samples refers to the proportion of the original dataset sampled to make the decision tree when bootstrapping.

In [68]:
param_grid = {
    'min_samples_leaf': [10], 
    'n_estimators': [200], 
    'max_depth': [5, 10, 15, 20, 50, 100],
    'max_samples': [0.2, 0.4, 0.6, 0.8]
}

rfc=RandomForestClassifier(random_state=508, bootstrap=True)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='accuracy', cv=5)
CV_rfc.fit(X_train, y_train)

print("Best Parameter Set: ", CV_rfc.best_params_)
print()

print("Best Score: ", CV_rfc.best_score_)
print()

if best_acc < CV_rfc.best_score_:
  print("Better Model?: ", True)
  best_acc = CV_rfc.best_score_
  best_model = CV_rfc.best_estimator_
else:
  print("Better Model?: ", False)
Best Parameter Set:  {'max_depth': 15, 'max_samples': 0.8, 'min_samples_leaf': 10, 'n_estimators': 200}

Best Score:  0.7636668144426382

Better Model?:  False

Tuning the max_depth and max_samples hyperparameters did not improve the Cross-Validation Score. Thus, we will not include them in the final analysis.

iv) Summary of Best RandomForestClassifier & Testing Performance

This section will generate summary statistics of the tuned, feature engineered model on the 'Yes' class.

In [69]:
best_parameters = best_model.get_params()
print("{:<10}\t\t\t{:<10}".format("HYPERPARAMETER", "VALUE"))

for key, value in best_parameters.items():
  if value == None:
    val = 'None'
  else:
    val = value

  if key == "min_impurity_decrease" or key == "min_impurity_split" or key=="min_samples_leaf" or key == "min_samples_split": 
    print("{:<10}\t\t{:<10}".format(key, val))
  elif key == "min_weight_fraction_leaf":
    print("{:<10}\t{:<10}".format(key, val))
  else:
    print("{:<10}\t\t\t{:<10}".format(key, val))
HYPERPARAMETER			VALUE     
bootstrap 			1         
ccp_alpha 			0.0       
class_weight			None      
criterion 			gini      
max_depth 			None      
max_features			auto      
max_leaf_nodes			None      
max_samples			None      
min_impurity_decrease		0.0       
min_impurity_split		None      
min_samples_leaf		10        
min_samples_split		2         
min_weight_fraction_leaf	0.0       
n_estimators			200       
n_jobs    			None      
oob_score 			0         
random_state			508       
verbose   			0         
warm_start			0         
In [70]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.80      0.82      0.81       144
           1       0.75      0.72      0.74       108

    accuracy                           0.78       252
   macro avg       0.77      0.77      0.77       252
weighted avg       0.78      0.78      0.78       252


V. Finalized Model & Results

The following section showcases the performance of the finalized model obtained from Section IV: Initial Model & Feature Engineering on each class of the supervisor response variable from the dataset. As previously mentioned, the classes of the supervisor response variable are as follows: 'Yes', 'Some of them', and 'No'. When looking at this section, keep in mind that feature engineering on the Random Forest Classifier, in addition to hyperparameter tuning, was only performed by training with respect to the 'Yes' label. Furthermore, the base rates of the 'Yes', 'Some of them', and 'No' classes, as found previously in Section III: Exploratory Data Analysis are 41%, 27.8%, and 31.2%, respectively.

From there, this section will go into analysis of the results to address the finalized hypotheses, which are listed again below for clarity:

Null Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors.

Alternative Hypothesis: Using Random Forest and feature importance from sklearn (Mean Decrease in Gini Index), the 5 target predictors, in no particular no order, will be the most important when predicting whether employees are willing to discuss mental health issues with supervisors

Target Predictors:

  • leave: How easy is it for you to take medical leave for a mental health condition?
  • benefits: Does your employer provide mental health benefits?
  • no_employees: How many employees does your company/organization have?
  • obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace
  • remote_work: Do you work remotely (outside of an office) at least 50% of the time?

a) Model Summary and Performance on each Class

i) Random Forest Classifier - Finalized Model with Optimal Features

In [71]:
agecdrop = df.drop('Country', 1)
agecdrop = agecdrop.drop('age_bucket', 1)
agecdrop = pd.get_dummies(agecdrop)

X = agecdrop.drop(['supervisor_Yes', 'supervisor_No', 'supervisor_Some of them', 'coworkers_Yes', 'coworkers_Some of them', 'coworkers_No', 'mental_health_interview_Maybe', 'mental_health_interview_No', 'mental_health_interview_Yes', 'phys_health_interview_Maybe', 'phys_health_interview_No', 'phys_health_interview_Yes'], axis=1)
y_yes = agecdrop.loc[: ,'supervisor_Yes']
y_some = agecdrop.loc[:, 'supervisor_Some of them']
y_no = agecdrop.loc[:, 'supervisor_No']

RF = RandomForestClassifier(n_estimators = 200,
                                     criterion = 'gini',
                                     #max_depth = 10,
                                     min_samples_leaf = 10,
                                     #max_samples = 0.6,
                                     bootstrap = True,
                                     warm_start = False,
                                     random_state = 508)

print("Summary of the Best Model")
print()

model_parameters = RF.get_params()
print("{:<10}\t\t\t{:<10}".format("HYPERPARAMETER", "VALUE"))

for key, value in model_parameters.items():
  if value == None:
    val = 'None'
  else:
    val = value

  if key == "min_impurity_decrease" or key == "min_impurity_split" or key=="min_samples_leaf" or key == "min_samples_split": 
    print("{:<10}\t\t{:<10}".format(key, val))
  elif key == "min_weight_fraction_leaf":
    print("{:<10}\t{:<10}".format(key, val))
  else:
    print("{:<10}\t\t\t{:<10}".format(key, val))
Summary of the Best Model

HYPERPARAMETER			VALUE     
bootstrap 			1         
ccp_alpha 			0.0       
class_weight			None      
criterion 			gini      
max_depth 			None      
max_features			auto      
max_leaf_nodes			None      
max_samples			None      
min_impurity_decrease		0.0       
min_impurity_split		None      
min_samples_leaf		10        
min_samples_split		2         
min_weight_fraction_leaf	0.0       
n_estimators			200       
n_jobs    			None      
oob_score 			0         
random_state			508       
verbose   			0         
warm_start			0         

ii) Training & Testing Model on 'Yes' Label of Supervisor

In [72]:
X_train_yes, X_test_yes, y_train_yes, y_test_yes = train_test_split(
            X,
            y_yes,
            test_size = 0.2,
            random_state = 508)

RF_fit_yes = RF.fit(X_train_yes, y_train_yes)

print('Training Score', RF_fit_yes.score(X_train_yes, y_train_yes).round(7))
print('Testing Score:', RF_fit_yes.score(X_test_yes, y_test_yes).round(7))
print()

yes_predict = RF_fit_yes.predict(X_test_yes) 
print(classification_report(y_test_yes, yes_predict)) 
Training Score 0.7795432
Testing Score: 0.7777778

              precision    recall  f1-score   support

           0       0.80      0.82      0.81       144
           1       0.75      0.72      0.74       108

    accuracy                           0.78       252
   macro avg       0.77      0.77      0.77       252
weighted avg       0.78      0.78      0.78       252

In [73]:
ftr_imp = RF_fit_yes.feature_importances_
ftrs = X_train_yes.columns
ftrs_gen = ftrs.str.rsplit('_', 1).str.get(0)
In [74]:
ftrs_df = pd.DataFrame(
    {'Original_Column': ftrs_gen,
     #'Bucketed Column': ftrs,
     'Feature_Importance': ftr_imp
    })
In [75]:
grouped = ftrs_df.groupby(['Original_Column']).sum().reset_index().head(10)
#ftrs_df['Original Column'].value_counts()
grouped.sort_values(by='Feature_Importance', ascending=False).head(5)
Out[75]:
Original_Column Feature_Importance
8 mental_health_consequence 0.444921
9 mental_vs_physical 0.129960
0 Age 0.047588
7 leave 0.043240
2 anonymity 0.030094
In [76]:
grouped.sort_values(by='Feature_Importance', ascending=False).tail(5)
Out[76]:
Original_Column Feature_Importance
1 Gender 0.017018
5 country_bucket 0.014821
6 family_history 0.014580
4 care_options 0.011414
3 benefits 0.011035
In [77]:
ftrs_df
Out[77]:
Original_Column Feature_Importance
0 Age 0.047588
1 Gender 0.006901
2 Gender 0.010059
3 Gender 0.000058
4 self_employed 0.003878
5 self_employed 0.002629
6 family_history 0.007195
7 family_history 0.007385
8 treatment 0.006085
9 treatment 0.009589
10 work_interfere 0.004771
11 work_interfere 0.007941
12 work_interfere 0.003528
13 work_interfere 0.002733
14 work_interfere 0.013994
15 no_employees 0.002788
16 no_employees 0.011632
17 no_employees 0.008057
18 no_employees 0.003641
19 no_employees 0.000725
20 no_employees 0.008726
21 remote_work 0.006021
22 remote_work 0.005657
23 tech_company 0.003874
24 tech_company 0.004672
25 benefits 0.005475
26 benefits 0.005560
27 care_options 0.005198
28 care_options 0.006216
29 wellness_program 0.004048
30 wellness_program 0.005536
31 seek_help 0.005092
32 seek_help 0.004361
33 anonymity 0.020166
34 anonymity 0.009928
35 leave 0.003117
36 leave 0.016974
37 leave 0.007024
38 leave 0.016125
39 mental_health_consequence 0.066655
40 mental_health_consequence 0.267035
41 mental_health_consequence 0.111230
42 phys_health_consequence 0.030946
43 phys_health_consequence 0.059542
44 phys_health_consequence 0.010204
45 mental_vs_physical 0.055514
46 mental_vs_physical 0.074446
47 obs_consequence 0.002252
48 obs_consequence 0.002409
49 country_bucket 0.004934
50 country_bucket 0.008325
51 country_bucket 0.001562

iii) Training & Testing Model on 'Some of them' Label of Supervisor

In [78]:
X_train_some, X_test_some, y_train_some, y_test_some = train_test_split(
            X,
            y_some,
            test_size = 0.2,
            random_state = 508)

RF_fit_some = RF.fit(X_train_some, y_train_some)

print('Training Score', RF_fit_some.score(X_train_some, y_train_some).round(7))
print('Testing Score:', RF_fit_some.score(X_test_some, y_test_some).round(7))
print()

some_predict = RF_fit_some.predict(X_test_some) 
print(classification_report(y_test_some, some_predict)) 
Training Score 0.7437934
Testing Score: 0.6825397

              precision    recall  f1-score   support

           0       0.68      1.00      0.81       172
           1       0.00      0.00      0.00        80

    accuracy                           0.68       252
   macro avg       0.34      0.50      0.41       252
weighted avg       0.47      0.68      0.55       252

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

iv) Training & Testing Model on 'No' Label of Supervisor

In [79]:
X_train_no, X_test_no, y_train_no, y_test_no = train_test_split(
            X,
            y_no,
            test_size = 0.2,
            random_state = 508)

RF_fit_no = RF.fit(X_train_no, y_train_no)

print('Training Score', RF_fit_no.score(X_train_no, y_train_no).round(7))
print('Testing Score:', RF_fit_no.score(X_test_no, y_test_no).round(7))
print()

no_predict = RF_fit_no.predict(X_test_no) 
print(classification_report(y_test_no, no_predict)) 
Training Score 0.7934459
Testing Score: 0.7936508

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       188
           1       0.64      0.42      0.51        64

    accuracy                           0.79       252
   macro avg       0.73      0.67      0.69       252
weighted avg       0.78      0.79      0.78       252

v) ROC Curves

In [80]:
ax = plt.gca()
ax.plot([0,1], [0,1], linestyle = '--', lw=2, color= 'black', label = 'Chance', alpha = .8)
ax.legend(loc = 'lower right')
roc_disp_yes = skmet.plot_roc_curve(RF_fit_yes, X_test_yes, y_test_yes, ax = ax, alpha=.8, name = 'RFC for "Yes" Response')
roc_disp_some = skmet.plot_roc_curve(RF_fit_some, X_test_some, y_test_some, ax = ax, alpha=.8, name = 'RFC for "Some of them" Response')
roc_disp_no = skmet.plot_roc_curve(RF_fit_no, X_test_no, y_test_no, ax = ax, alpha=.8, name = 'RFC for "No" Response')
plt.title("ROC curves of Model on the Different Supervisor Labels")
plt.show()

b) Feature Importance of Each Class's Model Fit

In order to ensure that all model's were correctly trained on the same dataset, we will plot the feature importance, as determined by the mean decrease in Gini Index of the Random Forest Classifier fit, for each class's model train.

In [81]:
def plot_feature_importances(model, train = X_train_yes):
    n_features = train.shape[1]
    fi = model.feature_importances_ #[::-1]
    fig = px.bar(X_train_yes, fi, y=X_train_yes.columns, orientation='h', title="Feature Importances of Model Features").update_yaxes(categoryorder="total ascending")
    fig.update_layout(xaxis=dict(title="Feature Importance"), yaxis=dict(title="Feature"))
    fig.update_layout(xaxis={'categoryorder':'total descending'})
    fig.show()
In [82]:
plot_feature_importances(RF_fit_yes, train = X_train_yes)
In [83]:
plot_feature_importances(RF_fit_some, train = X_train_some)
In [84]:
plot_feature_importances(RF_fit_no, train = X_train_no)

c) Confusion Matrix Analysis

i) Confusion Matrix for 'Yes' Class

In [85]:
yes_matrix = confusion_matrix(y_test_yes, yes_predict)
yes_labels = ["Yes", "Some/No"]

sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(yes_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20}, 
    linecolor="black", linewidth=0.1, xticklabels=yes_labels, yticklabels=yes_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})

plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')

plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)

ax.set_title("Confusion Matrix for 'Yes' Class", fontsize=24)
fig.tight_layout()
plt.show()

ii) Confusion Matrix for 'Some of them' Class

In [86]:
some_matrix = confusion_matrix(y_test_some, some_predict)
some_labels = ["Some", "Yes/No"]

sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(some_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20}, 
    linecolor="black", linewidth=0.1, xticklabels=some_labels, yticklabels=some_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})

plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')

plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)

ax.set_title("Confusion Matrix for 'Some of them' Class", fontsize=24)
fig.tight_layout()
plt.show()

iii) Confusion Matrix for 'No' Class

In [87]:
no_matrix = confusion_matrix(y_test_no, no_predict)
no_labels = ["No", "Yes/Some"]

sns.set()
fig, ax = plt.subplots(figsize=(8,6))
ax = sns.heatmap(no_matrix, annot=True, fmt='d', square=True, ax=ax, annot_kws={"fontsize":20}, 
    linecolor="black", linewidth=0.1, xticklabels=no_labels, yticklabels=no_labels, cmap="PuBuGn", cbar_kws={'label':'Count'})

plt.setp(ax.get_xticklabels(), fontsize=16, va='center', ha='center')
plt.setp(ax.get_yticklabels(), fontsize=16, va='center', ha='center')

plt.ylabel('Predicted', fontsize=18)
plt.xlabel('Actual', fontsize=18)

ax.set_title("Confusion Matrix for 'No' Class", fontsize=24)
fig.tight_layout()
plt.show()

d) Summary Statistics

In [88]:
def getSummaryStatistics(confusion_matrix):
  TP = float(confusion_matrix[0][0])
  TN = float(confusion_matrix[1][1])
  FP = float(confusion_matrix[0][1])
  FN = float(confusion_matrix[1][0])

  acc = round(((TP+TN) / (TP+TN+FP+FN)), 2)
  pre = round(((TP) / (TP+FP)), 2)
  rec = round(((TP) / (TP+FN)), 2)
  if TN+FP == 0:
    spe="---"
  else: 
    spe = round(1 - ((FP) / (TN+FP)), 2)
  f1 = round((((2)*(rec)*(pre)) / (rec+pre)), 2)

  return [acc, pre, rec, spe, f1]

yes_summary = getSummaryStatistics(yes_matrix)
some_summary = getSummaryStatistics(some_matrix)
no_summary = getSummaryStatistics(no_matrix)
In [89]:
cellColor='lightskyblue'
headerColor='dodgerblue'  
theTable = plt.table(
  cellText=[
      yes_summary,
      some_summary,
      no_summary
  ],
  cellColours=[
      [cellColor, cellColor, cellColor, cellColor, cellColor], 
      [cellColor, cellColor, cellColor, cellColor, cellColor], 
      [cellColor, cellColor, cellColor, cellColor, cellColor]
  ],
  cellLoc='center',
  rowLabels=['Yes', 'Some', 'No'],
  rowColours=[headerColor, headerColor, headerColor], 
  rowLoc='center',
  colLabels=['ACCURACY', 'PRECISION', 'RECALL', 'SPECIFICITY', 'F1-Score'],
  colColours=[headerColor, headerColor, headerColor, headerColor, headerColor],
  colLoc='center',
  loc='center'
)
theTable.auto_set_font_size(False)
theTable.set_fontsize(16)
theTable.scale(2, 2)
ax=plt.gca()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.box(on=None)
plt.show()

VI. Discussion & Conclusion

a) Discussion

i) Interpreting the Results

Based on mean decrease in Gini Index Criterion, we cannot reject the null hypothesis. Of the 5 categories we predicted to be the most important in terms of feature importance to our model, only leave was present. The actual 5 most important features were mental_health_consequence, mental_vs_physical, leave, age, and anonymity. benefits was actually even found to be the least important feature in predicting the response variable. no_employees, obs_consequencce, and remote_work were not found to be anywhere near the top five most important features of our model. As a result, we can sufficiently reject the null hypothesis.

However, the top five features still make intuitive sense. First, mental_vs_physical measures how much an employee thinks that their employer takes mental health as seriously as physical. Intuitively it is clear to see why this would be the one of the top predictors of feeling comfortable with talking to your supervisor about mental health issues. For instance, if one felt that their supervisor takes mental health as serious as physical health, there is less of a stigma associated in the workplace, and one is likely to reach out to them if they had a potential mental health concern. Furthermore, mental_health_consequence measures the perception that an employee has regarding a negative consequence resulting from discussing a mental health issue with their employer. This also makes intuitive sense, as it inherently tracks the substantial stigma associated with having a mental illness in the world today; it is not surprising that this would be a top predictor of reaching out to a supervisor as well. Ultimately, this reasoning was why we chose obs_consequence as a feature of interest, but mental_health_consequence likely performed better because it is measuring what an employee thinks will happen to themselves if they reach out, not what they observed to happen to their coworkers (i.e., what obs_consequence reflects).

The next strongest predictor was anonymity, which tracked how employees perceived anonymity to be protected if they or their coworker takes advantage of mental health or substance abuse treatment. Similar to the prevoius two explanations, if their identity is protected, an employee will more likely to be comfortable reaching out to others, including a supervisor, with mental health concerns; it makes sense. age was an unexpected member of the top 5, but there are two possible explanations for this: 1) There was a large unimodal distribution of age highly in favor of employees in their 20s and 30s, and 2) stigma may be more debilitating to older members of the workforce who are accustomed to mental health not being discussed at all. Finally, we have the leave feature, which is the only one we predicted to make the top 5 most important features list.

In terms of model performance, the best model performed admirably compared to base rate percentages. The model obtained a 'Yes' accuracy of 78.00%, a 'Some of them' accuracy of 68.00%, and a 'No' accuracy of 79.00%. In contrast, the base rates of the three classes of the response variable were 41.00%, 27.80%, and 31.20%, respectively. As such, the model we utilized served as a solid predictor of whether an employee was willing to reach out to their direct supervisor(s) regarding a mental health issue.

ii) Errors & Limitations

Primarily, the model suffered from a lack of generalizability due to many possible sources. First, there were very low sample sizes from countries in the "Other" bucket, or in other words, countries that were not the United States of America, the United Kingdom, or Canada. Similarly, healthcare structures of the employees' residence were not able to be identified; this limitation is further exacerbated by the healthcare of the United States being highly variable across the population. Furthermore, the dataset was heavily imbalanced with respect to gender; this was likely due to the dataset being recorded primarily from technological industry companies, which are male-dominated. However, this is another inherent limitation, as other industries were not intended to be surveyed, and employee responses were only included as a byproduct of unintended survey distribution. Finally, the lack of generalizability primarily stems from the small size of the dataset (1,259 survey responses across several different countries); this is far from representative of the entire tech workforce population globally. Thus, the strongest predictors are unlikely to be constant if conducting analysis with a more representative dataset.

Additionally, the inherent categorical bias of the survey itself limited the robustness of the model. Since continous variables have greater statistical power, less variance without regard to sample size, and higher sensitivity than categorical variables, the fact that most features in the database were categorical in nature limits the model accuracy found in the analysis. Binning the country feature made the model more interpretable but it was not at all inclusive (e.g., only 'North American', 'Europe', and 'Other' instances). Finally, the model suffered from great class imbalance in the dataset, as the model tended to perofrm better for the majority with respect to respondent country as well as the 'Yes' class for the response variable. This was likely due to the feature engineering and hypertuning of the model only being performed through analysis of model performance on the 'Yes' class of the response variable.

iii) Future Avenues of Research

One future avenue of research includes exploring options to improve the class imbalance of the response variable. One way to achieve this is to downsample 'Yes' instances (41% base rate) to match more closely to 'Some' or 'No' instances (27% and 31%, respectively). Additionally, one can utilize the class_weight feature in the Random Forest Classifier model to account for the imbalance. Yet another way to achieve this is to relevel the response variable to make 'some' instances grouped under either 'yes' or 'no', or try both variations to observe differences between the strategies.

Another avenue of future research includes exploring specific health policies in the workplace and their effect on the likelihood of whether an employee discusses a mental health concern with their direct supervisor(s). Some variables of interest here may include maternity/paternity leave, subsidized child care, and even counseling. Similarly, specific types of mental health and healthcare coverage should be analyzed, in addition to a whole multitude of specific features if the dataset was expanded. Moreover, healthcare systems should specifically be explored to see their effect on the response variable; these include privatized insurance, universal healthcare, and the availability of a public option.

Next, a time series analysis can be conducted, as the Open Sourcing Mental Illness organization has conducted this survey multiple times since 2014. In this time series analysis, one can observe how predictors of the response variable change over time with respect to feature importance. More specifically, an analysis with respect to COVID-19 can be conducted, and the transition to forced remote work due to the pandemic can be analyzed with respect to coping with a mental illness. Finally, feature engineering and hypertuning should be performed with respect to 'Some' and 'No' classes in addition to the 'Yes' class, and different models utilizing multi-class classification should be probed to test the same hypothesis.

b) Conclusion

A 2014 dataset consisting of employee responses to a survey conducted by the Open Sourcing Mental Illness organization was cleaned and pre-processed; the survey intended to measure attitudes towards mental health and frequency of mental health disorders in the tech workplace. After learning about the features in the dataset and conducting an extensive literature review, a null hypothesis was created: Using Random Forest and feature importance from scikit-learn (i.e., mean decrease in Gini Index), 5 target predictors, in no particular order, will not be the most important when predicting whether employees are willing to discuss mental health issues with supervisors. The 5 target predictors of interest were determined to be obs_consequence, benefits, no_employees, leave, and remote_work. After conducting exploratory data analysis, in which feature distributions with respect to the response variable (supervisor) were analyzed and highly correlated features were removed, feature engineering was performed. Feature engineering not only allowed for the elucidation of the tradeoff between categorical and continuous features in the dataset, but also led the Random Forest Classifier model to be hypertuned. The best model was then tested on each class of the response variable ('Yes', 'Some of them', and 'No'). In the end, only one of the five features predicted to be most important with respect to Mean Decrease in Gini Index was in the actual top five most important features: leave. Thus, although the model resulted in substantially higher class prediction accuracies during testing relative to the base rates, the null hypothesis was not able to be rejected.


VII. References

  1. NIMH » Mental Illness. Accessed January 6, 2021. https://www.nimh.nih.gov/health/statistics/mental-illness.shtml
  2. Depression. Accessed January 6, 2021. https://www.who.int/en/news-room/fact-sheets/detail/depression
  3. Mental Health By the Numbers | NAMI: National Alliance on Mental Illness. Accessed January 6, 2021. https://www.nami.org/mhstats
  4. Fuhrer R, Keyes KM. Population Mental Health in the 21st Century: Time to Act. Am J Public Health. 2019;109(Suppl 3):S152-S153. doi:10.2105/AJPH.2019.305200
  5. Pescosolido BA. The Public Stigma of Mental Illness: What Do We Think; What Do We Know; What Can We Prove? J Health Soc Behav. 2013;54(1):1-21. doi:10.1177/0022146512471197
  6. Oexle N, Müller M, Kawohl W, et al. Self-stigma as a barrier to recovery: a longitudinal study. Eur Arch Psychiatry Clin Neurosci. 2018;268(2):209-212. doi:10.1007/s00406-017-0773-2
  7. Mental health: Why is it hard to talk about problems? - CBBC Newsround.https://www.bbc.co.uk/newsround/46425575. Accessed January 6, 2021.
  8. Disability in the Workplace: A Unique and Variable Identity - Alecia M. Santuzzi, Pamela R. Waltz, 2016. Accessed January 6, 2021. https://journals.sagepub.com/doi/full/10.1177/0149206315626269
  9. Mental Health in the Workplace. Published April 26, 2019. Accessed January 6, 2021. https://www.cdc.gov/workplacehealthpromotion/tools-resources/workplace-health/mental-health/index.html
  10. COVID-19’s Impact on Mental Health and Workplace Well-being. NIHCM. Accessed January 6, 2021. https://nihcm.org/publications/covid-19s-impact-on-mental-health-and-workplace-well-being
  11. Mental illness - Symptoms and causes. Mayo Clinic. Accessed January 6, 2021. https://www.mayoclinic.org/diseases-conditions/mental-illness/symptoms-causes/syc-20374968
  12. Parcesepe AM, Cabassa LJ. Public Stigma of Mental Illness in the United States: A Systematic Literature Review. Adm Policy Ment Health. 2013;40(5). doi:10.1007/s10488-012-0430-z
  13. Agovino T, Agovino T. Mental Illness and the Workplace. SHRM. Published August 3, 2019. Accessed January 6, 2021. https://www.shrm.org/hr-today/news/all-things-work/pages/mental-illness-and-the-workplace.aspx
  14. Four Ways Culture Impacts Mental Health. Mental Health First Aid. Published July 11, 2019. Accessed January 6, 2021. https://www.mentalhealthfirstaid.org/2019/07/four-ways-culture-impacts-mental-health/
  15. Rathore S. Tech Industry Sees Major Growth in New Companies During 2019. Small Business Trends. Published March 5, 2020. Accessed January 6, 2021. https://smallbiztrends.com/2020/03/2019-industry-growth-statistics.html
  16. Sado M, Shirahase J, Yoshimura K, et al. Predictors of repeated sick leave in the workplace because of mental disorders. Neuropsychiatr Dis Treat. 2014;10:193-200. doi:10.2147/NDT.S55490
  17. Publishing HH. Mental health problems in the workplace. Harvard Health. Accessed January 6, 2021. https://www.health.harvard.edu/newsletter_article/mental-health-problems-in-the-workplace
  18. Workplace Mental Health - Working Remotely During COVID-19. Accessed January 6, 2021. http://www.workplacementalhealth.org/Employer-Resources/Working-Remotely-During-COVID-19
  19. OSMI Home :: Open Sourcing Mental Illness - Changing how we talk about mental health in the tech community - Stronger Than Fear. Accessed January 7, 2021. https://osmihelp.org/
  20. Mental Health in Tech Survey. Accessed January 6, 2021. https://kaggle.com/osmi/mental-health-in-tech-survey
  21. Categorical Encoding | One Hot Encoding vs Label Encoding. Accessed January 7, 2021. https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
  22. Chi Square Test. Accessed January 7, 2021. https://www.statisticssolutions.com/chi-square-test/
  23. Random forests - classification description. Accessed January 7, 2021. https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
  24. Random Forest Classifier. Accessed January 7, 2021. file:///C:/Users/Kunaal/Zotero/storage/GRIIG3PQ/sklearn.ensemble.RandomForestClassifier.html