William Goble

Predicting Heart Attcks using Machine Learning

During my time in graduate school one of my courses worked on the classic machine learning example of trying to predict if a patient would have a heart attack. The exact assignment was to split our given dataset into train and test sets, and then design a formula that can calculate whether or not a person will have a heart attack. The tick our professor gave us was that we were not allowed to use preexisting models. The assignment was to basically create a Random Forest Classifier, without creating a Random Forest Classifier. After creating our algorithm we needed to calculate the accuracy of our algorithm with the test set. The basic structure of my algorithm was to divide the data into the two gender options, and then consider factors like chest pain, cholesterol level, and the number of defects or vessels they had. The overall structure of this algorithm was a detailed if/else block. After calculating the accuracy my algorithm had an accuracy score of 76.19.
After graduating I decided to revist this assignment and see if I could generate a better score using the Scikit-Learn RandomForestClassifier. This post is walking through my workflow when working with this dataset and classifier. First, we load the data and examine it's contents using methods like head(), info(), and describe(). Doing this gave me an idea about the type of data that I'm working with, and allowed me to identify the target column I want to predict. Next, we create our model and train and test set.

									
## Creating train and test datasets.
X = health.drop("Result", axis = 1)
y = health["Result"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Model creation
model = RandomForestClassifier(n_estimators = 100)
# We can view the default parameters with the following:
model.get_params()

After seperating the train and test datasets, we can train our model with the fit method, and then start predicting outcomes with model.predict(X_test). After performing these steps we are able to determine the accuracy of our results in the following way.

								
model.score(X_train, y_train)	# The score our model has with the train dataset, should be 1.0
model.score(X_test, y_test)		# The score our model has with the test dataset, in this case is was 0.878
print(accuracy_score(y_test, y_prediction)) # Similar result to model.score(X_test, y_test)

# The following includes the precision, recall, and F1-score, which is calculated from precision and recall
print(classification_report(y_test, y_predictions))