Eyes and Naive Bayes

4 min readMay 4, 2022

During medical school, I conducted some research involving the cornea and its viability as donation tissue. When a cornea is harvested from a donor, an eye bank analyzes and prepares it for further use as a research specimen or a transplant candidate. The dataset I was looking at came from a large eye bank and it contained a great amount of detail for each cornea and its donor. We looked at three major diseases affecting the eye: diabetes, glaucoma, and cataracts (patients who had cataract surgery). Our measure of ocular quality correlated with endothelial cell counts (ECC). Endothelial cells drain water from and provide nutrients to the cornea, which is vital to eye health.

Plan and Preparing the Data

During school, I found myself interested in machine learning and neural networks. I delved into various courses and found several practice problems online. Finally, I felt confident enough to apply my knowledge to the eye bank dataset. Here was my plan of attack:

Figure out which features in the dataset determine ECC
Clean the data for a machine learning algorithm
Create a model that can classify the quality of the cornea based on the features

Instead of considering only the three diseases above, I decided to analyze the entire column that contained the medical history. Each record contained the patient’s “recent medical history”, which was the cause of death, and “past medical history”, which contained the rest of the patient’s medical conditions and surgeries. My goal was to figure out which terms in this column correlated with “adequate” (>2000) and “inadequate” ECC (<2000). The data required a lot of cleaning and standardizing, e.g., many of the diseases were listed by their full name and their abbreviations (hypertension and HTN). I wrote a script to standardize some of the more common diseases.

Training vs Test Set

I first wanted to separate our data into training and test sets. Ultimately, the model was to be used on real data. If a model is tested on the same data it was trained on, it may give an optimistic picture of how well it works. The model needed to generalize to data it has not seen before so 20% of the data was reserved as a test set.

CountVectorizer

The medical history columns had to be vectorized to be used to train machine learning models. I used scikit-learn’s CountVectorizer to handle turning the text data into a special type of array. The first step in this process is assigning an integer, a “token”, for every word that appears in the text column. Then, we count up the number of times each token occurs in each donor’s medical history. Each of these counts is a row representing the original text input or a feature vector for the more technical. For example, let's consider the dataset below of two sentences.

I am in the United States
The United States borders Canada

The vectorized form of this dataset is below:

Each word that appears in the dataset is thrown into a “bag of words.” The unique word is now a feature or a column in the output array. The numbers in the array represent the number of times each word appeared in the original text input. I checked the number of features the eye bank medical history data would have.

vectorized feature count:  13773

The feature count of our array is 13773. In other words, we have 13773 individual words in our dataset. If a donor’s medical history was “diabetes, acne, and hypertension”, the corresponding row would have 3 “1”’s and 13770 “0”’s. That’s a lot of memory used for a relatively small amount of information. Instead, we use the sparse matrix. This saves just the positives of our instance vectorization. If diabetes was the 300th word, acne the 345th word, and hypertension the 12456th word, the sparse matrix would just save those positions in the row. We’ve now saved a huge amount of memory and ultimately compute time. The data is ready to be used to train models.

Multinomial Naive Bayes

I used the multinomial naive Bayes machine learning algorithm for my eye classification. The features of a naive Bayes model are all conditionally independent of each other. Let’s imagine that a person could have two common diseases, chronic obstructive pulmonary disease (COPD) and gastroesophageal reflux disease (GERD). Now let’s say that person was a smoker. The diseases COPD and GERD are conditionally independent if knowing the person was a smoker, knowing the person had COPD would not change the likelihood of that person having GERD and vice versa. In other words, the chance of a smoker having GERD is the same regardless of him having COPD, and the chance of him having COPD is the same regardless of him having GERD. This may not necessarily be true, but the naive Bayes model assumes this to be true and performs pretty well.

Now for the grand finale! Let’s run train the model, test it, and calculate its accuracy.

Naive Bayes accuracy_score:
0.935534222483

I am pretty satisfied with this score considering we have used the default parameters on our algorithms. There’s plenty of room for improvement, and if we incorporate medications and used adjust the n-gram range of our CountVectorizer we may improve our accuracy.

This was a fun experiment in machine learning and I’m glad it was able to help me perform worthwhile research. It was one of my first forays in this exciting field.

Eyes and Naive Bayes

Plan and Preparing the Data

Training vs Test Set

CountVectorizer

Multinomial Naive Bayes

Written by Dinesh Rai MD