1. Introduction
Diabetes is a chronic medical condition that affects
millions of people worldwide. Its prevalence is steadily increasing, making it
a significant public health concern. Early detection and timely intervention
are crucial for effectively managing this condition and reducing the risk of
complications. In recent years, advancements in machine learning and data
analytics have provided new avenues for improving healthcare outcomes,
including the development of intelligent diagnostic systems.
Features
|
Gender |
Gender refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes. |
|
Age |
Age is an important factor as diabetes is more commonly diagnosed in older adults, Age ranges from 0-80 in our dataset. |
|
Hypertension |
Hypertension is a medical condition in which the blood pressure in the arteries is persistently elevated. |
|
Heart Disease |
Heart disease is another medical condition that is associated with an increased risk of developing diabetes. |
|
Smoking History |
Smoking history is also considered a risk factor for Diabetes. It represents the smoking history of the patient. |
|
BMI(Body Mass Index) |
BMI (Body Mass Index) is a measure of body fat based on weight and height. Patients with higher BMI are known for having more risk of suffering from diabetes. |
|
HbA1c Level |
HbA1c (Hemoglobin A1c) level is a measure of a person's average blood sugar level over the past 2-3 months. |
|
Blood Glucose Level |
Blood glucose level refers to the amount of glucose in the bloodstream at a given time. |
|
Diabetes |
Represents the patient if he/she has diabetes or not. 0 is negative, and 1 is positive. |
2. Data Preparation
The dataset used for diabetes prediction consists of a
diverse range of medical and demographic information obtained from patients,
accompanied by their corresponding diabetes status (positive or negative). It
encompasses several essential features, including age, gender, body mass index
(BMI), hypertension, heart disease, smoking history, HbA1c level, and blood
glucose level. By leveraging this dataset, we can construct powerful machine
learning models that aim to accurately predict the presence of diabetes based
on an individual's medical history and demographic details.
At first look, we can see that have no null data. But still, we will check if we have such data as 'no info'.
| figure:1 |
Here we see the description of
numerical features. These might give us ideas about our future work.
| figure: 2 |
Our numerical variables are:
- Age
- BMI
- HbA1c Level
- Blood Glucose Level
- Age BMI HbA1c Level Blood Glucose Level
Our categorical variables are:
- Gender
- Hypertension
- Heart Disease
- Smoking History
- Diabetes
Outliers are the entries that are way above or way below the average distribution of the data. We are going to check them with a
function, then get rid of them.
| figure:3 |
| figure:4 |
| figure:5 |
Missing Values
We never want missing values in our data. We either get rid of them by simply dropping them or filling them. As I've mentioned before, we do not have any missing NaN values on our data. But we have a category on smoking history called 'No Information'. I can't fill it with mean or something with numbers since it is a categorical feature. I cannot either guess it because it doesn't really seem like we can predict it by other features that are collected in our data.
3. Data Pre-Processing
In this section, we convert categorical variables to
Numerical and also compare the features that we talked about in variables.
| figure: 6 |
According to this
table, male patients tend to have diabetes more when compared to female
patients but the difference is really low. This means gender has nothing to do
with diabetes according to this data.
| figure: 7 |
According to this
table, we see that Hypertension has a low effect on diabetes. It is low but
there is an effect on having diabetes by Hyper Tension.
| figure: 8 |
The effect of
Heart Disease on Diabetes is almost the same as Hypertension.
| figure: 9 |
At first, we thought smoking would somehow affect diabetes but, our data shows us that it has no effect on diabetes since all kind of smoking history has almost the same affection on diabetes. Also, as you can see, we have a category called 'No Info' on smoking. Noted for future.
We have to convert categorical features into numerical features. Machine learning can be done with numbers, only numbers. If we can turn data into numbers, we can apply Machine Learning to it. Otherwise, we can't.
There are several ways of doing this. Get dummies, label encoder or list comprehension. I am going to use Label Encoder from Scikit Learn.
| figure: 10 |
- Smoking
History:
- No Info
35813 labeled as 0.
- never 35071
labeled as 4.
- former 9344
labeled as 3.
- current 9275
labeled as 1.
- not current
6443 labeled as 5.
- ever 4002 labeled as 2.
| figure: 11 |
- Gender:
- The female is
labeled as 0.
- The male is
labeled as 1.
- The other is
labeled as 2.
Our data
preprocessing is done here since no more preprocessing is needed on this dataset.
We can now start applying our ML models to this dataset.
4. Model design
![]() |
| figure: 12 |
- We are going
to apply a few different machine-learning models from SKLearn Library. We'll
use the following:
- Logistic
Regression
- Random
Forest Classifier
- KNeighbors Classifier
- Decision
Tree Classifier
- Support
Vector Machines(SVC)
- We are also
going to do hyperparameter tuning for this case to acquire the best
accuracy scores.
- Before we
start, we better normalize our data for better machine learning accuracy.
Normalization is basically scaling the numerical values between numbers
that are closer to other numbers.
| figure: 13 |
7.1 Logistic
Regression
| figure: 14 |
| figure:15 |
We have acquired an accuracy score of 96% Great. Let's try other models to see if we can acquire a better score.
7.2 Decision Tree Classifier
| figure: 16 |
7.3.K Neighbor Classification
| figure: 17 |
We have a score better than Decision Tree here, which is good. But we can improve the accuracy here by tuning the parameters.
| figure: 18 |
As seen in the plot, we can acquire an accuracy score of 96.2% by picking the K value as 11.
| figure: 19 |
7.4. Random Forest Classification
| figure: 20 |
Here comes our best score 97%. This is the best accuracy we've acquired so far.
7.5. Support Vector Machines
| figure: 21 |
- We have done a full examination of a dataset related to diabetes.
- We have described the features of the dataset.
- We have done basic data analysis.
- We have detected the outliers with a function that we've coded ourselves and got rid of these outlier values for a better machine learning score.
- We have visualized our data for better inspection and decision.
- We have done some feature engineering for better understanding from the computer and applied them to our dataset.
- We have applied 5 different machine learning algorithms to our data and found out the best one for our dataset.
11. Discussion
The development of a Diabetes Identifier using a binary classification system presents an exciting opportunity to leverage the power of machine learning in improving healthcare outcomes. By automating the diagnosis process, we can enhance accuracy, efficiency, and accessibility, ultimately contributing to the early identification and effective management of diabetes.
Machine learning is reshaping the medical landscape, bringing about significant advancements in diagnosis, treatment, predictive analytics, and remote patient monitoring. By leveraging machine learning algorithms in conjunction with healthcare data, the healthcare industry stands to benefit from improved patient outcomes, reduced costs, and accelerated medical research. However, it is crucial to address ethical concerns and data privacy issues to fully unlock the potential of machine learning in medicine. Collaboration between healthcare professionals and data scientists, coupled with ongoing research, holds the key to a future where healthcare is precise, accessible, and highly effective for everyone.
12. References
[1] https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/code
Group Members
N.M.C.Samarasekara - ASP/18/19/151 - 4472
D.S.Wanigathunga - ASP/18/19/155 - 4474

Comments
Post a Comment