Skip to main content

Building a Diabetes Identifier using Binary Classification System


1. Introduction

Diabetes is a chronic medical condition that affects millions of people worldwide. Its prevalence is steadily increasing, making it a significant public health concern. Early detection and timely intervention are crucial for effectively managing this condition and reducing the risk of complications. In recent years, advancements in machine learning and data analytics have provided new avenues for improving healthcare outcomes, including the development of intelligent diagnostic systems.

Features

Gender

Gender refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.

Age

Age is an important factor as diabetes is more commonly diagnosed in older adults, Age ranges from 0-80 in our dataset.

Hypertension

Hypertension is a medical condition in which the blood pressure in the arteries is persistently elevated.

Heart Disease

Heart disease is another medical condition that is associated with an increased risk of developing diabetes.

Smoking History

Smoking history is also considered a risk factor for Diabetes. It represents the smoking history of the patient.

BMI(Body Mass Index)

BMI (Body Mass Index) is a measure of body fat based on weight and height. Patients with higher BMI are known for having more risk of suffering from diabetes.

HbA1c Level

HbA1c (Hemoglobin A1c) level is a measure of a person's average blood sugar level over the past 2-3 months.

Blood Glucose Level

Blood glucose level refers to the amount of glucose in the bloodstream at a given time.

Diabetes

Represents the patient if he/she has diabetes or not. 0 is negative, and 1 is positive.

2. Data Preparation

The dataset used for diabetes prediction consists of a diverse range of medical and demographic information obtained from patients, accompanied by their corresponding diabetes status (positive or negative). It encompasses several essential features, including age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. By leveraging this dataset, we can construct powerful machine learning models that aim to accurately predict the presence of diabetes based on an individual's medical history and demographic details.

At first look, we can see that have no null data. But still, we will check if we have such data as 'no info'. 

figure:1

Here we see the description of numerical features. These might give us ideas about our future work.

figure: 2

Our numerical variables are:

  • Age
  • BMI
  • HbA1c Level
  • Blood Glucose Level
  • Age BMI HbA1c Level Blood Glucose Level

Our categorical variables are:

  • Gender
  • Hypertension
  • Heart Disease
  • Smoking History
  • Diabetes

Outliers are the entries that are way above or way below the average distribution of the data. We are going to check them with a function, then get rid of them.

figure:3

figure:4

figure:5

Missing Values 

We never want missing values in our data. We either get rid of them by simply dropping them or filling them. As I've mentioned before, we do not have any missing NaN values on our data. But we have a category on smoking history called 'No Information'. I can't fill it with mean or something with numbers since it is a categorical feature. I cannot either guess it because it doesn't really seem like we can predict it by other features that are collected in our data. 

3. Data Pre-Processing

In this section, we convert categorical variables to Numerical and also compare the features that we talked about in variables.

figure: 6

According to this table, male patients tend to have diabetes more when compared to female patients but the difference is really low. This means gender has nothing to do with diabetes according to this data. 

figure: 7

According to this table, we see that Hypertension has a low effect on diabetes. It is low but there is an effect on having diabetes by Hyper Tension.

figure: 8

The effect of Heart Disease on Diabetes is almost the same as Hypertension.

figure: 9

At first, we thought smoking would somehow affect diabetes but, our data shows us that it has no effect on diabetes since all kind of smoking history has almost the same affection on diabetes. Also, as you can see, we have a category called 'No Info' on smoking. Noted for future.

We have to convert categorical features into numerical features. Machine learning can be done with numbers, only numbers. If we can turn data into numbers, we can apply Machine Learning to it. Otherwise, we can't.

There are several ways of doing this. Get dummies, label encoder or list comprehension. I am going to use Label Encoder from Scikit Learn.

figure: 10

  • Smoking History:
  • No Info 35813 labeled as 0.
  • never 35071 labeled as 4.
  • former 9344 labeled as 3.
  • current 9275 labeled as 1.
  • not current 6443 labeled as 5.
  • ever 4002 labeled as 2.

figure: 11

  • Gender:
  • The female is labeled as 0.
  • The male is labeled as 1.
  • The other is labeled as 2.

Our data preprocessing is done here since no more preprocessing is needed on this dataset. We can now start applying our ML models to this dataset.

4. Model design

figure: 12

  • We are going to apply a few different machine-learning models from SKLearn Library. We'll use the following:
    • Logistic Regression
    • Random Forest Classifier
    • KNeighbors Classifier
    • Decision Tree Classifier
    • Support Vector Machines(SVC)
5. Normalization, Training Set, and Test Set 
  • We are also going to do hyperparameter tuning for this case to acquire the best accuracy scores. 
  • Before we start, we better normalize our data for better machine learning accuracy. Normalization is basically scaling the numerical values between numbers that are closer to other numbers.

figure: 13
6. Overview of the implementation platform
Anaconda Navigator is a graphical user interface (GUI) included in the Anaconda distribution, a popular Python and R programming language environment for data science and scientific computing. Anaconda Navigator provides a convenient way to manage and launch applications, environments, and packages within the Anaconda ecosystem.

Anaconda Navigator eliminates the need for complex setup and configuration by providing a user-friendly interface that allows users to access various tools and resources seamlessly. It simplifies the management of environments, which are isolated spaces where specific Python versions, libraries, and dependencies can be installed and used. This is particularly useful when working on projects with different requirements or when collaborating with others who might have different software configurations.

Anaconda Navigator includes a package manager that enables users to search, install, and update packages from a vast collection of pre-built libraries, making it effortless to add functionality to your projects. This extensive library collection encompasses popular data analysis and machine learning packages such as NumPy, Pandas, TensorFlow, and scikit-learn, among others.

One of the significant advantages of Anaconda Navigator is its integration with Jupyter Notebook. This open-source web application allows the creation and sharing of documents containing live code, visualizations, and explanatory text. Jupyter Notebook is a powerful tool for interactive data analysis and exploratory coding, and its integration with Anaconda Navigator enhances the overall development experience.

Additionally, Anaconda Navigator offers a range of other tools and utilities, including an integrated development environment (IDE) called Anaconda Prompt, which provides a command-line interface for executing Python commands and managing environments. It also provides access to the Anaconda Cloud, a cloud-based platform for sharing and discovering packages, notebooks, and other resources.

In summary, Anaconda Navigator provides a user-friendly interface for managing environments, installing packages, and launching tools within the Anaconda ecosystem. It integrates seamlessly with Jupyter Notebook and offers a comprehensive collection of pre-built libraries for data analysis and machine learning. With its intuitive interface and powerful features, Anaconda Navigator is a valuable tool for data scientists, researchers, and developers working on Python-based projects.

7. Training 

7.1 Logistic Regression

figure: 14

figure:15

We have acquired an accuracy score of 96% Great. Let's try other models to see if we can acquire a better score.

7.2 Decision Tree Classifier

figure: 16
We have acquired 95.3%  accuracy.

7.3.K Neighbor Classification

figure: 17

We have a score better than Decision Tree here, which is good. But we can improve the accuracy here by tuning the parameters.

figure: 18

As seen in the plot, we can acquire an accuracy score of 96.2% by picking the K value as 11.

figure: 19

7.4. Random Forest Classification

figure: 20

Here comes our best score 97%. This is the best accuracy we've acquired so far.

7.5. Support Vector Machines

figure: 21

  • We have done a full examination of a dataset related to diabetes.
  • We have described the features of the dataset.
  • We have done basic data analysis.
  • We have detected the outliers with a function that we've coded ourselves and got rid of these outlier values for a better machine learning score.
  • We have visualized our data for better inspection and decision.
  • We have done some feature engineering for better understanding from the computer and applied them to our dataset.
  • We have applied 5 different machine learning algorithms to our data and found out the best one for our dataset.

11. Discussion

The development of a Diabetes Identifier using a binary classification system presents an exciting opportunity to leverage the power of machine learning in improving healthcare outcomes. By automating the diagnosis process, we can enhance accuracy, efficiency, and accessibility, ultimately contributing to the early identification and effective management of diabetes.

Machine learning is reshaping the medical landscape, bringing about significant advancements in diagnosis, treatment, predictive analytics, and remote patient monitoring. By leveraging machine learning algorithms in conjunction with healthcare data, the healthcare industry stands to benefit from improved patient outcomes, reduced costs, and accelerated medical research. However, it is crucial to address ethical concerns and data privacy issues to fully unlock the potential of machine learning in medicine. Collaboration between healthcare professionals and data scientists, coupled with ongoing research, holds the key to a future where healthcare is precise, accessible, and highly effective for everyone.

12. References 

[1] https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/code

Group Members

N.M.C.Samarasekara - ASP/18/19/151 - 4472 

D.S.Wanigathunga - ASP/18/19/155 - 4474




Comments