Diabetes is one of deadliest diseases in the world. It is not only a disease but also a creator of different kinds of diseases like heart attack, blindness, kidney diseases, etc. The normal identifying process is that patients need to visit a diagnostic center, consult their doctor, and sit tight for a day or more to get their reports. Moreover, every time they want to get their diagnosis report, they have to waste their money in vain.

But with the rise of Machine Learning approaches we have the ability to find a solution to this issue, we have developed a system using data mining which has the ability to predict whether the patient has diabetes or not. Furthermore, predicting the disease early leads to treating the patients before it becomes critical. Data mining has the ability to extract hidden knowledge from a huge amount of diabetes-related data. Because of that, it has a significant role in diabetes research, now more than ever. The aim of this research is to develop a system which can predict the diabetic risk level of a patient with a higher accuracy. This research has focused on developing a system based on three classification methods namely, Decision Tree, Naïve Bayes, and Support Vector Machine algorithms.

Currently, the models give accuracies of 84.6667%, 76.6667%, and 77.3333% for Decision Tree, Naïve Bayes, and SMO Support Vector Machine respectively. These results have been verified using Receiver Operating Characteristic curves in a sensitive manner. The developed ensemble method uses votes given by the other algorithms to produce the final result. This voting mechanism eliminates the algorithm-dependent misclassifications. It also helps to get a more accurate prediction of the disease. We used Weka data mining extension for data preprocessing and experimental analysis. Results show a significant improvement of accuracy of the ensemble method compares to other existing methods.

**Methodology**

These algorithms don’t work alone; we have developed an ensemble method which uses votes given by the other algorithms to produce the final result. The system accepts final result, only when more than two models give same prediction outputs. It gives the majorities decision. This voting mechanism eliminates the algorithm dependent misclassifications. It also helps to get a more accurate prediction of the disease.

**A. Decision Tree J48 Algorithm**

Decision-Tree is a tree structure which has the form of a flowchart. It can be used as a method for classification and prediction with a representation using nodes and internodes. Root and internal nodes are the test cases. Leaf nodes considered as class variables. In order to classify a new item, it creates a decision tree based on the attribute values of the available training data set. Every node of the tree is generated by calculating the highest information gain for all attributes. If any attribute gives an unambiguous end result, the branch of that attribute will be terminated and then the target value is assigned to it. The following diagram shows a sample decision tree.

12-fold cross validation technique has used to build the model. It’s simply as follows:

- Break data into 12 sets of size n/12.
- Train on 11 datasets and test on 1.
- Repeat 12 times and take the mean accuracy.

In 12-fold cross-validation, original sample is randomly partitioned into 12 equal sized subsamples. Then a single subsample is retained as the validation data for test the model, and the remaining (12− 1) subsamples are used as training data.

**B. Naïve Bayes Algorithm**

This is based on Bayes rule of conditional probability. It uses all the attributes contained in the data and analyzes them individually as though they are equally important and independent of each other. The build process for Naive Bayes is parallelized. It overcomes various limitations like the omission of complex iterative estimations of the parameter because it can be applied to a large dataset in real time. We have used 70:30 percentage split technique to build the model using this algorithm. 70 percent of the data set have been used to train the data and other 30 percent of the data set have been used to test the model.

**C. SMO (Sequential Minimal Optimization)**

This algorithm is commonly used for solving the quadratic programming problems that arise during the training of SVM (Support Vector Machines). SMO uses heuristics to partition the training problem into smaller problems that can be solved analytically. It replaces all missing values and transforms nominal attributes into binary ones. Also, normalizes all attributes by default which helps to speed up the training process. Here also 70:30 percentage split technique have been used to train and test the dataset using this model.

**D. Dataset Used:**

Data has been obtained from Pima Indians Diabetes Database and the National Institute of Diabetes and Digestive and Kidney Diseases.

**E. Procedure:**

- Load previous datasets to the system.
- Data pre-processing has done using integrating WEKA tool.
- Following operations are performed on the dataset after that.

a. Replace Missing values.

b. Normalization of values.

- User input data to the system in order to diagnose whether he has the disease or not.
Build three models using J48 Decision Tree, Naïve Bayes, and SMO Support VectorMachine Algorithms and train the data set.

- Test the dataset using three models.
Get the evaluation results.

Get the predicted voting from all classifiers and gives the diagnostic result.

**Results and Discussion**

Evaluation-Results |
J48-Decision-Tree |
Naïve-Bayes |
SMO-Support Vector-Machine |

Predicted-results |
tested_positive |
tested_positive |
tested_positive |

Correctly-Classified-Instances |
508(84.6667%) |
460(76.6667%) |
464(77.3333%) |

Incorrectly-Classified-Instances |
92(15.3333%) |
140(23.3333%) |
136(22.6667%) |

Kappa-statistic |
0.6343 |
0.4718 |
0.4593 |

Mean-absolute-error |
0.2225 |
0.2824 |
0.2267 |

Root-mean-squared-error |
0.3335 |
0.4156 |
0.4761 |

Total-Number-of-Instances |
600 |
600 |
600 |

**Conclusion**

Considering these results, we can infer that, every model has more than 70% precision which is high. Likewise, because of the voting process of all the algorithms, it ensures that the final diagnose is very accurate. Besides, we have planned to gather more information from various districts over the nation and grow more exact and general farsighted model.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}