The data is obtained from https://www.kaggle.com/uciml/pima-indians-diabetes-database. The aim of the task was to use machine learning to accurately predict whether or not the patients in the dataset have diabetes or not
The overall goal for this kaggle dataset was to improve on previous attempts by using more advanced machine learning algorithms. Therefore, this project included the following algorithms:
- K- Nearest Neighbours (KNN)
- Decision Tree
- Random Forest
- Support Vector Machines (SVM)
- Artifical Neural Network
- XGBoost
Within this project, there were also data analysis methodologies such as: Outlier Detection and Normalising, Redundant Features, Feature Importance and Feature Selection to help identify the important attributes and if further cleaning needs to be performed.
Full report can be found on my github: https://github.com/TingHanGan/pima-indians-diabetes-prediction.
The best classifcation model out of all 6 models resulted in the Random Forest and XGBoost, with around an 80% accuracy rate, while ANN performed the worst with ~70%. This could be due to the fact that the dataset was not large enough to be learnt from, and since ANN ideally requires a lot of training data to produce a more accurate model, it could be a reason of the result in a lower accuracy.