Disease Prediction System
A machine learning-powered health prediction system that uses multiple classification algorithms to predict diseases from patient symptoms — achieving 90%+ accuracy on test datasets.
Overview
This project applies supervised machine learning to the healthcare domain — predicting potential diseases based on patient-reported symptoms and medical history. Three different classification algorithms were trained, evaluated, and compared to select the best-performing model.
The system includes a complete ML pipeline: data preprocessing, feature engineering, model training, hyperparameter tuning, evaluation, and a user-facing prediction interface built with Streamlit.
Business Impact & Key Results
- High Accuracy: Achieved over 90% test accuracy across all 3 trained models (Random Forest, Logistic Regression, Decision Tree).
- Data Pipeline Efficiency: Built a robust data preprocessing pipeline that handles missing values and encodes categorical features automatically.
- Real-world Applicability: Designed the model to aid preliminary screenings, prioritizing recall to minimize false negatives for critical conditions.
Problem Statement
Early disease detection can significantly improve patient outcomes, but access to quick preliminary assessments is limited. Many patients delay seeking medical attention because initial symptom evaluation requires a doctor visit.
This system provides a preliminary screening tool that analyzes symptoms using trained ML models, helping users understand potential conditions and encouraging timely medical consultation.
Architecture
[ Streamlit Frontend ]
├─ Symptom Input Form
└─ Results Dashboard
│ (Feature Vectors)
▼
[ ML Prediction Engine ]
├─ Random Forest
├─ Logistic Regression
└─ Decision Tree Classifier
▲ (Trained Models)
│
[ Data Preprocessing Pipeline ]
├─ Missing Value Imputation
├─ Feature Encoding (One-Hot)
└─ SMOTE (Class Balancing)
▲
[ Dataset (10,000+ Records) ]
Challenges & Solutions
- Imbalanced dataset: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance disease class distributions and prevent model bias toward common conditions.
- Feature selection: Used correlation analysis and feature importance ranking to identify the most predictive symptom features, reducing noise and improving accuracy.
- Model comparison: Trained 3 different algorithms and compared precision, recall, and F1-scores to select the best model for each disease category.
- Data quality: Built a preprocessing pipeline to handle missing values, encode categorical symptoms, and normalize numerical features consistently.
Key Learnings
ML Pipeline Design
Built an end-to-end machine learning pipeline from raw data ingestion to model deployment, understanding each stage's importance.
Model Evaluation
Learned to go beyond accuracy — evaluating models using precision, recall, F1-score, and confusion matrices for healthcare-critical predictions.
Domain-Specific ML
Understanding that healthcare ML requires higher confidence thresholds and careful handling of false negatives compared to general classification tasks.