Disease Prediction System

A machine learning-powered health prediction system that uses multiple classification algorithms to predict diseases from patient symptoms — achieving 90%+ accuracy on test datasets.

Status: Completed

Category: Machine Learning

Accuracy: 90%+

Python scikit-learn Pandas NumPy Streamlit Matplotlib

View on GitHub →

📋 Overview

This project applies supervised machine learning to the healthcare domain — predicting potential diseases based on patient-reported symptoms and medical history. Three different classification algorithms were trained, evaluated, and compared to select the best-performing model.

The system includes a complete ML pipeline: data preprocessing, feature engineering, model training, hyperparameter tuning, evaluation, and a user-facing prediction interface built with Streamlit.

📈 Business Impact & Key Results

High Accuracy: Achieved over 90% test accuracy across all 3 trained models (Random Forest, Logistic Regression, Decision Tree).
Data Pipeline Efficiency: Built a robust data preprocessing pipeline that handles missing values and encodes categorical features automatically.
Real-world Applicability: Designed the model to aid preliminary screenings, prioritizing recall to minimize false negatives for critical conditions.

🎯 Problem Statement

Early disease detection can significantly improve patient outcomes, but access to quick preliminary assessments is limited. Many patients delay seeking medical attention because initial symptom evaluation requires a doctor visit.

This system provides a preliminary screening tool that analyzes symptoms using trained ML models, helping users understand potential conditions and encouraging timely medical consultation.

🏗️ Architecture

[ Streamlit Frontend ]
   ├─ Symptom Input Form
   └─ Results Dashboard
       │ (Feature Vectors)
       ▼
[ ML Prediction Engine ]
   ├─ Random Forest
   ├─ Logistic Regression
   └─ Decision Tree Classifier
       ▲ (Trained Models)
       │
[ Data Preprocessing Pipeline ]
   ├─ Missing Value Imputation
   ├─ Feature Encoding (One-Hot)
   └─ SMOTE (Class Balancing)
       ▲
[ Dataset (10,000+ Records) ]

⚡ Challenges & Solutions

Imbalanced dataset: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance disease class distributions and prevent model bias toward common conditions.
Feature selection: Used correlation analysis and feature importance ranking to identify the most predictive symptom features, reducing noise and improving accuracy.
Model comparison: Trained 3 different algorithms and compared precision, recall, and F1-scores to select the best model for each disease category.
Data quality: Built a preprocessing pipeline to handle missing values, encode categorical symptoms, and normalize numerical features consistently.

💡 Key Learnings

ML Pipeline Design

Built an end-to-end machine learning pipeline from raw data ingestion to model deployment, understanding each stage's importance.

Model Evaluation

Learned to go beyond accuracy — evaluating models using precision, recall, F1-score, and confusion matrices for healthcare-critical predictions.

Domain-Specific ML

Understanding that healthcare ML requires higher confidence thresholds and careful handling of false negatives compared to general classification tasks.

🔗 Related Projects

← Back to All Projects