Disease Prediction System

A machine learning-powered health prediction system that uses multiple classification algorithms to predict diseases from patient symptoms — achieving 90%+ accuracy on test datasets.

Status: Completed
Category: Machine Learning
Accuracy: 90%+
Python scikit-learn Pandas NumPy Streamlit Matplotlib

📋 Overview

This project applies supervised machine learning to the healthcare domain — predicting potential diseases based on patient-reported symptoms and medical history. Three different classification algorithms were trained, evaluated, and compared to select the best-performing model.

The system includes a complete ML pipeline: data preprocessing, feature engineering, model training, hyperparameter tuning, evaluation, and a user-facing prediction interface built with Streamlit.

📈 Business Impact & Key Results

  • High Accuracy: Achieved over 90% test accuracy across all 3 trained models (Random Forest, Logistic Regression, Decision Tree).
  • Data Pipeline Efficiency: Built a robust data preprocessing pipeline that handles missing values and encodes categorical features automatically.
  • Real-world Applicability: Designed the model to aid preliminary screenings, prioritizing recall to minimize false negatives for critical conditions.

🎯 Problem Statement

Early disease detection can significantly improve patient outcomes, but access to quick preliminary assessments is limited. Many patients delay seeking medical attention because initial symptom evaluation requires a doctor visit.

This system provides a preliminary screening tool that analyzes symptoms using trained ML models, helping users understand potential conditions and encouraging timely medical consultation.

🏗️ Architecture

[ Streamlit Frontend ] ├─ Symptom Input Form └─ Results Dashboard │ (Feature Vectors) ▼ [ ML Prediction Engine ] ├─ Random Forest ├─ Logistic Regression └─ Decision Tree Classifier ▲ (Trained Models) │ [ Data Preprocessing Pipeline ] ├─ Missing Value Imputation ├─ Feature Encoding (One-Hot) └─ SMOTE (Class Balancing) ▲ [ Dataset (10,000+ Records) ]

Challenges & Solutions

  • Imbalanced dataset: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance disease class distributions and prevent model bias toward common conditions.
  • Feature selection: Used correlation analysis and feature importance ranking to identify the most predictive symptom features, reducing noise and improving accuracy.
  • Model comparison: Trained 3 different algorithms and compared precision, recall, and F1-scores to select the best model for each disease category.
  • Data quality: Built a preprocessing pipeline to handle missing values, encode categorical symptoms, and normalize numerical features consistently.

💡 Key Learnings

ML Pipeline Design

Built an end-to-end machine learning pipeline from raw data ingestion to model deployment, understanding each stage's importance.

Model Evaluation

Learned to go beyond accuracy — evaluating models using precision, recall, F1-score, and confusion matrices for healthcare-critical predictions.

Domain-Specific ML

Understanding that healthcare ML requires higher confidence thresholds and careful handling of false negatives compared to general classification tasks.

🔗 Related Projects

← Back to All Projects