Medical Report Summarization with LLM & OCR

Use Case

Healthcare professionals often deal with long and complex medical reports, making it time-consuming to extract key information. This project aims to automate the summarization of medical reports, helping doctors and medical staff quickly access essential details while preserving critical information.

Objective

  • Develop an automated summarization system for medical reports.
  • Use OCR (Optical Character Recognition) to extract text from scanned documents.
  • Leverage Large Language Models (LLMs) to generate concise summaries.
  • Improve efficiency in medical data processing while maintaining accuracy.

Steps Taken

1. Data Used

  • Dataset: Medical Transcriptions Dataset from Kaggle.
  • Data Type: Unstructured text data from medical reports.
  • Preprocessing:
    • Cleaned text by removing noise (e.g., special characters, irrelevant metadata).
    • Standardized medical terminology for better summarization accuracy.

2. Technology Stack

  • Python – Core programming language.
  • OCR (Tesseract/Pytesseract or EasyOCR) – Extracts text from scanned reports.
  • NLP (Natural Language Processing) – Processes and cleans medical text.
  • Transformers (Hugging Face’s Pre-trained LLMs) – Generates accurate summaries.

3. Implementation Steps

  • Data Extraction: Used OCR to convert scanned reports into machine-readable text.
  • Preprocessing: Tokenized, cleaned, and standardized medical text.
  • Summarization Model: Fine-tuned a transformer-based LLM (e.g., BART, T5, or GPT) for medical text summarization.
  • Evaluation: Compared model outputs with human-generated summaries using ROUGE scores.

4. Results & Impact

  • Reduced reading time for medical professionals by 42% on average.
  • Achieved high accuracy in summarization while retaining essential medical information.
  • Demonstrated the potential of AI-powered automation in healthcare.

5. Challenges & Learnings

  • Handling medical jargon and abbreviations required fine-tuning the LLM on domain-specific data.
  • OCR accuracy varied based on scan quality, requiring preprocessing improvements.