Medical Report Summarization – Aissam Bakhtaoui

Medical Report Summarization with LLM & OCR

Use Case

Healthcare professionals often deal with long and complex medical reports, making it time-consuming to extract key information. This project aims to automate the summarization of medical reports, helping doctors and medical staff quickly access essential details while preserving critical information.

Objective

Develop an automated summarization system for medical reports.
Use OCR (Optical Character Recognition) to extract text from scanned documents.
Leverage Large Language Models (LLMs) to generate concise summaries.
Improve efficiency in medical data processing while maintaining accuracy.

Steps Taken

1. Data Used

Dataset: Medical Transcriptions Dataset from Kaggle.
Data Type: Unstructured text data from medical reports.
Preprocessing:
- Cleaned text by removing noise (e.g., special characters, irrelevant metadata).
- Standardized medical terminology for better summarization accuracy.

2. Technology Stack

Python – Core programming language.
OCR (Tesseract/Pytesseract or EasyOCR) – Extracts text from scanned reports.
NLP (Natural Language Processing) – Processes and cleans medical text.
Transformers (Hugging Face’s Pre-trained LLMs) – Generates accurate summaries.

3. Implementation Steps

Data Extraction: Used OCR to convert scanned reports into machine-readable text.
Preprocessing: Tokenized, cleaned, and standardized medical text.
Summarization Model: Fine-tuned a transformer-based LLM (e.g., BART, T5, or GPT) for medical text summarization.
Evaluation: Compared model outputs with human-generated summaries using ROUGE scores.

4. Results & Impact

Reduced reading time for medical professionals by 42% on average.
Achieved high accuracy in summarization while retaining essential medical information.
Demonstrated the potential of AI-powered automation in healthcare.

5. Challenges & Learnings

Handling medical jargon and abbreviations required fine-tuning the LLM on domain-specific data.
OCR accuracy varied based on scan quality, requiring preprocessing improvements.