DeepScan

Overview

DeepScan is a comprehensive pipeline designed to detect deepfake content across images, videos, and audio. The project integrates a web application, a browser extension, and a flask backend to provide end-to-end functionality—from media upload and preprocessing to feature extraction, model inference, and result presentation. At its core, DeepScan leverages multiple machine learning models (random forest, SVM, XGBoost) and ensemble methods (soft voting, hard voting, stacking) to achieve high detection accuracy.

Key Objectives

Cross-Media Deepfake Detection – Offer a unified pipeline that handles images, videos, and audio seamlessly, rather than focusing on a single modality.
High Accuracy via Ensemble Learning – Leverage ensemble methods (soft voting, hard voting, stacking) to combine the strengths of multiple algorithms (Random Forest, SVM, XGBoost) for improved performance.
Real-Time & Client-Server Integration – Enable users to detect deepfakes in real time through a browser extension (client-side) and a Web App (server-side). The Browser Extension can process media directly on the page, while the Web App handles heavier preprocessing and feature extraction via a Flask API.
Scalability & Extensibility – Architect the system so new detection models, feature extractors, or additional media types (e.g., text-based for deepfake GPT outputs) can be integrated without major refactoring.
Ease of Deployment & Use – Provide a polished frontend (Next.js + Tailwind CSS), a lightweight Flask backend, and deployment scripts so that teammates, partners, or end users can quickly replicate and extend DeepScan.

Technologies

Frontend

Browser Extension

Client-Side ML & Preprocessing

Backend

Machine Learning & Feature Extraction

Face Detection & Embeddings (Images/Videos)

MTCNN – Locates faces in each image or video frame.
FaceNet – Generates 512-dimensional embeddings for each detected face.
Feature Vector Assembly – Aggregation of FaceNet embeddings across all frames (for videos) via statistics (mean, variance) or max-pooling to form a fixed-length feature vector.

Audio Features (Audio)

MFCC (Mel-Frequency Cepstral Coefficients) – Typically the first 13–40 coefficients per audio frame, aggregated (e.g., using mean and delta) to produce a fixed-length feature set.

Phishing Website Detection

URL‐based & HTML‐based Features – Tokenization of domain names, analysis of form actions, suspicious keywords.
Models – Random Forest trained on a labeled phishing vs. legitimate dataset.

Video deepfake detection result — Frame-by-frame deepfake detection overlay showing classification confidence across segments.

Conclusion

DeepScan exemplifies a modular, multi-modal approach to deepfake detection—favoring traditional ML models over heavyweight deep-learning-only solutions. By combining MTCNN + FaceNet embeddings (for images/videos) with MFCC features (for audio) and using ensemble techniques (RF, SVM, XGBoost, stacking), the system achieves up to 95.5% accuracy on image/video tasks and 94% on audio tasks.

Key takeaways:

Ensemble Learning provides substantial gains with minimal overhead (soft voting alone took performance from ~91% to ~94%).
Client-Side Preprocessing (Pyodide) alleviates server bottlenecks, reducing end-to-end latency.
Browser Extension + Web App Integration broadens usability, letting both tech-savvy users and laypeople access DeepScan seamlessly.
Scalable Deployment via Render.com, Hugging Face Spaces, and Supabase ensures reliability and low latency for global users.