
Overview
DeepScan is a comprehensive pipeline designed to detect deepfake content across images, videos, and audio. The project integrates a web application
, a browser extension
, and a flask backend
to provide end-to-end functionality—from media upload and preprocessing to feature extraction, model inference, and result presentation. At its core, DeepScan leverages multiple machine learning models (random forest
, SVM
, XGBoost
) and ensemble methods (soft voting
, hard voting
, stacking
) to achieve high detection accuracy.
Key Objectives
- Cross-Media Deepfake Detection – Offer a unified pipeline that handles
images
,videos
, andaudio
seamlessly, rather than focusing on a single modality. - High Accuracy via Ensemble Learning – Leverage ensemble methods (
soft voting
,hard voting
,stacking
) to combine the strengths of multiple algorithms (Random Forest
,SVM
,XGBoost
) for improved performance. - Real-Time & Client-Server Integration – Enable users to detect deepfakes in real time through a
browser extension
(client-side) and aWeb App
(server-side). The Browser Extension can process media directly on the page, while the Web App handles heavier preprocessing and feature extraction via aFlask API
. - Scalability & Extensibility – Architect the system so new detection models, feature extractors, or additional media types (e.g., text-based for deepfake GPT outputs) can be integrated without major refactoring.
- Ease of Deployment & Use – Provide a polished frontend (
Next.js
+Tailwind CSS
), a lightweightFlask backend
, and deployment scripts so that teammates, partners, or end users can quickly replicate and extend DeepScan.
Technologies
Frontend
Browser Extension
Client-Side ML & Preprocessing
Backend
Machine Learning & Feature Extraction
Face Detection & Embeddings (Images/Videos)
- MTCNN – Locates faces in each image or video frame.
- FaceNet – Generates 512-dimensional embeddings for each detected face.
- Feature Vector Assembly – Aggregation of FaceNet embeddings across all frames (for videos) via statistics (mean, variance) or max-pooling to form a fixed-length feature vector.
Audio Features (Audio)
- MFCC (Mel-Frequency Cepstral Coefficients) – Typically the first 13–40 coefficients per audio frame, aggregated (e.g., using mean and delta) to produce a fixed-length feature set.
Phishing Website Detection
- URL‐based & HTML‐based Features – Tokenization of domain names, analysis of form actions, suspicious keywords.
- Models – Random Forest trained on a labeled phishing vs. legitimate dataset.

Conclusion
DeepScan exemplifies a modular, multi-modal approach to deepfake detection—favoring traditional ML models over heavyweight deep-learning-only solutions. By combining MTCNN
+ FaceNet
embeddings (for images/videos) with MFCC
features (for audio) and using ensemble techniques (RF
, SVM
, XGBoost
, stacking
), the system achieves up to 95.5% accuracy on image/video tasks and 94% on audio tasks.
Key takeaways:
- Ensemble Learning provides substantial gains with minimal overhead (
soft voting
alone took performance from ~91% to ~94%). - Client-Side Preprocessing
(Pyodide)
alleviates server bottlenecks, reducing end-to-end latency. - Browser Extension + Web App Integration broadens usability, letting both tech-savvy users and laypeople access DeepScan seamlessly.
- Scalable Deployment via Render.com, Hugging Face Spaces, and Supabase ensures reliability and low latency for global users.