Technical walkthrough

How DefectScope works

You upload a photo. Within a second you get a verdict — good or defective — backed by a probability score, a visual attention map, and a second independent check from an autoencoder. Here's exactly how that happens, step by step.

Step by step

From pixel to prediction

Every image you upload follows this exact path before a label is produced.

1
Upload
Your JPG or PNG is sent as a multipart upload to /predict. File size limit is 10 MB — large enough for any real photo.
2
Resize & normalise
Image is resized to 256 × 256 px using OpenCV, then normalised with ImageNet mean and std. This matches exactly how training images were prepared.
3
DenseNet-121
The frozen backbone scans every spatial region and produces a 1024-dimensional fingerprint — a rich description of what's in the image.
4
Classifier head
A small 2-layer neural network maps that 1024-number fingerprint into two scores — one for good, one for defective.
5
Grad-CAM + AE check
A heatmap is generated showing where the model looked. A separate autoencoder also checks for anomalies independently.
6
Verdict
If defect probability ≥ 33.6%, the result is Defective. Otherwise Good. If CNN and AE disagree, the case is flagged for review.
The model

DenseNet-121 with transfer learning

We only had ~300 training images. Training a full neural network from scratch on that would be like trying to teach a child to recognise dogs after showing them three photos. Transfer learning solves this — we start with a backbone already trained on 1.2 million images and just teach the final layers the bottle-specific task.

Frozen — parameters not updated during training
DenseNet-121 backbone
Input image 3 × 256 × 256
Conv + BN + Pool 64 channels, 64 × 64
Dense Blocks 1–4 increasingly rich feature maps
Global average pool 1024-dim vector
Pre-trained on ImageNet (1.2M images, 1000 classes). All ~7 million parameters are frozen — we never touch these weights.
↓ 1024 features passed to classifier
Trained — only these 260K parameters update
Custom classifier head
Linear 1024 → 256
ReLU zeros out negative activations
Dropout (0.4) randomly drops 40% during training to prevent overfitting
Linear 256 → 2 scores
Softmax P(good) + P(defective) = 100%
Trained with Adam (lr = 1e-4), cosine learning rate decay, and class-weighted cross-entropy to handle the imbalance between good and defective samples.

Why DenseNet-121 specifically?

Dense connections = better gradient flow

In a standard CNN, gradients have to flow backwards through every layer sequentially, getting weaker as they go. DenseNet connects each layer directly to all previous layers, so gradients reach the early layers without decay. On small datasets, this matters a lot.

Parameter-efficient for small datasets

MVTec bottle gives us ~300 training images. DenseNet's architecture reuses feature maps across layers, so it extracts more information with fewer parameters than a ResNet or VGG of similar depth — reducing the overfitting risk.

Proven track record on MVTec

DenseNet-based classifiers appear consistently in the MVTec benchmark literature. It's not a novel choice — it's a reliable one. We achieved AUC-ROC 0.984, which is very strong for a fine-tuned classifier on this dataset.

Explainability

Grad-CAM — seeing what the model sees

A prediction of "defective" is not very useful if you can't see where the defect was found. Grad-CAM generates a spatial heatmap that highlights the regions of the image that most influenced the model's verdict.

How it works

After the CNN produces its class scores, Grad-CAM runs a backward pass and asks: "Which spatial locations in the final feature map pushed the defective score up the most?"

  1. Forward pass — the image goes through the entire network, producing a defect probability.
  2. Backward pass — gradients of the defect score are computed with respect to the last convolutional feature maps (features.norm5, which is 8 × 8 spatial resolution for our 256 px input).
  3. Weighted activation map — each feature map is weighted by how much its gradient contributes. The weighted sum is passed through ReLU (keeping only positive contributions) and resized back to 256 × 256.
  4. Colour overlay — the heatmap is colourised (red = high attention, blue = low attention) and blended onto the original image at 45% opacity.
We always visualise the defect class (class 1) — even when the verdict is "Good." That means the heatmap always shows where the model checked for defects. On a good image, you'll typically see low, diffuse activation. On a defective image, attention concentrates on the damaged region.

Reading the heatmap

Low attention High attention
Red / orange — the model found strong evidence of a defect here
Yellow / green — moderate attention, supporting signal
Blue / purple — this region did not influence the prediction much

The heatmap is generated at 8 × 8 feature resolution and upsampled — so you get a coarse spatial explanation, not pixel-level precision. It tells you which region triggered the alert, not which exact pixel is the defect.

Dual-model verification

Autoencoder — a second opinion

The CNN classifier is powerful, but it can be overconfident. A separate autoencoder acts as an independent sanity check using a completely different approach.

The idea

The autoencoder was trained exclusively on good bottles. Its job is simple: compress the image down into a compact code, then reconstruct it back to full size. Because it only ever saw good images during training, it gets very good at reconstructing those — and struggles with defective ones.

When a defective image is passed through, the reconstruction will be off in the damaged region. We measure this as the mean squared error (MSE) between the original and the reconstruction — the anomaly score. High MSE = the AE couldn't reconstruct it cleanly = likely anomaly.

Input image
256 × 256
Encoder
compress to latent code
Decoder
reconstruct image
MSE vs original
anomaly score

When the two models disagree

If the CNN says Defective but the AE anomaly score is low (it reconstructed the image well), or vice versa, the result is flagged with an amber "needs manual review" alert. This disagreement pattern often catches edge cases — images that are genuinely ambiguous, or where one model is operating near its reliability boundary.

✓ ✓
Both agree — high confidence. Standard result, no review flag.
✓ ✗
CNN says defective, AE says normal — possible false alarm. Flagged for review.
✗ ✓
CNN says good, AE spots anomaly — possible missed defect. Flagged for review.
Dataset

MVTec Anomaly Detection dataset

DefectScope is trained and evaluated exclusively on the bottle category of the MVTec AD dataset (Bergmann et al., 2019). MVTec has 15 object categories — bottle, cable, carpet, leather, metal nut, screw, and more — we only trained on bottle. The images are glass/plastic bottles photographed from directly overhead (top-down, bird's-eye view) on a plain background with consistent industrial lighting.

~209
Good training images
~84
Defective training images
83
Test images
3
Defect types
Why is this dataset so small? Industrial defect datasets are inherently small — you need real defective products off a production line, each manually photographed and annotated. MVTec AD is the standard academic benchmark for exactly this reason: it represents the realistic data scarcity that real companies face. A production deployment would collect months of images specific to their product. We use transfer learning from ImageNet (1.2M images) specifically to compensate — the backbone already learned rich visual features, so we only need a few hundred images to fine-tune for bottle defects.

Defect categories in training data

Broken large defect
broken_large
A large crack or break through the bottle surface. The most visually distinctive defect type — the model detects these with highest confidence, usually 85%+.
Broken small defect
broken_small
A small chip or hairline crack. Harder to detect — confidence scores tend to cluster closer to the 33.6% threshold. This is where human review adds the most value.
Contamination defect
contamination
A foreign material inside or on the bottle. Visually distinct as a texture or colour anomaly the model learned to distinguish from clean bottle surfaces.

What a "good" image looks like

Good bottle 1 Good
Good bottle 2 Good
Good bottle 3 Good
Good bottle 4 Good

What you're looking at: the circular shape is the bottle opening as seen from directly above. Camera pointing straight down, single bottle centred in frame, plain white background, consistent industrial lighting. The model learned what "normal" looks like exclusively from images like these — any other angle, background, or object type produces meaningless results.

Evaluation

The numbers — what they actually mean

The model outputs a probability, not a hard yes/no. Understanding the numbers makes the results actionable.

Defect probability

The raw probability that the image contains a defect, ranging 0–100%. The result bar fills to this value. Once it passes the detection threshold, it turns red and the verdict flips to Defective.

18% → Good threshold 33.6%
78% → Defective threshold 33.6%

Why threshold = 33.6%?

The default approach — label as defective if the defect score exceeds 50% — is naive. We ran a search across every possible threshold on the test set and measured the F1-score at each point. The sweet spot was 33.6%.

Default 50% (argmax) F1 = 0.50
Tuned 33.6% F1 = 0.96

In industrial QA, a missed defect (false negative) is almost always more costly than a false alarm. The lower threshold reflects that: we'd rather flag a borderline case for human review than silently pass a broken bottle.

AUC-ROC = 0.984

The Area Under the ROC Curve measures how reliably the model ranks defective images above good ones — across every possible threshold at once. A score of 0.5 means random guessing. A score of 1.0 is perfect separation. We're at 0.984.

Random (0.5) 0.984 Perfect (1.0)

What this means in practice: if you randomly pick one good image and one defective image, the model will rank the defective one higher 98.4% of the time.

Domain scope

What to upload — and what not to

The single most common reason for wrong predictions is uploading an out-of-domain image. The model was trained exclusively on overhead bottle photos — it has no idea what to do with anything else, but it won't tell you that. It will still produce a label. It just won't mean anything.

Upload this
  • Glass or plastic bottle, photographed from directly above
  • Camera pointing straight down — top-down, bird's-eye view
  • Single bottle, centred, filling most of the frame
  • Plain white or grey background — no clutter, no other objects
  • Even, consistent lighting — no harsh shadows or reflections
  • Known defect types: large cracks, small chips, contamination
  • JPG or PNG, at least 128 × 128 px
Do not upload this
  • Photos of people, animals, food, or scenes
  • Screenshots, documents, or app UI
  • Multi-object or cluttered backgrounds
  • Non-bottle products — cables, metal parts, etc.
  • Very blurry, dark, or overexposed images
  • Side-angle or angled shots — the model expects top-down only
  • Memes, posters, or any non-product image
Why does this matter? The CNN learned what "good" and "defective" look like only from MVTec bottle images. When you give it a completely different type of image, the 1024-dimensional feature vector the backbone extracts is meaningless for this classification task — the model will still output a number, but it's essentially noise. This is called out-of-distribution (OOD) inference, and it's a fundamental limitation of all supervised learning systems.
Honest limitations

What this model cannot do

No model is perfect. Here's exactly where DefectScope falls short — so you can use it appropriately.

Single category only

Trained exclusively on MVTec "bottle." Upload a cable, fabric, or metal part and the result is undefined — the model will produce a confident label, but it has no basis for it. There are 14 other MVTec categories it knows nothing about.

No out-of-domain detection

The model cannot say "I don't recognise this image." It will confidently classify a photo of the sky, a selfie, or a dog. The confidence score gives a hint — values below 60% usually signal an unfamiliar input — but it's not a reliable OOD detector.

Not a final QA verdict

AUC-ROC 0.984 is strong, but not perfect. Edge cases exist — especially small chips near the detection threshold. Treat this as a fast screening step, not a final pass/fail gate. Manual review remains important for borderline cases.

Small training set

MVTec bottle has ~300 training images. The model performs well within this domain, but may not generalise to unseen lighting conditions, different bottle shapes, or transparent bottles. More production data would meaningfully improve robustness.

See it in action

The sample images on the Analyze page are from the exact test set the model was evaluated on — they're the best way to see the heatmap and dual-model output for real.

Go to Analyze →