How DefectScope works
You upload a photo. Within a second you get a verdict — good or defective — backed by a probability score, a visual attention map, and a second independent check from an autoencoder. Here's exactly how that happens, step by step.
From pixel to prediction
Every image you upload follows this exact path before a label is produced.
/predict. File size limit is 10 MB — large enough for any real photo.DenseNet-121 with transfer learning
We only had ~300 training images. Training a full neural network from scratch on that would be like trying to teach a child to recognise dogs after showing them three photos. Transfer learning solves this — we start with a backbone already trained on 1.2 million images and just teach the final layers the bottle-specific task.
Why DenseNet-121 specifically?
In a standard CNN, gradients have to flow backwards through every layer sequentially, getting weaker as they go. DenseNet connects each layer directly to all previous layers, so gradients reach the early layers without decay. On small datasets, this matters a lot.
MVTec bottle gives us ~300 training images. DenseNet's architecture reuses feature maps across layers, so it extracts more information with fewer parameters than a ResNet or VGG of similar depth — reducing the overfitting risk.
DenseNet-based classifiers appear consistently in the MVTec benchmark literature. It's not a novel choice — it's a reliable one. We achieved AUC-ROC 0.984, which is very strong for a fine-tuned classifier on this dataset.
Grad-CAM — seeing what the model sees
A prediction of "defective" is not very useful if you can't see where the defect was found. Grad-CAM generates a spatial heatmap that highlights the regions of the image that most influenced the model's verdict.
How it works
After the CNN produces its class scores, Grad-CAM runs a backward pass and asks: "Which spatial locations in the final feature map pushed the defective score up the most?"
- Forward pass — the image goes through the entire network, producing a defect probability.
-
Backward pass — gradients of the defect score are computed with respect to the last convolutional feature maps (
features.norm5, which is 8 × 8 spatial resolution for our 256 px input). - Weighted activation map — each feature map is weighted by how much its gradient contributes. The weighted sum is passed through ReLU (keeping only positive contributions) and resized back to 256 × 256.
- Colour overlay — the heatmap is colourised (red = high attention, blue = low attention) and blended onto the original image at 45% opacity.
Reading the heatmap
The heatmap is generated at 8 × 8 feature resolution and upsampled — so you get a coarse spatial explanation, not pixel-level precision. It tells you which region triggered the alert, not which exact pixel is the defect.
Autoencoder — a second opinion
The CNN classifier is powerful, but it can be overconfident. A separate autoencoder acts as an independent sanity check using a completely different approach.
The idea
The autoencoder was trained exclusively on good bottles. Its job is simple: compress the image down into a compact code, then reconstruct it back to full size. Because it only ever saw good images during training, it gets very good at reconstructing those — and struggles with defective ones.
When a defective image is passed through, the reconstruction will be off in the damaged region. We measure this as the mean squared error (MSE) between the original and the reconstruction — the anomaly score. High MSE = the AE couldn't reconstruct it cleanly = likely anomaly.
When the two models disagree
If the CNN says Defective but the AE anomaly score is low (it reconstructed the image well), or vice versa, the result is flagged with an amber "needs manual review" alert. This disagreement pattern often catches edge cases — images that are genuinely ambiguous, or where one model is operating near its reliability boundary.
MVTec Anomaly Detection dataset
DefectScope is trained and evaluated exclusively on the bottle category of the MVTec AD dataset (Bergmann et al., 2019). MVTec has 15 object categories — bottle, cable, carpet, leather, metal nut, screw, and more — we only trained on bottle. The images are glass/plastic bottles photographed from directly overhead (top-down, bird's-eye view) on a plain background with consistent industrial lighting.
Defect categories in training data
What a "good" image looks like
Good
Good
Good
Good
What you're looking at: the circular shape is the bottle opening as seen from directly above. Camera pointing straight down, single bottle centred in frame, plain white background, consistent industrial lighting. The model learned what "normal" looks like exclusively from images like these — any other angle, background, or object type produces meaningless results.
The numbers — what they actually mean
The model outputs a probability, not a hard yes/no. Understanding the numbers makes the results actionable.
Defect probability
The raw probability that the image contains a defect, ranging 0–100%. The result bar fills to this value. Once it passes the detection threshold, it turns red and the verdict flips to Defective.
Why threshold = 33.6%?
The default approach — label as defective if the defect score exceeds 50% — is naive. We ran a search across every possible threshold on the test set and measured the F1-score at each point. The sweet spot was 33.6%.
In industrial QA, a missed defect (false negative) is almost always more costly than a false alarm. The lower threshold reflects that: we'd rather flag a borderline case for human review than silently pass a broken bottle.
AUC-ROC = 0.984
The Area Under the ROC Curve measures how reliably the model ranks defective images above good ones — across every possible threshold at once. A score of 0.5 means random guessing. A score of 1.0 is perfect separation. We're at 0.984.
What this means in practice: if you randomly pick one good image and one defective image, the model will rank the defective one higher 98.4% of the time.
What to upload — and what not to
The single most common reason for wrong predictions is uploading an out-of-domain image. The model was trained exclusively on overhead bottle photos — it has no idea what to do with anything else, but it won't tell you that. It will still produce a label. It just won't mean anything.
- Glass or plastic bottle, photographed from directly above
- Camera pointing straight down — top-down, bird's-eye view
- Single bottle, centred, filling most of the frame
- Plain white or grey background — no clutter, no other objects
- Even, consistent lighting — no harsh shadows or reflections
- Known defect types: large cracks, small chips, contamination
- JPG or PNG, at least 128 × 128 px
- Photos of people, animals, food, or scenes
- Screenshots, documents, or app UI
- Multi-object or cluttered backgrounds
- Non-bottle products — cables, metal parts, etc.
- Very blurry, dark, or overexposed images
- Side-angle or angled shots — the model expects top-down only
- Memes, posters, or any non-product image
What this model cannot do
No model is perfect. Here's exactly where DefectScope falls short — so you can use it appropriately.
Single category only
Trained exclusively on MVTec "bottle." Upload a cable, fabric, or metal part and the result is undefined — the model will produce a confident label, but it has no basis for it. There are 14 other MVTec categories it knows nothing about.
No out-of-domain detection
The model cannot say "I don't recognise this image." It will confidently classify a photo of the sky, a selfie, or a dog. The confidence score gives a hint — values below 60% usually signal an unfamiliar input — but it's not a reliable OOD detector.
Not a final QA verdict
AUC-ROC 0.984 is strong, but not perfect. Edge cases exist — especially small chips near the detection threshold. Treat this as a fast screening step, not a final pass/fail gate. Manual review remains important for borderline cases.
Small training set
MVTec bottle has ~300 training images. The model performs well within this domain, but may not generalise to unseen lighting conditions, different bottle shapes, or transparent bottles. More production data would meaningfully improve robustness.
See it in action
The sample images on the Analyze page are from the exact test set the model was evaluated on — they're the best way to see the heatmap and dual-model output for real.
Go to Analyze →