Image Classification System Development
Image classification is the assignment of one or more classes to each image. Practical applications include sorting medical images by pathologies, tagging e-commerce catalogs by categories, filtering user-generated content, and classifying production defects. The task has been solved above human level for standard benchmarks since 2015, but proper adaptation to a specific domain requires a methodical approach.
Architecture Selection
For most tasks, the optimal choice is EfficientNet-B4 or ConvNeXt-Tiny: a good balance of accuracy and inference speed.
| Architecture | Top-1 ImageNet | Parameters | Latency (T4 GPU) |
|---|---|---|---|
| EfficientNet-B0 | 77.1% | 5.3M | 3.5 ms |
| EfficientNet-B4 | 82.9% | 19M | 9.2 ms |
| ConvNeXt-Tiny | 82.1% | 28M | 7.8 ms |
| ViT-B/16 | 81.8% | 86M | 12.1 ms |
| EfficientNet-B7 | 84.4% | 66M | 28 ms |
For edge devices (Raspberry Pi, Jetson Nano): MobileNetV3, EfficientNet-Lite, YOLO11-cls.
Transfer Learning and Fine-tuning
Training from scratch requires millions of examples. Fine-tuning a pretrained model yields good results with hundreds of images per class.
import timm
import torch.nn as nn
def build_classifier(num_classes: int,
pretrained_model: str = 'efficientnet_b4'):
model = timm.create_model(
pretrained_model,
pretrained=True,
num_classes=0 # remove original classifier head
)
embedding_dim = model.num_features # 1792 for B4
# Freeze backbone in early epochs
for param in model.parameters():
param.requires_grad = False
# Custom classification head
classifier = nn.Sequential(
nn.Linear(embedding_dim, 512),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
model.classifier = classifier
return model
Training strategy: 5 epochs with frozen backbone → unfreeze last 2 blocks → 10 epochs with 10x lower LR → full unfreeze → another 10 epochs with cosine schedule.
Dealing with Class Imbalance
Real datasets are rarely balanced. Strategies:
- Weighted random sampler: sampling frequency is inversely proportional to class size
-
Focal Loss:
FL(p) = -(1-p)^γ · log(p), focuses training on hard examples (γ=2 is standard) - Oversampling rare classes: albumentations augmentation only for underrepresented classes
-
Class-weighted cross-entropy: weights proportional to
1 / class_frequency
Multi-class vs Multi-label Classification
Multi-class (one class per image): softmax + cross-entropy. Example: animal type.
Multi-label (multiple classes simultaneously): sigmoid + binary cross-entropy. Example: photo tags (nature + mountains + sunset). The activation threshold is selected separately by F1 for each class.
Evaluation Metrics
- Top-1 / Top-5 Accuracy — for balanced datasets
- Macro-averaged F1 — for imbalanced data
- Cohen's Kappa — for medical tasks
- AUC-ROC per class — for multi-label classification
Performance and Deployment
For API service: ONNX export + ONNX Runtime → latency 5–15ms on CPU (batch=1). For GPU: TorchServe with dynamic batching. For mobile: Core ML (iOS), TFLite (Android).
| Task Complexity | Timeline |
|---|---|
| 2–10 classes, 1000+ photos/class | 1–2 weeks |
| 50+ classes or complex domain | 3–5 weeks |
| Hierarchical classification, edge deployment | 5–8 weeks |







