# CIFAR-10 — reproducing He et al. (2016) `examples/cifar/train_cifar_resnet56.py` replicates the CIFAR-10 result from *Deep Residual Learning for Image Recognition* (He et al., 2016): **6.97 % test error (~93 % top-1 accuracy)** with ResNet-56, matching the paper's SGD recipe exactly. It is the primary example of writing a **custom** {class}`~mentor.trainers.MentorTrainer` that deviates from the built-in Adam + StepLR defaults. ```bash # fresh run python examples/cifar/train_cifar_resnet56.py # resume, show progress bars python examples/cifar/train_cifar_resnet56.py \ -resume_path ./tmp/resnet56.pt -epochs 200 -verbose true ``` ## Performance Measured on an RTX 3090 (batch size 128, single GPU, < 1 GB GPU memory): | Metric | Value | |---|---| | Throughput | ~43 iterations / sec | | Total runtime | ~30 min (78 K iterations) | | Peak GPU memory | < 1 GB | | Best validation accuracy | ~93.02 % | The validation-loss curve below shows the characteristic three-step staircase produced by the iteration-based LR schedule: ```{figure} ../_static/cifar_56_loss.png :alt: Validation loss over 200 epochs — three sharp drops at the LR milestones :align: center Validation loss over 200 epochs. The dotted vertical line marks epoch 0 (baseline before training); the three drops correspond to LR reductions at 32 K, 48 K, and 64 K iterations (~epochs 82, 123, 164). ``` Reproduce the plot from a finished checkpoint: ```bash mtr_plot_file_hist -paths ./tmp/resnet56.pt -verbose \ -values validate/loss -output /tmp/cifar_56_loss.png ``` ## Key design decisions **SGD instead of Adam** : The built-in {class}`~mentor.trainers.Classifier` and {class}`~mentor.trainers.Regressor` trainers use Adam. `CifarSGDResnetClassifier` overrides `create_train_objects` to create an SGD optimiser with momentum 0.9 and weight decay 1e-4 — the settings from the paper. Assigning `self.trainer = CifarSGDResnetClassifier()` in the model's `__init__` is sufficient; {class}`~mentor.Mentee` delegates `create_train_objects`, `training_step`, and `validation_step` to the trainer automatically. **Iteration-based LR schedule** : The paper's milestones (32 K / 48 K / 64 K iterations) do not align with epoch boundaries for all batch sizes. `IterationMultiStepLR` reads {attr}`~mentor.Mentee.total_train_iterations` — a cumulative batch counter maintained and checkpointed by {class}`~mentor.Mentee` — instead of carrying its own state. `state_dict()` therefore returns `{}`, and `load_state_dict()` simply re-derives the correct LR from the restored counter. The schedule survives resume unchanged, even across machines or batch-size changes. **First metric key is the principal metric** : `default_training_step` returns `{"acc": acc, "loss": loss.item()}` with `acc` first. {meth}`~mentor.Mentee.validate_epoch` always *maximises* the first key when selecting the best checkpoint, so a higher-is-better metric must come first. Putting `loss` first would cause the untrained model (highest loss) to be permanently recorded as "best". ## Source ```{literalinclude} ../../examples/cifar/train_cifar_resnet56.py :language: python :linenos: ```