NN Training =========== All QUiNN architectures are trained through the generic ``nnfit`` function, which provides a configurable training loop with support for multiple loss functions, optimizers, learning-rate schedules, mini-batching, and automatic early stopping. The standard entry point is ``MLPBase.fit(xtrn, ytrn, **kwargs)``, which delegates to ``nnfit`` internally. Training Objective ------------------ At each gradient step the optimizer minimizes a loss function :math:`\mathcal{L}(w)` that depends on the current mini-batch :math:`\mathcal{B} \subseteq \{1,\ldots,N\}`: .. math:: w^{(t+1)} = w^{(t)} - \eta_t\,\nabla_w \mathcal{L}_{\mathcal{B}}(w^{(t)}), where :math:`\eta_t` is the learning rate at step :math:`t`. Loss Functions -------------- ``nnfit`` selects the loss through either the ``loss_fn`` string or a user-supplied callable ``loss_xy``. Mean Squared Error (``loss_fn='mse'``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The default loss is the mean squared error over the mini-batch: .. math:: \mathcal{L}_{\text{MSE}}(w) = \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \bigl\|y_i - M(x_i;\,w)\bigr\|^2. Negative Log-Posterior (``loss_fn='logpost'``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When a Bayesian prior is used (e.g. for ``NN_MCMC`` or ``NN_RMS``), the loss is the negative log-posterior combining a Gaussian likelihood and a Gaussian prior centred at an anchor :math:`w_0`: .. math:: \mathcal{L}_{\text{logpost}}(w) = \frac{1}{2\sigma^2}\sum_{i \in \mathcal{B}}\bigl\|y_i - M(x_i;\,w)\bigr\|^2 + \frac{|\mathcal{B}|}{2}\log(2\pi\sigma^2) + \frac{|\mathcal{B}|}{N}\!\left( \frac{1}{2\sigma_{\text{prior}}^2}\|w - w_0\|^2 + \frac{K}{2}\log(2\pi\sigma_{\text{prior}}^2) \right), where :math:`\sigma` is the data noise, :math:`\sigma_{\text{prior}}` the prior standard deviation, and :math:`K` the parameter count. This loss requires the ``datanoise`` and ``priorparams`` arguments. Log-Loss (``loss_fn='logloss'``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Fits in the log-transformed space with a user-specified shift :math:`y_{\text{shift}}`: .. math:: \mathcal{L}_{\text{log}}(w) = \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \bigl[\log(M(x_i;\,w) - y_{\text{shift}}) - \log(y_i - y_{\text{shift}})\bigr]^2. This is activated by ``loss_fn='logloss'`` and requires the ``lossparams`` argument. Custom Loss (``loss_xy``) ^^^^^^^^^^^^^^^^^^^^^^^^^ Any callable with signature ``loss_xy(x_batch, y_batch) -> scalar tensor`` can be passed directly. When ``loss_xy`` is provided, the ``loss_fn`` string is ignored. This mechanism is used internally by ``NN_VI`` (Bayes by Backprop) and can be used for any problem-specific objective. Optimizers ---------- ``nnfit`` supports two first-order optimizers: .. list-table:: :header-rows: 1 :widths: 15 55 * - String - Algorithm * - ``'adam'`` - Adam (adaptive moment estimation), the default. Updates each parameter with bias-corrected first and second moment estimates. * - ``'sgd'`` - Stochastic Gradient Descent with optional momentum (via PyTorch defaults). Both accept an optional weight-decay parameter ``wd`` that adds an L2 penalty :math:`\frac{\lambda}{2}\|w\|^2` to the loss: .. math:: \mathcal{L}_{\text{reg}}(w) = \mathcal{L}(w) + \frac{\lambda}{2}\|w\|^2. Learning Rate Schedules ----------------------- Three scheduling modes are available (mutually exclusive): 1. **Constant** — when neither ``lmbd`` nor ``scheduler_lr`` is set, the learning rate stays at ``lrate`` throughout training. 2. **Lambda schedule** — a user-defined function ``lmbd(epoch)`` that returns a multiplicative factor. The effective rate at epoch :math:`t` is .. math:: \eta_t = \texttt{lrate} \times \texttt{lmbd}(t). 3. **ReduceLROnPlateau** — set ``scheduler_lr='ReduceLROnPlateau'``. The scheduler monitors the validation loss and reduces the learning rate by ``factor`` whenever the loss plateaus for ``cooldown`` epochs: .. math:: \eta \leftarrow \texttt{factor} \times \eta \quad\text{if validation loss stagnates for \texttt{cooldown} epochs}. Mini-Batch Training ------------------- When ``batch_size`` is specified and smaller than :math:`N`, each epoch is split into :math:`\lceil N / B \rceil` sub-epochs, where :math:`B` is the batch size. At the start of every epoch the training data is randomly permuted, and each sub-epoch draws a contiguous slice of size :math:`B`: .. math:: \mathcal{B}_k = \bigl\{\pi(kB+1),\;\pi(kB+2),\;\ldots,\;\pi\!\bigl(\min((k+1)B,\,N)\bigr)\bigr\}, where :math:`\pi` is the random permutation. When ``batch_size`` is ``None`` or exceeds :math:`N`, full-batch training is used. Early Stopping -------------- At every gradient step the validation loss is evaluated (without gradients). If it improves on the current best, a deep copy of the model is checkpointed: .. math:: w^* = \arg\min_{w^{(t)}} \mathcal{L}_{\text{val}}(w^{(t)}). The returned model is always the best snapshot, not the final-epoch model. When no separate validation set is provided (``val=None``), the training set is used for both training and validation. Arguments --------- .. list-table:: :header-rows: 1 :widths: 18 12 50 * - Argument - Default - Description * - ``nnmodel`` - - The ``torch.nn.Module`` to train. * - ``xtrn`` - - Training inputs, numpy array of shape :math:`(N,\,d)`. * - ``ytrn`` - - Training targets, numpy array of shape :math:`(N,\,o)`. * - ``val`` - ``None`` - Validation data as an ``(x, y)`` tuple. If ``None``, the training set doubles as validation. * - ``loss_fn`` - ``'mse'`` - Loss identifier: ``'mse'``, ``'logpost'``, or ``'logloss'``. Ignored when ``loss_xy`` is provided. * - ``loss_xy`` - ``None`` - Custom loss callable ``loss_xy(x, y) -> scalar``. Overrides ``loss_fn`` when provided. * - ``datanoise`` - ``None`` - Data noise :math:`\sigma` for ``'logpost'`` loss. * - ``wd`` - ``0.0`` - Weight decay (L2 regularisation) coefficient :math:`\lambda`. * - ``priorparams`` - ``None`` - Dictionary with keys ``'sigma'`` (:math:`\sigma_{\text{prior}}`) and ``'anchor'`` (:math:`w_0`) for the Gaussian prior. * - ``lossparams`` - ``None`` - Parameters for custom losses (e.g. ``[y_shift]`` for ``'logloss'``). * - ``optimizer`` - ``'adam'`` - Optimizer string: ``'adam'`` or ``'sgd'``. * - ``lrate`` - ``0.1`` - Base learning rate :math:`\eta`. * - ``lmbd`` - ``None`` - Lambda schedule ``lmbd(epoch) -> float``. Effective rate is ``lrate * lmbd(epoch)``. * - ``scheduler_lr`` - ``None`` - Adaptive scheduler. Currently only ``'ReduceLROnPlateau'`` is supported. Cannot be combined with ``lmbd``. * - ``nepochs`` - ``5000`` - Total number of training epochs. * - ``batch_size`` - ``None`` - Mini-batch size :math:`B`. ``None`` means full-batch. * - ``gradcheck`` - ``False`` - If ``True``, verify auto-diff gradients numerically (slow, experimental). * - ``cooldown`` - ``100`` - Cooldown epochs for ``ReduceLROnPlateau``. * - ``factor`` - ``0.95`` - Multiplicative factor for ``ReduceLROnPlateau``. * - ``freq_out`` - ``100`` - Screen-output frequency (in epochs). * - ``freq_plot`` - ``1000`` - Loss-history plot frequency (in epochs). * - ``lhist_suffix`` - ``''`` - Filename suffix for the saved loss-history figures. Return Value ------------ ``nnfit`` returns a dictionary with the following keys: .. list-table:: :header-rows: 1 :widths: 22 55 * - Key - Content * - ``'best_nnmodel'`` - Deep copy of the model at the best validation loss. * - ``'best_loss'`` - Best validation loss value. * - ``'best_epoch'`` - Epoch index at which the best loss occurred. * - ``'best_fepoch'`` - Fractional epoch (accounts for sub-epochs in mini-batch training). * - ``'history'`` - List of ``[fepoch, batch_loss, train_loss, val_loss]`` recorded at every gradient step.