Mini-Torch Framework Specification
This document outlines the architecture for a “Mini-Torch” framework, designed for computer science students building neural networks from scratch. It mirrors the PyTorch API using only numpy, matplotlib, select elements of scipy, and standard Python, emphasizing manual gradient calculations and batch-first row-vector notation.
To improve encapsulation and modularity, this framework incorporates the following major architectural elements: an Optimizer base class (parameter updating), a Loss base class (error and initial gradient calculation, a Dataset/DataLoader pipeline (base classes to structure and iterate through data), a Module base class (layers, forward and backward passes), and a Sequential container (Module subclass that manages chaining of multiple layers).
Core Philosophy
To bridge the gap between foundational mathematics and the modern architecture of Generative AI (LLMs, GANs), this course utilizes a “Mini-Torch” Framework.
- Foundations: We implement algorithms “from scratch” to understand the mechanics.
- Modern Structure: We use Row-Vector (Batch-First) notation. This deviates from the Neural Network Design textbooks (which use column vectors) but aligns with modern GenAI papers and libraries (PyTorch, TensorFlow, JAX).
- No Black Boxes: We do not use autograd engines (like
torch.autograd). We calculate gradients manually.
Allowable Libraries
- Required:
numpy(Standard array operations),matplotlib(Visualization). - Allowed for Efficiency:
scipy.special(specificallyexpitfor Sigmoid, softmax),numexpr(specificallyevaluateto apply optimizations for CPU execution). - Prohibited: torch, tensorflow, keras, sklearn.neural_network.
Object-Oriented Principles and Design Patterns
The revised architecture relies heavily on established software design patterns to ensure the framework is modular, scalable, and easy to maintain.
- Single Responsibility Principle (SRP): Forward propagation and gradient management concerns are separated.
Modulestrictly handles the model’s state and mathematical operations, while theOptimizerclass takes responsibility for parameter updates. Data fetching and batching are delegated to theDatasetandDataLoaderclasses. - Composite Pattern (Structural): The Composite pattern allows a tree structure of simple and composite objects to be treated uniformly. The
Sequentialclass implements this by inheriting from the baseModulewhile simultaneously acting as a container for a list of otherModuleinstances (likeLinearorReLU). Callingforward()on aSequentialobject automatically propagates the input through all contained modules. - Strategy Pattern (Behavioral): The Strategy pattern encapsulates interchangeable algorithms inside a class.
- The
Optimizerbase class serves as a strategy for weight updates. Students can implement and swapSGDforAdamW, for example, without altering the model architecture. - The
Lossbase class serves as a strategy for error calculation, allowing students to implement and seamlessly switch betweenMSELoss(for regression) andCrossEntropyLoss(for classification), for example.
- The
- Iterator Pattern (Behavioral): The Iterator pattern provides sequential access to elements of a collection. The
DataLoaderacts as an iterable wrapper around aDatasetobject, managing complex data flows, minibatches, and shuffling without exposing the underlying dataset structure.
Using these patterns means that it is possible to first implement simplified versions of many of these classes, later adding more elaborated version without needing to change other classes.
UML Framework Architecture
classDiagram
class Module {
<<abstract>>
+forward(x)
+backward(grad_output)
+parameters() list
+grads() list
}
class Sequential {
-modules : list~Module~
+forward(x)
+backward(grad_output)
+parameters() list
}
class Linear {
-W : array
-b : array
+forward(x)
+backward(grad_output)
}
class Activation {
<<abstract>>
+forward(x)
+backward(grad_output)
}
Module <|-- Sequential
Module <|-- Linear
Module <|-- Activation
Sequential o-- Module : contains
class Optimizer {
<<abstract>>
#params : list
#lr : float
+zero_grad()
+step()
}
class SGD {
+step()
}
class AdamW {
+step()
}
Optimizer <|-- SGD
Optimizer <|-- AdamW
Optimizer --> Module : modifies parameters
class Loss {
<<abstract>>
+forward(predictions, targets)
+backward()
}
class MSELoss {
+forward(predictions, targets)
+backward()
}
class CrossEntropyLoss {
+forward(predictions, targets)
+backward()
}
Loss <|-- MSELoss
Loss <|-- CrossEntropyLoss
class Dataset {
<<abstract>>
+__len__()
+__getitem__(idx)
}
class DataLoader {
-dataset : Dataset
-batch_size : int
-shuffle : bool
+__iter__()
}
DataLoader o-- Dataset : iterates over
Core Component Specifications
The Neural Network Hierarchy (Module, Sequential, and Activation)
The Module base class is the foundational building block of the neural network. Every layer (detailed information) must implement an __init__(), forward(x), and a backward(grad_output) method:
__init__(self, ...): This method is strictly required. It must callsuper().__init__()and is used to define and register the internal layers and parameters the module will use. Parameters will depend on whether the subclass implements a single layer or is a container.forward(self, x): The method where the forward pass computation (how inputxflows through the layer(s) defined in__init__()) is explicitly specified.backward(self, grad_output): The method where the backward pass (computation of parameter and input gradients from output gradients) is specified.
Sequential Container (detailed information)
- Subclasses
Moduleand takes a list ofModules during initialization. Itsforwardmethod loops through the list, passing the output of one layer as the input to the next. Itsbackwardmethod loops through the list in reverse, applying the chain rule. - Parameter Management: The
Sequentialmodule recursively collects and returns the output ofparameters()andgrads()from all of its child modules, enabling the optimizer to update the entire network at once.
Activation Layer (detailed information)
The Activation abstract class specializes the Module interface to act as a blueprint for parameter-free, non-linear transformations, such as ReLU, Sigmoid, or GELU. Because an activation layer’s sole purpose is to apply a mathematical function element-wise to the outputs of a preceding linear layer, its implementation of the Module interface is specialized in the following ways:
- No Trainable Parameters: Unlike a layer of “neurons”, an activation function does not contain any learnable weights or biases. Consequently, an
Activationsubclass specializes the interface by having itsparameters()andgrads()methods simply return empty lists[](or by relying on a baseModuleimplementation that defaults to returning empty lists). - Simplified
__init__(): The constructor does not need to initialize matrices for weights and biases, nor does it need placeholders for parameter gradients (self.dW,self.db). It only needs to set up a state variable (e.g.,self.x = None) to cache the input data for the backward pass. - Element-wise
forward(x): The forward pass applies its specific non-linear equation across the input array. For example, a ReLU subclass would threshold all negative inputs to 0, while a GELU subclass would apply a smooth, non-linear approximation. It caches the inputxand returns the transformed array. - Input-Only
backward(grad_output): Because there are no internal weights to optimize, the backward pass skips parameter gradient calculations entirely. Instead, it computes the local derivative of the activation function evaluated at the cached inputx, multiplies this element-wise by the incominggrad_output(applying the chain rule), and returns the resultinggrad_inputto be passed to the preceding layer.
The Optimization Engine (Optimizer) (detailed information)
The Optimizer base class handles mathematical optimization of model parameters.
- Initialization: Takes an iterable list of references to the model’s learnable parameters and a learning rate (e.g.,
lr=0.01). zero_grad(): Iterates through the stored parameters and clears their gradients (sets them to zero) before each training step to prevent unintended gradient accumulation.step(): Iterates through the parameters and their calculated gradients, applying the specific update algorithm (such as Stochastic Gradient Descent).
The Error Calculation (Loss) (detailed information)
Loss functions quantify the difference between model predictions and target values and initiate the backpropagation process.
forward(predictions, targets): Returns a scalar measure of the error (e.g., Mean Squared Error or Cross-Entropy).backward(): Computes the initial loss gradient with respect to the network’s predictions. This output is then passed directly into the finalModule’sbackward(grad_output)method.
Management (Dataset and DataLoader)
These classes separate data handling logic from the main training loop.
DatasetBase Class: (detailed information) An abstract class requiring students to implement two python “magic” methods:__len__(self)to return the total number of samples, and__getitem__(self, index)to retrieve a single(x, y)data sample and label at a specific index.DataLoader: (detailed information) Wraps theDatasetand acts as a Python generator/iterator. It groups individual samples intonumpyarrays (minibatches) and handles dataset shuffling at the start of each epoch.
The Standard Training Loop
With the revised architecture, students will implement a clean training loop that perfectly maps to the standard PyTorch workflow:
- Iterate over epochs.
- Iterate over batches yielded by the
DataLoader. - Forward Pass: Pass the batch through the
Sequentialmodel to generate predictions. - Loss Calculation: Pass predictions and targets to the
Lossobject’sforwardmethod. - Zero Gradients: Call
optimizer.zero_grad(). - Backward Pass: Call
loss.backward()to get the initial gradient, then pass it tomodel.backward(grad_output)to calculate all internal gradients. - Parameter Update: Call
optimizer.step()to adjust the weights.
Important References and Further Reading
The following resources are highly recommended for students to deepen their understanding of the design patterns and architectural concepts used in this framework.
- Introduction to PyTorch (Raschka): A comprehensive primer on PyTorch tensors, autograd, and the standard training loop.
- PyTorch Data Structures (Official): Official guide on
torch.utils.data.DatasetandDataLoader. - PyTorch Neural Network Architecture: Deep dive into
nn.Moduleand theSequentialcontainer. - Software Design Patterns Overview: A summary of Creational, Structural, and Behavioral design patterns (including Composite, Strategy, and Iterator).
- Neural Network Design: Deep Learning: Foundational text covering multilayer network training, gradients, and optimization (Chapters 2 & 3).