Skip to the content.

Detailed Specification for DataLoader Methods

In the mini-torch framework, the DataLoader class is responsible for orchestrating how data is fed into the neural network during training or evaluation. While the Dataset class acts as a simple storage or retrieval mechanism for individual data samples, the DataLoader wraps this dataset and handles the complex logic of grouping individual samples into minibatches, shuffling the data to ensure robust training dynamics, and iterating over the dataset one batch at a time.

From an object-oriented design perspective, the DataLoader implements several key principles:

__init__()

The constructor initializes the data loader by storing a reference to the dataset and setting the hyperparameters that dictate how batches are formed.

Method Signature

def __init__(self, dataset, batch_size=1, shuffle=False, drop_last=False):

Specification

Example Implementation

def __init__(self, dataset, batch_size=1, shuffle=False, drop_last=False):
    self.dataset = dataset
    self.batch_size = batch_size
    self.shuffle = shuffle
    self.drop_last = drop_last

__iter__()

This method transforms the DataLoader object into a Python generator, yielding minibatches of data one at a time for the training loop.

Method Signature

def __iter__(self):

Specification

Example Implementation

def __iter__(self):
    indices = np.arange(len(self.dataset))
    if self.shuffle:
        np.random.shuffle(indices)
        
    for start_idx in range(0, len(indices), self.batch_size):
        batch_indices = indices[start_idx : start_idx + self.batch_size]
        
        # Handle the drop_last condition
        if self.drop_last and len(batch_indices) < self.batch_size:
            break
            
        batch_x, batch_y = [], []
        for idx in batch_indices:
            x, y = self.dataset[idx]
            batch_x.append(x)
            batch_y.append(y)
            
        # Stack individual samples into batch-first NumPy arrays
        yield np.vstack(batch_x), np.vstack(batch_y)

__len__()

While not strictly required for the forward or backward pass, implementing the length operator is highly recommended so the training loop can easily calculate progress (e.g., for printing output or showing progress bars).

Method Signature

def __len__(self):

Specification

Example Implementation

def __len__(self):
    if self.drop_last:
        return len(self.dataset) // self.batch_size
    else:
        return int(np.ceil(len(self.dataset) / self.batch_size))