Skip to the content.

Detailed Specification for Dataset Methods

In the mini-torch framework, the Dataset class serves as an abstract base class that defines how individual data records and their corresponding labels are loaded and accessed. While the DataLoader class is responsible for the complex orchestration of batching, shuffling, and multi-process loading, the Dataset class is strictly focused on data representation and providing access to single data samples.

From an object-oriented design perspective, the Dataset implements several key principles:

A custom dataset class is created by subclassing this base Dataset and is strictly required to implement three main components: the __init__ constructor, the __len__ method, and the __getitem__ method.

__init__()

The constructor initializes the dataset object by setting up the necessary attributes, file paths, or data structures that will be accessed later during data retrieval.

Method Signature

def __init__(self, features, labels, *args, **kwargs):

Specification

Example Implementation (for a simple in-memory array dataset)

def __init__(self, X, y):
    self.features = X
    self.labels = y

__len__()

This method allows the dataset object to respond to Python’s built-in len() function.

Method Signature

def __len__(self):

Specification

Example Implementation

def __len__(self):
    return self.labels.shape

__getitem__()

This method is the core engine of the Dataset class. It allows the dataset object to be indexed like a list or array (e.g., dataset), returning a single training example.

Method Signature

def __getitem__(self, index):

Specification

Example Implementation

def __getitem__(self, index):
    # Retrieve exactly one data record and the corresponding label
    one_x = self.features[index]
    one_y = self.labels[index]
    
    return one_x, one_y