Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

Custom classifier
Scikit-learn’s MLPClassifier
Keras Sequential classifier using SGD and Adam optimizers.

This will help you learn about their various use cases and how they work.

Prerequisites

Mathematics (Calculus, Linear Algebra, Statistics)
Coding in Python
Basic understanding of Machine Learning concepts

What is a Perceptron?

A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

The perceptron consists of four main parts:

Input layer: Takes the initial numerical values into the system for further processing.
Weights: Combines input values with weights (and bias terms).
Activation function: Determines whether the neuron should fire based on the threshold value.
Output layer: Produces classification result.

It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

Applications of Perceptrons

Perceptrons are applied to tasks such as:

Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.
Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

How the Activation Function Works

For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

$$phi(z) = begin{cases} 1 &text{if } z geq theta \ \ 0 &text{if } z < theta end{cases}$$

where:

ϕ(z): the output of the activation function.
z: the weighted sum of the inputs plus the bias:

$$z = sum_{i=1}^m w_i x_i + b$$

(xi: input values, w: weight associated with each input, b: bias terms)

θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

In that case, the formula becomes:

$$phi(z) = begin{cases} 1 &text{if } z geq 0 \ \ 0 &text{if } z < 0 end{cases}$$

When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

In modern implementations, we can use other activation functions like the sigmoid function:

$$sigma (z) = frac {1} {1 + e^{-z}}$$

The sigmoid function also outputs zero or one depending on the weighted sum (z).

How the Loss Function Works

The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model’s predictions and the actual target values.

Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model’s parameters in a way that minimizes this error and improves performance.

In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

$$L(y, h(x)) = max(0, 1- y*h(x))$$

(h(x): prediction label, y: true label)

How to Build a Single-Layered Classifier

Now, let’s build a simple single-layer perceptron for binary classification.

1. Custom Classifier

Initialize the classifier

We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
    self.learning_rate = learning_rate
    self.n_iterations = n_iterations
    self.weights = <span class="hljs-literal">None</span>
    self.bias = <span class="hljs-literal">None</span>

Define the activation function

Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
     <span class="hljs-keyword">return</span> np.where(x > threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

Train the model

Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

This process is controlled by a specified number of training epochs defined by n_iterations.

In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
    n_samples, n_features = X.shape

    self.weights = np.zeros(n_features)
    self.bias = <span class="hljs-number">0</span>

    <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
            <span class="hljs-comment"># compute weighted sum (z)</span>
            z = np.dot(X[i], self.weights) + self.bias

            <span class="hljs-comment"># apply the activation function</span>
            y_pred = self._step_function(z)

            <span class="hljs-comment"># update weights and bias</span>
            self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
            self.bias += self.learning_rate * (y[i] - y_pred)

How the weights work in the iteration loop

The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

Its iterative update in the for loop aims to reduce classification errors such that:

$$begin {align*} w_j &:= w_j + Delta w_j \ & := w_j + eta (y_i – hat y_i)x_{ij} \ &= begin{cases} w_j &text{(a) } y_i – hat y_i = 0\ w_j + eta x_ij &text{(b) } y_i – hat y_i = 1 \ w_j – eta x_ij &text{(c) } y_i – hat y_i = -1 \ end{cases} end{align*}$$

(w_j: j-th weight, η: learning rate, (yi−y^i): error)

This means that:

When the prediction is correct, the error is zero, so the weight is unchanged.
When the prediction is too low (yi=1 and y^i=0), the weight is adjusted to the same direction to increase the weighted sum.
When the prediction is too high (yi=0 and y^i=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

How the bias terms work in the iteration loop

The bias determines the decision boundary’s intercept (position from the origin).

Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

$$begin {align*} b &:= b + Delta b \ & := b + eta (y_i – hat y_i) \ &= begin{cases} b &text{(a) } y_i – hat y_i = 0\ b + eta &text{(b) } y_i – hat y_i = 1 \ b – eta &text{(c) } y_i – hat y_i = -1 \ end{cases} end{align*}$$

This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

Make a prediction

Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
      linear_output = np.dot(X, self.weights) + self.bias
      predictions = self._step_function(linear_output)
      <span class="hljs-keyword">return</span> predictions

The entire classifier looks like this:

<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
        <span class="hljs-keyword">return</span> np.where(x > threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                linear_output = np.dot(X[i], self.weights) + self.bias
                y_pred = self._step_function(linear_output)
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
        linear_output = np.dot(X, self.weights) + self.bias
        y_pred = self._step_function(linear_output)
        <span class="hljs-keyword">return</span> y_pred

Simulate with synthetic datasets

First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

<span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np

<span class="hljs-comment"># create a mock dataset</span>
X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)

<span class="hljs-comment"># split</span>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># train the model</span>
perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train = perceptron.predict(X_train)
y_pred_test = perceptron.predict(X_test)

<span class="hljs-comment"># evaluate the results</span>
acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)

Results

The classifier generated a clear, highly accurate linear decision boundary.

Accuracy (Train): 0.981
Accuracy (Test): 0.975

2. Leverage SckitLearn’s MCP Classifier

For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

model = MLPClassifier(
    hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
    activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
    solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
    max_iter=<span class="hljs-number">1000</span>,
    random_state=<span class="hljs-number">42</span>, 
    learning_rate=<span class="hljs-string">'constant'</span>, 
    learning_rate_init=<span class="hljs-number">0.1</span>
).fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

acc_train = np.mean(y_pred_train == y_train)
acc_test = np.mean(y_pred_test == y_test)
print(<span class="hljs-string">f"MCPClassifiernAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)

Results

The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

Accuracy (Train): 0.985
Accuracy (Test): 0.995

Limitations of Single-Layer Perceptrons

Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

Other challenges of a single-layer perceptron include:

Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.
Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.
Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

What is a Multi-Layer Perceptron?

An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

an input layer,
one or more hidden layers, and
an output layer.

Except for the input nodes, each node is a neuron that uses a nonlinear activation function.

MLPs are widely used for classification problems as well as regression:

Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.
Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.

How to Build Multi-Layered Perceptrons

Let’s handle a binary classification task using a standard MLP architecture.

Outline of the Project

Objective

Detect fraudulent transactions

Evaluation Metrics

Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores
Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

Cost of Misclassification (from high to low):

False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)
False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)
True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.
True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

Planning an MLP Architecture

In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

Preprocessing the Datasets

First, we consolidate three datasets – transaction, customer, and credit card – into a single DataFrame, independently sanitizing numerical and categorical data:

<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
<span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

<span class="hljs-comment"># download the raw data to local</span>
<span class="hljs-keyword">import</span> kagglehub
path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
    <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
    <span class="hljs-keyword">if</span> isinstance(amount_str, str):
        <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
    <span class="hljs-keyword">return</span> amount_str

<span class="hljs-comment"># load transaction data</span>
trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)

<span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
    fraud_labels_json = json.load(fp=fp)

fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})

<span class="hljs-comment"># load card data</span>
card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)

<span class="hljs-comment"># merge transaction and card data</span>
merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)

<span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
df = merged_df.copy()
df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
print(<span class="hljs-string">'nDataFrame: n'</span>, df.head(n=<span class="hljs-number">3</span>))

DataFrame:

Our DataFrame shows an extremely skewed data distribution with:

Fraud samples: 1,191
Non-fraud samples: 11,477,397

For classification tasks, it’s crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

For our data, we’ll:

split the 1,191 fraud samples into training, validation, and test sets,
add an equal number of randomly chosen non-fraud samples from the DataFrame, and
adjust split balances later if generalization challenges arise.

<span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
val_size_per_class = <span class="hljs-number">200</span>
test_size_per_class = <span class="hljs-number">200</span>

<span class="hljs-comment"># create test sets</span>
X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced test set</span>
X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)


<span class="hljs-comment"># create validation sets</span>
X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)

<span class="hljs-comment"># combine to form the balanced validation set</span>
X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)

<span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)


<span class="hljs-comment"># create training sets</span>
min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))

X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)

X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)


print(<span class="hljs-string">"n--- Final Dataset Shapes and Distributions ---"</span>)
print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)

After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

<span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter

train_target = <span class="hljs-number">2000</span>

smote_train = SMOTE(
  sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
  random_state=<span class="hljs-number">12</span>
)
X_train, y_train = smote_train.fit_resample(X_train, y_train)

print(<span class="hljs-string">f"nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)

We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

Lastly, we’ll apply column transformers to numerical and categorical features separately.

Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

<span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
<span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
<span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline

categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])

numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
        (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
    ]
)

X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)
X_test_processed = preprocessor.transform(X_test)

Understanding Optimizers

In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

In this article, we’ll use the SGD Optimizer and Adam Optimizer.

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

$$begin{align*} w_j &:= w_j – eta frac {partial J} {partial w_j} \ \ b &:= b – eta frac {partial J} {partial b} end{align*}$$

(w: weight, b: bias, J: cost function, η: learning rate)

In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

$$begin{align*} J(y, hat y) &=−[y log(hat y) + (1-y)log(1-hat y)] \ \ hat y &= sigma (z) = frac {1} {1+e^{-z}} \ \ z &= sum_{i=1}^m w_i x_i + b end {align*}$$

2. How Adam (Adaptive Moment Estimation) Optimizer Works

Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

$$w_{j,t+1} = w_{j,t} – alpha cdot frac{hat{m}{t,w_j}}{sqrt{hat{v}{t,w_j}} + epsilon}$$

where:

α: The learning rate (default is 0.001)
ϵ: A small positive constant used to avoid division by zero
m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

$$begin{align*} hat m_t &= frac {m_t} {1 – beta_1^t} \ \ m_t &= beta_1 m_{t-1} + (1-beta_1) underbrace{ frac {partial L} {partial w_t}}_{text{gradient}} end{align*}$$

(β1: Decay rates, typically set to β1=0.9)

v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

$$begin{align*} hat v_t &= frac {v_t} {1 – beta_2^t} \ \ v_t &=beta_2 v_{t-1} + (1- beta_2) (frac {partial L} {partial w_t})^2 end {align*}$$

(β2: Decay rates, typically set to β2=0.999)

Since both m and v are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

Learn More: A Comprehensive Guide on Neural Network in Deep Learning

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

<span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
    <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
    X_batch = X_shuffled[i : i + self.batch_size]
    y_batch = y_shuffled[i : i + self.batch_size]

    <span class="hljs-comment"># A. forward pass</span>
    activations, zs = self._forward_pass(X_batch)
    y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>

    <span class="hljs-comment"># B. backpropagation</span>
    <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
    delta = y_pred - y_batch
    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

    <span class="hljs-comment"># 2) update output layer parameters</span>
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

    <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

        self.weights[l] -= self.learning_rate * dW
        self.biases[l] -= self.learning_rate * db

In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
    activations = [X]
    zs = []

    <span class="hljs-comment"># forward through hidden layers</span>
    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
        z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
        zs.append(z)
        a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
        activations.append(a)

    <span class="hljs-comment"># forward through output layer</span>
    z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
    zs.append(z_output)

    <span class="hljs-comment"># computes the final output using sigmoid function</span>
    y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    activations.append(y_pred)
    <span class="hljs-keyword">return</span> activations, zs

So the final classifier looks like this:

<span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.weights = []
        self.biases = []
        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x > <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
        self.weights = []
        self.biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)
        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            <span class="hljs-comment"># shuffle datasets</span>
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>]

                delta = y_pred - y_batch
                dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db

                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    self.weights[l] -= self.learning_rate * dW
                    self.biases[l] -= self.learning_rate * db

            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities >= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>

Training / Prediction

Train the model and make a prediction using training and validation datasets:

<span class="hljs-comment"># 1. define the model</span>
mlp_sgd = MLP_SGD(
  hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
  learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
  n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
  batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
)

<span class="hljs-comment"># 2. train the model</span>
mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
y_pred_train = mlp_sgd.predict(X_train_processed)
y_pred_val = mlp_sgd.predict(X_val_processed)

<span class="hljs-comment"># 4. compute evaluation matrics</span>
conf_matrix = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)


print(<span class="hljs-string">f"nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)

Results

Recall: 0.7930 — 0.6650 (from training to validation)
Precision: 0.7790 — 0.6786 (from training to validation)

The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

Loss history:

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

Leverage SckitLearn’s MCP Classifier

We can use an MCP Classifier to define a similar model, incorporating;

Early stopping using internal validation to prevent overfitting and
L2 regularization with a small tolerance.

<span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier

<span class="hljs-comment"># define a model</span>
model_sklearn_mlp_sgd = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'sgd'</span>,
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    momentum=<span class="hljs-number">0.9</span>,
    nesterovs_momentum=<span class="hljs-literal">True</span>,
    alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
    max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
    batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
    n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
    validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
    tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
    verbose=<span class="hljs-literal">False</span>,
)

<span class="hljs-comment"># training</span>
model_sklearn_mlp_sgd.fit(X_train_processed, y_train)

<span class="hljs-comment"># make a prediction</span>
y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)

Results

Recall: 0.7830 – 0.6200 (from training to validation)
Precision: 0.8208 – 0.6703 (from training to validation)

The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

Leverage Keras Sequential Classifier

For the sequential classifier, we can further enhance the classifier by:

Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train) to address dataset imbalance and promote faster convergence,
Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,
Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,
Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and
Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

<span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


<span class="hljs-comment"># calculates an initial bias for the output layer </span>
initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])


<span class="hljs-comment"># defines the model</span>
model_keras_sgd = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
          bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
])



<span class="hljs-comment"># compiles the model with the SGD optimizer</span>
opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_sgd.compile(
    optimizer=opt, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>,
    metrics=[
        <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)


<span class="hljs-comment"># defines early stopping to prevent overfitting</span>
early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
    mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
    patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
    min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
    verbose=<span class="hljs-number">0</span>
)


<span class="hljs-comment"># compute the class weight</span>
class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))


<span class="hljs-comment"># train the model</span>
history = model_keras_sgd.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
    callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
    class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
    verbose=<span class="hljs-number">0</span>
)

<span class="hljs-comment"># evaluate</span>
loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)

loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># display model summary</span>
model_keras_sgd.summary()

Results

Recall: 0.7125 — 0.7250 (from training to validation)
Precision: 0.7607 — 0.7545 (from training to validation)

Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

It suggests that the regularization techniques are likely effective in preventing significant overfitting.

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

<span class="hljs-comment"># apply Adam updates for output layer parameters</span>
<span class="hljs-comment"># 1) weights (w)</span>
self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

<span class="hljs-comment"># 2) bias (b)</span>
self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)

Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                 beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
        self.hidden_layer_sizes = hidden_layer_sizes
        self.learning_rate = learning_rate
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon

        self.weights = [] 
        self.biases = []

        <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        self.weights_history = []
        self.biases_history = []
        self.loss_history = []

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        s = self._sigmoid(x)
        <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> (x > <span class="hljs-number">0</span>).astype(float)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
        layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]

        self.weights = []
        self.biases = []
        self.m_weights = []
        self.v_weights = []
        self.m_biases = []
        self.v_biases = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
            limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))

            self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
            self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))

            self.m_weights.append(np.zeros((fan_in, fan_out)))
            self.v_weights.append(np.zeros((fan_in, fan_out)))
            self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
            self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []

        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z)
            activations.append(a)

        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
        y_pred = self._sigmoid(z_output)
        activations.append(y_pred)

        <span class="hljs-keyword">return</span> activations, zs

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
        y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
        loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
        <span class="hljs-keyword">return</span> loss

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
        y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
        X = np.asarray(X)

        self._initialize_parameters(n_features)
        self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
        self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
        activations, _ = self._forward_pass(X)
        initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
        self.loss_history.append(initial_loss)

        <span class="hljs-comment"># global time step for Adam bias correction</span>
        t = <span class="hljs-number">0</span>

        <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
            permutation = np.random.permutation(n_samples)
            X_shuffled = X[permutation]
            y_shuffled = y[permutation]

            <span class="hljs-comment"># Mini-batch loop</span>
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                X_batch = X_shuffled[i : i + self.batch_size]
                y_batch = y_shuffled[i : i + self.batch_size]

                t += <span class="hljs-number">1</span>

                <span class="hljs-comment"># 1. forward pass</span>
                activations, zs = self._forward_pass(X_batch)
                y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>

                <span class="hljs-comment"># 2. backpropagation</span>
                delta = y_pred - y_batch
                grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                <span class="hljs-comment"># apply Adam updates to weights</span>
                self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                <span class="hljs-comment"># apply Adam updates to bias</span>
                self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


                <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                    delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                    grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]

                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                    self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)

                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                    self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)


            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])

            activations, _ = self._forward_pass(X)
            epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(epoch_loss)

            <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
        <span class="hljs-keyword">return</span> self


    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
        activations, _ = self._forward_pass(X)
        <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
        probabilities = self.predict_proba(X)
        <span class="hljs-keyword">return</span> (probabilities >= threshold).astype(int).flatten()

Training / Prediction

Train the model and make a prediction using training and validation datasets:

mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
mlp_adam.fit(X_train_processed, y_train)

y_pred_train = mlp_adam.predict(X_train_processed)
y_pred_val = mlp_adam.predict(X_val_processed)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)

print(<span class="hljs-string">f"nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)

Results

Recall: 0.9870–0.6150 (from training to validation)
Precision: 0.9811–0.6474 (from training to validation)

While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

Loss History

We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

Leverage SckitLearn’s MCP Classifier

We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

model_sklearn_mlp_adam = MLPClassifier(
    hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
    activation=<span class="hljs-string">'relu'</span>,
    solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
    learning_rate_init=<span class="hljs-number">0.001</span>,
    learning_rate=<span class="hljs-string">'constant'</span>,
    alpha=<span class="hljs-number">0.0001</span>,
    max_iter=<span class="hljs-number">3000</span>,
    batch_size=<span class="hljs-number">16</span>,
    random_state=<span class="hljs-number">42</span>,
    early_stopping=<span class="hljs-literal">True</span>,
    n_iter_no_change=<span class="hljs-number">50</span>,
    validation_fraction=<span class="hljs-number">0.1</span>,
    tol=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-literal">False</span>,
)

model_sklearn_mlp_adam.fit(X_train_processed, y_train)

y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)

Results

Recall: 0.8975–0.6400 (from training to validation)
Precision: 0.8864 — 0.6305 (from training to validation)

Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

Leverage Keras Sequential Classifier

Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

<span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
<span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
<span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
<span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
<span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
<span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
<span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight


initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
model_keras_adam = Sequential([
    Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
    Dropout(<span class="hljs-number">0.1</span>),
    Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
          bias_initializer=tf.keras.initializers.Constant(initial_bias))
])


optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
model_keras_adam.compile(
    optimizer=optimizer_keras, 
    loss=<span class="hljs-string">'binary_crossentropy'</span>, 
    metrics=[
        <span class="hljs-string">'accuracy'</span>,
        tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
        tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
        tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
    ]
)

early_stopping_callback = EarlyStopping(
    monitor=<span class="hljs-string">'val_recall'</span>,
    mode=<span class="hljs-string">'max'</span>,
    patience=<span class="hljs-number">50</span>,
    min_delta=<span class="hljs-number">1e-4</span>,
    verbose=<span class="hljs-number">0</span>
)

class_weights = class_weight.compute_class_weight(
    class_weight=<span class="hljs-string">'balanced'</span>,
    classes=np.unique(y_train),
    y=y_train
)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

model_keras_adam.fit(
    X_train_processed, y_train,
    epochs=<span class="hljs-number">1000</span>,
    batch_size=<span class="hljs-number">32</span>,
    validation_data=(X_val_processed, y_val),
    callbacks=[early_stopping_callback],
    class_weight=class_weights_dict,
    verbose=<span class="hljs-number">0</span>
)


loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"n--- Keras Model Accuracy (Train) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)


loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
print(<span class="hljs-string">f"n--- Keras Model Accuracy (Validation) ---"</span>)
print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)


model_keras_adam.summary()

Results

Recall: 0.7995–0.7500 (from training to validation)
Precision: 0.8409–0.8065 (from training to validation)

The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

This indicates good generalization, with only minor performance degradation on unseen data.

Final Results: Generalization

Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

<span class="hljs-comment"># Custom classifiers</span>
y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># MLPClassifer</span>
y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)

<span class="hljs-comment"># Keras Sequential</span>
_, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
_, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)

Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

Conclusion

In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

Our findings underscore that effective machine learning hinges on three critical factors:

robust data preprocessing (tailored to objectives and data distribution),
judicious model selection, and
strategic framework or library choices.

Choosing the right framework

Generally speaking, choose MLPClassifier when:

You’re primarily working with tabular data,
You want to prioritize simplicity, quick iteration, and seamless integration,
You have simple, shallow architectures, and
You have a moderate dataset size (manageable on a CPU).

Choose Keras Sequential when:

You’re dealing with image, text, audio, or other sequential data,
You’re building deep learning models such as CNNs, RNNs, LSTMs,
You need fine-grained control over the model architecture, training process, or custom components,
You need to leverage GPU acceleration,
You’re planning for production deployment, and
You want to experiment with more advanced deep learning techniques.

Limitation of MLPs

While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

You can find more info about me on my Portfolio / LinkedIn / Github.

Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & MoreÂ

Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

Table of Contents

Prerequisites

What is a Perceptron?

Applications of Perceptrons

How the Activation Function Works

How the Loss Function Works

How to Build a Single-Layered Classifier

1. Custom Classifier

Initialize the classifier

Define the activation function

Train the model

How the weights work in the iteration loop

How the bias terms work in the iteration loop

Make a prediction

Simulate with synthetic datasets

Results

2. Leverage SckitLearn’s MCP Classifier

Results

Limitations of Single-Layer Perceptrons

What is a Multi-Layer Perceptron?

How to Build Multi-Layered Perceptrons

Outline of the Project

Objective

Evaluation Metrics

Planning an MLP Architecture

Preprocessing the Datasets

Understanding Optimizers

1. How a SGD (Stochastic Gradient Descent) Optimizer Works

2. How Adam (Adaptive Moment Estimation) Optimizer Works

How to Build an MLP Classifier with SGD Optimizer

Custom Classifier

Training / Prediction

Results

Leverage SckitLearn’s MCP Classifier

Results

Leverage Keras Sequential Classifier

Results

How to Build an MLP Classifier with Adam Optimizer

Custom Classifier

Training / Prediction

Results

Leverage SckitLearn’s MCP Classifier

Results

Leverage Keras Sequential Classifier

Results

Final Results: Generalization

Conclusion

Choosing the right framework

Limitation of MLPs

Related Posts