Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The state of DevOps and AI: Not just hype

      September 1, 2025

      A Breeze Of Inspiration In September (2025 Wallpapers Edition)

      August 31, 2025

      10 Top Generative AI Development Companies for Enterprise Node.js Projects

      August 30, 2025

      Prompting Is A Design Act: How To Brief, Guide And Iterate With AI

      August 29, 2025

      Look out, Meta Ray-Bans! These AI glasses just raised over $1M in pre-orders in 3 days

      September 2, 2025

      Samsung ‘Galaxy Glasses’ powered by Android XR are reportedly on track to be unveiled this month

      September 2, 2025

      The M4 iPad Pro is discounted $100 as a last-minute Labor Day deal

      September 2, 2025

      Distribution Release: Linux From Scratch 12.4

      September 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Enhanced Queue Job Control with Laravel’s ThrottlesExceptions failWhen() Method

      September 2, 2025
      Recent

      Enhanced Queue Job Control with Laravel’s ThrottlesExceptions failWhen() Method

      September 2, 2025

      August report 2025

      September 2, 2025

      Fake News Detection using Python Machine Learning (ML)

      September 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Installing Proxmox on a Raspberry Pi to run Virtual Machines on it

      September 2, 2025
      Recent

      Installing Proxmox on a Raspberry Pi to run Virtual Machines on it

      September 2, 2025

      Download Transcribe! for Windows

      September 1, 2025

      Microsoft Fixes CertificateServicesClient (CertEnroll) Error in Windows 11

      September 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

    Learn to Build a Multilayer Perceptron with Real-Life Examples and Python Code

    May 30, 2025

    The perceptron is a fundamental concept in deep learning, with many algorithms stemming from its original design.

    In this tutorial, I’ll show you how to build both single layer and multi-layer perceptrons (MLPs) across three frameworks:

    • Custom classifier

    • Scikit-learn’s MLPClassifier

    • Keras Sequential classifier using SGD and Adam optimizers.

    This will help you learn about their various use cases and how they work.

    Table of Contents

    • What is a Perceptron?

    • How to Build a Single-Layered Classifier

    • What is a Multi-Layer Perceptron?

    • How to Build Multi-Layered Perceptrons

    • Understanding Optimizers

    • How to Build an MLP Classifier with SGD Optimizer

    • How to Build an MLP Classifier with Adam Optimizer

    • Final Results: Generalization

    • Conclusion

    Prerequisites

    • Mathematics (Calculus, Linear Algebra, Statistics)

    • Coding in Python

    • Basic understanding of Machine Learning concepts

    What is a Perceptron?

    A perceptron is one of the simplest types of artificial neurons used in Machine Learning. It’s a building block of artificial neural networks that learns from labeled data to perform classification and pattern recognition tasks, typically on linearly separable data.

    A single-layer perceptron consists of a single layer of artificial neurons, called perceptrons.

    But when you connect many perceptrons together in layers, you have a multi-layer perceptron (MLP). This lets the network learn more complex patterns by combining simple decisions from each perceptron. And this makes MLPs powerful tools for tasks like image recognition and natural language processing.

    The perceptron consists of four main parts:

    • Input layer: Takes the initial numerical values into the system for further processing.

    • Weights: Combines input values with weights (and bias terms).

    • Activation function: Determines whether the neuron should fire based on the threshold value.

    • Output layer: Produces classification result.

    Image: Organization of a perceptron. Source: Rosenblatt 1958

    It performs a weighted sum of inputs, adds a bias, and passes the result through an activation function – just like logistic regression. It’s sort of like a little decision-maker that says “yes” or “no” based on the information it gets.

    So for instance, when we use a sigmoid activation, its output is a probability between 0 and 1, mimicking the behavior of logistic regression.

    Applications of Perceptrons

    Perceptrons are applied to tasks such as:

    • Image classification: Perceptrons classify images containing specific objects. They achieve this by performing binary classification tasks.

    • Linear regression: Perceptrons can predict continuous outputs based on input features. This makes them useful for solving linear regression problems.

    How the Activation Function Works

    For a single perceptron used for binary classification, the most common activation function is the step function (also known as the threshold function):

    $$phi(z) = begin{cases} 1 &text{if } z geq theta \ \ 0 &text{if } z < theta end{cases}$$

    where:

    • ϕ(z): the output of the activation function.

    • z: the weighted sum of the inputs plus the bias:

    $$z = sum_{i=1}^m w_i x_i + b$$

    (xi: input values, w: weight associated with each input, b: bias terms)

    θ is the threshold. Often, the threshold θ is set to zero, and the bias (b) effectively controls the activation threshold.

    In that case, the formula becomes:

    $$phi(z) = begin{cases} 1 &text{if } z geq 0 \ \ 0 &text{if } z < 0 end{cases}$$

    Image: Step Function (Author)

    When the step function ϕ(z) outputs one, it signifies that the input belongs to the class labeled one.

    This occurs when the weighted sum is greater than zero, leading the perceptron to predict the input is in this binary class.

    While the step function is conceptually the original activation for a perceptron, its discontinuity at zero causes computational challenges.

    In modern implementations, we can use other activation functions like the sigmoid function:

    $$sigma (z) = frac {1} {1 + e^{-z}}$$

    The sigmoid function also outputs zero or one depending on the weighted sum (z).

    How the Loss Function Works

    The loss function is a crucial concept in machine learning that quantifies the error or discrepancy between the model’s predictions and the actual target values.

    Its purpose is to penalize the model for making incorrect or inaccurate predictions, which guides the learning algorithm (for example, gradient descent) to adjust the model’s parameters in a way that minimizes this error and improves performance.

    In a binary classification task, the model may adopt the hinge loss function to penalize misclassifications by incurring an additional cost for incorrect predictions:

    $$L(y, h(x)) = max(0, 1- y*h(x))$$

    (h(x): prediction label, y: true label)

    How to Build a Single-Layered Classifier

    Now, let’s build a simple single-layer perceptron for binary classification.

    1. Custom Classifier

    Initialize the classifier

    We’ll first initialize the classifier with weights, bias, number of epochs (n_iterations), and learning_rates.

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = <span class="hljs-literal">None</span>
        self.bias = <span class="hljs-literal">None</span>
    

    Define the activation function

    Use a step function that returns zero if input (x) ≤ 0, else 1. By default, the threshold is set to zero.

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
         <span class="hljs-keyword">return</span> np.where(x > threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
    

    Train the model

    Now it’s time to start training. The learning process involves iteratively updating the perceptron’s internal parameters: weights and bias.

    This process is controlled by a specified number of training epochs defined by n_iterations.

    In each epoch, the model processes the entire input dataset (X) and adjusts its weights and bias based on the difference between its predictions and the true labels (y), guided by a predefined learning_rate.

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
        n_samples, n_features = X.shape
    
        self.weights = np.zeros(n_features)
        self.bias = <span class="hljs-number">0</span>
    
        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                <span class="hljs-comment"># compute weighted sum (z)</span>
                z = np.dot(X[i], self.weights) + self.bias
    
                <span class="hljs-comment"># apply the activation function</span>
                y_pred = self._step_function(z)
    
                <span class="hljs-comment"># update weights and bias</span>
                self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                self.bias += self.learning_rate * (y[i] - y_pred)
    

    How the weights work in the iteration loop

    The weights in a perceptron define the orientation (slope) of the decision boundary that separates the classes.

    Its iterative update in the for loop aims to reduce classification errors such that:

    $$begin {align*} w_j &:= w_j + Delta w_j \ & := w_j + eta (y_i – hat y_i)x_{ij} \ &= begin{cases} w_j &text{(a) } y_i – hat y_i = 0\ w_j + eta x_ij &text{(b) } y_i – hat y_i = 1 \ w_j – eta x_ij &text{(c) } y_i – hat y_i = -1 \ end{cases} end{align*}$$

    (w_j: j-th weight, η: learning rate, (yi​−y^​i​): error)

    This means that:

    1. When the prediction is correct, the error is zero, so the weight is unchanged.

    2. When the prediction is too low (yi​=1 and y^​i​=0), the weight is adjusted to the same direction to increase the weighted sum.

    3. When the prediction is too high (yi​=0 and y^​i​=1), the weight is adjusted to the opposite direction to pull the weighted sum lower.

    How the bias terms work in the iteration loop

    The bias determines the decision boundary’s intercept (position from the origin).

    Similar to weights, we adjust the bias terms in each epoch to position the decision boundary:

    $$begin {align*} b &:= b + Delta b \ & := b + eta (y_i – hat y_i) \ &= begin{cases} b &text{(a) } y_i – hat y_i = 0\ b + eta &text{(b) } y_i – hat y_i = 1 \ b – eta &text{(c) } y_i – hat y_i = -1 \ end{cases} end{align*}$$

    This repeated adjustment aims to optimize the model’s ability to correctly classify the training data.

    Make a prediction

    Lastly, we add a function to generate an outcome value (zero or one) for a new, unseen data (X):

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
          linear_output = np.dot(X, self.weights) + self.bias
          predictions = self._step_function(linear_output)
          <span class="hljs-keyword">return</span> predictions
    

    The entire classifier looks like this:

    <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
    
    <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Perceptron</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, learning_rate=<span class="hljs-number">0.01</span>, n_iterations=<span class="hljs-number">1000</span></span>):</span>
            self.learning_rate = learning_rate
            self.n_iterations = n_iterations
            self.weights = <span class="hljs-literal">None</span>
            self.bias = <span class="hljs-literal">None</span>
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_step_function</span>(<span class="hljs-params">self, x, threshold: int = <span class="hljs-number">0</span></span>):</span>
            <span class="hljs-keyword">return</span> np.where(x > threshold, <span class="hljs-number">1</span>, <span class="hljs-number">0</span>)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
            n_samples, n_features = X.shape
            self.weights = np.zeros(n_features)
            self.bias = <span class="hljs-number">0</span>
    
            <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(self.n_iterations):
                <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(n_samples):
                    linear_output = np.dot(X[i], self.weights) + self.bias
                    y_pred = self._step_function(linear_output)
                    self.weights += self.learning_rate * (y[i] - y_pred) * X[i]
                    self.bias += self.learning_rate * (y[i] - y_pred)
            <span class="hljs-keyword">return</span> self
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X</span>):</span>
            linear_output = np.dot(X, self.weights) + self.bias
            y_pred = self._step_function(linear_output)
            <span class="hljs-keyword">return</span> y_pred
    

    Simulate with synthetic datasets

    First, we generated a synthetic linearly separable dataset using make_blob and computed a decision boundary, then train the classifier we created.

    <span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> make_blobs
    <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
    <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
    
    <span class="hljs-comment"># create a mock dataset</span>
    X, y = make_blobs(n_features=<span class="hljs-number">2</span>, centers=<span class="hljs-number">2</span>, n_samples=<span class="hljs-number">1000</span>, random_state=<span class="hljs-number">12</span>)
    
    <span class="hljs-comment"># split</span>
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">0.2</span>, random_state=<span class="hljs-number">42</span>)
    
    <span class="hljs-comment"># train the model</span>
    perceptron = Perceptron(learning_rate=<span class="hljs-number">0.1</span>, n_iterations=<span class="hljs-number">1000</span>).fit(X_train, y_train)
    
    <span class="hljs-comment"># make a prediction</span>
    y_pred_train = perceptron.predict(X_train)
    y_pred_test = perceptron.predict(X_test)
    
    <span class="hljs-comment"># evaluate the results</span>
    acc_train = np.mean(y_pred_train == y_train)
    acc_test = np.mean(y_pred_test == y_test)
    print(<span class="hljs-string">f"Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
    

    Results

    The classifier generated a clear, highly accurate linear decision boundary.

    • Accuracy (Train): 0.981

    • Accuracy (Test): 0.975

    Decision boundary of single-layer perceptron (Custom classifier)

    2. Leverage SckitLearn’s MCP Classifier

    For our convenience, we’ll use sckit-learn’s build-in classifier ( MCPClassifier) to build a similar, yet more robust classifier:

    model = MLPClassifier(
        hidden_layer_sizes=(), <span class="hljs-comment"># intentionally set empty to create a single layer perceptron</span>
        activation=<span class="hljs-string">'logistic'</span>, <span class="hljs-comment"># choosing a sigmoid function as an activation function</span>
        solver=<span class="hljs-string">'sgd'</span>, <span class="hljs-comment"># choosing SGD optimizer</span>
        max_iter=<span class="hljs-number">1000</span>,
        random_state=<span class="hljs-number">42</span>, 
        learning_rate=<span class="hljs-string">'constant'</span>, 
        learning_rate_init=<span class="hljs-number">0.1</span>
    ).fit(X_train, y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    acc_train = np.mean(y_pred_train == y_train)
    acc_test = np.mean(y_pred_test == y_test)
    print(<span class="hljs-string">f"MCPClassifiernAccuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>}</span> nAccuracy (Test): <span class="hljs-subst">{acc_test:<span class="hljs-number">.3</span>}</span>"</span>)
    

    Results

    The MCP Classifier generated a clear linear decision boundary with slightly better accuracy scores.

    • Accuracy (Train): 0.985

    • Accuracy (Test): 0.995

    Decision boundary of single-layer perceptron (MCP Classifier)

    Limitations of Single-Layer Perceptrons

    Now, let’s talk about the key differences between the MCP Classifier and our custom single-layer perceptron.

    Unlike more general neural networks, single-layer perceptrons use a step function as their activation.

    Due to its discontinuity at x=0, the step function is not differentiable over its entire domain (−∞ to ∞).

    This fundamental property precludes the use of gradient-based optimization algorithms such as SGD or Adam, as these methods depend on the computation of gradients, partial derivatives for the cost function.

    In contrast, most neural networks employ differentiable activation functions (for example, sigmoid, ReLU) and loss functions (for example, MSE, Cross-Entropy) for effective optimization.

    Other challenges of a single-layer perceptron include:

    • Limited to linear separability: Because they can only learn linear decision boundaries, they are unable to handle complex, non-linearly separable data.

    • Lack of depth: Being single-layered, they cannot learn complex hierarchical representations.

    • Limited optimizer options: As mentioned, their non-differentiable activation function precludes the use of major gradient-based optimizers.

    So, in the next section, you’ll learn about multi-layered perceptrons to overcome the disadvantages.

    What is a Multi-Layer Perceptron?

    An MLP is a class of feedforward artificial neural network that consists of at least three layers of nodes:

    • an input layer,

    • one or more hidden layers, and

    • an output layer.

    Except for the input nodes, each node is a neuron that uses a nonlinear activation function.​

    MLPs are widely used for classification problems as well as regression:

    • Classification tasks: MLPs are widely used for classification problems, such as handwriting recognition and speech recognition.​

    • Regression analysis: They are also applied in regression problems where the relationship between input and output is complex.​

    How to Build Multi-Layered Perceptrons

    Let’s handle a binary classification task using a standard MLP architecture.

    Outline of the Project

    Objective

    • Detect fraudulent transactions

    Evaluation Metrics

    • Considering the cost of misclassification, we’ll prioritize improving Recall and Precision scores

    • Then check the accuracy of classification with Accuracy Score (TP + TN / (TP + TN + FP + FN ))

    Cost of Misclassification (from high to low):

    • False Negative (FN): The model incorrectly identifies a fraudulent transaction as legitimate (Missing actual fraud)

    • False Positive (FP): The model incorrectly identifies a legitimate transaction as fraudulent (Blocking legitimate customers.)

    • True Positive (TP): The model correctly identifies a fraudulent transaction as fraud.

    • True Negative (TN): The model correctly identifies a non-fraudulent transaction as non-fraud.

    Planning an MLP Architecture

    In the network, 19 input features feed into the first hidden layer’s 30 neurons, which use a ReLU activation function.

    Then, their outputs are passed to the second layer, culminating in sigmoid values as the final output.

    During the optimization process, we’ll let the optimizer (SGD and Adam) perform forward and backward passes to adjust parameters.

    Standard MLP Architecture for Binary Classification Tasks)

    Image: Standard MLP Architecture for Binary Classification Tasks (Created by Kuriko Iwai using image source)

    Especially in deeper network, ReLU is advantageous in preventing vanishing gradient problems where gradients become extremely small as they are backpropagated from the output layers.

    Comparison of major activation functions: From left to right: Sigmoid, Tanh, ReLU

    Learn More: A Comprehensive Guide on Neural Network in Deep Learning

    Preprocessing the Datasets

    First, we consolidate three datasets  –  transaction, customer, and credit card  –  into a single DataFrame, independently sanitizing numerical and categorical data:

    <span class="hljs-keyword">import</span> json
    <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
    <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
    <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split
    <span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler, OneHotEncoder
    <span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
    <span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
    <span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline
    
    <span class="hljs-comment"># download the raw data to local</span>
    <span class="hljs-keyword">import</span> kagglehub
    path = kagglehub.dataset_download(<span class="hljs-string">"computingvictor/transactions-fraud-datasets"</span>)
    dir = <span class="hljs-string">f'<span class="hljs-subst">{path}</span>/gd_card_flaud_demo'</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">sanitize_df</span>(<span class="hljs-params">amount_str</span>):</span>
        <span class="hljs-string">"""Removes '$' and converts the string to a float."""</span>
        <span class="hljs-keyword">if</span> isinstance(amount_str, str):
            <span class="hljs-keyword">return</span> float(amount_str.replace(<span class="hljs-string">'$'</span>, <span class="hljs-string">''</span>))
        <span class="hljs-keyword">return</span> amount_str
    
    <span class="hljs-comment"># load transaction data</span>
    trx_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/transactions_data.csv'</span>)
    
    <span class="hljs-comment"># sanitize the dataset (drop unnecessary columns and error transactions, convert string to int/float dtype)</span>
    trx_df = trx_df[trx_df[<span class="hljs-string">'errors'</span>].isna()]
    trx_df = trx_df.drop(columns=[<span class="hljs-string">'merchant_city'</span>,<span class="hljs-string">'merchant_state'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'mcc'</span>, <span class="hljs-string">'errors'</span>], axis=<span class="hljs-string">'columns'</span>)
    trx_df[<span class="hljs-string">'amount'</span>] = trx_df[<span class="hljs-string">'amount'</span>].apply(sanitize_df)
    
    <span class="hljs-comment"># merge the dataframe with fraud transaction flag.</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/train_fraud_labels.json'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> fp:
        fraud_labels_json = json.load(fp=fp)
    
    fraud_labels_dict = fraud_labels_json.get(<span class="hljs-string">'target'</span>, {})
    fraud_labels_series = pd.Series(fraud_labels_dict, name=<span class="hljs-string">'is_fraud'</span>)
    fraud_labels_series.index = fraud_labels_series.index.astype(int) <span class="hljs-comment"># convert the datatype from string to integer</span>
    merged_df = pd.merge(trx_df, fraud_labels_series, left_on=<span class="hljs-string">'id'</span>, right_index=<span class="hljs-literal">True</span>, how=<span class="hljs-string">'left'</span>)
    merged_df.fillna({<span class="hljs-string">'is_fraud'</span>: <span class="hljs-string">'No'</span>}, inplace=<span class="hljs-literal">True</span>)
    merged_df[<span class="hljs-string">'is_fraud'</span>] = merged_df[<span class="hljs-string">'is_fraud'</span>].map({<span class="hljs-string">'Yes'</span>: <span class="hljs-number">1</span>, <span class="hljs-string">'No'</span>: <span class="hljs-number">0</span>})
    
    <span class="hljs-comment"># load card data</span>
    card_df = pd.read_csv(<span class="hljs-string">f'<span class="hljs-subst">{dir}</span>/cards_data.csv'</span>)
    card_df = card_df.drop(columns=[<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'acct_open_date'</span>, <span class="hljs-string">'card_number'</span>, <span class="hljs-string">'expires'</span>, <span class="hljs-string">'cvv'</span>], axis=<span class="hljs-string">'columns'</span>)
    card_df[<span class="hljs-string">'credit_limit'</span>] = card_df[<span class="hljs-string">'credit_limit'</span>].apply(sanitize_df)
    
    <span class="hljs-comment"># merge transaction and card data</span>
    merged_df = pd.merge(left=merged_df, right=card_df, left_on=<span class="hljs-string">'card_id'</span>, right_on=<span class="hljs-string">'id'</span>, how=<span class="hljs-string">'inner'</span>)
    merged_df = merged_df.drop(columns=[<span class="hljs-string">'id_y'</span>, <span class="hljs-string">'card_id'</span>], axis=<span class="hljs-string">'columns'</span>)
    
    <span class="hljs-comment"># converts categorical variables into a new binary column (0 or 1)</span>
    categorical_cols = merged_df.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns
    df = merged_df.copy()
    df = pd.get_dummies(df, columns=categorical_cols, dummy_na=<span class="hljs-literal">False</span>, dtype=float) 
    df = df.dropna().drop([<span class="hljs-string">'client_id'</span>, <span class="hljs-string">'id_x'</span>], axis=<span class="hljs-number">1</span>)
    print(<span class="hljs-string">'nDataFrame: n'</span>, df.head(n=<span class="hljs-number">3</span>))
    

    DataFrame:

    Base DataFrame

    Our DataFrame shows an extremely skewed data distribution with:

    • Fraud samples: 1,191

    • Non-fraud samples: 11,477,397

    For classification tasks, it’s crucial to be aware of sample size imbalances and employ appropriate strategies to mitigate their negative impact on classification model performance, especially regarding the minority class.

    For our data, we’ll:

    1. split the 1,191 fraud samples into training, validation, and test sets,

    2. add an equal number of randomly chosen non-fraud samples from the DataFrame, and

    3. adjust split balances later if generalization challenges arise.

    <span class="hljs-comment"># define the desired size of the fraud samples for the validation and test sets</span>
    val_size_per_class = <span class="hljs-number">200</span>
    test_size_per_class = <span class="hljs-number">200</span>
    
    <span class="hljs-comment"># create test sets</span>
    X_test_fraud = df_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
    X_test_non_fraud = df_non_fraud.sample(n=test_size_per_class, random_state=<span class="hljs-number">42</span>)
    
    <span class="hljs-comment"># combine to form the balanced test set</span>
    X_test = pd.concat([X_test_fraud, X_test_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
    y_test = X_test[<span class="hljs-string">'is_fraud'</span>]
    X_test = X_test.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)
    
    <span class="hljs-comment"># remove sampled rows from the original dataframes to avoid data leakage</span>
    df_fraud_remaining = df_fraud.drop(X_test_fraud.index)
    df_non_fraud_remaining = df_non_fraud.drop(X_test_non_fraud.index)
    
    
    <span class="hljs-comment"># create validation sets</span>
    X_val_fraud = df_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
    X_val_non_fraud = df_non_fraud_remaining.sample(n=val_size_per_class, random_state=<span class="hljs-number">42</span>)
    
    <span class="hljs-comment"># combine to form the balanced validation set</span>
    X_val = pd.concat([X_val_fraud, X_val_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
    y_val = X_val[<span class="hljs-string">'is_fraud'</span>]
    X_val = X_val.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)
    
    <span class="hljs-comment"># remove sampled rows from the remaining dataframes</span>
    df_fraud_train = df_fraud_remaining.drop(X_val_fraud.index)
    df_non_fraud_train = df_non_fraud_remaining.drop(X_val_non_fraud.index)
    
    
    <span class="hljs-comment"># create training sets</span>
    min_train_samples_per_class = min(len(df_fraud_train), len(df_non_fraud_train))
    
    X_train_fraud = df_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
    X_train_non_fraud = df_non_fraud_train.sample(n=min_train_samples_per_class, random_state=<span class="hljs-number">42</span>)
    
    X_train = pd.concat([X_train_fraud, X_train_non_fraud]).sample(frac=<span class="hljs-number">1</span>, random_state=<span class="hljs-number">42</span>).reset_index(drop=<span class="hljs-literal">True</span>)
    y_train = X_train[<span class="hljs-string">'is_fraud'</span>]
    X_train = X_train.drop(<span class="hljs-string">'is_fraud'</span>, axis=<span class="hljs-number">1</span>)
    
    
    print(<span class="hljs-string">"n--- Final Dataset Shapes and Distributions ---"</span>)
    print(<span class="hljs-string">f"X_train shape: <span class="hljs-subst">{X_train.shape}</span>, y_train distribution: <span class="hljs-subst">{np.unique(y_train, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
    print(<span class="hljs-string">f"X_val shape: <span class="hljs-subst">{X_val.shape}</span>, y_val distribution: <span class="hljs-subst">{np.unique(y_val, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
    print(<span class="hljs-string">f"X_test shape: <span class="hljs-subst">{X_test.shape}</span>, y_test distribution: <span class="hljs-subst">{np.unique(y_test, return_counts=<span class="hljs-literal">True</span>)}</span>"</span>)
    

    After the operation, we secured 1,582 training, 400 validation, and 400 test samples, each dataset maintaining a 50:50 split between fraud and non-fraud transactions:

    X, y datasets shape

    Considering the high dimensional feature space with 19 input features, we’ll apply SMOTE to resample the training data (SMOTE should not be applied to validation or test sets to avoid data leakage):

    <span class="hljs-keyword">from</span> imblearn.over_sampling <span class="hljs-keyword">import</span> SMOTE
    <span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> Counter
    
    train_target = <span class="hljs-number">2000</span>
    
    smote_train = SMOTE(
      sampling_strategy={<span class="hljs-number">0</span>: train_target, <span class="hljs-number">1</span>: train_target},  <span class="hljs-comment"># increase sample size to 2,000</span>
      random_state=<span class="hljs-number">12</span>
    )
    X_train, y_train = smote_train.fit_resample(X_train, y_train)
    
    print(<span class="hljs-string">f"nAfter SMOTE with custom sampling_strategy (target train: <span class="hljs-subst">{train_target}</span>):"</span>)
    print(<span class="hljs-string">f"X_train_oversampled shape: <span class="hljs-subst">{X_train.shape}</span>"</span>)
    print(<span class="hljs-string">f"y_train_oversampled distribution: <span class="hljs-subst">{Counter(y_train)}</span>"</span>)
    

    We’ve secured 4,000 training samples, maintaining a 50:50 split between fraud and non-fraud transactions:

    Training sample shape after SMOTE

    Lastly, we’ll apply column transformers to numerical and categorical features separately.

    Column transformers are advantageous in handling datasets with multiple data types, as they can apply different transformations to different subsets of columns while preventing data leakage.

    <span class="hljs-keyword">from</span> sklearn.impute <span class="hljs-keyword">import</span> SimpleImputer
    <span class="hljs-keyword">from</span> sklearn.compose <span class="hljs-keyword">import</span> ColumnTransformer
    <span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline
    
    categorical_features = X_train.select_dtypes(include=[<span class="hljs-string">'object'</span>]).columns.tolist()
    categorical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'most_frequent'</span>)),(<span class="hljs-string">'onehot'</span>, OneHotEncoder(handle_unknown=<span class="hljs-string">'ignore'</span>))])
    
    numerical_features = X_train.select_dtypes(include=[<span class="hljs-string">'int64'</span>, <span class="hljs-string">'float64'</span>]).columns.tolist()
    numerical_transformer = Pipeline(steps=[(<span class="hljs-string">'imputer'</span>, SimpleImputer(strategy=<span class="hljs-string">'mean'</span>)), (<span class="hljs-string">'scaler'</span>, StandardScaler())])
    
    preprocessor = ColumnTransformer(
        transformers=[
            (<span class="hljs-string">'num'</span>, numerical_transformer, numerical_features),
            (<span class="hljs-string">'cat'</span>, categorical_transformer, categorical_features)
        ]
    )
    
    X_train_processed = preprocessor.fit_transform(X_train)
    X_val_processed = preprocessor.transform(X_val)
    X_test_processed = preprocessor.transform(X_test)
    

    Understanding Optimizers

    In deep learning, an optimizer is a crucial element that fine-tunes a neural network’s parameters during training. Its primary role is to minimize the model’s loss function, enhancing performance.

    Various optimization algorithms, known as optimizers, employ distinct strategies to converge towards optimal parameters for improved predictions efficiently.

    In this article, we’ll use the SGD Optimizer and Adam Optimizer.

    1. How a SGD (Stochastic Gradient Descent) Optimizer Works

    SGD is a major optimization algorithm that computes the gradient (partial derivative of the cost function) using a small mini-batch of examples at each epoch:

    $$begin{align*} w_j &:= w_j – eta frac {partial J} {partial w_j} \ \ b &:= b – eta frac {partial J} {partial b} end{align*}$$

    (w: weight, b: bias, J: cost function, η: learning rate)

    In binary classification, the cost function (J) is defined with a sigmoid function (σ(z)) where z generates weighted sum of inputs and bias terms:

    $$begin{align*} J(y, hat y) &=−[y log(hat y) + (1-y)log(1-hat y)] \ \ hat y &= sigma (z) = frac {1} {1+e^{-z}} \ \ z &= sum_{i=1}^m w_i x_i + b end {align*}$$

    2. How Adam (Adaptive Moment Estimation) Optimizer Works

    Adam is an optimization algorithm that computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

    Adam optimizer combines the advantages of RMSprop (using squared gradients to scale the learning rate) and Momentum (using past gradients to accelerate convergence):

    $$w_{j,t+1} = w_{j,t} – alpha cdot frac{hat{m}{t,w_j}}{sqrt{hat{v}{t,w_j}} + epsilon}$$

    where:

    • α: The learning rate (default is 0.001)

    • ϵ: A small positive constant used to avoid division by zero

    • m^: First moment (mean) estimate with a bias correction, leveraging Momentum:

    $$begin{align*} hat m_t &= frac {m_t} {1 – beta_1^t} \ \ m_t &= beta_1 m_{t-1} + (1-beta_1) underbrace{ frac {partial L} {partial w_t}}_{text{gradient}} end{align*}$$

    (β1​​: Decay rates, typically set to β1=0.9)

    v^: Second moment (variance) estimate with a bias correction, leveraging RMSprop:

    $$begin{align*} hat v_t &= frac {v_t} {1 – beta_2^t} \ \ v_t &=beta_2 v_{t-1} + (1- beta_2) (frac {partial L} {partial w_t})^2 end {align*}$$

    (β2​​: Decay rates, typically set to β2=0.999)

    Since both m​​ and v​ are initialized at zero, Adam computes the bias-corrected estimates to prevent them being biased toward zero.

    Learn More: A Comprehensive Guide on Neural Network in Deep Learning

    How to Build an MLP Classifier with SGD Optimizer

    Custom Classifier

    This process involves a forward pass and backpropagation, during which SGD computes optimal weights and biases using gradients:

    <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
        <span class="hljs-comment"># SGD starts with randomly selected mini-batch for the epoch</span>
        X_batch = X_shuffled[i : i + self.batch_size]
        y_batch = y_shuffled[i : i + self.batch_size]
    
        <span class="hljs-comment"># A. forward pass</span>
        activations, zs = self._forward_pass(X_batch)
        y_pred = activations[<span class="hljs-number">-1</span>]  <span class="hljs-comment"># final output of the network</span>
    
        <span class="hljs-comment"># B. backpropagation</span>
        <span class="hljs-comment"># 1) calculating gradients for the output layer)</span>
        delta = y_pred - y_batch
        dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
    
        <span class="hljs-comment"># 2) update output layer parameters</span>
        self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
        self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db
    
        <span class="hljs-comment"># 3) iterate backward from last hidden layer to the input layer</span>
        <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
            delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
            dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
            db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
    
            self.weights[l] -= self.learning_rate * dW
            self.biases[l] -= self.learning_rate * db
    

    In the process of the forward pass, the network calculates a weighted sum of weights and bias (z), applies an activation function (ReLU) to the values in each hidden layer, and then computes the predicted output (y_pred) using a sigmoid function.

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
        activations = [X]
        zs = []
    
        <span class="hljs-comment"># forward through hidden layers</span>
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
            z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
            zs.append(z)
            a = self._relu(z) <span class="hljs-comment"># using ReLU for hidden layers</span>
            activations.append(a)
    
        <span class="hljs-comment"># forward through output layer</span>
        z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
        zs.append(z_output)
    
        <span class="hljs-comment"># computes the final output using sigmoid function</span>
        y_pred = <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
        activations.append(y_pred)
        <span class="hljs-keyword">return</span> activations, zs
    

    So the final classifier looks like this:

    <span class="hljs-keyword">from</span> sklearn.metrics <span class="hljs-keyword">import</span> accuracy_score
    
    <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_SGD</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.01</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span></span>):</span>
            self.hidden_layer_sizes = hidden_layer_sizes
            self.learning_rate = learning_rate
            self.n_epochs = n_epochs
            self.batch_size = batch_size
            self.weights = []
            self.biases = []
            self.weights_history = []
            self.biases_history = []
            self.loss_history = []
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
            s = self._sigmoid(x)
            <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> (x > <span class="hljs-number">0</span>).astype(float)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
            layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
            self.weights = []
            self.biases = []
    
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
                fan_in = layer_sizes[i]
                fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
                limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
                self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
                self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
            activations = [X]
            zs = []
    
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
                z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
                zs.append(z)
                a = self._relu(z)
                activations.append(a)
    
            z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
            zs.append(z_output)
            y_pred = self._sigmoid(z_output)
            activations.append(y_pred)
    
            <span class="hljs-keyword">return</span> activations, zs
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
            y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
            loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
            <span class="hljs-keyword">return</span> loss
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
            n_samples, n_features = X.shape
            y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
            X = np.asarray(X)
            self._initialize_parameters(n_features)
            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
            activations, _ = self._forward_pass(X)
            initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(initial_loss)
    
            <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
                <span class="hljs-comment"># shuffle datasets</span>
                permutation = np.random.permutation(n_samples)
                X_shuffled = X[permutation]
                y_shuffled = y[permutation]
    
                <span class="hljs-comment"># mini-batch loop</span>
                <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                    X_batch = X_shuffled[i : i + self.batch_size]
                    y_batch = y_shuffled[i : i + self.batch_size]
    
                    activations, zs = self._forward_pass(X_batch)
                    y_pred = activations[<span class="hljs-number">-1</span>]
    
                    delta = y_pred - y_batch
                    dW = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                    db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
                    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * dW
                    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * db
    
                    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                        dW = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                        db = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
    
                        self.weights[l] -= self.learning_rate * dW
                        self.biases[l] -= self.learning_rate * db
    
                self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
                self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
    
                activations, _ = self._forward_pass(X)
                epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
                self.loss_history.append(epoch_loss)
    
                <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                    print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
            <span class="hljs-keyword">return</span> self
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
            activations, _ = self._forward_pass(X)
            <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
            probabilities = self.predict_proba(X)
            <span class="hljs-keyword">return</span> (probabilities >= threshold).astype(int).flatten() <span class="hljs-comment"># for 1D output</span>
    

    Training / Prediction

    Train the model and make a prediction using training and validation datasets:

    <span class="hljs-comment"># 1. define the model</span>
    mlp_sgd = MLP_SGD(
      hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>, ), <span class="hljs-comment"># 2 hidden layers with 30 neurons each</span>
      learning_rate=<span class="hljs-number">0.001</span>,           <span class="hljs-comment"># a step size</span>
      n_epochs=<span class="hljs-number">1000</span>,                 <span class="hljs-comment"># number of epochs</span>
      batch_size=<span class="hljs-number">32</span>                  <span class="hljs-comment"># mini-batch size</span>
    )
    
    <span class="hljs-comment"># 2. train the model</span>
    mlp_sgd.fit(X_train_processed, y_train)
    
    <span class="hljs-comment"># 3. make a prediction with training and validation datasets</span>
    y_pred_train = mlp_sgd.predict(X_train_processed)
    y_pred_val = mlp_sgd.predict(X_val_processed)
    
    <span class="hljs-comment"># 4. compute evaluation matrics</span>
    conf_matrix = confusion_matrix(y_true, y_pred)
    acc = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
    recall = recall_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
    f1 = f1_score(y_true, y_pred, pos_label=<span class="hljs-number">1</span>)
    
    
    print(<span class="hljs-string">f"nMLP (Custom SGD) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
    print(<span class="hljs-string">f"MLP (Custom SGD) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
    

    Results

    • Recall: 0.7930 — 0.6650 (from training to validation)

    • Precision: 0.7790 — 0.6786 (from training to validation)

    The model effectively learned and generalized the patterns, achieving a Recall of 79.3% (approximately 80% accuracy in identifying fraud transactions) with a 12-point drop on the validation set.

    Loss history:

    Loss by epoch, weight history, bias history (Source: Kuriko Iwai)

    We visualized the decision boundary using the first two principal components (PCA) as the x and y axes. Note that the boundary is non-linear.

    Image: Decision Boundary of MLP Classifier with SGD optimizer (Source: Kuriko Iwai)

    Leverage SckitLearn’s MCP Classifier

    We can use an MCP Classifier to define a similar model, incorporating;

    • Early stopping using internal validation to prevent overfitting and

    • L2 regularization with a small tolerance.

    <span class="hljs-keyword">from</span> sklearn.neural_network <span class="hljs-keyword">import</span> MLPClassifier
    
    <span class="hljs-comment"># define a model</span>
    model_sklearn_mlp_sgd = MLPClassifier(
        hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
        activation=<span class="hljs-string">'relu'</span>,
        solver=<span class="hljs-string">'sgd'</span>,
        learning_rate_init=<span class="hljs-number">0.001</span>,
        learning_rate=<span class="hljs-string">'constant'</span>,
        momentum=<span class="hljs-number">0.9</span>,
        nesterovs_momentum=<span class="hljs-literal">True</span>,
        alpha=<span class="hljs-number">0.00001</span>,           <span class="hljs-comment"># l2 regulation strength</span>
        max_iter=<span class="hljs-number">3000</span>,           <span class="hljs-comment"># max epochs (keep it high)</span>
        batch_size=<span class="hljs-number">16</span>,           <span class="hljs-comment"># mini-batch size</span>
        random_state=<span class="hljs-number">42</span>,
        early_stopping=<span class="hljs-literal">True</span>,     <span class="hljs-comment"># apply early stopping</span>
        n_iter_no_change=<span class="hljs-number">50</span>,     <span class="hljs-comment"># stop the iteration if internal validation score doesn't improve for 50 epochs</span>
        validation_fraction=<span class="hljs-number">0.1</span>, <span class="hljs-comment"># proportion of training data for internal validation (default is 0.1)</span>
        tol=<span class="hljs-number">1e-4</span>,                <span class="hljs-comment"># tolerance for optimization</span>
        verbose=<span class="hljs-literal">False</span>,
    )
    
    <span class="hljs-comment"># training</span>
    model_sklearn_mlp_sgd.fit(X_train_processed, y_train)
    
    <span class="hljs-comment"># make a prediction</span>
    y_pred_train_sklearn = model_sklearn_mlp_sgd.predict(X_train_processed)
    y_pred_val_sklearn = model_sklearn_mlp_sgd.predict(X_val_processed)
    

    Results

    • Recall: 0.7830 – 0.6200 (from training to validation)

    • Precision: 0.8208  – 0.6703 (from training to validation)

    The model showed strong performance during training, achieving a Recall of 78.30%. Its performance declined on the validation set.

    This suggests that while the model learned effectively from the training data, it may be overfitting and not generalizing as well to unseen data.

    Leverage Keras Sequential Classifier

    For the sequential classifier, we can further enhance the classifier by:

    • Initializing the output layer’s bias with the log-odds of positive class occurrences in the training data (y_train​) to address dataset imbalance and promote faster convergence,

    • Integrating 10% dropout between hidden layers to prevent overfitting by randomly deactivating neurons during training,

    • Including Precision and Recall in the model’s compilation metrics to optimize for classification performance,

    • Applying class weights to penalize misclassifications of the minority class more heavily, improving the model’s ability to learn rare patterns, and

    • Utilizing a separate validation dataset for monitoring performance during training to help detect overfitting and guides hyperparameter tuning.

    <span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
    <span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
    <span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
    <span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
    <span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> SGD
    <span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
    <span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight
    
    
    <span class="hljs-comment"># calculates an initial bias for the output layer </span>
    initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
    
    
    <span class="hljs-comment"># defines the model</span>
    model_keras_sgd = Sequential([
        Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
        Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
        Dropout(<span class="hljs-number">0.1</span>), <span class="hljs-comment"># 10% of the neurons in that layer randomly dropped out</span>
        Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
        Dropout(<span class="hljs-number">0.1</span>),
        Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, <span class="hljs-comment"># binary classification</span>
              bias_initializer=tf.keras.initializers.Constant(initial_bias)) <span class="hljs-comment"># to address the imbalanced datasets</span>
    ])
    
    
    
    <span class="hljs-comment"># compiles the model with the SGD optimizer</span>
    opt = SGD(learning_rate=<span class="hljs-number">0.001</span>)
    model_keras_sgd.compile(
        optimizer=opt, 
        loss=<span class="hljs-string">'binary_crossentropy'</span>,
        metrics=[
            <span class="hljs-string">'accuracy'</span>, <span class="hljs-comment"># add several metrics to return</span>
            tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
            tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
            tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
        ]
    )
    
    
    <span class="hljs-comment"># defines early stopping to prevent overfitting</span>
    early_stopping_callback = EarlyStopping(
        monitor=<span class="hljs-string">'val_recall'</span>,  <span class="hljs-comment"># monitor recall </span>
        mode=<span class="hljs-string">'max'</span>,         <span class="hljs-comment"># maximize recall</span>
        patience=<span class="hljs-number">50</span>,        <span class="hljs-comment"># stop after 50 epochs without loss improvement</span>
        min_delta=<span class="hljs-number">1e-4</span>,     <span class="hljs-comment"># minimum change to be considered an improvement (tol)</span>
        verbose=<span class="hljs-number">0</span>
    )
    
    
    <span class="hljs-comment"># compute the class weight</span>
    class_weights = class_weight.compute_class_weight(
        class_weight=<span class="hljs-string">'balanced'</span>,
        classes=np.unique(y_train),
        y=y_train
    )
    class_weights_dict = dict(zip(np.unique(y_train), class_weights))
    
    
    <span class="hljs-comment"># train the model</span>
    history = model_keras_sgd.fit(
        X_train_processed, y_train,
        epochs=<span class="hljs-number">1000</span>,
        batch_size=<span class="hljs-number">32</span>,
        validation_data=(X_val_processed, y_val), <span class="hljs-comment"># use our external val set</span>
        callbacks=[early_stopping_callback], <span class="hljs-comment"># early stopping to prevent overfitting</span>
        class_weight=class_weights_dict, <span class="hljs-comment"># penarlize more misclassification on minority class</span>
        verbose=<span class="hljs-number">0</span>
    )
    
    <span class="hljs-comment"># evaluate</span>
    loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_sgd.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
    print(<span class="hljs-string">f"n--- Keras Model Accuracy (Train) ---"</span>)
    print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    
    loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_sgd.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
    print(<span class="hljs-string">f"n--- Keras Model Accuracy (Validation) ---"</span>)
    print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    
    <span class="hljs-comment"># display model summary</span>
    model_keras_sgd.summary()
    

    Results

    • Recall: 0.7125 — 0.7250 (from training to validation)

    • Precision: 0.7607 — 0.7545 (from training to validation)

    Given that the gaps between training and validation are relatively small, the model is generalizing reasonably well.

    It suggests that the regularization techniques are likely effective in preventing significant overfitting.

    Image: Summary of the Keras Sequential Model with SGD Optimizer

    How to Build an MLP Classifier with Adam Optimizer

    Custom Classifier

    This iterative process of updating parameters occurs within the mini-batch loop to keep updating weights and bias:

    <span class="hljs-comment"># apply Adam updates for output layer parameters</span>
    <span class="hljs-comment"># 1) weights (w)</span>
    self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
    self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
    m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
    v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)
    
    <span class="hljs-comment"># 2) bias (b)</span>
    self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
    self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
    m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
    v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
    

    Following the principles of forward and backward passes, we construct the final classifier by initializing it with beta1 and beta2, built upon an MLP_SGD architecture:

    <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MLP_Adam</span>:</span>
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, hidden_layer_sizes=(<span class="hljs-params"><span class="hljs-number">10</span>,</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">1000</span>, batch_size=<span class="hljs-number">32</span>,
                     beta1=<span class="hljs-number">0.9</span>, beta2=<span class="hljs-number">0.999</span>, epsilon=<span class="hljs-number">1e-8</span></span>):</span>
            self.hidden_layer_sizes = hidden_layer_sizes
            self.learning_rate = learning_rate
            self.n_epochs = n_epochs
            self.batch_size = batch_size
            self.beta1 = beta1
            self.beta2 = beta2
            self.epsilon = epsilon
    
            self.weights = [] 
            self.biases = []
    
            <span class="hljs-comment"># Adam optimizer internal states for each parameter (weights and biases)</span>
            self.m_weights = []
            self.v_weights = []
            self.m_biases = []
            self.v_biases = []
    
            self.weights_history = []
            self.biases_history = []
            self.loss_history = []
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> <span class="hljs-number">1</span> / (<span class="hljs-number">1</span> + np.exp(-np.clip(x, <span class="hljs-number">-500</span>, <span class="hljs-number">500</span>)))
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_sigmoid_derivative</span>(<span class="hljs-params">self, x</span>):</span>
            s = self._sigmoid(x)
            <span class="hljs-keyword">return</span> s * (<span class="hljs-number">1</span> - s)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> np.maximum(<span class="hljs-number">0</span>, x)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_relu_derivative</span>(<span class="hljs-params">self, x</span>):</span>
            <span class="hljs-keyword">return</span> (x > <span class="hljs-number">0</span>).astype(float)
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_initialize_parameters</span>(<span class="hljs-params">self, n_features</span>):</span>
            layer_sizes = [n_features] + list(self.hidden_layer_sizes) + [<span class="hljs-number">1</span>]
    
            self.weights = []
            self.biases = []
            self.m_weights = []
            self.v_weights = []
            self.m_biases = []
            self.v_biases = []
    
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(layer_sizes) - <span class="hljs-number">1</span>):
                fan_in = layer_sizes[i]
                fan_out = layer_sizes[i+<span class="hljs-number">1</span>]
                limit = np.sqrt(<span class="hljs-number">6</span> / (fan_in + fan_out))
    
                self.weights.append(np.random.uniform(-limit, limit, (fan_in, fan_out)))
                self.biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
    
                self.m_weights.append(np.zeros((fan_in, fan_out)))
                self.v_weights.append(np.zeros((fan_in, fan_out)))
                self.m_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
                self.v_biases.append(np.zeros((<span class="hljs-number">1</span>, fan_out)))
    
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_forward_pass</span>(<span class="hljs-params">self, X</span>):</span>
            activations = [X]
            zs = []
    
            <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">1</span>):
                z = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[i]) + self.biases[i]
                zs.append(z)
                a = self._relu(z)
                activations.append(a)
    
            z_output = np.dot(activations[<span class="hljs-number">-1</span>], self.weights[<span class="hljs-number">-1</span>]) + self.biases[<span class="hljs-number">-1</span>]
            zs.append(z_output)
            y_pred = self._sigmoid(z_output)
            activations.append(y_pred)
    
            <span class="hljs-keyword">return</span> activations, zs
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_compute_loss</span>(<span class="hljs-params">self, y_true, y_pred</span>):</span>
            y_pred = np.clip(y_pred, <span class="hljs-number">1e-10</span>, <span class="hljs-number">1</span> - <span class="hljs-number">1e-10</span>)
            loss = -np.mean(y_true * np.log(y_pred) + (<span class="hljs-number">1</span> - y_true) * np.log(<span class="hljs-number">1</span> - y_pred))
            <span class="hljs-keyword">return</span> loss
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fit</span>(<span class="hljs-params">self, X, y</span>):</span>
            n_samples, n_features = X.shape
            y = np.asarray(y).reshape(<span class="hljs-number">-1</span>, <span class="hljs-number">1</span>)
            X = np.asarray(X)
    
            self._initialize_parameters(n_features)
            self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
            self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
            activations, _ = self._forward_pass(X)
            initial_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
            self.loss_history.append(initial_loss)
    
            <span class="hljs-comment"># global time step for Adam bias correction</span>
            t = <span class="hljs-number">0</span>
    
            <span class="hljs-keyword">for</span> epoch <span class="hljs-keyword">in</span> range(self.n_epochs):
                permutation = np.random.permutation(n_samples)
                X_shuffled = X[permutation]
                y_shuffled = y[permutation]
    
                <span class="hljs-comment"># Mini-batch loop</span>
                <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, n_samples, self.batch_size):
                    X_batch = X_shuffled[i : i + self.batch_size]
                    y_batch = y_shuffled[i : i + self.batch_size]
    
                    t += <span class="hljs-number">1</span>
    
                    <span class="hljs-comment"># 1. forward pass</span>
                    activations, zs = self._forward_pass(X_batch)
                    y_pred = activations[<span class="hljs-number">-1</span>] <span class="hljs-comment"># Output of the network</span>
    
                    <span class="hljs-comment"># 2. backpropagation</span>
                    delta = y_pred - y_batch
                    grad_w_output = np.dot(activations[<span class="hljs-number">-2</span>].T, delta) / X_batch.shape[<span class="hljs-number">0</span>] <span class="hljs-comment"># Average over batch</span>
                    grad_b_output = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
    
                    <span class="hljs-comment"># apply Adam updates to weights</span>
                    self.m_weights[<span class="hljs-number">-1</span>] = self.beta1 * self.m_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_output
                    self.v_weights[<span class="hljs-number">-1</span>] = self.beta2 * self.v_weights[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_output ** <span class="hljs-number">2</span>)
                    m_w_hat = self.m_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_w_hat = self.v_weights[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.weights[<span class="hljs-number">-1</span>] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)
    
                    <span class="hljs-comment"># apply Adam updates to bias</span>
                    self.m_biases[<span class="hljs-number">-1</span>] = self.beta1 * self.m_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_output
                    self.v_biases[<span class="hljs-number">-1</span>] = self.beta2 * self.v_biases[<span class="hljs-number">-1</span>] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_output ** <span class="hljs-number">2</span>)
                    m_b_hat = self.m_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta1**t)
                    v_b_hat = self.v_biases[<span class="hljs-number">-1</span>] / (<span class="hljs-number">1</span> - self.beta2**t)
                    self.biases[<span class="hljs-number">-1</span>] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
    
    
                    <span class="hljs-comment"># Propagate gradients backward through hidden layers</span>
                    <span class="hljs-keyword">for</span> l <span class="hljs-keyword">in</span> range(len(self.weights) - <span class="hljs-number">2</span>, <span class="hljs-number">-1</span>, <span class="hljs-number">-1</span>):
                        delta = np.dot(delta, self.weights[l+<span class="hljs-number">1</span>].T) * self._relu_derivative(zs[l]) <span class="hljs-comment"># d_activation(z)</span>
                        grad_w_hidden = np.dot(activations[l].T, delta) / X_batch.shape[<span class="hljs-number">0</span>]
                        grad_b_hidden = np.sum(delta, axis=<span class="hljs-number">0</span>) / X_batch.shape[<span class="hljs-number">0</span>]
    
                        <span class="hljs-comment"># apply Adam updates to weights</span>
                        self.m_weights[l] = self.beta1 * self.m_weights[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_w_hidden
                        self.v_weights[l] = self.beta2 * self.v_weights[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_w_hidden ** <span class="hljs-number">2</span>)
                        m_w_hat = self.m_weights[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                        v_w_hat = self.v_weights[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                        self.weights[l] -= self.learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)
    
                        <span class="hljs-comment"># apply Adam updates to bias</span>
                        self.m_biases[l] = self.beta1 * self.m_biases[l] + (<span class="hljs-number">1</span> - self.beta1) * grad_b_hidden
                        self.v_biases[l] = self.beta2 * self.v_biases[l] + (<span class="hljs-number">1</span> - self.beta2) * (grad_b_hidden ** <span class="hljs-number">2</span>)
                        m_b_hat = self.m_biases[l] / (<span class="hljs-number">1</span> - self.beta1**t)
                        v_b_hat = self.v_biases[l] / (<span class="hljs-number">1</span> - self.beta2**t)
                        self.biases[l] -= self.learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
    
    
                self.weights_history.append([w.copy() <span class="hljs-keyword">for</span> w <span class="hljs-keyword">in</span> self.weights])
                self.biases_history.append([b.copy() <span class="hljs-keyword">for</span> b <span class="hljs-keyword">in</span> self.biases])
    
                activations, _ = self._forward_pass(X)
                epoch_loss = self._compute_loss(y, activations[<span class="hljs-number">-1</span>])
                self.loss_history.append(epoch_loss)
    
                <span class="hljs-keyword">if</span> (epoch + <span class="hljs-number">1</span>) % <span class="hljs-number">100</span> == <span class="hljs-number">0</span>:
                    print(<span class="hljs-string">f"Epoch <span class="hljs-subst">{epoch+<span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{self.n_epochs}</span>, Loss: <span class="hljs-subst">{epoch_loss:<span class="hljs-number">.4</span>f}</span>"</span>)
            <span class="hljs-keyword">return</span> self
    
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict_proba</span>(<span class="hljs-params">self, X</span>):</span>
            activations, _ = self._forward_pass(X)
            <span class="hljs-keyword">return</span> activations[<span class="hljs-number">-1</span>]
    
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">predict</span>(<span class="hljs-params">self, X, threshold=<span class="hljs-number">0.5</span></span>):</span>
            probabilities = self.predict_proba(X)
            <span class="hljs-keyword">return</span> (probabilities >= threshold).astype(int).flatten()
    

    Training / Prediction

    Train the model and make a prediction using training and validation datasets:

    mlp_adam = MLP_Adam(hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">10</span>), learning_rate=<span class="hljs-number">0.001</span>, n_epochs=<span class="hljs-number">500</span>, batch_size=<span class="hljs-number">32</span>)
    mlp_adam.fit(X_train_processed, y_train)
    
    y_pred_train = mlp_adam.predict(X_train_processed)
    y_pred_val = mlp_adam.predict(X_val_processed)
    
    acc_train = accuracy_score(y_train, y_pred_train)
    acc_val = accuracy_score(y_val, y_pred_val)
    
    print(<span class="hljs-string">f"nMLP (Custom Adam) Accuracy (Train): <span class="hljs-subst">{acc_train:<span class="hljs-number">.3</span>f}</span>"</span>)
    print(<span class="hljs-string">f"MLP (Custom Adam) Accuracy (Validation): <span class="hljs-subst">{acc_val:<span class="hljs-number">.3</span>f}</span>"</span>)
    

    Results

    • Recall: 0.9870–0.6150 (from training to validation)

    • Precision: 0.9811–0.6474 (from training to validation)

    While the Adam optimizer outperformed SGD, the model exhibited significant overfitting, with both Recall and Precision falling by around 30 points between training and validation.

    Loss History

    Loss by epoch, middle: weights history by epoch, right: bias history by epoch (source: Kuriko Iwai)

    We visualized the decision boundary using the first two principal components (PCA) as the x and y axes.

    Decision Boundary of MLP with Adam Optimizer (source: Kuriko Iwai)

    Leverage SckitLearn’s MCP Classifier

    We’ve switched the optimizer from SGD to Adam, keeping all other settings constant:

    model_sklearn_mlp_adam = MLPClassifier(
        hidden_layer_sizes=(<span class="hljs-number">30</span>, <span class="hljs-number">30</span>),
        activation=<span class="hljs-string">'relu'</span>,
        solver=<span class="hljs-string">'adam'</span>,             <span class="hljs-comment"># update the optimizer from SGD to Adam</span>
        learning_rate_init=<span class="hljs-number">0.001</span>,
        learning_rate=<span class="hljs-string">'constant'</span>,
        alpha=<span class="hljs-number">0.0001</span>,
        max_iter=<span class="hljs-number">3000</span>,
        batch_size=<span class="hljs-number">16</span>,
        random_state=<span class="hljs-number">42</span>,
        early_stopping=<span class="hljs-literal">True</span>,
        n_iter_no_change=<span class="hljs-number">50</span>,
        validation_fraction=<span class="hljs-number">0.1</span>,
        tol=<span class="hljs-number">1e-4</span>,
        verbose=<span class="hljs-literal">False</span>,
    )
    
    model_sklearn_mlp_adam.fit(X_train_processed, y_train)
    
    y_pred_train_sklearn = model_sklearn_mlp_adam.predict(X_train_processed)
    y_pred_val_sklearn = model_sklearn_mlp_adam.predict(X_val_processed)
    

    Results

    • Recall: 0.8975–0.6400 (from training to validation)

    • Precision: 0.8864 —  0.6305 (from training to validation)

    Despite a performance improvement compared to the SGD optimizer, the significant drop in both Recall (from 0.8975 to 0.6400) and Precision (from 0.8864 to 0.6305) from training to validation data indicates that the model is still overfitting.

    Leverage Keras Sequential Classifier

    Similar to MLPClassifier, we’ve switched the optimizer from SGD to Adam with all the other conditions remaining the same:

    <span class="hljs-keyword">import</span> tensorflow <span class="hljs-keyword">as</span> tf
    <span class="hljs-keyword">from</span> tensorflow <span class="hljs-keyword">import</span> keras
    <span class="hljs-keyword">from</span> keras.models <span class="hljs-keyword">import</span> Sequential
    <span class="hljs-keyword">from</span> keras.layers <span class="hljs-keyword">import</span> Dense, Dropout, Input
    <span class="hljs-keyword">from</span> keras.optimizers <span class="hljs-keyword">import</span> Adam
    <span class="hljs-keyword">from</span> keras.callbacks <span class="hljs-keyword">import</span> EarlyStopping
    <span class="hljs-keyword">from</span> sklearn.utils <span class="hljs-keyword">import</span> class_weight
    
    
    initial_bias = np.log([np.sum(y_train == <span class="hljs-number">1</span>) / np.sum(y_train == <span class="hljs-number">0</span>)])
    model_keras_adam = Sequential([
        Input(shape=(X_train_processed.shape[<span class="hljs-number">1</span>],)), 
        Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>)),
        Dropout(<span class="hljs-number">0.1</span>),
        Dense(<span class="hljs-number">30</span>, activation=<span class="hljs-string">'relu'</span>),
        Dropout(<span class="hljs-number">0.1</span>),
        Dense(<span class="hljs-number">1</span>, activation=<span class="hljs-string">'sigmoid'</span>, 
              bias_initializer=tf.keras.initializers.Constant(initial_bias))
    ])
    
    
    optimizer_keras = Adam(learning_rate=<span class="hljs-number">0.001</span>)
    model_keras_adam.compile(
        optimizer=optimizer_keras, 
        loss=<span class="hljs-string">'binary_crossentropy'</span>, 
        metrics=[
            <span class="hljs-string">'accuracy'</span>,
            tf.keras.metrics.Precision(name=<span class="hljs-string">'precision'</span>),
            tf.keras.metrics.Recall(name=<span class="hljs-string">'recall'</span>),
            tf.keras.metrics.AUC(name=<span class="hljs-string">'auc'</span>) 
        ]
    )
    
    early_stopping_callback = EarlyStopping(
        monitor=<span class="hljs-string">'val_recall'</span>,
        mode=<span class="hljs-string">'max'</span>,
        patience=<span class="hljs-number">50</span>,
        min_delta=<span class="hljs-number">1e-4</span>,
        verbose=<span class="hljs-number">0</span>
    )
    
    class_weights = class_weight.compute_class_weight(
        class_weight=<span class="hljs-string">'balanced'</span>,
        classes=np.unique(y_train),
        y=y_train
    )
    class_weights_dict = dict(zip(np.unique(y_train), class_weights))
    
    model_keras_adam.fit(
        X_train_processed, y_train,
        epochs=<span class="hljs-number">1000</span>,
        batch_size=<span class="hljs-number">32</span>,
        validation_data=(X_val_processed, y_val),
        callbacks=[early_stopping_callback],
        class_weight=class_weights_dict,
        verbose=<span class="hljs-number">0</span>
    )
    
    
    loss_train, accuracy_train, precision_train, recall_train, auc_train = model_keras_adam.evaluate(X_train_processed, y_train, verbose=<span class="hljs-number">0</span>)
    print(<span class="hljs-string">f"n--- Keras Model Accuracy (Train) ---"</span>)
    print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_train:<span class="hljs-number">.4</span>f}</span>"</span>)
    
    
    loss_val, accuracy_val, precision_val, recall_val, auc_val = model_keras_adam.evaluate(X_val_processed, y_val, verbose=<span class="hljs-number">0</span>)
    print(<span class="hljs-string">f"n--- Keras Model Accuracy (Validation) ---"</span>)
    print(<span class="hljs-string">f"Loss: <span class="hljs-subst">{loss_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Accuracy: <span class="hljs-subst">{accuracy_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Precision: <span class="hljs-subst">{precision_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"Recall: <span class="hljs-subst">{recall_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    print(<span class="hljs-string">f"AUC: <span class="hljs-subst">{auc_val:<span class="hljs-number">.4</span>f}</span>"</span>)
    
    
    model_keras_adam.summary()
    

    Results

    • Recall: 0.7995–0.7500 (from training to validation)

    • Precision: 0.8409–0.8065 (from training to validation)

    The model exhibits good performance, with Recall slightly decreasing from 0.7995 (training) to 0.7500 (validation), and Precision similarly dropping from 0.8409 (training) to 0.8065 (validation).

    This indicates good generalization, with only minor performance degradation on unseen data.

    Image: Keras Sequential Model with Adam Optimizer (Source: Kuriko Iwai)

    Final Results: Generalization

    Finally, we’ll evaluate the model’s ultimate performance on the test dataset, which has remained completely separate from all prior training and validation processes.

    <span class="hljs-comment"># Custom classifiers</span>
    y_pred_test_custom_sgd = mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
    y_pred_test_custom_adam = mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)
    
    <span class="hljs-comment"># MLPClassifer</span>
    y_pred_test_sk_sgd = model_sklearn_mlp_sgd.fit(X_train_processed, y_train).predict(X_test_processed)
    y_pred_test_sk_adam = model_sklearn_mlp_adam.fit(X_train_processed, y_train).predict(X_test_processed)
    
    <span class="hljs-comment"># Keras Sequential</span>
    _, accuracy_val_sgd, precision_val_sgd, recall_val_sgd, auc_val_sgd = model_keras_sgd.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
    _, accuracy_val_adam, precision_val_adam, recall_val_adam, auc_val_adam = model_keras_adam.evaluate(X_test_processed, y_test, verbose=<span class="hljs-number">0</span>)
    

    Overall, the Keras Sequential model, optimized with SGD, achieved the best performance with an AUPRC (Area Under Precision-Recall Curve) of 0.72.

    Precision-Recall Curves for Six Classifier Models (Comparing Custom, MLP, and Keras Sequential Classifiers with SGD and Adam Optimizers (Source: Kuriko Iwai)

    Conclusion

    In this exploration, we experimented with custom classifiers, Scikit-learn models, and Keras deep learning architectures.

    Our findings underscore that effective machine learning hinges on three critical factors:

    1. robust data preprocessing (tailored to objectives and data distribution),

    2. judicious model selection, and

    3. strategic framework or library choices.

    Choosing the right framework

    Generally speaking, choose MLPClassifier when:

    • You’re primarily working with tabular data,

    • You want to prioritize simplicity, quick iteration, and seamless integration,

    • You have simple, shallow architectures, and

    • You have a moderate dataset size (manageable on a CPU).

    Choose Keras Sequential when:

    • You’re dealing with image, text, audio, or other sequential data,

    • You’re building deep learning models such as CNNs, RNNs, LSTMs,

    • You need fine-grained control over the model architecture, training process, or custom components,

    • You need to leverage GPU acceleration,

    • You’re planning for production deployment, and

    • You want to experiment with more advanced deep learning techniques.

    Limitation of MLPs

    While Multilayer Perceptrons (MLPs) proved valuable, their susceptibility to computational complexity and overfitting emerged as key challenges.

    Looking ahead, we’ll delve into how Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) offer powerful solutions to these inherent MLP limitations.

    You can find more info about me on my Portfolio / LinkedIn / Github.

    Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleSave $200 on this 77-piece Milwaukee wrench set at The Home Depot
    Next Article How Microfrontends Work: From iframes to Module Federation

    Related Posts

    Development

    Enhanced Queue Job Control with Laravel’s ThrottlesExceptions failWhen() Method

    September 2, 2025
    Artificial Intelligence

    Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

    September 2, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Foundation Model Hidden Representations for Heart Rate Estimation from Auscultation

    Machine Learning

    CVE-2025-6669 – Gooaclok819 SublinkX Cryptographic Key Hard-Coding Remote Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    RansomHub affiliates linked to rival RaaS gangs

    Development

    CVE-2025-47726 – Delta Electronics CNCSoft Remote Code Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    HTMX – Server Components without React

    August 19, 2025

    Let’s be honest: Half the APIs you write are for a specific purpose in a…

    CVE-2025-7042 – SOLIDWORKS eDrawings After Free Code Execution Vulnerability

    July 15, 2025

    CVE-2025-46544 – Sherpa Orchestrator Privilege Escalation Vulnerability

    April 24, 2025

    SteelSeries reveals new Arctis Nova 3 Wireless headset series for Xbox, PlayStation, Nintendo Switch, and PC

    June 3, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.