Exploring the Adam Optimizer Technique

🚀 Implementation of ADAM

In earlier posts, we implemented a momentum-based optimizer, which offered several advantages:

Momentum helps avoid getting stuck in local minima, much like in physics 🏃‍♂️.
It smooths updates by acting as a moving average of the gradient, especially useful when dealing with steep loss surfaces 🌄.

Loss Surface f(x) = kx + b

However, this approach had a significant drawback—it required a carefully selected learning rate ⚖️. Since the learning rate remained constant and didn’t adjust dynamically, large values could cause the updates to go in the wrong direction, leading to exploding gradients ⚡ and increasing loss 📉. This is less of an issue with single-variable functions like f(x), so let's consider the following function:

$$f(x_1, x_2) = 0.3 \cdot x_1 + 0.1 \cdot x_2 + 10$$

where:

$$w_1 = 0.3, \, w_2 = 0.1, \, b = 10$$

We will use the same momentum-based optimizer, but without any learning rate adaptation 🔄.

Code:

class MomentumOptimizer:
    def __init__(self, param, lr=0.0000001, beta=0.8):
        self.param = param
        self.lr = lr
        self.beta = beta
        self.momentum = 0
        self.prev_momentum = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.momentum = self.beta * self.prev_momentum + self.lr * gradient
        self.param -= self.momentum

    def get_param(self):
        return self.param

Now let’s look at the result of training with a good learning rate (0.000004 for the weights and 0.01 for the bias):

As seen, the model converges well, thanks to the carefully chosen learning rate 👍.

However, increasing the learning rate just a bit too much can lead to instability 😱. Let’s set the learning rate to 0.000008 for the weights and 0.02 for the bias:

The predictions are now too large to be visible, as the gradients have exploded 🔥.

Here’s what the training process looks like with these values:

Epoch	Loss	w1 Gradient	w2 Gradient
0	1663.901	2361174.8	201861.549
1	9269.569	-13415972.577	-1097061.735
2	44622.031	64360033.279	5304690.402
3	212856.013	-307215418.666	-25282080.172
4	1015987.765	1466190419.961	120695444.971
5	4848691.714	-6997400267.384	-575986518.531
6	23140483.845	33395098059.392	2748925665.978
...	...	...	...

At epoch 439, the values are so large they appear as:

Epoch	Loss	w1 Gradient	w2 Gradient
439	1.825613645291461e+301	-2.63462883619106e+304	-2.1686975236400557e+303
440	8.712743251677126e+301	1.257376999394519e+305	1.0350111461292857e+304
441	4.1581577331783074e+302	-inf	-4.939591902211514e+304
442	inf	inf	inf
443	nan	nan	nan
444	nan	nan	nan

As you can see, the loss and other values grow exponentially, and the gradients flip between negative and positive values due to momentum ⚡. A single wrong number can break the system and cause the optimization to go haywire 🚨.

Managing Learning Rate Dynamically 🔄

Now, let’s tackle this issue. First, we need to identify the problem: static learning rate 🚧. We don't want to spend time tweaking the learning rate through trial and error 🔬. We want it to adapt dynamically 🌱. But to adapt, it should have something to base its adjustments on — for example, the gradient 📉.

Here’s when things get interesting 🔍:

Let’s remember that we already have something that changes based on the gradient — momentum. Momentum is essentially a moving average, and we use it to update parameters (weights/coeffs) directly ⚙️. But what if we updated the learning rate in a similar way? 🤔 What if we use the gradient just like momentum does? Let’s give it a try! 💡

We could directly use our moving average (momentum) and come up with an update formula, but there’s a catch 🎣. Momentum cares about the direction of gradients, so it can be either positive or negative ➕➖. However, the learning rate must never be negative ❌. If we don’t manage this, positive and negative gradients might cancel each other out, like this: -5 + 5 + 4 - 4 = 0 🌀.

So, here’s the plan 📝: we’ll take the squared gradients and then use square root when we update the values to return them to their standard form 🔄. This ensures that negative gradients won’t cause cancellation.

Next, we’ll use a second moving average to accumulate squared gradients ✨.

Here are the formulas:

Momentum (M):

$$M(t) = M(t-1) \times \beta_1 + (1 - \beta_1) \times g(t)$$
Squared Gradient Moving Average (V):

$$V(t) = V(t-1) \times \beta_2 + (1 - \beta_2) \times g(t)^2$$

Why are β1 and β2 different? 🤔

Good question! 😄 Why don’t we use the same β\beta for both moving averages?

β1 (usually 0.9) saves 90% of past gradients and uses 10% as new gradients. It decays faster ⏳, capturing recent gradients more quickly.
β2(usually 0.999) decays much slower ⏱️, with a much longer history of past gradients. It better handles variations (gradient spikes ⚡), smoothing out the values for stable adaptation 🌿.

Implementation of Adam🚀

class AdamOptimizer:
    def __init__(
        self,
        param,
        beta_m=0.9,
        beta_v=0.999
    ):
        self.param = param
        self.beta_m = beta_m
        self.beta_v = beta_v
        self.momentum = 0
        self.prev_momentum = 0
        self.lr = 0.05
        self.v = 0
        self.prev_v = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.prev_v = self.v

        # Calculate the moving average of the momentum and the squared gradients
        self.momentum = (self.beta_m * self.prev_momentum) + (1 - self.beta_m) * gradient
        self.v = (self.beta_v * self.prev_v) + (1 - self.beta_v) * (gradient**2)

        # Bias correction 🧠
        # IMPORTANT: We create new variables to preserve original ones for the next iteration
        # Update param with the corrected values.
        momentum_hat = self.momentum / (1 - self.beta_m ** (epoch + 1))
        v_hat = self.v / (1 - self.beta_v ** (epoch + 1))

        learning_rate = self.lr / (np.sqrt(v_hat) + 1e-8)  # Prevents division by zero ⚠️
        self.param -= learning_rate * momentum_hat  # Update parameters 🔧

    def get_param(self):
        return self.param

Bias Correction and Parameter Updates 🛠️

As you can see, we’ve introduced bias correction to deal with a common issue in optimization algorithms. When the moving averages (momentum and squared gradients) are initialized to 0, they start growing slowly, which causes parameter updates to be underpowered.

The bias correction compensates for this by adjusting the averages, particularly in the early epochs, where the estimates would otherwise be biased toward zero. As the number of epochs increases, the correction gets less significant, and the optimizer behaves more normally 🌱.

Key Idea: The momentum increases as the gradient increases, but the learning rate decreases with larger gradients. This dynamic adaptation allows Adam to optimize more efficiently in a variety of scenarios.

The formula contains a small term, 1e-8, which prevents division by zero during the update process ❌➗.

Training Result with Adam 🎯

Loss: 2.13 (acceptable due to noise in data) 🎵

Loss: 2.5 (also fine because of noise in data) 🎵

What's Next? 🚀

Now that we've covered momentum and dynamic learning rates, we’ll combine everything and begin building neural networks 🤖. We’ll create a dynamic model constructor that allows for easy replacement of optimizers, loss functions, layers, etc., providing flexibility to experiment and optimize different components.

Stay tuned — neural networks are coming soon! 🌟

In the previous posts, we implemented a momentum-based optimizer. It had its advantages:

⚡ Because of momentum (just like in physics), local minima could be avoided to a certain extent.
🎯 Momentum acts like a moving average of the gradient, allowing smooth updates when dealing with sharp loss surfaces.

However, it required a precise initial learning rate setting. Since the learning rate was constant and did not adjust, high values could give inverse momentum to the updates, causing the process to reverse—gradients would explode, and the loss would rise. This issue is less likely with single-variable functions like f(x)f(x), so let's consider:

f(x1,x2)=0.3x1+0.1x2+10f(x_1, x_2) = 0.3x_1 + 0.1x_2 + 10

w1=0.3,w2=0.1,b=10w_1=0.3, w_2=0.1, b = 10

We will use the same momentum-based optimizer that does not have any learning rate adaptation:

class MomentumOptimizer:
    def __init__(self, param, lr=0.0000001, beta=0.8):
        self.param = param
        self.lr = lr
        self.beta = beta
        self.momentum = 0
        self.prev_momentum = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.momentum = self.beta * self.prev_momentum + self.lr * gradient
        self.param -= self.momentum

    def get_param(self):
        return self.param

📊 Results of Training

✅ With a well-tuned learning rate:

Coefficients (weights): 0.000004
Bias (intercept): 0.01

⚠️ Now, let’s slightly increase the learning rate:

Coefficients: 0.000008
Bias: 0.02

The predictions become too large—gradients explode. Below is the training process in numbers:

epoch	loss	w1_gradient	w2_gradient	w1	w2
0	1663.901	2361174.8	201861.549	2.955	2.738
1	9269.569	-13415972.577	-1097061.735	-15.935	1.123
2	44622.031	64360033.279	5304690.402	77.226	8.688
...	...	...	...	...	...
441	4.1581577331783074e+302	-inf	-4.939591902211514e+304	-7.187001109566918e+299	-5.915987593624408e+298
442	inf	inf	inf	inf	2.8234057691394808e+299
443	nan	nan	nan	nan	-inf
444	nan	nan	nan	nan	nan

As seen, the loss and gradients keep growing, with values oscillating due to momentum. One incorrect learning rate setting disrupts the entire system.

🛠 Addressing the Problem

The root issue is the static learning rate. We do not want to rely on trial and error; instead, we need a dynamic adaptation method. For adaptation, the learning rate should be based on something—such as the gradient.

💡 Key Insight

We already have something that changes based on the gradient: momentum. It is a moving average used to update weights. What if we also updated the learning rate similarly? What if we tracked gradients like momentum does?

However, there is a catch:

Momentum considers gradient direction and can be positive or negative.
The learning rate must always be positive, meaning we need a way to accumulate gradients without sign interference.
If we simply summed gradients, positive and negative values would cancel each other out: −5+5+4−4=0-5 + 5 + 4 - 4 = 0

✅ Solution

We accumulate squared gradients and take the square root when updating weights to maintain scale. This requires a second moving average.

📐 Formulas

Momentum moving average: M(t)=M(t−1)×B1+(1−B1)×g(t)M(t) = M(t-1) \times B_1 + (1 - B_1) \times g(t)

Second moving average (variance estimation): V(t)=V(t−1)×B2+(1−B2)×g(t)2V(t) = V(t-1) \times B_2 + (1 - B_2) \times g(t)^2

🔢 Why Two Betas?

B1B_1 (typically 0.9): Stores 90% of past gradients, focusing on recent updates.
B2B_2 (typically 0.999): Decays much slower, capturing long-term gradient variations and smoothing spikes.

⚙️ ADAM Implementation

class AdamOptimizer:
    def __init__(self, param, beta_m=0.9, beta_v=0.999):
        self.param = param
        self.beta_m = beta_m
        self.beta_v = beta_v
        self.momentum = 0
        self.prev_momentum = 0
        self.lr = 0.05
        self.v = 0
        self.prev_v = 0

    def step(self, gradient, epoch):
        self.prev_momentum = self.momentum
        self.prev_v = self.v

        # Compute moving averages
        self.momentum = (self.beta_m * self.prev_momentum) + (1 - self.beta_m) * gradient
        self.v = (self.beta_v * self.prev_v) + (1 - self.beta_v) * (gradient ** 2)

        # Bias correction
        momentum_hat = self.momentum / (1 - self.beta_m ** (epoch + 1))
        v_hat = self.v / (1 - self.beta_v ** (epoch + 1))

        # Update parameter
        self.param -= (self.lr / (v_hat ** 0.5 + 1e-8)) * momentum_hat

    def get_param(self):
        return self.param

This implementation dynamically adjusts the learning rate based on past gradient information, preventing issues like exploding gradients.

🎯 Conclusion

✅ ADAM combines momentum (first moment) and adaptive learning rate (second moment). ✅ Unlike static learning rates, it self-adjusts based on gradient trends. ✅ This allows stable and efficient convergence without extensive tuning.

GitHub Link: GitHub

🔜 What's Next?

🧐 Testing ADAM on real-world datasets.
📈 Comparing ADAM vs. other optimizers in training stability.
🛠️ Fine-tuning ADAM parameters for different ML models.

With ADAM, training becomes more resilient to improper initial learning rate settings, leading to more robust performance across various optimization tasks. 🚀

4. Gradient Descent: Discovering Adam Optimizer

Table of contents