Table of contents
- 🚀 Implementation of ADAM
🚀 Implementation of ADAM
In earlier posts, we implemented a momentum-based optimizer, which offered several advantages:
Momentum helps avoid getting stuck in local minima, much like in physics 🏃♂️.
It smooths updates by acting as a moving average of the gradient, especially useful when dealing with steep loss surfaces 🌄.
Loss Surface f(x) = kx + b
However, this approach had a significant drawback—it required a carefully selected learning rate ⚖️. Since the learning rate remained constant and didn’t adjust dynamically, large values could cause the updates to go in the wrong direction, leading to exploding gradients ⚡ and increasing loss 📉. This is less of an issue with single-variable functions like f(x), so let's consider the following function:
$$f(x_1, x_2) = 0.3 \cdot x_1 + 0.1 \cdot x_2 + 10$$
where:
$$w_1 = 0.3, \, w_2 = 0.1, \, b = 10$$
We will use the same momentum-based optimizer, but without any learning rate adaptation 🔄.
Code:
class MomentumOptimizer:
def __init__(self, param, lr=0.0000001, beta=0.8):
self.param = param
self.lr = lr
self.beta = beta
self.momentum = 0
self.prev_momentum = 0
def step(self, gradient, epoch):
self.prev_momentum = self.momentum
self.momentum = self.beta * self.prev_momentum + self.lr * gradient
self.param -= self.momentum
def get_param(self):
return self.param
Now let’s look at the result of training with a good learning rate (0.000004 for the weights and 0.01 for the bias):
As seen, the model converges well, thanks to the carefully chosen learning rate 👍.
However, increasing the learning rate just a bit too much can lead to instability 😱. Let’s set the learning rate to 0.000008 for the weights and 0.02 for the bias:
The predictions are now too large to be visible, as the gradients have exploded 🔥.
Here’s what the training process looks like with these values:
Epoch | Loss | w1 Gradient | w2 Gradient |
0 | 1663.901 | 2361174.8 | 201861.549 |
1 | 9269.569 | -13415972.577 | -1097061.735 |
2 | 44622.031 | 64360033.279 | 5304690.402 |
3 | 212856.013 | -307215418.666 | -25282080.172 |
4 | 1015987.765 | 1466190419.961 | 120695444.971 |
5 | 4848691.714 | -6997400267.384 | -575986518.531 |
6 | 23140483.845 | 33395098059.392 | 2748925665.978 |
... | ... | ... | ... |
At epoch 439, the values are so large they appear as:
Epoch | Loss | w1 Gradient | w2 Gradient |
439 | 1.825613645291461e+301 | -2.63462883619106e+304 | -2.1686975236400557e+303 |
440 | 8.712743251677126e+301 | 1.257376999394519e+305 | 1.0350111461292857e+304 |
441 | 4.1581577331783074e+302 | -inf | -4.939591902211514e+304 |
442 | inf | inf | inf |
443 | nan | nan | nan |
444 | nan | nan | nan |
As you can see, the loss and other values grow exponentially, and the gradients flip between negative and positive values due to momentum ⚡. A single wrong number can break the system and cause the optimization to go haywire 🚨.
Managing Learning Rate Dynamically 🔄
Now, let’s tackle this issue. First, we need to identify the problem: static learning rate 🚧. We don't want to spend time tweaking the learning rate through trial and error 🔬. We want it to adapt dynamically 🌱. But to adapt, it should have something to base its adjustments on — for example, the gradient 📉.
Here’s when things get interesting 🔍:
Let’s remember that we already have something that changes based on the gradient — momentum. Momentum is essentially a moving average, and we use it to update parameters (weights/coeffs) directly ⚙️. But what if we updated the learning rate in a similar way? 🤔 What if we use the gradient just like momentum does? Let’s give it a try! 💡
We could directly use our moving average (momentum) and come up with an update formula, but there’s a catch 🎣. Momentum cares about the direction of gradients, so it can be either positive or negative ➕➖. However, the learning rate must never be negative ❌. If we don’t manage this, positive and negative gradients might cancel each other out, like this: -5 + 5 + 4 - 4 = 0 🌀.
So, here’s the plan 📝: we’ll take the squared gradients and then use square root when we update the values to return them to their standard form 🔄. This ensures that negative gradients won’t cause cancellation.
Next, we’ll use a second moving average to accumulate squared gradients ✨.
Here are the formulas:
Momentum (M):
$$M(t) = M(t-1) \times \beta_1 + (1 - \beta_1) \times g(t)$$
Squared Gradient Moving Average (V):
$$V(t) = V(t-1) \times \beta_2 + (1 - \beta_2) \times g(t)^2$$
Why are β1 and β2 different? 🤔
Good question! 😄 Why don’t we use the same β\beta for both moving averages?
β1 (usually 0.9) saves 90% of past gradients and uses 10% as new gradients. It decays faster ⏳, capturing recent gradients more quickly.
β2(usually 0.999) decays much slower ⏱️, with a much longer history of past gradients. It better handles variations (gradient spikes ⚡), smoothing out the values for stable adaptation 🌿.
Implementation of Adam🚀
class AdamOptimizer:
def __init__(
self,
param,
beta_m=0.9,
beta_v=0.999
):
self.param = param
self.beta_m = beta_m
self.beta_v = beta_v
self.momentum = 0
self.prev_momentum = 0
self.lr = 0.05
self.v = 0
self.prev_v = 0
def step(self, gradient, epoch):
self.prev_momentum = self.momentum
self.prev_v = self.v
# Calculate the moving average of the momentum and the squared gradients
self.momentum = (self.beta_m * self.prev_momentum) + (1 - self.beta_m) * gradient
self.v = (self.beta_v * self.prev_v) + (1 - self.beta_v) * (gradient**2)
# Bias correction 🧠
# IMPORTANT: We create new variables to preserve original ones for the next iteration
# Update param with the corrected values.
momentum_hat = self.momentum / (1 - self.beta_m ** (epoch + 1))
v_hat = self.v / (1 - self.beta_v ** (epoch + 1))
learning_rate = self.lr / (np.sqrt(v_hat) + 1e-8) # Prevents division by zero ⚠️
self.param -= learning_rate * momentum_hat # Update parameters 🔧
def get_param(self):
return self.param
Bias Correction and Parameter Updates 🛠️
As you can see, we’ve introduced bias correction to deal with a common issue in optimization algorithms. When the moving averages (momentum and squared gradients) are initialized to 0, they start growing slowly, which causes parameter updates to be underpowered.
The bias correction compensates for this by adjusting the averages, particularly in the early epochs, where the estimates would otherwise be biased toward zero. As the number of epochs increases, the correction gets less significant, and the optimizer behaves more normally 🌱.
Key Idea: The momentum increases as the gradient increases, but the learning rate decreases with larger gradients. This dynamic adaptation allows Adam to optimize more efficiently in a variety of scenarios.
The formula contains a small term, 1e-8, which prevents division by zero during the update process ❌➗.
Training Result with Adam 🎯
Loss: 2.13 (acceptable due to noise in data) 🎵
Loss: 2.5 (also fine because of noise in data) 🎵
What's Next? 🚀
Now that we've covered momentum and dynamic learning rates, we’ll combine everything and begin building neural networks 🤖. We’ll create a dynamic model constructor that allows for easy replacement of optimizers, loss functions, layers, etc., providing flexibility to experiment and optimize different components.
Stay tuned — neural networks are coming soon! 🌟
In the previous posts, we implemented a momentum-based optimizer. It had its advantages:
⚡ Because of momentum (just like in physics), local minima could be avoided to a certain extent.
🎯 Momentum acts like a moving average of the gradient, allowing smooth updates when dealing with sharp loss surfaces.
However, it required a precise initial learning rate setting. Since the learning rate was constant and did not adjust, high values could give inverse momentum to the updates, causing the process to reverse—gradients would explode, and the loss would rise. This issue is less likely with single-variable functions like f(x)f(x), so let's consider:
f(x1,x2)=0.3x1+0.1x2+10f(x_1, x_2) = 0.3x_1 + 0.1x_2 + 10
w1=0.3,w2=0.1,b=10w_1=0.3, w_2=0.1, b = 10
We will use the same momentum-based optimizer that does not have any learning rate adaptation:
class MomentumOptimizer:
def __init__(self, param, lr=0.0000001, beta=0.8):
self.param = param
self.lr = lr
self.beta = beta
self.momentum = 0
self.prev_momentum = 0
def step(self, gradient, epoch):
self.prev_momentum = self.momentum
self.momentum = self.beta * self.prev_momentum + self.lr * gradient
self.param -= self.momentum
def get_param(self):
return self.param
📊 Results of Training
✅ With a well-tuned learning rate:
Coefficients (weights): 0.000004
Bias (intercept): 0.01
⚠️ Now, let’s slightly increase the learning rate:
Coefficients: 0.000008
Bias: 0.02
The predictions become too large—gradients explode. Below is the training process in numbers:
epoch | loss | w1_gradient | w2_gradient | w1 | w2 |
0 | 1663.901 | 2361174.8 | 201861.549 | 2.955 | 2.738 |
1 | 9269.569 | -13415972.577 | -1097061.735 | -15.935 | 1.123 |
2 | 44622.031 | 64360033.279 | 5304690.402 | 77.226 | 8.688 |
... | ... | ... | ... | ... | ... |
441 | 4.1581577331783074e+302 | -inf | -4.939591902211514e+304 | -7.187001109566918e+299 | -5.915987593624408e+298 |
442 | inf | inf | inf | inf | 2.8234057691394808e+299 |
443 | nan | nan | nan | nan | -inf |
444 | nan | nan | nan | nan | nan |
As seen, the loss and gradients keep growing, with values oscillating due to momentum. One incorrect learning rate setting disrupts the entire system.
🛠 Addressing the Problem
The root issue is the static learning rate. We do not want to rely on trial and error; instead, we need a dynamic adaptation method. For adaptation, the learning rate should be based on something—such as the gradient.
💡 Key Insight
We already have something that changes based on the gradient: momentum. It is a moving average used to update weights. What if we also updated the learning rate similarly? What if we tracked gradients like momentum does?
However, there is a catch:
Momentum considers gradient direction and can be positive or negative.
The learning rate must always be positive, meaning we need a way to accumulate gradients without sign interference.
If we simply summed gradients, positive and negative values would cancel each other out: −5+5+4−4=0-5 + 5 + 4 - 4 = 0
✅ Solution
We accumulate squared gradients and take the square root when updating weights to maintain scale. This requires a second moving average.
📐 Formulas
Momentum moving average: M(t)=M(t−1)×B1+(1−B1)×g(t)M(t) = M(t-1) \times B_1 + (1 - B_1) \times g(t)
Second moving average (variance estimation): V(t)=V(t−1)×B2+(1−B2)×g(t)2V(t) = V(t-1) \times B_2 + (1 - B_2) \times g(t)^2
🔢 Why Two Betas?
B1B_1 (typically 0.9): Stores 90% of past gradients, focusing on recent updates.
B2B_2 (typically 0.999): Decays much slower, capturing long-term gradient variations and smoothing spikes.
⚙️ ADAM Implementation
class AdamOptimizer:
def __init__(self, param, beta_m=0.9, beta_v=0.999):
self.param = param
self.beta_m = beta_m
self.beta_v = beta_v
self.momentum = 0
self.prev_momentum = 0
self.lr = 0.05
self.v = 0
self.prev_v = 0
def step(self, gradient, epoch):
self.prev_momentum = self.momentum
self.prev_v = self.v
# Compute moving averages
self.momentum = (self.beta_m * self.prev_momentum) + (1 - self.beta_m) * gradient
self.v = (self.beta_v * self.prev_v) + (1 - self.beta_v) * (gradient ** 2)
# Bias correction
momentum_hat = self.momentum / (1 - self.beta_m ** (epoch + 1))
v_hat = self.v / (1 - self.beta_v ** (epoch + 1))
# Update parameter
self.param -= (self.lr / (v_hat ** 0.5 + 1e-8)) * momentum_hat
def get_param(self):
return self.param
This implementation dynamically adjusts the learning rate based on past gradient information, preventing issues like exploding gradients.
🎯 Conclusion
✅ ADAM combines momentum (first moment) and adaptive learning rate (second moment). ✅ Unlike static learning rates, it self-adjusts based on gradient trends. ✅ This allows stable and efficient convergence without extensive tuning.
GitHub Link: GitHub
🔜 What's Next?
🧐 Testing ADAM on real-world datasets.
📈 Comparing ADAM vs. other optimizers in training stability.
🛠️ Fine-tuning ADAM parameters for different ML models.
With ADAM, training becomes more resilient to improper initial learning rate settings, leading to more robust performance across various optimization tasks. 🚀