🚀 Learning Rate Scheduling: Fixing the Convergence Issue

Context 📚

In my first post, I implemented simple Gradient Descent, and while it worked, we faced a common issue—the convergence of parameters was inconsistent. The intercept (b) converged much slower than the slope (k).

The reason? k has a stronger relationship with the loss function because it's directly scaled when computing the gradient, while b has a weaker influence. This led to a problem where each gradient update became smaller and smaller, slowing down training, especially for b.

To solve this, I thought: why not scale the gradient of b? This idea led to using a learning rate, a widely used concept in optimization. The idea is simple—if the gradient is too small, we multiply it by a scaling factor (learning rate) so that updates remain significant and training doesn't take forever.

Main 🔧

We introduced a learning rate to scale the gradient, but initially, it was constant. This meant it didn’t really influence the training beyond adjusting step sizes uniformly.

Then, I thought:

If we visualize the loss function as a parabola, we start with high loss and a large gradient.
So, wouldn’t it make sense to start with big learning rates and decrease them over time as we approach the minimum?

This seemed intuitive—big steps when far from the minimum, smaller steps for precision when close. However, in practice, it performed even worse than using a constant learning rate! 😲

The reason?

As training progresses, gradients naturally become smaller.
If we also decrease the learning rate, the updates shrink even more, making training painfully slow.

So, I flipped my approach—instead of decreasing the learning rate, I started with a small initial value and increased it over time. And guess what? It worked better! 🎉

Challenges ⚠️

This method helped because:

Early in training, gradients are large, so updates are fine.
Later, gradients are small, and a larger learning rate ensures updates don’t become insignificant.

However, a new issue emerged:

Since the learning rate only depends on epochs, it keeps growing indefinitely.
This makes the method unreliable over long training periods.

If training runs for too long, the learning rate can become extremely large, causing unstable updates and uncontrollable gradients. Instead of converging, the model may start jumping wildly, preventing it from learning anything useful.

In other words, we fixed one problem but created another. 🚧

Results 📉

Here's what we observed:

The model was trained for 450 epochs.
Loss on training data: 0.8
Loss on test data: 0.68
This matches expectations since we introduced noise (std = 1, mean = 0), so the loss should be near the standard deviation.

As you can see, the variable b (intercept) converges much faster with optimization (up to the 100th epoch):

Without optimization, the convergence of b is noticeably slower (up to 280th epoch):

Below is the learning rate scheduler I implemented:

class LRScheduler:
    def __init__(self, param, initial_lr=0.0001, decay_rate=0.01):
        self.param = param
        self.initial_lr = initial_lr
        self.decay_rate = decay_rate
        self.lr = self.initial_lr

    def step(self, gradient, epoch):
        self.lr = self.initial_lr * (1 + epoch * self.decay_rate)
        self.param -= self.lr * gradient

    def get_param(self):
        return self.param

🔗 Here’s the link to my Jupyter Notebook on GitHub: GitHub Link

ℹ️ This type of optimization process is known as a Learning Rate Scheduler. It dynamically adjusts the learning rate during training to improve convergence and performance. By modifying the learning rate over time, it helps in achieving better results and stability in the optimization process.

🚧 This approach helped, but it’s still not reliable for long-term use. The next step? Refining it into a more optimized method.

What’s Next? 🔄

What we did here is actually called an optimizer—it adjusts the learning rate dynamically to improve training. But this version isn’t perfect yet. Stay tuned, because we’re going to make it even better! 🚀

More updates coming soon! 🔥

2. Gradient Descent: The First Step in Optimization

Table of contents