There is a lot of maths in Lipshitz inequality. I don’t think I can manage them without some heavy study. So I try to see that from another angle.

Image for post
Image for post

Lipshitz constant is the maximum of the derivative. If we clip the weight, the partial derivative for each layer should be linear to the weight even though adding layers may make it growing or vanishing fast. But then, we can find an upper bound for α.

The clipping is a simple and easy way to do Lipshitz inequality but even the author thinks it is hard to tune. But all these research fall into one interest concept of penalizing gradient which kind of make sense from the original proposition of having a smoother gradient everywhere.

Sorry for the delay. Just see your comment.

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store