There is a lot of maths in Lipshitz inequality. I don’t think I can manage them without some heavy study. So I try to see that from another angle.
Lipshitz constant is the maximum of the derivative. If we clip the weight, the partial derivative for each layer should be linear to the weight even though adding layers may make it growing or vanishing fast. But then, we can find an upper bound for α.
The clipping is a simple and easy way to do Lipshitz inequality but even the author thinks it is hard to tune. But all these research fall into one interest concept of penalizing gradient which kind of make sense from the original proposition of having a smoother gradient everywhere.
Sorry for the delay. Just see your comment.