Dec 18, 2020
We define the function mapping from x to y. So it is our choice to make it to work with tape.gradient. For example, the elements in y may represent two different loss functions that we want to sum up. In other models, like RNN, we want to sum up the gradients at each time step since the weights in RNN cells are shared. So tape.gradient just provides a convenient way to do it. Can tape.gradient be theoretically defined as multiplication instead? Yes, but summing them up will cover more useful use cases and that is the TF choice.
