You are correct in observing that the scaling you are doing for the log(prob)
is exactly what "InfogainLoss"
layer is doing (You can read more about it here and here).
As for the derivative (back-prop): the loss computed by this layer is
L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )
If you differentiate this expression with respect to prob(j)
(which is the input variable to this layer), you'll notice that the derivative of log(x)
is 1/x
this is why you see that
dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)
Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss"
layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss"
layer I noticed that sometimes prob(j)
can have a very small value leading to unstable estimation of the gradients.
Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss"
and "InfogainLoss"
layers with respect to the raw predictions (x
), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss"
layer that is more numerically robust than computing infogain loss on top of a softmax layer:
PS,
Note that if you are going to use infogain loss for weighing, you should feed H
(the infogain_mat
) with label similarities, rather than distances.
Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…