I am not sure which equations you refer to. In the context of the natural gradient, KL-divergence is used to measure the difference between the two distributions. So which distribution should be used as p or q is sometimes up to the researcher. While there are significant differences in GAN and variational inferencing, it seems less important in the natural gradient. For GAN, the two distributions are so different that extra care is needed for the progress of training. For policy gradient, it seems not that critical compared to other areas since the difference of the distributions is much smaller. The difference in KL-divergence or reverse KL-divergence is not that important. So the equations used here refer to the research paper that it may reference.
But being said, I sometimes write things in the exact opposite way of what my head thinks. So for doubt, you may want to double-check with the research paper.