We want some stochastic behavior in the action. A deterministic policy hurts exploration badly and the curvature of your calculated rewards can be very steep and bad for training. Not much explanation so far but let’s take my word for a second. Therefore, one common approach is “sample” from π distribution for the action. But a few years back, D. Silver proposes DDGP based on deterministic policy. He overcomes the issue by adding noise to the action. In short, both are possible but if the policy is deterministic, you need to add some exploration concept like noise in the action or in the parameter.
For your Q2, it should be explained in “Continuous control with Gaussian policies” and the network just predict a continuous value for action.