I am answering my own question in the hope that it will be helpful for other people.
SAC works like a soft switch, more like a mixing coefficient, which tells what information to take from both the atrous convolution (having different atrous rate) and mix them up. As "S" is dependent on 1x1 convolution has a trainable parameter, this helps the network to learn the optimal mixing coefficient. This is how our algorithm is looking twice at the image with the different receptive fields (different atrous rate) to capture important semantic level information which is important for object detection and semantic/ instance segmentation.
These images helped me a lot to unfold the information.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…