For (1) logically the maximum parallelism would be limited by the least-parallel part of the pipeline which is the 10 LFBs, and this is probably strictly true for demand-load parallelism when prefetching is disabled or can't help. In practice, everything is more complicated once your load is at least partly helped by prefetching, since then the wider queues between L2 and RAM can be used which could make the observed parallelism greater than 10. The most practical approach is probably direct measurement: given measured latency to RAM, and observed throughput, you can calculate an effective parallelism for any particular load.
For (2) my understanding is that it is the other way around: all demand misses in L1 first allocate into the LFB (unless of course they hit an existing LFB) and may involve the "superqueue" later (or whatever it is called these days) if they also miss higher in the cache hierarchy. The diagram you included seems to confirm that: the only path from the L1 is through the LFB queue.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…