There are many possible designs for a TLB that supports multiple page sizes and the trade-offs are significant. However, I'll only briefly discuss those designs used in commercial processors (see this and this for more).
One immediate issue is that how to know the page size before accessing a set-associative TLB. A given virtual address to be mapped to a physical address has to be partitioned as follows:
-----------------------------------------
| page number | page offset |
-----------------------------------------
| tag | index | page offset |
-----------------------------------------
The index is used to determine which set of the TLB to lookup and the tag is used to determine whether there is a matching entry in that set. But given only a virtual address, the page size cannot be known without accessing the page table entry. And if the page size is not known, the size of the page offset cannot be determined. This means that the location of the bits that constitute the index and the tag are not known.
Most commercial processors use one of two designs (or both) to deal with this issue. The first is by using a parallel TLB structure where each TLB is designated for page entries of a particular size only (this is not precise, see below). All TLBs are looked up in parallel. There can either be a single hit or all misses. There are also situations where multiple hits can occur. In such cases the processor may choose one of the cached entries.
The second is by using a fully-associative TLB, which is designed as follows. Let POmin denote the size of the page offset for the smallest page size supported by the architecture. Let VA denote the size of a virtual address. In a fully-associative cache, an address is partitioned into a page offset and a tag; there is no index. Let Tmin denote VA - POmin. The TLB is designed so that each entry to hold a tag of size Tmin irrespective of the size of the page of the page table entry cached in that TLB entry.
The Tmin most significant bits of the virtual address are supplied to the comparator at each entry in of the fully-associative TLB to compare the tags (if the entry is valid). The comparison is performed as follows.
| M |
|11|0000| | the mask of the cached entry
-----------------------------------------
| T(x) |M(x)| | some bits of the offset needs to be masked out
-----------------------------------------
| T(x) | PO(x) | partitioning according to actual page size
-----------------------------------------
| T(min) | PO(min) | partitioning before tag comparison
-----------------------------------------
Each entry in the TLB contains an field called the tag mask. Let Tmax denote the size of the tag of the largest page size supported by the architecture. Then the size of the tag mask, M, is Tmin - Tmax. When a page table entry gets cached in the TLB, the mask is set in a way so that when its bitwise-and'ed with the corresponding least significant bit of a given tag (of Tmin), any remaining bits that belong to the page offset field would become all zeros. In addition, the tag stored in the entry is appended with a sufficient number of zeros so that its size is Tmin. So some bits of the mask would be zeros while others would be ones, as shown in the figure above.
Now I'll discuss a couple of examples. For simplicity, I'll assume there is no hyperthreading (possible designs options include sharing, static partitioning, and dynamic partitioning). Intel Skylake uses the parallel TLB design for both the L1 D/I TLB and the L2 TLB. In Intel Haswell, 1 GB pages are not supported by the L2 TLB. Note that 4 MB pages use two TLB entires (with replicated tags). I think that the 4 MB page table entries can only be cached in the 2 MB page entry TLB. The AMD 10h and 12h processors use a fully-associative L1 DTLB, a parallel L2 DTLB, a fully-associative parallel L1 ITLB, and an L2 ITLB that supports only 4 KB pages. The Sparc T4 processor uses a fully-associative L1 ITLB and a fully-associative L1 DTLB. There is no L2 TLB in Sparc T4.