1*4d2e26a3SMauro Carvalho Chehab=================================================== 2*4d2e26a3SMauro Carvalho ChehabPCI Express I/O Virtualization Resource on Powerenv 3*4d2e26a3SMauro Carvalho Chehab=================================================== 4*4d2e26a3SMauro Carvalho Chehab 5*4d2e26a3SMauro Carvalho ChehabWei Yang <weiyang@linux.vnet.ibm.com> 6*4d2e26a3SMauro Carvalho Chehab 7*4d2e26a3SMauro Carvalho ChehabBenjamin Herrenschmidt <benh@au1.ibm.com> 8*4d2e26a3SMauro Carvalho Chehab 9*4d2e26a3SMauro Carvalho ChehabBjorn Helgaas <bhelgaas@google.com> 10*4d2e26a3SMauro Carvalho Chehab 11*4d2e26a3SMauro Carvalho Chehab26 Aug 2014 12*4d2e26a3SMauro Carvalho Chehab 13*4d2e26a3SMauro Carvalho ChehabThis document describes the requirement from hardware for PCI MMIO resource 14*4d2e26a3SMauro Carvalho Chehabsizing and assignment on PowerKVM and how generic PCI code handles this 15*4d2e26a3SMauro Carvalho Chehabrequirement. The first two sections describe the concepts of Partitionable 16*4d2e26a3SMauro Carvalho ChehabEndpoints and the implementation on P8 (IODA2). The next two sections talks 17*4d2e26a3SMauro Carvalho Chehababout considerations on enabling SRIOV on IODA2. 18*4d2e26a3SMauro Carvalho Chehab 19*4d2e26a3SMauro Carvalho Chehab1. Introduction to Partitionable Endpoints 20*4d2e26a3SMauro Carvalho Chehab========================================== 21*4d2e26a3SMauro Carvalho Chehab 22*4d2e26a3SMauro Carvalho ChehabA Partitionable Endpoint (PE) is a way to group the various resources 23*4d2e26a3SMauro Carvalho Chehabassociated with a device or a set of devices to provide isolation between 24*4d2e26a3SMauro Carvalho Chehabpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism 25*4d2e26a3SMauro Carvalho Chehabto freeze a device that is causing errors in order to limit the possibility 26*4d2e26a3SMauro Carvalho Chehabof propagation of bad data. 27*4d2e26a3SMauro Carvalho Chehab 28*4d2e26a3SMauro Carvalho ChehabThere is thus, in HW, a table of PE states that contains a pair of "frozen" 29*4d2e26a3SMauro Carvalho Chehabstate bits (one for MMIO and one for DMA, they get set together but can be 30*4d2e26a3SMauro Carvalho Chehabcleared independently) for each PE. 31*4d2e26a3SMauro Carvalho Chehab 32*4d2e26a3SMauro Carvalho ChehabWhen a PE is frozen, all stores in any direction are dropped and all loads 33*4d2e26a3SMauro Carvalho Chehabreturn all 1's value. MSIs are also blocked. There's a bit more state that 34*4d2e26a3SMauro Carvalho Chehabcaptures things like the details of the error that caused the freeze etc., but 35*4d2e26a3SMauro Carvalho Chehabthat's not critical. 36*4d2e26a3SMauro Carvalho Chehab 37*4d2e26a3SMauro Carvalho ChehabThe interesting part is how the various PCIe transactions (MMIO, DMA, ...) 38*4d2e26a3SMauro Carvalho Chehabare matched to their corresponding PEs. 39*4d2e26a3SMauro Carvalho Chehab 40*4d2e26a3SMauro Carvalho ChehabThe following section provides a rough description of what we have on P8 41*4d2e26a3SMauro Carvalho Chehab(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB 42*4d2e26a3SMauro Carvalho Chehabis a completely separate HW entity that replicates the entire logic, so has 43*4d2e26a3SMauro Carvalho Chehabits own set of PEs, etc. 44*4d2e26a3SMauro Carvalho Chehab 45*4d2e26a3SMauro Carvalho Chehab2. Implementation of Partitionable Endpoints on P8 (IODA2) 46*4d2e26a3SMauro Carvalho Chehab========================================================== 47*4d2e26a3SMauro Carvalho Chehab 48*4d2e26a3SMauro Carvalho ChehabP8 supports up to 256 Partitionable Endpoints per PHB. 49*4d2e26a3SMauro Carvalho Chehab 50*4d2e26a3SMauro Carvalho Chehab * Inbound 51*4d2e26a3SMauro Carvalho Chehab 52*4d2e26a3SMauro Carvalho Chehab For DMA, MSIs and inbound PCIe error messages, we have a table (in 53*4d2e26a3SMauro Carvalho Chehab memory but accessed in HW by the chip) that provides a direct 54*4d2e26a3SMauro Carvalho Chehab correspondence between a PCIe RID (bus/dev/fn) with a PE number. 55*4d2e26a3SMauro Carvalho Chehab We call this the RTT. 56*4d2e26a3SMauro Carvalho Chehab 57*4d2e26a3SMauro Carvalho Chehab - For DMA we then provide an entire address space for each PE that can 58*4d2e26a3SMauro Carvalho Chehab contain two "windows", depending on the value of PCI address bit 59. 59*4d2e26a3SMauro Carvalho Chehab Each window can be configured to be remapped via a "TCE table" (IOMMU 60*4d2e26a3SMauro Carvalho Chehab translation table), which has various configurable characteristics 61*4d2e26a3SMauro Carvalho Chehab not described here. 62*4d2e26a3SMauro Carvalho Chehab 63*4d2e26a3SMauro Carvalho Chehab - For MSIs, we have two windows in the address space (one at the top of 64*4d2e26a3SMauro Carvalho Chehab the 32-bit space and one much higher) which, via a combination of the 65*4d2e26a3SMauro Carvalho Chehab address and MSI value, will result in one of the 2048 interrupts per 66*4d2e26a3SMauro Carvalho Chehab bridge being triggered. There's a PE# in the interrupt controller 67*4d2e26a3SMauro Carvalho Chehab descriptor table as well which is compared with the PE# obtained from 68*4d2e26a3SMauro Carvalho Chehab the RTT to "authorize" the device to emit that specific interrupt. 69*4d2e26a3SMauro Carvalho Chehab 70*4d2e26a3SMauro Carvalho Chehab - Error messages just use the RTT. 71*4d2e26a3SMauro Carvalho Chehab 72*4d2e26a3SMauro Carvalho Chehab * Outbound. That's where the tricky part is. 73*4d2e26a3SMauro Carvalho Chehab 74*4d2e26a3SMauro Carvalho Chehab Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" 75*4d2e26a3SMauro Carvalho Chehab from the CPU address space to the PCI address space. There is one M32 76*4d2e26a3SMauro Carvalho Chehab window and sixteen M64 windows. They have different characteristics. 77*4d2e26a3SMauro Carvalho Chehab First what they have in common: they forward a configurable portion of 78*4d2e26a3SMauro Carvalho Chehab the CPU address space to the PCIe bus and must be naturally aligned 79*4d2e26a3SMauro Carvalho Chehab power of two in size. The rest is different: 80*4d2e26a3SMauro Carvalho Chehab 81*4d2e26a3SMauro Carvalho Chehab - The M32 window: 82*4d2e26a3SMauro Carvalho Chehab 83*4d2e26a3SMauro Carvalho Chehab * Is limited to 4GB in size. 84*4d2e26a3SMauro Carvalho Chehab 85*4d2e26a3SMauro Carvalho Chehab * Drops the top bits of the address (above the size) and replaces 86*4d2e26a3SMauro Carvalho Chehab them with a configurable value. This is typically used to generate 87*4d2e26a3SMauro Carvalho Chehab 32-bit PCIe accesses. We configure that window at boot from FW and 88*4d2e26a3SMauro Carvalho Chehab don't touch it from Linux; it's usually set to forward a 2GB 89*4d2e26a3SMauro Carvalho Chehab portion of address space from the CPU to PCIe 90*4d2e26a3SMauro Carvalho Chehab 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually 91*4d2e26a3SMauro Carvalho Chehab reserved for MSIs but this is not a problem at this point; we just 92*4d2e26a3SMauro Carvalho Chehab need to ensure Linux doesn't assign anything there, the M32 logic 93*4d2e26a3SMauro Carvalho Chehab ignores that however and will forward in that space if we try). 94*4d2e26a3SMauro Carvalho Chehab 95*4d2e26a3SMauro Carvalho Chehab * It is divided into 256 segments of equal size. A table in the chip 96*4d2e26a3SMauro Carvalho Chehab maps each segment to a PE#. That allows portions of the MMIO space 97*4d2e26a3SMauro Carvalho Chehab to be assigned to PEs on a segment granularity. For a 2GB window, 98*4d2e26a3SMauro Carvalho Chehab the segment granularity is 2GB/256 = 8MB. 99*4d2e26a3SMauro Carvalho Chehab 100*4d2e26a3SMauro Carvalho Chehab Now, this is the "main" window we use in Linux today (excluding 101*4d2e26a3SMauro Carvalho Chehab SR-IOV). We basically use the trick of forcing the bridge MMIO windows 102*4d2e26a3SMauro Carvalho Chehab onto a segment alignment/granularity so that the space behind a bridge 103*4d2e26a3SMauro Carvalho Chehab can be assigned to a PE. 104*4d2e26a3SMauro Carvalho Chehab 105*4d2e26a3SMauro Carvalho Chehab Ideally we would like to be able to have individual functions in PEs 106*4d2e26a3SMauro Carvalho Chehab but that would mean using a completely different address allocation 107*4d2e26a3SMauro Carvalho Chehab scheme where individual function BARs can be "grouped" to fit in one or 108*4d2e26a3SMauro Carvalho Chehab more segments. 109*4d2e26a3SMauro Carvalho Chehab 110*4d2e26a3SMauro Carvalho Chehab - The M64 windows: 111*4d2e26a3SMauro Carvalho Chehab 112*4d2e26a3SMauro Carvalho Chehab * Must be at least 256MB in size. 113*4d2e26a3SMauro Carvalho Chehab 114*4d2e26a3SMauro Carvalho Chehab * Do not translate addresses (the address on PCIe is the same as the 115*4d2e26a3SMauro Carvalho Chehab address on the PowerBus). There is a way to also set the top 14 116*4d2e26a3SMauro Carvalho Chehab bits which are not conveyed by PowerBus but we don't use this. 117*4d2e26a3SMauro Carvalho Chehab 118*4d2e26a3SMauro Carvalho Chehab * Can be configured to be segmented. When not segmented, we can 119*4d2e26a3SMauro Carvalho Chehab specify the PE# for the entire window. When segmented, a window 120*4d2e26a3SMauro Carvalho Chehab has 256 segments; however, there is no table for mapping a segment 121*4d2e26a3SMauro Carvalho Chehab to a PE#. The segment number *is* the PE#. 122*4d2e26a3SMauro Carvalho Chehab 123*4d2e26a3SMauro Carvalho Chehab * Support overlaps. If an address is covered by multiple windows, 124*4d2e26a3SMauro Carvalho Chehab there's a defined ordering for which window applies. 125*4d2e26a3SMauro Carvalho Chehab 126*4d2e26a3SMauro Carvalho Chehab We have code (fairly new compared to the M32 stuff) that exploits that 127*4d2e26a3SMauro Carvalho Chehab for large BARs in 64-bit space: 128*4d2e26a3SMauro Carvalho Chehab 129*4d2e26a3SMauro Carvalho Chehab We configure an M64 window to cover the entire region of address space 130*4d2e26a3SMauro Carvalho Chehab that has been assigned by FW for the PHB (about 64GB, ignore the space 131*4d2e26a3SMauro Carvalho Chehab for the M32, it comes out of a different "reserve"). We configure it 132*4d2e26a3SMauro Carvalho Chehab as segmented. 133*4d2e26a3SMauro Carvalho Chehab 134*4d2e26a3SMauro Carvalho Chehab Then we do the same thing as with M32, using the bridge alignment 135*4d2e26a3SMauro Carvalho Chehab trick, to match to those giant segments. 136*4d2e26a3SMauro Carvalho Chehab 137*4d2e26a3SMauro Carvalho Chehab Since we cannot remap, we have two additional constraints: 138*4d2e26a3SMauro Carvalho Chehab 139*4d2e26a3SMauro Carvalho Chehab - We do the PE# allocation *after* the 64-bit space has been assigned 140*4d2e26a3SMauro Carvalho Chehab because the addresses we use directly determine the PE#. We then 141*4d2e26a3SMauro Carvalho Chehab update the M32 PE# for the devices that use both 32-bit and 64-bit 142*4d2e26a3SMauro Carvalho Chehab spaces or assign the remaining PE# to 32-bit only devices. 143*4d2e26a3SMauro Carvalho Chehab 144*4d2e26a3SMauro Carvalho Chehab - We cannot "group" segments in HW, so if a device ends up using more 145*4d2e26a3SMauro Carvalho Chehab than one segment, we end up with more than one PE#. There is a HW 146*4d2e26a3SMauro Carvalho Chehab mechanism to make the freeze state cascade to "companion" PEs but 147*4d2e26a3SMauro Carvalho Chehab that only works for PCIe error messages (typically used so that if 148*4d2e26a3SMauro Carvalho Chehab you freeze a switch, it freezes all its children). So we do it in 149*4d2e26a3SMauro Carvalho Chehab SW. We lose a bit of effectiveness of EEH in that case, but that's 150*4d2e26a3SMauro Carvalho Chehab the best we found. So when any of the PEs freezes, we freeze the 151*4d2e26a3SMauro Carvalho Chehab other ones for that "domain". We thus introduce the concept of 152*4d2e26a3SMauro Carvalho Chehab "master PE" which is the one used for DMA, MSIs, etc., and "secondary 153*4d2e26a3SMauro Carvalho Chehab PEs" that are used for the remaining M64 segments. 154*4d2e26a3SMauro Carvalho Chehab 155*4d2e26a3SMauro Carvalho Chehab We would like to investigate using additional M64 windows in "single 156*4d2e26a3SMauro Carvalho Chehab PE" mode to overlay over specific BARs to work around some of that, for 157*4d2e26a3SMauro Carvalho Chehab example for devices with very large BARs, e.g., GPUs. It would make 158*4d2e26a3SMauro Carvalho Chehab sense, but we haven't done it yet. 159*4d2e26a3SMauro Carvalho Chehab 160*4d2e26a3SMauro Carvalho Chehab3. Considerations for SR-IOV on PowerKVM 161*4d2e26a3SMauro Carvalho Chehab======================================== 162*4d2e26a3SMauro Carvalho Chehab 163*4d2e26a3SMauro Carvalho Chehab * SR-IOV Background 164*4d2e26a3SMauro Carvalho Chehab 165*4d2e26a3SMauro Carvalho Chehab The PCIe SR-IOV feature allows a single Physical Function (PF) to 166*4d2e26a3SMauro Carvalho Chehab support several Virtual Functions (VFs). Registers in the PF's SR-IOV 167*4d2e26a3SMauro Carvalho Chehab Capability control the number of VFs and whether they are enabled. 168*4d2e26a3SMauro Carvalho Chehab 169*4d2e26a3SMauro Carvalho Chehab When VFs are enabled, they appear in Configuration Space like normal 170*4d2e26a3SMauro Carvalho Chehab PCI devices, but the BARs in VF config space headers are unusual. For 171*4d2e26a3SMauro Carvalho Chehab a non-VF device, software uses BARs in the config space header to 172*4d2e26a3SMauro Carvalho Chehab discover the BAR sizes and assign addresses for them. For VF devices, 173*4d2e26a3SMauro Carvalho Chehab software uses VF BAR registers in the *PF* SR-IOV Capability to 174*4d2e26a3SMauro Carvalho Chehab discover sizes and assign addresses. The BARs in the VF's config space 175*4d2e26a3SMauro Carvalho Chehab header are read-only zeros. 176*4d2e26a3SMauro Carvalho Chehab 177*4d2e26a3SMauro Carvalho Chehab When a VF BAR in the PF SR-IOV Capability is programmed, it sets the 178*4d2e26a3SMauro Carvalho Chehab base address for all the corresponding VF(n) BARs. For example, if the 179*4d2e26a3SMauro Carvalho Chehab PF SR-IOV Capability is programmed to enable eight VFs, and it has a 180*4d2e26a3SMauro Carvalho Chehab 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. 181*4d2e26a3SMauro Carvalho Chehab This region is divided into eight contiguous 1MB regions, each of which 182*4d2e26a3SMauro Carvalho Chehab is a BAR0 for one of the VFs. Note that even though the VF BAR 183*4d2e26a3SMauro Carvalho Chehab describes an 8MB region, the alignment requirement is for a single VF, 184*4d2e26a3SMauro Carvalho Chehab i.e., 1MB in this example. 185*4d2e26a3SMauro Carvalho Chehab 186*4d2e26a3SMauro Carvalho Chehab There are several strategies for isolating VFs in PEs: 187*4d2e26a3SMauro Carvalho Chehab 188*4d2e26a3SMauro Carvalho Chehab - M32 window: There's one M32 window, and it is split into 256 189*4d2e26a3SMauro Carvalho Chehab equally-sized segments. The finest granularity possible is a 256MB 190*4d2e26a3SMauro Carvalho Chehab window with 1MB segments. VF BARs that are 1MB or larger could be 191*4d2e26a3SMauro Carvalho Chehab mapped to separate PEs in this window. Each segment can be 192*4d2e26a3SMauro Carvalho Chehab individually mapped to a PE via the lookup table, so this is quite 193*4d2e26a3SMauro Carvalho Chehab flexible, but it works best when all the VF BARs are the same size. If 194*4d2e26a3SMauro Carvalho Chehab they are different sizes, the entire window has to be small enough that 195*4d2e26a3SMauro Carvalho Chehab the segment size matches the smallest VF BAR, which means larger VF 196*4d2e26a3SMauro Carvalho Chehab BARs span several segments. 197*4d2e26a3SMauro Carvalho Chehab 198*4d2e26a3SMauro Carvalho Chehab - Non-segmented M64 window: A non-segmented M64 window is mapped entirely 199*4d2e26a3SMauro Carvalho Chehab to a single PE, so it could only isolate one VF. 200*4d2e26a3SMauro Carvalho Chehab 201*4d2e26a3SMauro Carvalho Chehab - Single segmented M64 windows: A segmented M64 window could be used just 202*4d2e26a3SMauro Carvalho Chehab like the M32 window, but the segments can't be individually mapped to 203*4d2e26a3SMauro Carvalho Chehab PEs (the segment number is the PE#), so there isn't as much 204*4d2e26a3SMauro Carvalho Chehab flexibility. A VF with multiple BARs would have to be in a "domain" of 205*4d2e26a3SMauro Carvalho Chehab multiple PEs, which is not as well isolated as a single PE. 206*4d2e26a3SMauro Carvalho Chehab 207*4d2e26a3SMauro Carvalho Chehab - Multiple segmented M64 windows: As usual, each window is split into 256 208*4d2e26a3SMauro Carvalho Chehab equally-sized segments, and the segment number is the PE#. But if we 209*4d2e26a3SMauro Carvalho Chehab use several M64 windows, they can be set to different base addresses 210*4d2e26a3SMauro Carvalho Chehab and different segment sizes. If we have VFs that each have a 1MB BAR 211*4d2e26a3SMauro Carvalho Chehab and a 32MB BAR, we could use one M64 window to assign 1MB segments and 212*4d2e26a3SMauro Carvalho Chehab another M64 window to assign 32MB segments. 213*4d2e26a3SMauro Carvalho Chehab 214*4d2e26a3SMauro Carvalho Chehab Finally, the plan to use M64 windows for SR-IOV, which will be described 215*4d2e26a3SMauro Carvalho Chehab more in the next two sections. For a given VF BAR, we need to 216*4d2e26a3SMauro Carvalho Chehab effectively reserve the entire 256 segments (256 * VF BAR size) and 217*4d2e26a3SMauro Carvalho Chehab position the VF BAR to start at the beginning of a free range of 218*4d2e26a3SMauro Carvalho Chehab segments/PEs inside that M64 window. 219*4d2e26a3SMauro Carvalho Chehab 220*4d2e26a3SMauro Carvalho Chehab The goal is of course to be able to give a separate PE for each VF. 221*4d2e26a3SMauro Carvalho Chehab 222*4d2e26a3SMauro Carvalho Chehab The IODA2 platform has 16 M64 windows, which are used to map MMIO 223*4d2e26a3SMauro Carvalho Chehab range to PE#. Each M64 window defines one MMIO range and this range is 224*4d2e26a3SMauro Carvalho Chehab divided into 256 segments, with each segment corresponding to one PE. 225*4d2e26a3SMauro Carvalho Chehab 226*4d2e26a3SMauro Carvalho Chehab We decide to leverage this M64 window to map VFs to individual PEs, since 227*4d2e26a3SMauro Carvalho Chehab SR-IOV VF BARs are all the same size. 228*4d2e26a3SMauro Carvalho Chehab 229*4d2e26a3SMauro Carvalho Chehab But doing so introduces another problem: total_VFs is usually smaller 230*4d2e26a3SMauro Carvalho Chehab than the number of M64 window segments, so if we map one VF BAR directly 231*4d2e26a3SMauro Carvalho Chehab to one M64 window, some part of the M64 window will map to another 232*4d2e26a3SMauro Carvalho Chehab device's MMIO range. 233*4d2e26a3SMauro Carvalho Chehab 234*4d2e26a3SMauro Carvalho Chehab IODA supports 256 PEs, so segmented windows contain 256 segments, so if 235*4d2e26a3SMauro Carvalho Chehab total_VFs is less than 256, we have the situation in Figure 1.0, where 236*4d2e26a3SMauro Carvalho Chehab segments [total_VFs, 255] of the M64 window may map to some MMIO range on 237*4d2e26a3SMauro Carvalho Chehab other devices:: 238*4d2e26a3SMauro Carvalho Chehab 239*4d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 240*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+ 241*4d2e26a3SMauro Carvalho Chehab | | | ... | | | 242*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+ 243*4d2e26a3SMauro Carvalho Chehab 244*4d2e26a3SMauro Carvalho Chehab VF(n) BAR space 245*4d2e26a3SMauro Carvalho Chehab 246*4d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 247*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 248*4d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 249*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 250*4d2e26a3SMauro Carvalho Chehab 251*4d2e26a3SMauro Carvalho Chehab M64 window 252*4d2e26a3SMauro Carvalho Chehab 253*4d2e26a3SMauro Carvalho Chehab Figure 1.0 Direct map VF(n) BAR space 254*4d2e26a3SMauro Carvalho Chehab 255*4d2e26a3SMauro Carvalho Chehab Our current solution is to allocate 256 segments even if the VF(n) BAR 256*4d2e26a3SMauro Carvalho Chehab space doesn't need that much, as shown in Figure 1.1:: 257*4d2e26a3SMauro Carvalho Chehab 258*4d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 259*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 260*4d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 261*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 262*4d2e26a3SMauro Carvalho Chehab 263*4d2e26a3SMauro Carvalho Chehab VF(n) BAR space + extra 264*4d2e26a3SMauro Carvalho Chehab 265*4d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 266*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 267*4d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 268*4d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 269*4d2e26a3SMauro Carvalho Chehab 270*4d2e26a3SMauro Carvalho Chehab M64 window 271*4d2e26a3SMauro Carvalho Chehab 272*4d2e26a3SMauro Carvalho Chehab Figure 1.1 Map VF(n) BAR space + extra 273*4d2e26a3SMauro Carvalho Chehab 274*4d2e26a3SMauro Carvalho Chehab Allocating the extra space ensures that the entire M64 window will be 275*4d2e26a3SMauro Carvalho Chehab assigned to this one SR-IOV device and none of the space will be 276*4d2e26a3SMauro Carvalho Chehab available for other devices. Note that this only expands the space 277*4d2e26a3SMauro Carvalho Chehab reserved in software; there are still only total_VFs VFs, and they only 278*4d2e26a3SMauro Carvalho Chehab respond to segments [0, total_VFs - 1]. There's nothing in hardware that 279*4d2e26a3SMauro Carvalho Chehab responds to segments [total_VFs, 255]. 280*4d2e26a3SMauro Carvalho Chehab 281*4d2e26a3SMauro Carvalho Chehab4. Implications for the Generic PCI Code 282*4d2e26a3SMauro Carvalho Chehab======================================== 283*4d2e26a3SMauro Carvalho Chehab 284*4d2e26a3SMauro Carvalho ChehabThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be 285*4d2e26a3SMauro Carvalho Chehabaligned to the size of an individual VF BAR. 286*4d2e26a3SMauro Carvalho Chehab 287*4d2e26a3SMauro Carvalho ChehabIn IODA2, the MMIO address determines the PE#. If the address is in an M32 288*4d2e26a3SMauro Carvalho Chehabwindow, we can set the PE# by updating the table that translates segments 289*4d2e26a3SMauro Carvalho Chehabto PE#s. Similarly, if the address is in an unsegmented M64 window, we can 290*4d2e26a3SMauro Carvalho Chehabset the PE# for the window. But if it's in a segmented M64 window, the 291*4d2e26a3SMauro Carvalho Chehabsegment number is the PE#. 292*4d2e26a3SMauro Carvalho Chehab 293*4d2e26a3SMauro Carvalho ChehabTherefore, the only way to control the PE# for a VF is to change the base 294*4d2e26a3SMauro Carvalho Chehabof the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact 295*4d2e26a3SMauro Carvalho Chehabamount of space required for the VF(n) BAR space, the VF BAR value is fixed 296*4d2e26a3SMauro Carvalho Chehaband cannot be changed. 297*4d2e26a3SMauro Carvalho Chehab 298*4d2e26a3SMauro Carvalho ChehabOn the other hand, if the PCI core allocates additional space, the VF BAR 299*4d2e26a3SMauro Carvalho Chehabvalue can be changed as long as the entire VF(n) BAR space remains inside 300*4d2e26a3SMauro Carvalho Chehabthe space allocated by the core. 301*4d2e26a3SMauro Carvalho Chehab 302*4d2e26a3SMauro Carvalho ChehabIdeally the segment size will be the same as an individual VF BAR size. 303*4d2e26a3SMauro Carvalho ChehabThen each VF will be in its own PE. The VF BARs (and therefore the PE#s) 304*4d2e26a3SMauro Carvalho Chehabare contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we 305*4d2e26a3SMauro Carvalho Chehaballocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. 306*4d2e26a3SMauro Carvalho Chehab 307*4d2e26a3SMauro Carvalho ChehabIf the segment size is smaller than the VF BAR size, it will take several 308*4d2e26a3SMauro Carvalho Chehabsegments to cover a VF BAR, and a VF will be in several PEs. This is 309*4d2e26a3SMauro Carvalho Chehabpossible, but the isolation isn't as good, and it reduces the number of PE# 310*4d2e26a3SMauro Carvalho Chehabchoices because instead of consuming only numVFs segments, the VF(n) BAR 311*4d2e26a3SMauro Carvalho Chehabspace will consume (numVFs * n) segments. That means there aren't as many 312*4d2e26a3SMauro Carvalho Chehabavailable segments for adjusting base of the VF(n) BAR space. 313