14d2e26a3SMauro Carvalho Chehab=================================================== 24d2e26a3SMauro Carvalho ChehabPCI Express I/O Virtualization Resource on Powerenv 34d2e26a3SMauro Carvalho Chehab=================================================== 44d2e26a3SMauro Carvalho Chehab 54d2e26a3SMauro Carvalho ChehabWei Yang <weiyang@linux.vnet.ibm.com> 64d2e26a3SMauro Carvalho Chehab 74d2e26a3SMauro Carvalho ChehabBenjamin Herrenschmidt <benh@au1.ibm.com> 84d2e26a3SMauro Carvalho Chehab 94d2e26a3SMauro Carvalho ChehabBjorn Helgaas <bhelgaas@google.com> 104d2e26a3SMauro Carvalho Chehab 114d2e26a3SMauro Carvalho Chehab26 Aug 2014 124d2e26a3SMauro Carvalho Chehab 134d2e26a3SMauro Carvalho ChehabThis document describes the requirement from hardware for PCI MMIO resource 144d2e26a3SMauro Carvalho Chehabsizing and assignment on PowerKVM and how generic PCI code handles this 154d2e26a3SMauro Carvalho Chehabrequirement. The first two sections describe the concepts of Partitionable 164d2e26a3SMauro Carvalho ChehabEndpoints and the implementation on P8 (IODA2). The next two sections talks 174d2e26a3SMauro Carvalho Chehababout considerations on enabling SRIOV on IODA2. 184d2e26a3SMauro Carvalho Chehab 194d2e26a3SMauro Carvalho Chehab1. Introduction to Partitionable Endpoints 204d2e26a3SMauro Carvalho Chehab========================================== 214d2e26a3SMauro Carvalho Chehab 224d2e26a3SMauro Carvalho ChehabA Partitionable Endpoint (PE) is a way to group the various resources 234d2e26a3SMauro Carvalho Chehabassociated with a device or a set of devices to provide isolation between 244d2e26a3SMauro Carvalho Chehabpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism 254d2e26a3SMauro Carvalho Chehabto freeze a device that is causing errors in order to limit the possibility 264d2e26a3SMauro Carvalho Chehabof propagation of bad data. 274d2e26a3SMauro Carvalho Chehab 284d2e26a3SMauro Carvalho ChehabThere is thus, in HW, a table of PE states that contains a pair of "frozen" 294d2e26a3SMauro Carvalho Chehabstate bits (one for MMIO and one for DMA, they get set together but can be 304d2e26a3SMauro Carvalho Chehabcleared independently) for each PE. 314d2e26a3SMauro Carvalho Chehab 324d2e26a3SMauro Carvalho ChehabWhen a PE is frozen, all stores in any direction are dropped and all loads 334d2e26a3SMauro Carvalho Chehabreturn all 1's value. MSIs are also blocked. There's a bit more state that 344d2e26a3SMauro Carvalho Chehabcaptures things like the details of the error that caused the freeze etc., but 354d2e26a3SMauro Carvalho Chehabthat's not critical. 364d2e26a3SMauro Carvalho Chehab 374d2e26a3SMauro Carvalho ChehabThe interesting part is how the various PCIe transactions (MMIO, DMA, ...) 384d2e26a3SMauro Carvalho Chehabare matched to their corresponding PEs. 394d2e26a3SMauro Carvalho Chehab 404d2e26a3SMauro Carvalho ChehabThe following section provides a rough description of what we have on P8 414d2e26a3SMauro Carvalho Chehab(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB 424d2e26a3SMauro Carvalho Chehabis a completely separate HW entity that replicates the entire logic, so has 434d2e26a3SMauro Carvalho Chehabits own set of PEs, etc. 444d2e26a3SMauro Carvalho Chehab 454d2e26a3SMauro Carvalho Chehab2. Implementation of Partitionable Endpoints on P8 (IODA2) 464d2e26a3SMauro Carvalho Chehab========================================================== 474d2e26a3SMauro Carvalho Chehab 484d2e26a3SMauro Carvalho ChehabP8 supports up to 256 Partitionable Endpoints per PHB. 494d2e26a3SMauro Carvalho Chehab 504d2e26a3SMauro Carvalho Chehab * Inbound 514d2e26a3SMauro Carvalho Chehab 524d2e26a3SMauro Carvalho Chehab For DMA, MSIs and inbound PCIe error messages, we have a table (in 534d2e26a3SMauro Carvalho Chehab memory but accessed in HW by the chip) that provides a direct 544d2e26a3SMauro Carvalho Chehab correspondence between a PCIe RID (bus/dev/fn) with a PE number. 554d2e26a3SMauro Carvalho Chehab We call this the RTT. 564d2e26a3SMauro Carvalho Chehab 574d2e26a3SMauro Carvalho Chehab - For DMA we then provide an entire address space for each PE that can 584d2e26a3SMauro Carvalho Chehab contain two "windows", depending on the value of PCI address bit 59. 594d2e26a3SMauro Carvalho Chehab Each window can be configured to be remapped via a "TCE table" (IOMMU 604d2e26a3SMauro Carvalho Chehab translation table), which has various configurable characteristics 614d2e26a3SMauro Carvalho Chehab not described here. 624d2e26a3SMauro Carvalho Chehab 634d2e26a3SMauro Carvalho Chehab - For MSIs, we have two windows in the address space (one at the top of 644d2e26a3SMauro Carvalho Chehab the 32-bit space and one much higher) which, via a combination of the 654d2e26a3SMauro Carvalho Chehab address and MSI value, will result in one of the 2048 interrupts per 664d2e26a3SMauro Carvalho Chehab bridge being triggered. There's a PE# in the interrupt controller 674d2e26a3SMauro Carvalho Chehab descriptor table as well which is compared with the PE# obtained from 684d2e26a3SMauro Carvalho Chehab the RTT to "authorize" the device to emit that specific interrupt. 694d2e26a3SMauro Carvalho Chehab 704d2e26a3SMauro Carvalho Chehab - Error messages just use the RTT. 714d2e26a3SMauro Carvalho Chehab 724d2e26a3SMauro Carvalho Chehab * Outbound. That's where the tricky part is. 734d2e26a3SMauro Carvalho Chehab 744d2e26a3SMauro Carvalho Chehab Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" 754d2e26a3SMauro Carvalho Chehab from the CPU address space to the PCI address space. There is one M32 764d2e26a3SMauro Carvalho Chehab window and sixteen M64 windows. They have different characteristics. 774d2e26a3SMauro Carvalho Chehab First what they have in common: they forward a configurable portion of 784d2e26a3SMauro Carvalho Chehab the CPU address space to the PCIe bus and must be naturally aligned 794d2e26a3SMauro Carvalho Chehab power of two in size. The rest is different: 804d2e26a3SMauro Carvalho Chehab 814d2e26a3SMauro Carvalho Chehab - The M32 window: 824d2e26a3SMauro Carvalho Chehab 834d2e26a3SMauro Carvalho Chehab * Is limited to 4GB in size. 844d2e26a3SMauro Carvalho Chehab 854d2e26a3SMauro Carvalho Chehab * Drops the top bits of the address (above the size) and replaces 864d2e26a3SMauro Carvalho Chehab them with a configurable value. This is typically used to generate 874d2e26a3SMauro Carvalho Chehab 32-bit PCIe accesses. We configure that window at boot from FW and 884d2e26a3SMauro Carvalho Chehab don't touch it from Linux; it's usually set to forward a 2GB 894d2e26a3SMauro Carvalho Chehab portion of address space from the CPU to PCIe 904d2e26a3SMauro Carvalho Chehab 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually 914d2e26a3SMauro Carvalho Chehab reserved for MSIs but this is not a problem at this point; we just 924d2e26a3SMauro Carvalho Chehab need to ensure Linux doesn't assign anything there, the M32 logic 934d2e26a3SMauro Carvalho Chehab ignores that however and will forward in that space if we try). 944d2e26a3SMauro Carvalho Chehab 954d2e26a3SMauro Carvalho Chehab * It is divided into 256 segments of equal size. A table in the chip 964d2e26a3SMauro Carvalho Chehab maps each segment to a PE#. That allows portions of the MMIO space 974d2e26a3SMauro Carvalho Chehab to be assigned to PEs on a segment granularity. For a 2GB window, 984d2e26a3SMauro Carvalho Chehab the segment granularity is 2GB/256 = 8MB. 994d2e26a3SMauro Carvalho Chehab 1004d2e26a3SMauro Carvalho Chehab Now, this is the "main" window we use in Linux today (excluding 1014d2e26a3SMauro Carvalho Chehab SR-IOV). We basically use the trick of forcing the bridge MMIO windows 1024d2e26a3SMauro Carvalho Chehab onto a segment alignment/granularity so that the space behind a bridge 1034d2e26a3SMauro Carvalho Chehab can be assigned to a PE. 1044d2e26a3SMauro Carvalho Chehab 1054d2e26a3SMauro Carvalho Chehab Ideally we would like to be able to have individual functions in PEs 1064d2e26a3SMauro Carvalho Chehab but that would mean using a completely different address allocation 1074d2e26a3SMauro Carvalho Chehab scheme where individual function BARs can be "grouped" to fit in one or 1084d2e26a3SMauro Carvalho Chehab more segments. 1094d2e26a3SMauro Carvalho Chehab 1104d2e26a3SMauro Carvalho Chehab - The M64 windows: 1114d2e26a3SMauro Carvalho Chehab 1124d2e26a3SMauro Carvalho Chehab * Must be at least 256MB in size. 1134d2e26a3SMauro Carvalho Chehab 1144d2e26a3SMauro Carvalho Chehab * Do not translate addresses (the address on PCIe is the same as the 1154d2e26a3SMauro Carvalho Chehab address on the PowerBus). There is a way to also set the top 14 1164d2e26a3SMauro Carvalho Chehab bits which are not conveyed by PowerBus but we don't use this. 1174d2e26a3SMauro Carvalho Chehab 1184d2e26a3SMauro Carvalho Chehab * Can be configured to be segmented. When not segmented, we can 1194d2e26a3SMauro Carvalho Chehab specify the PE# for the entire window. When segmented, a window 1204d2e26a3SMauro Carvalho Chehab has 256 segments; however, there is no table for mapping a segment 1214d2e26a3SMauro Carvalho Chehab to a PE#. The segment number *is* the PE#. 1224d2e26a3SMauro Carvalho Chehab 1234d2e26a3SMauro Carvalho Chehab * Support overlaps. If an address is covered by multiple windows, 1244d2e26a3SMauro Carvalho Chehab there's a defined ordering for which window applies. 1254d2e26a3SMauro Carvalho Chehab 1264d2e26a3SMauro Carvalho Chehab We have code (fairly new compared to the M32 stuff) that exploits that 1274d2e26a3SMauro Carvalho Chehab for large BARs in 64-bit space: 1284d2e26a3SMauro Carvalho Chehab 1294d2e26a3SMauro Carvalho Chehab We configure an M64 window to cover the entire region of address space 1304d2e26a3SMauro Carvalho Chehab that has been assigned by FW for the PHB (about 64GB, ignore the space 1314d2e26a3SMauro Carvalho Chehab for the M32, it comes out of a different "reserve"). We configure it 1324d2e26a3SMauro Carvalho Chehab as segmented. 1334d2e26a3SMauro Carvalho Chehab 1344d2e26a3SMauro Carvalho Chehab Then we do the same thing as with M32, using the bridge alignment 1354d2e26a3SMauro Carvalho Chehab trick, to match to those giant segments. 1364d2e26a3SMauro Carvalho Chehab 1374d2e26a3SMauro Carvalho Chehab Since we cannot remap, we have two additional constraints: 1384d2e26a3SMauro Carvalho Chehab 1394d2e26a3SMauro Carvalho Chehab - We do the PE# allocation *after* the 64-bit space has been assigned 1404d2e26a3SMauro Carvalho Chehab because the addresses we use directly determine the PE#. We then 1414d2e26a3SMauro Carvalho Chehab update the M32 PE# for the devices that use both 32-bit and 64-bit 1424d2e26a3SMauro Carvalho Chehab spaces or assign the remaining PE# to 32-bit only devices. 1434d2e26a3SMauro Carvalho Chehab 1444d2e26a3SMauro Carvalho Chehab - We cannot "group" segments in HW, so if a device ends up using more 1454d2e26a3SMauro Carvalho Chehab than one segment, we end up with more than one PE#. There is a HW 1464d2e26a3SMauro Carvalho Chehab mechanism to make the freeze state cascade to "companion" PEs but 1474d2e26a3SMauro Carvalho Chehab that only works for PCIe error messages (typically used so that if 1484d2e26a3SMauro Carvalho Chehab you freeze a switch, it freezes all its children). So we do it in 1494d2e26a3SMauro Carvalho Chehab SW. We lose a bit of effectiveness of EEH in that case, but that's 1504d2e26a3SMauro Carvalho Chehab the best we found. So when any of the PEs freezes, we freeze the 1514d2e26a3SMauro Carvalho Chehab other ones for that "domain". We thus introduce the concept of 1524d2e26a3SMauro Carvalho Chehab "master PE" which is the one used for DMA, MSIs, etc., and "secondary 1534d2e26a3SMauro Carvalho Chehab PEs" that are used for the remaining M64 segments. 1544d2e26a3SMauro Carvalho Chehab 1554d2e26a3SMauro Carvalho Chehab We would like to investigate using additional M64 windows in "single 1564d2e26a3SMauro Carvalho Chehab PE" mode to overlay over specific BARs to work around some of that, for 1574d2e26a3SMauro Carvalho Chehab example for devices with very large BARs, e.g., GPUs. It would make 1584d2e26a3SMauro Carvalho Chehab sense, but we haven't done it yet. 1594d2e26a3SMauro Carvalho Chehab 1604d2e26a3SMauro Carvalho Chehab3. Considerations for SR-IOV on PowerKVM 1614d2e26a3SMauro Carvalho Chehab======================================== 1624d2e26a3SMauro Carvalho Chehab 1634d2e26a3SMauro Carvalho Chehab * SR-IOV Background 1644d2e26a3SMauro Carvalho Chehab 1654d2e26a3SMauro Carvalho Chehab The PCIe SR-IOV feature allows a single Physical Function (PF) to 1664d2e26a3SMauro Carvalho Chehab support several Virtual Functions (VFs). Registers in the PF's SR-IOV 1674d2e26a3SMauro Carvalho Chehab Capability control the number of VFs and whether they are enabled. 1684d2e26a3SMauro Carvalho Chehab 1694d2e26a3SMauro Carvalho Chehab When VFs are enabled, they appear in Configuration Space like normal 1704d2e26a3SMauro Carvalho Chehab PCI devices, but the BARs in VF config space headers are unusual. For 1714d2e26a3SMauro Carvalho Chehab a non-VF device, software uses BARs in the config space header to 1724d2e26a3SMauro Carvalho Chehab discover the BAR sizes and assign addresses for them. For VF devices, 1734d2e26a3SMauro Carvalho Chehab software uses VF BAR registers in the *PF* SR-IOV Capability to 1744d2e26a3SMauro Carvalho Chehab discover sizes and assign addresses. The BARs in the VF's config space 1754d2e26a3SMauro Carvalho Chehab header are read-only zeros. 1764d2e26a3SMauro Carvalho Chehab 1774d2e26a3SMauro Carvalho Chehab When a VF BAR in the PF SR-IOV Capability is programmed, it sets the 1784d2e26a3SMauro Carvalho Chehab base address for all the corresponding VF(n) BARs. For example, if the 1794d2e26a3SMauro Carvalho Chehab PF SR-IOV Capability is programmed to enable eight VFs, and it has a 1804d2e26a3SMauro Carvalho Chehab 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. 1814d2e26a3SMauro Carvalho Chehab This region is divided into eight contiguous 1MB regions, each of which 1824d2e26a3SMauro Carvalho Chehab is a BAR0 for one of the VFs. Note that even though the VF BAR 1834d2e26a3SMauro Carvalho Chehab describes an 8MB region, the alignment requirement is for a single VF, 1844d2e26a3SMauro Carvalho Chehab i.e., 1MB in this example. 1854d2e26a3SMauro Carvalho Chehab 1864d2e26a3SMauro Carvalho Chehab There are several strategies for isolating VFs in PEs: 1874d2e26a3SMauro Carvalho Chehab 1884d2e26a3SMauro Carvalho Chehab - M32 window: There's one M32 window, and it is split into 256 1894d2e26a3SMauro Carvalho Chehab equally-sized segments. The finest granularity possible is a 256MB 1904d2e26a3SMauro Carvalho Chehab window with 1MB segments. VF BARs that are 1MB or larger could be 1914d2e26a3SMauro Carvalho Chehab mapped to separate PEs in this window. Each segment can be 1924d2e26a3SMauro Carvalho Chehab individually mapped to a PE via the lookup table, so this is quite 1934d2e26a3SMauro Carvalho Chehab flexible, but it works best when all the VF BARs are the same size. If 1944d2e26a3SMauro Carvalho Chehab they are different sizes, the entire window has to be small enough that 1954d2e26a3SMauro Carvalho Chehab the segment size matches the smallest VF BAR, which means larger VF 1964d2e26a3SMauro Carvalho Chehab BARs span several segments. 1974d2e26a3SMauro Carvalho Chehab 1984d2e26a3SMauro Carvalho Chehab - Non-segmented M64 window: A non-segmented M64 window is mapped entirely 1994d2e26a3SMauro Carvalho Chehab to a single PE, so it could only isolate one VF. 2004d2e26a3SMauro Carvalho Chehab 2014d2e26a3SMauro Carvalho Chehab - Single segmented M64 windows: A segmented M64 window could be used just 2024d2e26a3SMauro Carvalho Chehab like the M32 window, but the segments can't be individually mapped to 2034d2e26a3SMauro Carvalho Chehab PEs (the segment number is the PE#), so there isn't as much 2044d2e26a3SMauro Carvalho Chehab flexibility. A VF with multiple BARs would have to be in a "domain" of 2054d2e26a3SMauro Carvalho Chehab multiple PEs, which is not as well isolated as a single PE. 2064d2e26a3SMauro Carvalho Chehab 2074d2e26a3SMauro Carvalho Chehab - Multiple segmented M64 windows: As usual, each window is split into 256 2084d2e26a3SMauro Carvalho Chehab equally-sized segments, and the segment number is the PE#. But if we 2094d2e26a3SMauro Carvalho Chehab use several M64 windows, they can be set to different base addresses 2104d2e26a3SMauro Carvalho Chehab and different segment sizes. If we have VFs that each have a 1MB BAR 2114d2e26a3SMauro Carvalho Chehab and a 32MB BAR, we could use one M64 window to assign 1MB segments and 2124d2e26a3SMauro Carvalho Chehab another M64 window to assign 32MB segments. 2134d2e26a3SMauro Carvalho Chehab 2144d2e26a3SMauro Carvalho Chehab Finally, the plan to use M64 windows for SR-IOV, which will be described 2154d2e26a3SMauro Carvalho Chehab more in the next two sections. For a given VF BAR, we need to 2164d2e26a3SMauro Carvalho Chehab effectively reserve the entire 256 segments (256 * VF BAR size) and 2174d2e26a3SMauro Carvalho Chehab position the VF BAR to start at the beginning of a free range of 2184d2e26a3SMauro Carvalho Chehab segments/PEs inside that M64 window. 2194d2e26a3SMauro Carvalho Chehab 2204d2e26a3SMauro Carvalho Chehab The goal is of course to be able to give a separate PE for each VF. 2214d2e26a3SMauro Carvalho Chehab 2224d2e26a3SMauro Carvalho Chehab The IODA2 platform has 16 M64 windows, which are used to map MMIO 2234d2e26a3SMauro Carvalho Chehab range to PE#. Each M64 window defines one MMIO range and this range is 2244d2e26a3SMauro Carvalho Chehab divided into 256 segments, with each segment corresponding to one PE. 2254d2e26a3SMauro Carvalho Chehab 2264d2e26a3SMauro Carvalho Chehab We decide to leverage this M64 window to map VFs to individual PEs, since 2274d2e26a3SMauro Carvalho Chehab SR-IOV VF BARs are all the same size. 2284d2e26a3SMauro Carvalho Chehab 2294d2e26a3SMauro Carvalho Chehab But doing so introduces another problem: total_VFs is usually smaller 2304d2e26a3SMauro Carvalho Chehab than the number of M64 window segments, so if we map one VF BAR directly 2314d2e26a3SMauro Carvalho Chehab to one M64 window, some part of the M64 window will map to another 2324d2e26a3SMauro Carvalho Chehab device's MMIO range. 2334d2e26a3SMauro Carvalho Chehab 2344d2e26a3SMauro Carvalho Chehab IODA supports 256 PEs, so segmented windows contain 256 segments, so if 2354d2e26a3SMauro Carvalho Chehab total_VFs is less than 256, we have the situation in Figure 1.0, where 2364d2e26a3SMauro Carvalho Chehab segments [total_VFs, 255] of the M64 window may map to some MMIO range on 2374d2e26a3SMauro Carvalho Chehab other devices:: 2384d2e26a3SMauro Carvalho Chehab 2394d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 2404d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+ 2414d2e26a3SMauro Carvalho Chehab | | | ... | | | 2424d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+ 2434d2e26a3SMauro Carvalho Chehab 2444d2e26a3SMauro Carvalho Chehab VF(n) BAR space 2454d2e26a3SMauro Carvalho Chehab 2464d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 2474d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2484d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 2494d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2504d2e26a3SMauro Carvalho Chehab 2514d2e26a3SMauro Carvalho Chehab M64 window 2524d2e26a3SMauro Carvalho Chehab 2534d2e26a3SMauro Carvalho Chehab Figure 1.0 Direct map VF(n) BAR space 2544d2e26a3SMauro Carvalho Chehab 2554d2e26a3SMauro Carvalho Chehab Our current solution is to allocate 256 segments even if the VF(n) BAR 2564d2e26a3SMauro Carvalho Chehab space doesn't need that much, as shown in Figure 1.1:: 2574d2e26a3SMauro Carvalho Chehab 2584d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 2594d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2604d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 2614d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2624d2e26a3SMauro Carvalho Chehab 2634d2e26a3SMauro Carvalho Chehab VF(n) BAR space + extra 2644d2e26a3SMauro Carvalho Chehab 2654d2e26a3SMauro Carvalho Chehab 0 1 total_VFs - 1 255 2664d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2674d2e26a3SMauro Carvalho Chehab | | | ... | | | ... | | | 2684d2e26a3SMauro Carvalho Chehab +------+------+- -+------+------+- -+------+------+ 2694d2e26a3SMauro Carvalho Chehab 2704d2e26a3SMauro Carvalho Chehab M64 window 2714d2e26a3SMauro Carvalho Chehab 2724d2e26a3SMauro Carvalho Chehab Figure 1.1 Map VF(n) BAR space + extra 2734d2e26a3SMauro Carvalho Chehab 2744d2e26a3SMauro Carvalho Chehab Allocating the extra space ensures that the entire M64 window will be 2754d2e26a3SMauro Carvalho Chehab assigned to this one SR-IOV device and none of the space will be 2764d2e26a3SMauro Carvalho Chehab available for other devices. Note that this only expands the space 2774d2e26a3SMauro Carvalho Chehab reserved in software; there are still only total_VFs VFs, and they only 2784d2e26a3SMauro Carvalho Chehab respond to segments [0, total_VFs - 1]. There's nothing in hardware that 2794d2e26a3SMauro Carvalho Chehab responds to segments [total_VFs, 255]. 2804d2e26a3SMauro Carvalho Chehab 2814d2e26a3SMauro Carvalho Chehab4. Implications for the Generic PCI Code 2824d2e26a3SMauro Carvalho Chehab======================================== 2834d2e26a3SMauro Carvalho Chehab 2844d2e26a3SMauro Carvalho ChehabThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be 2854d2e26a3SMauro Carvalho Chehabaligned to the size of an individual VF BAR. 2864d2e26a3SMauro Carvalho Chehab 2874d2e26a3SMauro Carvalho ChehabIn IODA2, the MMIO address determines the PE#. If the address is in an M32 2884d2e26a3SMauro Carvalho Chehabwindow, we can set the PE# by updating the table that translates segments 2894d2e26a3SMauro Carvalho Chehabto PE#s. Similarly, if the address is in an unsegmented M64 window, we can 2904d2e26a3SMauro Carvalho Chehabset the PE# for the window. But if it's in a segmented M64 window, the 2914d2e26a3SMauro Carvalho Chehabsegment number is the PE#. 2924d2e26a3SMauro Carvalho Chehab 2934d2e26a3SMauro Carvalho ChehabTherefore, the only way to control the PE# for a VF is to change the base 2944d2e26a3SMauro Carvalho Chehabof the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact 2954d2e26a3SMauro Carvalho Chehabamount of space required for the VF(n) BAR space, the VF BAR value is fixed 2964d2e26a3SMauro Carvalho Chehaband cannot be changed. 2974d2e26a3SMauro Carvalho Chehab 2984d2e26a3SMauro Carvalho ChehabOn the other hand, if the PCI core allocates additional space, the VF BAR 2994d2e26a3SMauro Carvalho Chehabvalue can be changed as long as the entire VF(n) BAR space remains inside 3004d2e26a3SMauro Carvalho Chehabthe space allocated by the core. 3014d2e26a3SMauro Carvalho Chehab 3024d2e26a3SMauro Carvalho ChehabIdeally the segment size will be the same as an individual VF BAR size. 3034d2e26a3SMauro Carvalho ChehabThen each VF will be in its own PE. The VF BARs (and therefore the PE#s) 3044d2e26a3SMauro Carvalho Chehabare contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we 3054d2e26a3SMauro Carvalho Chehaballocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. 3064d2e26a3SMauro Carvalho Chehab 3074d2e26a3SMauro Carvalho ChehabIf the segment size is smaller than the VF BAR size, it will take several 3084d2e26a3SMauro Carvalho Chehabsegments to cover a VF BAR, and a VF will be in several PEs. This is 3094d2e26a3SMauro Carvalho Chehabpossible, but the isolation isn't as good, and it reduces the number of PE# 3104d2e26a3SMauro Carvalho Chehabchoices because instead of consuming only numVFs segments, the VF(n) BAR 3114d2e26a3SMauro Carvalho Chehabspace will consume (numVFs * n) segments. That means there aren't as many 3124d2e26a3SMauro Carvalho Chehabavailable segments for adjusting base of the VF(n) BAR space. 313