14d2e26a3SMauro Carvalho Chehab===================================================
24d2e26a3SMauro Carvalho ChehabPCI Express I/O Virtualization Resource on Powerenv
34d2e26a3SMauro Carvalho Chehab===================================================
44d2e26a3SMauro Carvalho Chehab
54d2e26a3SMauro Carvalho ChehabWei Yang <weiyang@linux.vnet.ibm.com>
64d2e26a3SMauro Carvalho Chehab
74d2e26a3SMauro Carvalho ChehabBenjamin Herrenschmidt <benh@au1.ibm.com>
84d2e26a3SMauro Carvalho Chehab
94d2e26a3SMauro Carvalho ChehabBjorn Helgaas <bhelgaas@google.com>
104d2e26a3SMauro Carvalho Chehab
114d2e26a3SMauro Carvalho Chehab26 Aug 2014
124d2e26a3SMauro Carvalho Chehab
134d2e26a3SMauro Carvalho ChehabThis document describes the requirement from hardware for PCI MMIO resource
144d2e26a3SMauro Carvalho Chehabsizing and assignment on PowerKVM and how generic PCI code handles this
154d2e26a3SMauro Carvalho Chehabrequirement. The first two sections describe the concepts of Partitionable
164d2e26a3SMauro Carvalho ChehabEndpoints and the implementation on P8 (IODA2). The next two sections talks
174d2e26a3SMauro Carvalho Chehababout considerations on enabling SRIOV on IODA2.
184d2e26a3SMauro Carvalho Chehab
194d2e26a3SMauro Carvalho Chehab1. Introduction to Partitionable Endpoints
204d2e26a3SMauro Carvalho Chehab==========================================
214d2e26a3SMauro Carvalho Chehab
224d2e26a3SMauro Carvalho ChehabA Partitionable Endpoint (PE) is a way to group the various resources
234d2e26a3SMauro Carvalho Chehabassociated with a device or a set of devices to provide isolation between
244d2e26a3SMauro Carvalho Chehabpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
254d2e26a3SMauro Carvalho Chehabto freeze a device that is causing errors in order to limit the possibility
264d2e26a3SMauro Carvalho Chehabof propagation of bad data.
274d2e26a3SMauro Carvalho Chehab
284d2e26a3SMauro Carvalho ChehabThere is thus, in HW, a table of PE states that contains a pair of "frozen"
294d2e26a3SMauro Carvalho Chehabstate bits (one for MMIO and one for DMA, they get set together but can be
304d2e26a3SMauro Carvalho Chehabcleared independently) for each PE.
314d2e26a3SMauro Carvalho Chehab
324d2e26a3SMauro Carvalho ChehabWhen a PE is frozen, all stores in any direction are dropped and all loads
334d2e26a3SMauro Carvalho Chehabreturn all 1's value. MSIs are also blocked. There's a bit more state that
344d2e26a3SMauro Carvalho Chehabcaptures things like the details of the error that caused the freeze etc., but
354d2e26a3SMauro Carvalho Chehabthat's not critical.
364d2e26a3SMauro Carvalho Chehab
374d2e26a3SMauro Carvalho ChehabThe interesting part is how the various PCIe transactions (MMIO, DMA, ...)
384d2e26a3SMauro Carvalho Chehabare matched to their corresponding PEs.
394d2e26a3SMauro Carvalho Chehab
404d2e26a3SMauro Carvalho ChehabThe following section provides a rough description of what we have on P8
414d2e26a3SMauro Carvalho Chehab(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
424d2e26a3SMauro Carvalho Chehabis a completely separate HW entity that replicates the entire logic, so has
434d2e26a3SMauro Carvalho Chehabits own set of PEs, etc.
444d2e26a3SMauro Carvalho Chehab
454d2e26a3SMauro Carvalho Chehab2. Implementation of Partitionable Endpoints on P8 (IODA2)
464d2e26a3SMauro Carvalho Chehab==========================================================
474d2e26a3SMauro Carvalho Chehab
484d2e26a3SMauro Carvalho ChehabP8 supports up to 256 Partitionable Endpoints per PHB.
494d2e26a3SMauro Carvalho Chehab
504d2e26a3SMauro Carvalho Chehab  * Inbound
514d2e26a3SMauro Carvalho Chehab
524d2e26a3SMauro Carvalho Chehab    For DMA, MSIs and inbound PCIe error messages, we have a table (in
534d2e26a3SMauro Carvalho Chehab    memory but accessed in HW by the chip) that provides a direct
544d2e26a3SMauro Carvalho Chehab    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
554d2e26a3SMauro Carvalho Chehab    We call this the RTT.
564d2e26a3SMauro Carvalho Chehab
574d2e26a3SMauro Carvalho Chehab    - For DMA we then provide an entire address space for each PE that can
584d2e26a3SMauro Carvalho Chehab      contain two "windows", depending on the value of PCI address bit 59.
594d2e26a3SMauro Carvalho Chehab      Each window can be configured to be remapped via a "TCE table" (IOMMU
604d2e26a3SMauro Carvalho Chehab      translation table), which has various configurable characteristics
614d2e26a3SMauro Carvalho Chehab      not described here.
624d2e26a3SMauro Carvalho Chehab
634d2e26a3SMauro Carvalho Chehab    - For MSIs, we have two windows in the address space (one at the top of
644d2e26a3SMauro Carvalho Chehab      the 32-bit space and one much higher) which, via a combination of the
654d2e26a3SMauro Carvalho Chehab      address and MSI value, will result in one of the 2048 interrupts per
664d2e26a3SMauro Carvalho Chehab      bridge being triggered.  There's a PE# in the interrupt controller
674d2e26a3SMauro Carvalho Chehab      descriptor table as well which is compared with the PE# obtained from
684d2e26a3SMauro Carvalho Chehab      the RTT to "authorize" the device to emit that specific interrupt.
694d2e26a3SMauro Carvalho Chehab
704d2e26a3SMauro Carvalho Chehab    - Error messages just use the RTT.
714d2e26a3SMauro Carvalho Chehab
724d2e26a3SMauro Carvalho Chehab  * Outbound.  That's where the tricky part is.
734d2e26a3SMauro Carvalho Chehab
744d2e26a3SMauro Carvalho Chehab    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
754d2e26a3SMauro Carvalho Chehab    from the CPU address space to the PCI address space.  There is one M32
764d2e26a3SMauro Carvalho Chehab    window and sixteen M64 windows.  They have different characteristics.
774d2e26a3SMauro Carvalho Chehab    First what they have in common: they forward a configurable portion of
784d2e26a3SMauro Carvalho Chehab    the CPU address space to the PCIe bus and must be naturally aligned
794d2e26a3SMauro Carvalho Chehab    power of two in size.  The rest is different:
804d2e26a3SMauro Carvalho Chehab
814d2e26a3SMauro Carvalho Chehab    - The M32 window:
824d2e26a3SMauro Carvalho Chehab
834d2e26a3SMauro Carvalho Chehab      * Is limited to 4GB in size.
844d2e26a3SMauro Carvalho Chehab
854d2e26a3SMauro Carvalho Chehab      * Drops the top bits of the address (above the size) and replaces
864d2e26a3SMauro Carvalho Chehab	them with a configurable value.  This is typically used to generate
874d2e26a3SMauro Carvalho Chehab	32-bit PCIe accesses.  We configure that window at boot from FW and
884d2e26a3SMauro Carvalho Chehab	don't touch it from Linux; it's usually set to forward a 2GB
894d2e26a3SMauro Carvalho Chehab	portion of address space from the CPU to PCIe
904d2e26a3SMauro Carvalho Chehab	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
914d2e26a3SMauro Carvalho Chehab	reserved for MSIs but this is not a problem at this point; we just
924d2e26a3SMauro Carvalho Chehab	need to ensure Linux doesn't assign anything there, the M32 logic
934d2e26a3SMauro Carvalho Chehab	ignores that however and will forward in that space if we try).
944d2e26a3SMauro Carvalho Chehab
954d2e26a3SMauro Carvalho Chehab      * It is divided into 256 segments of equal size.  A table in the chip
964d2e26a3SMauro Carvalho Chehab	maps each segment to a PE#.  That allows portions of the MMIO space
974d2e26a3SMauro Carvalho Chehab	to be assigned to PEs on a segment granularity.  For a 2GB window,
984d2e26a3SMauro Carvalho Chehab	the segment granularity is 2GB/256 = 8MB.
994d2e26a3SMauro Carvalho Chehab
1004d2e26a3SMauro Carvalho Chehab    Now, this is the "main" window we use in Linux today (excluding
1014d2e26a3SMauro Carvalho Chehab    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
1024d2e26a3SMauro Carvalho Chehab    onto a segment alignment/granularity so that the space behind a bridge
1034d2e26a3SMauro Carvalho Chehab    can be assigned to a PE.
1044d2e26a3SMauro Carvalho Chehab
1054d2e26a3SMauro Carvalho Chehab    Ideally we would like to be able to have individual functions in PEs
1064d2e26a3SMauro Carvalho Chehab    but that would mean using a completely different address allocation
1074d2e26a3SMauro Carvalho Chehab    scheme where individual function BARs can be "grouped" to fit in one or
1084d2e26a3SMauro Carvalho Chehab    more segments.
1094d2e26a3SMauro Carvalho Chehab
1104d2e26a3SMauro Carvalho Chehab    - The M64 windows:
1114d2e26a3SMauro Carvalho Chehab
1124d2e26a3SMauro Carvalho Chehab      * Must be at least 256MB in size.
1134d2e26a3SMauro Carvalho Chehab
1144d2e26a3SMauro Carvalho Chehab      * Do not translate addresses (the address on PCIe is the same as the
1154d2e26a3SMauro Carvalho Chehab	address on the PowerBus).  There is a way to also set the top 14
1164d2e26a3SMauro Carvalho Chehab	bits which are not conveyed by PowerBus but we don't use this.
1174d2e26a3SMauro Carvalho Chehab
1184d2e26a3SMauro Carvalho Chehab      * Can be configured to be segmented.  When not segmented, we can
1194d2e26a3SMauro Carvalho Chehab	specify the PE# for the entire window.  When segmented, a window
1204d2e26a3SMauro Carvalho Chehab	has 256 segments; however, there is no table for mapping a segment
1214d2e26a3SMauro Carvalho Chehab	to a PE#.  The segment number *is* the PE#.
1224d2e26a3SMauro Carvalho Chehab
1234d2e26a3SMauro Carvalho Chehab      * Support overlaps.  If an address is covered by multiple windows,
1244d2e26a3SMauro Carvalho Chehab	there's a defined ordering for which window applies.
1254d2e26a3SMauro Carvalho Chehab
1264d2e26a3SMauro Carvalho Chehab    We have code (fairly new compared to the M32 stuff) that exploits that
1274d2e26a3SMauro Carvalho Chehab    for large BARs in 64-bit space:
1284d2e26a3SMauro Carvalho Chehab
1294d2e26a3SMauro Carvalho Chehab    We configure an M64 window to cover the entire region of address space
1304d2e26a3SMauro Carvalho Chehab    that has been assigned by FW for the PHB (about 64GB, ignore the space
1314d2e26a3SMauro Carvalho Chehab    for the M32, it comes out of a different "reserve").  We configure it
1324d2e26a3SMauro Carvalho Chehab    as segmented.
1334d2e26a3SMauro Carvalho Chehab
1344d2e26a3SMauro Carvalho Chehab    Then we do the same thing as with M32, using the bridge alignment
1354d2e26a3SMauro Carvalho Chehab    trick, to match to those giant segments.
1364d2e26a3SMauro Carvalho Chehab
1374d2e26a3SMauro Carvalho Chehab    Since we cannot remap, we have two additional constraints:
1384d2e26a3SMauro Carvalho Chehab
1394d2e26a3SMauro Carvalho Chehab    - We do the PE# allocation *after* the 64-bit space has been assigned
1404d2e26a3SMauro Carvalho Chehab      because the addresses we use directly determine the PE#.  We then
1414d2e26a3SMauro Carvalho Chehab      update the M32 PE# for the devices that use both 32-bit and 64-bit
1424d2e26a3SMauro Carvalho Chehab      spaces or assign the remaining PE# to 32-bit only devices.
1434d2e26a3SMauro Carvalho Chehab
1444d2e26a3SMauro Carvalho Chehab    - We cannot "group" segments in HW, so if a device ends up using more
1454d2e26a3SMauro Carvalho Chehab      than one segment, we end up with more than one PE#.  There is a HW
1464d2e26a3SMauro Carvalho Chehab      mechanism to make the freeze state cascade to "companion" PEs but
1474d2e26a3SMauro Carvalho Chehab      that only works for PCIe error messages (typically used so that if
1484d2e26a3SMauro Carvalho Chehab      you freeze a switch, it freezes all its children).  So we do it in
1494d2e26a3SMauro Carvalho Chehab      SW.  We lose a bit of effectiveness of EEH in that case, but that's
1504d2e26a3SMauro Carvalho Chehab      the best we found.  So when any of the PEs freezes, we freeze the
1514d2e26a3SMauro Carvalho Chehab      other ones for that "domain".  We thus introduce the concept of
1524d2e26a3SMauro Carvalho Chehab      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
1534d2e26a3SMauro Carvalho Chehab      PEs" that are used for the remaining M64 segments.
1544d2e26a3SMauro Carvalho Chehab
1554d2e26a3SMauro Carvalho Chehab    We would like to investigate using additional M64 windows in "single
1564d2e26a3SMauro Carvalho Chehab    PE" mode to overlay over specific BARs to work around some of that, for
1574d2e26a3SMauro Carvalho Chehab    example for devices with very large BARs, e.g., GPUs.  It would make
1584d2e26a3SMauro Carvalho Chehab    sense, but we haven't done it yet.
1594d2e26a3SMauro Carvalho Chehab
1604d2e26a3SMauro Carvalho Chehab3. Considerations for SR-IOV on PowerKVM
1614d2e26a3SMauro Carvalho Chehab========================================
1624d2e26a3SMauro Carvalho Chehab
1634d2e26a3SMauro Carvalho Chehab  * SR-IOV Background
1644d2e26a3SMauro Carvalho Chehab
1654d2e26a3SMauro Carvalho Chehab    The PCIe SR-IOV feature allows a single Physical Function (PF) to
1664d2e26a3SMauro Carvalho Chehab    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
1674d2e26a3SMauro Carvalho Chehab    Capability control the number of VFs and whether they are enabled.
1684d2e26a3SMauro Carvalho Chehab
1694d2e26a3SMauro Carvalho Chehab    When VFs are enabled, they appear in Configuration Space like normal
1704d2e26a3SMauro Carvalho Chehab    PCI devices, but the BARs in VF config space headers are unusual.  For
1714d2e26a3SMauro Carvalho Chehab    a non-VF device, software uses BARs in the config space header to
1724d2e26a3SMauro Carvalho Chehab    discover the BAR sizes and assign addresses for them.  For VF devices,
1734d2e26a3SMauro Carvalho Chehab    software uses VF BAR registers in the *PF* SR-IOV Capability to
1744d2e26a3SMauro Carvalho Chehab    discover sizes and assign addresses.  The BARs in the VF's config space
1754d2e26a3SMauro Carvalho Chehab    header are read-only zeros.
1764d2e26a3SMauro Carvalho Chehab
1774d2e26a3SMauro Carvalho Chehab    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
1784d2e26a3SMauro Carvalho Chehab    base address for all the corresponding VF(n) BARs.  For example, if the
1794d2e26a3SMauro Carvalho Chehab    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
1804d2e26a3SMauro Carvalho Chehab    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
1814d2e26a3SMauro Carvalho Chehab    This region is divided into eight contiguous 1MB regions, each of which
1824d2e26a3SMauro Carvalho Chehab    is a BAR0 for one of the VFs.  Note that even though the VF BAR
1834d2e26a3SMauro Carvalho Chehab    describes an 8MB region, the alignment requirement is for a single VF,
1844d2e26a3SMauro Carvalho Chehab    i.e., 1MB in this example.
1854d2e26a3SMauro Carvalho Chehab
1864d2e26a3SMauro Carvalho Chehab  There are several strategies for isolating VFs in PEs:
1874d2e26a3SMauro Carvalho Chehab
1884d2e26a3SMauro Carvalho Chehab  - M32 window: There's one M32 window, and it is split into 256
1894d2e26a3SMauro Carvalho Chehab    equally-sized segments.  The finest granularity possible is a 256MB
1904d2e26a3SMauro Carvalho Chehab    window with 1MB segments.  VF BARs that are 1MB or larger could be
1914d2e26a3SMauro Carvalho Chehab    mapped to separate PEs in this window.  Each segment can be
1924d2e26a3SMauro Carvalho Chehab    individually mapped to a PE via the lookup table, so this is quite
1934d2e26a3SMauro Carvalho Chehab    flexible, but it works best when all the VF BARs are the same size.  If
1944d2e26a3SMauro Carvalho Chehab    they are different sizes, the entire window has to be small enough that
1954d2e26a3SMauro Carvalho Chehab    the segment size matches the smallest VF BAR, which means larger VF
1964d2e26a3SMauro Carvalho Chehab    BARs span several segments.
1974d2e26a3SMauro Carvalho Chehab
1984d2e26a3SMauro Carvalho Chehab  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
1994d2e26a3SMauro Carvalho Chehab    to a single PE, so it could only isolate one VF.
2004d2e26a3SMauro Carvalho Chehab
2014d2e26a3SMauro Carvalho Chehab  - Single segmented M64 windows: A segmented M64 window could be used just
2024d2e26a3SMauro Carvalho Chehab    like the M32 window, but the segments can't be individually mapped to
2034d2e26a3SMauro Carvalho Chehab    PEs (the segment number is the PE#), so there isn't as much
2044d2e26a3SMauro Carvalho Chehab    flexibility.  A VF with multiple BARs would have to be in a "domain" of
2054d2e26a3SMauro Carvalho Chehab    multiple PEs, which is not as well isolated as a single PE.
2064d2e26a3SMauro Carvalho Chehab
2074d2e26a3SMauro Carvalho Chehab  - Multiple segmented M64 windows: As usual, each window is split into 256
2084d2e26a3SMauro Carvalho Chehab    equally-sized segments, and the segment number is the PE#.  But if we
2094d2e26a3SMauro Carvalho Chehab    use several M64 windows, they can be set to different base addresses
2104d2e26a3SMauro Carvalho Chehab    and different segment sizes.  If we have VFs that each have a 1MB BAR
2114d2e26a3SMauro Carvalho Chehab    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
2124d2e26a3SMauro Carvalho Chehab    another M64 window to assign 32MB segments.
2134d2e26a3SMauro Carvalho Chehab
2144d2e26a3SMauro Carvalho Chehab  Finally, the plan to use M64 windows for SR-IOV, which will be described
2154d2e26a3SMauro Carvalho Chehab  more in the next two sections.  For a given VF BAR, we need to
2164d2e26a3SMauro Carvalho Chehab  effectively reserve the entire 256 segments (256 * VF BAR size) and
2174d2e26a3SMauro Carvalho Chehab  position the VF BAR to start at the beginning of a free range of
2184d2e26a3SMauro Carvalho Chehab  segments/PEs inside that M64 window.
2194d2e26a3SMauro Carvalho Chehab
2204d2e26a3SMauro Carvalho Chehab  The goal is of course to be able to give a separate PE for each VF.
2214d2e26a3SMauro Carvalho Chehab
2224d2e26a3SMauro Carvalho Chehab  The IODA2 platform has 16 M64 windows, which are used to map MMIO
2234d2e26a3SMauro Carvalho Chehab  range to PE#.  Each M64 window defines one MMIO range and this range is
2244d2e26a3SMauro Carvalho Chehab  divided into 256 segments, with each segment corresponding to one PE.
2254d2e26a3SMauro Carvalho Chehab
2264d2e26a3SMauro Carvalho Chehab  We decide to leverage this M64 window to map VFs to individual PEs, since
2274d2e26a3SMauro Carvalho Chehab  SR-IOV VF BARs are all the same size.
2284d2e26a3SMauro Carvalho Chehab
2294d2e26a3SMauro Carvalho Chehab  But doing so introduces another problem: total_VFs is usually smaller
2304d2e26a3SMauro Carvalho Chehab  than the number of M64 window segments, so if we map one VF BAR directly
2314d2e26a3SMauro Carvalho Chehab  to one M64 window, some part of the M64 window will map to another
2324d2e26a3SMauro Carvalho Chehab  device's MMIO range.
2334d2e26a3SMauro Carvalho Chehab
2344d2e26a3SMauro Carvalho Chehab  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
2354d2e26a3SMauro Carvalho Chehab  total_VFs is less than 256, we have the situation in Figure 1.0, where
2364d2e26a3SMauro Carvalho Chehab  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
2374d2e26a3SMauro Carvalho Chehab  other devices::
2384d2e26a3SMauro Carvalho Chehab
2394d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1
2404d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+
2414d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |
2424d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+
2434d2e26a3SMauro Carvalho Chehab
2444d2e26a3SMauro Carvalho Chehab                           VF(n) BAR space
2454d2e26a3SMauro Carvalho Chehab
2464d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
2474d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2484d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
2494d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2504d2e26a3SMauro Carvalho Chehab
2514d2e26a3SMauro Carvalho Chehab                           M64 window
2524d2e26a3SMauro Carvalho Chehab
2534d2e26a3SMauro Carvalho Chehab		Figure 1.0 Direct map VF(n) BAR space
2544d2e26a3SMauro Carvalho Chehab
2554d2e26a3SMauro Carvalho Chehab  Our current solution is to allocate 256 segments even if the VF(n) BAR
2564d2e26a3SMauro Carvalho Chehab  space doesn't need that much, as shown in Figure 1.1::
2574d2e26a3SMauro Carvalho Chehab
2584d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
2594d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2604d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
2614d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2624d2e26a3SMauro Carvalho Chehab
2634d2e26a3SMauro Carvalho Chehab                           VF(n) BAR space + extra
2644d2e26a3SMauro Carvalho Chehab
2654d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
2664d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2674d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
2684d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
2694d2e26a3SMauro Carvalho Chehab
2704d2e26a3SMauro Carvalho Chehab			   M64 window
2714d2e26a3SMauro Carvalho Chehab
2724d2e26a3SMauro Carvalho Chehab		Figure 1.1 Map VF(n) BAR space + extra
2734d2e26a3SMauro Carvalho Chehab
2744d2e26a3SMauro Carvalho Chehab  Allocating the extra space ensures that the entire M64 window will be
2754d2e26a3SMauro Carvalho Chehab  assigned to this one SR-IOV device and none of the space will be
2764d2e26a3SMauro Carvalho Chehab  available for other devices.  Note that this only expands the space
2774d2e26a3SMauro Carvalho Chehab  reserved in software; there are still only total_VFs VFs, and they only
2784d2e26a3SMauro Carvalho Chehab  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
2794d2e26a3SMauro Carvalho Chehab  responds to segments [total_VFs, 255].
2804d2e26a3SMauro Carvalho Chehab
2814d2e26a3SMauro Carvalho Chehab4. Implications for the Generic PCI Code
2824d2e26a3SMauro Carvalho Chehab========================================
2834d2e26a3SMauro Carvalho Chehab
2844d2e26a3SMauro Carvalho ChehabThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
2854d2e26a3SMauro Carvalho Chehabaligned to the size of an individual VF BAR.
2864d2e26a3SMauro Carvalho Chehab
2874d2e26a3SMauro Carvalho ChehabIn IODA2, the MMIO address determines the PE#.  If the address is in an M32
2884d2e26a3SMauro Carvalho Chehabwindow, we can set the PE# by updating the table that translates segments
2894d2e26a3SMauro Carvalho Chehabto PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
2904d2e26a3SMauro Carvalho Chehabset the PE# for the window.  But if it's in a segmented M64 window, the
2914d2e26a3SMauro Carvalho Chehabsegment number is the PE#.
2924d2e26a3SMauro Carvalho Chehab
2934d2e26a3SMauro Carvalho ChehabTherefore, the only way to control the PE# for a VF is to change the base
2944d2e26a3SMauro Carvalho Chehabof the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
2954d2e26a3SMauro Carvalho Chehabamount of space required for the VF(n) BAR space, the VF BAR value is fixed
2964d2e26a3SMauro Carvalho Chehaband cannot be changed.
2974d2e26a3SMauro Carvalho Chehab
2984d2e26a3SMauro Carvalho ChehabOn the other hand, if the PCI core allocates additional space, the VF BAR
2994d2e26a3SMauro Carvalho Chehabvalue can be changed as long as the entire VF(n) BAR space remains inside
3004d2e26a3SMauro Carvalho Chehabthe space allocated by the core.
3014d2e26a3SMauro Carvalho Chehab
3024d2e26a3SMauro Carvalho ChehabIdeally the segment size will be the same as an individual VF BAR size.
3034d2e26a3SMauro Carvalho ChehabThen each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
3044d2e26a3SMauro Carvalho Chehabare contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
3054d2e26a3SMauro Carvalho Chehaballocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
3064d2e26a3SMauro Carvalho Chehab
3074d2e26a3SMauro Carvalho ChehabIf the segment size is smaller than the VF BAR size, it will take several
3084d2e26a3SMauro Carvalho Chehabsegments to cover a VF BAR, and a VF will be in several PEs.  This is
3094d2e26a3SMauro Carvalho Chehabpossible, but the isolation isn't as good, and it reduces the number of PE#
3104d2e26a3SMauro Carvalho Chehabchoices because instead of consuming only numVFs segments, the VF(n) BAR
3114d2e26a3SMauro Carvalho Chehabspace will consume (numVFs * n) segments.  That means there aren't as many
3124d2e26a3SMauro Carvalho Chehabavailable segments for adjusting base of the VF(n) BAR space.
313