xref: /openbmc/linux/Documentation/powerpc/pci_iov_resource_on_powernv.rst (revision ac94be498f84f7327533b62faca4c3da64434904)
1*4d2e26a3SMauro Carvalho Chehab===================================================
2*4d2e26a3SMauro Carvalho ChehabPCI Express I/O Virtualization Resource on Powerenv
3*4d2e26a3SMauro Carvalho Chehab===================================================
4*4d2e26a3SMauro Carvalho Chehab
5*4d2e26a3SMauro Carvalho ChehabWei Yang <weiyang@linux.vnet.ibm.com>
6*4d2e26a3SMauro Carvalho Chehab
7*4d2e26a3SMauro Carvalho ChehabBenjamin Herrenschmidt <benh@au1.ibm.com>
8*4d2e26a3SMauro Carvalho Chehab
9*4d2e26a3SMauro Carvalho ChehabBjorn Helgaas <bhelgaas@google.com>
10*4d2e26a3SMauro Carvalho Chehab
11*4d2e26a3SMauro Carvalho Chehab26 Aug 2014
12*4d2e26a3SMauro Carvalho Chehab
13*4d2e26a3SMauro Carvalho ChehabThis document describes the requirement from hardware for PCI MMIO resource
14*4d2e26a3SMauro Carvalho Chehabsizing and assignment on PowerKVM and how generic PCI code handles this
15*4d2e26a3SMauro Carvalho Chehabrequirement. The first two sections describe the concepts of Partitionable
16*4d2e26a3SMauro Carvalho ChehabEndpoints and the implementation on P8 (IODA2). The next two sections talks
17*4d2e26a3SMauro Carvalho Chehababout considerations on enabling SRIOV on IODA2.
18*4d2e26a3SMauro Carvalho Chehab
19*4d2e26a3SMauro Carvalho Chehab1. Introduction to Partitionable Endpoints
20*4d2e26a3SMauro Carvalho Chehab==========================================
21*4d2e26a3SMauro Carvalho Chehab
22*4d2e26a3SMauro Carvalho ChehabA Partitionable Endpoint (PE) is a way to group the various resources
23*4d2e26a3SMauro Carvalho Chehabassociated with a device or a set of devices to provide isolation between
24*4d2e26a3SMauro Carvalho Chehabpartitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
25*4d2e26a3SMauro Carvalho Chehabto freeze a device that is causing errors in order to limit the possibility
26*4d2e26a3SMauro Carvalho Chehabof propagation of bad data.
27*4d2e26a3SMauro Carvalho Chehab
28*4d2e26a3SMauro Carvalho ChehabThere is thus, in HW, a table of PE states that contains a pair of "frozen"
29*4d2e26a3SMauro Carvalho Chehabstate bits (one for MMIO and one for DMA, they get set together but can be
30*4d2e26a3SMauro Carvalho Chehabcleared independently) for each PE.
31*4d2e26a3SMauro Carvalho Chehab
32*4d2e26a3SMauro Carvalho ChehabWhen a PE is frozen, all stores in any direction are dropped and all loads
33*4d2e26a3SMauro Carvalho Chehabreturn all 1's value. MSIs are also blocked. There's a bit more state that
34*4d2e26a3SMauro Carvalho Chehabcaptures things like the details of the error that caused the freeze etc., but
35*4d2e26a3SMauro Carvalho Chehabthat's not critical.
36*4d2e26a3SMauro Carvalho Chehab
37*4d2e26a3SMauro Carvalho ChehabThe interesting part is how the various PCIe transactions (MMIO, DMA, ...)
38*4d2e26a3SMauro Carvalho Chehabare matched to their corresponding PEs.
39*4d2e26a3SMauro Carvalho Chehab
40*4d2e26a3SMauro Carvalho ChehabThe following section provides a rough description of what we have on P8
41*4d2e26a3SMauro Carvalho Chehab(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
42*4d2e26a3SMauro Carvalho Chehabis a completely separate HW entity that replicates the entire logic, so has
43*4d2e26a3SMauro Carvalho Chehabits own set of PEs, etc.
44*4d2e26a3SMauro Carvalho Chehab
45*4d2e26a3SMauro Carvalho Chehab2. Implementation of Partitionable Endpoints on P8 (IODA2)
46*4d2e26a3SMauro Carvalho Chehab==========================================================
47*4d2e26a3SMauro Carvalho Chehab
48*4d2e26a3SMauro Carvalho ChehabP8 supports up to 256 Partitionable Endpoints per PHB.
49*4d2e26a3SMauro Carvalho Chehab
50*4d2e26a3SMauro Carvalho Chehab  * Inbound
51*4d2e26a3SMauro Carvalho Chehab
52*4d2e26a3SMauro Carvalho Chehab    For DMA, MSIs and inbound PCIe error messages, we have a table (in
53*4d2e26a3SMauro Carvalho Chehab    memory but accessed in HW by the chip) that provides a direct
54*4d2e26a3SMauro Carvalho Chehab    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
55*4d2e26a3SMauro Carvalho Chehab    We call this the RTT.
56*4d2e26a3SMauro Carvalho Chehab
57*4d2e26a3SMauro Carvalho Chehab    - For DMA we then provide an entire address space for each PE that can
58*4d2e26a3SMauro Carvalho Chehab      contain two "windows", depending on the value of PCI address bit 59.
59*4d2e26a3SMauro Carvalho Chehab      Each window can be configured to be remapped via a "TCE table" (IOMMU
60*4d2e26a3SMauro Carvalho Chehab      translation table), which has various configurable characteristics
61*4d2e26a3SMauro Carvalho Chehab      not described here.
62*4d2e26a3SMauro Carvalho Chehab
63*4d2e26a3SMauro Carvalho Chehab    - For MSIs, we have two windows in the address space (one at the top of
64*4d2e26a3SMauro Carvalho Chehab      the 32-bit space and one much higher) which, via a combination of the
65*4d2e26a3SMauro Carvalho Chehab      address and MSI value, will result in one of the 2048 interrupts per
66*4d2e26a3SMauro Carvalho Chehab      bridge being triggered.  There's a PE# in the interrupt controller
67*4d2e26a3SMauro Carvalho Chehab      descriptor table as well which is compared with the PE# obtained from
68*4d2e26a3SMauro Carvalho Chehab      the RTT to "authorize" the device to emit that specific interrupt.
69*4d2e26a3SMauro Carvalho Chehab
70*4d2e26a3SMauro Carvalho Chehab    - Error messages just use the RTT.
71*4d2e26a3SMauro Carvalho Chehab
72*4d2e26a3SMauro Carvalho Chehab  * Outbound.  That's where the tricky part is.
73*4d2e26a3SMauro Carvalho Chehab
74*4d2e26a3SMauro Carvalho Chehab    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
75*4d2e26a3SMauro Carvalho Chehab    from the CPU address space to the PCI address space.  There is one M32
76*4d2e26a3SMauro Carvalho Chehab    window and sixteen M64 windows.  They have different characteristics.
77*4d2e26a3SMauro Carvalho Chehab    First what they have in common: they forward a configurable portion of
78*4d2e26a3SMauro Carvalho Chehab    the CPU address space to the PCIe bus and must be naturally aligned
79*4d2e26a3SMauro Carvalho Chehab    power of two in size.  The rest is different:
80*4d2e26a3SMauro Carvalho Chehab
81*4d2e26a3SMauro Carvalho Chehab    - The M32 window:
82*4d2e26a3SMauro Carvalho Chehab
83*4d2e26a3SMauro Carvalho Chehab      * Is limited to 4GB in size.
84*4d2e26a3SMauro Carvalho Chehab
85*4d2e26a3SMauro Carvalho Chehab      * Drops the top bits of the address (above the size) and replaces
86*4d2e26a3SMauro Carvalho Chehab	them with a configurable value.  This is typically used to generate
87*4d2e26a3SMauro Carvalho Chehab	32-bit PCIe accesses.  We configure that window at boot from FW and
88*4d2e26a3SMauro Carvalho Chehab	don't touch it from Linux; it's usually set to forward a 2GB
89*4d2e26a3SMauro Carvalho Chehab	portion of address space from the CPU to PCIe
90*4d2e26a3SMauro Carvalho Chehab	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
91*4d2e26a3SMauro Carvalho Chehab	reserved for MSIs but this is not a problem at this point; we just
92*4d2e26a3SMauro Carvalho Chehab	need to ensure Linux doesn't assign anything there, the M32 logic
93*4d2e26a3SMauro Carvalho Chehab	ignores that however and will forward in that space if we try).
94*4d2e26a3SMauro Carvalho Chehab
95*4d2e26a3SMauro Carvalho Chehab      * It is divided into 256 segments of equal size.  A table in the chip
96*4d2e26a3SMauro Carvalho Chehab	maps each segment to a PE#.  That allows portions of the MMIO space
97*4d2e26a3SMauro Carvalho Chehab	to be assigned to PEs on a segment granularity.  For a 2GB window,
98*4d2e26a3SMauro Carvalho Chehab	the segment granularity is 2GB/256 = 8MB.
99*4d2e26a3SMauro Carvalho Chehab
100*4d2e26a3SMauro Carvalho Chehab    Now, this is the "main" window we use in Linux today (excluding
101*4d2e26a3SMauro Carvalho Chehab    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
102*4d2e26a3SMauro Carvalho Chehab    onto a segment alignment/granularity so that the space behind a bridge
103*4d2e26a3SMauro Carvalho Chehab    can be assigned to a PE.
104*4d2e26a3SMauro Carvalho Chehab
105*4d2e26a3SMauro Carvalho Chehab    Ideally we would like to be able to have individual functions in PEs
106*4d2e26a3SMauro Carvalho Chehab    but that would mean using a completely different address allocation
107*4d2e26a3SMauro Carvalho Chehab    scheme where individual function BARs can be "grouped" to fit in one or
108*4d2e26a3SMauro Carvalho Chehab    more segments.
109*4d2e26a3SMauro Carvalho Chehab
110*4d2e26a3SMauro Carvalho Chehab    - The M64 windows:
111*4d2e26a3SMauro Carvalho Chehab
112*4d2e26a3SMauro Carvalho Chehab      * Must be at least 256MB in size.
113*4d2e26a3SMauro Carvalho Chehab
114*4d2e26a3SMauro Carvalho Chehab      * Do not translate addresses (the address on PCIe is the same as the
115*4d2e26a3SMauro Carvalho Chehab	address on the PowerBus).  There is a way to also set the top 14
116*4d2e26a3SMauro Carvalho Chehab	bits which are not conveyed by PowerBus but we don't use this.
117*4d2e26a3SMauro Carvalho Chehab
118*4d2e26a3SMauro Carvalho Chehab      * Can be configured to be segmented.  When not segmented, we can
119*4d2e26a3SMauro Carvalho Chehab	specify the PE# for the entire window.  When segmented, a window
120*4d2e26a3SMauro Carvalho Chehab	has 256 segments; however, there is no table for mapping a segment
121*4d2e26a3SMauro Carvalho Chehab	to a PE#.  The segment number *is* the PE#.
122*4d2e26a3SMauro Carvalho Chehab
123*4d2e26a3SMauro Carvalho Chehab      * Support overlaps.  If an address is covered by multiple windows,
124*4d2e26a3SMauro Carvalho Chehab	there's a defined ordering for which window applies.
125*4d2e26a3SMauro Carvalho Chehab
126*4d2e26a3SMauro Carvalho Chehab    We have code (fairly new compared to the M32 stuff) that exploits that
127*4d2e26a3SMauro Carvalho Chehab    for large BARs in 64-bit space:
128*4d2e26a3SMauro Carvalho Chehab
129*4d2e26a3SMauro Carvalho Chehab    We configure an M64 window to cover the entire region of address space
130*4d2e26a3SMauro Carvalho Chehab    that has been assigned by FW for the PHB (about 64GB, ignore the space
131*4d2e26a3SMauro Carvalho Chehab    for the M32, it comes out of a different "reserve").  We configure it
132*4d2e26a3SMauro Carvalho Chehab    as segmented.
133*4d2e26a3SMauro Carvalho Chehab
134*4d2e26a3SMauro Carvalho Chehab    Then we do the same thing as with M32, using the bridge alignment
135*4d2e26a3SMauro Carvalho Chehab    trick, to match to those giant segments.
136*4d2e26a3SMauro Carvalho Chehab
137*4d2e26a3SMauro Carvalho Chehab    Since we cannot remap, we have two additional constraints:
138*4d2e26a3SMauro Carvalho Chehab
139*4d2e26a3SMauro Carvalho Chehab    - We do the PE# allocation *after* the 64-bit space has been assigned
140*4d2e26a3SMauro Carvalho Chehab      because the addresses we use directly determine the PE#.  We then
141*4d2e26a3SMauro Carvalho Chehab      update the M32 PE# for the devices that use both 32-bit and 64-bit
142*4d2e26a3SMauro Carvalho Chehab      spaces or assign the remaining PE# to 32-bit only devices.
143*4d2e26a3SMauro Carvalho Chehab
144*4d2e26a3SMauro Carvalho Chehab    - We cannot "group" segments in HW, so if a device ends up using more
145*4d2e26a3SMauro Carvalho Chehab      than one segment, we end up with more than one PE#.  There is a HW
146*4d2e26a3SMauro Carvalho Chehab      mechanism to make the freeze state cascade to "companion" PEs but
147*4d2e26a3SMauro Carvalho Chehab      that only works for PCIe error messages (typically used so that if
148*4d2e26a3SMauro Carvalho Chehab      you freeze a switch, it freezes all its children).  So we do it in
149*4d2e26a3SMauro Carvalho Chehab      SW.  We lose a bit of effectiveness of EEH in that case, but that's
150*4d2e26a3SMauro Carvalho Chehab      the best we found.  So when any of the PEs freezes, we freeze the
151*4d2e26a3SMauro Carvalho Chehab      other ones for that "domain".  We thus introduce the concept of
152*4d2e26a3SMauro Carvalho Chehab      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
153*4d2e26a3SMauro Carvalho Chehab      PEs" that are used for the remaining M64 segments.
154*4d2e26a3SMauro Carvalho Chehab
155*4d2e26a3SMauro Carvalho Chehab    We would like to investigate using additional M64 windows in "single
156*4d2e26a3SMauro Carvalho Chehab    PE" mode to overlay over specific BARs to work around some of that, for
157*4d2e26a3SMauro Carvalho Chehab    example for devices with very large BARs, e.g., GPUs.  It would make
158*4d2e26a3SMauro Carvalho Chehab    sense, but we haven't done it yet.
159*4d2e26a3SMauro Carvalho Chehab
160*4d2e26a3SMauro Carvalho Chehab3. Considerations for SR-IOV on PowerKVM
161*4d2e26a3SMauro Carvalho Chehab========================================
162*4d2e26a3SMauro Carvalho Chehab
163*4d2e26a3SMauro Carvalho Chehab  * SR-IOV Background
164*4d2e26a3SMauro Carvalho Chehab
165*4d2e26a3SMauro Carvalho Chehab    The PCIe SR-IOV feature allows a single Physical Function (PF) to
166*4d2e26a3SMauro Carvalho Chehab    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
167*4d2e26a3SMauro Carvalho Chehab    Capability control the number of VFs and whether they are enabled.
168*4d2e26a3SMauro Carvalho Chehab
169*4d2e26a3SMauro Carvalho Chehab    When VFs are enabled, they appear in Configuration Space like normal
170*4d2e26a3SMauro Carvalho Chehab    PCI devices, but the BARs in VF config space headers are unusual.  For
171*4d2e26a3SMauro Carvalho Chehab    a non-VF device, software uses BARs in the config space header to
172*4d2e26a3SMauro Carvalho Chehab    discover the BAR sizes and assign addresses for them.  For VF devices,
173*4d2e26a3SMauro Carvalho Chehab    software uses VF BAR registers in the *PF* SR-IOV Capability to
174*4d2e26a3SMauro Carvalho Chehab    discover sizes and assign addresses.  The BARs in the VF's config space
175*4d2e26a3SMauro Carvalho Chehab    header are read-only zeros.
176*4d2e26a3SMauro Carvalho Chehab
177*4d2e26a3SMauro Carvalho Chehab    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
178*4d2e26a3SMauro Carvalho Chehab    base address for all the corresponding VF(n) BARs.  For example, if the
179*4d2e26a3SMauro Carvalho Chehab    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
180*4d2e26a3SMauro Carvalho Chehab    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
181*4d2e26a3SMauro Carvalho Chehab    This region is divided into eight contiguous 1MB regions, each of which
182*4d2e26a3SMauro Carvalho Chehab    is a BAR0 for one of the VFs.  Note that even though the VF BAR
183*4d2e26a3SMauro Carvalho Chehab    describes an 8MB region, the alignment requirement is for a single VF,
184*4d2e26a3SMauro Carvalho Chehab    i.e., 1MB in this example.
185*4d2e26a3SMauro Carvalho Chehab
186*4d2e26a3SMauro Carvalho Chehab  There are several strategies for isolating VFs in PEs:
187*4d2e26a3SMauro Carvalho Chehab
188*4d2e26a3SMauro Carvalho Chehab  - M32 window: There's one M32 window, and it is split into 256
189*4d2e26a3SMauro Carvalho Chehab    equally-sized segments.  The finest granularity possible is a 256MB
190*4d2e26a3SMauro Carvalho Chehab    window with 1MB segments.  VF BARs that are 1MB or larger could be
191*4d2e26a3SMauro Carvalho Chehab    mapped to separate PEs in this window.  Each segment can be
192*4d2e26a3SMauro Carvalho Chehab    individually mapped to a PE via the lookup table, so this is quite
193*4d2e26a3SMauro Carvalho Chehab    flexible, but it works best when all the VF BARs are the same size.  If
194*4d2e26a3SMauro Carvalho Chehab    they are different sizes, the entire window has to be small enough that
195*4d2e26a3SMauro Carvalho Chehab    the segment size matches the smallest VF BAR, which means larger VF
196*4d2e26a3SMauro Carvalho Chehab    BARs span several segments.
197*4d2e26a3SMauro Carvalho Chehab
198*4d2e26a3SMauro Carvalho Chehab  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
199*4d2e26a3SMauro Carvalho Chehab    to a single PE, so it could only isolate one VF.
200*4d2e26a3SMauro Carvalho Chehab
201*4d2e26a3SMauro Carvalho Chehab  - Single segmented M64 windows: A segmented M64 window could be used just
202*4d2e26a3SMauro Carvalho Chehab    like the M32 window, but the segments can't be individually mapped to
203*4d2e26a3SMauro Carvalho Chehab    PEs (the segment number is the PE#), so there isn't as much
204*4d2e26a3SMauro Carvalho Chehab    flexibility.  A VF with multiple BARs would have to be in a "domain" of
205*4d2e26a3SMauro Carvalho Chehab    multiple PEs, which is not as well isolated as a single PE.
206*4d2e26a3SMauro Carvalho Chehab
207*4d2e26a3SMauro Carvalho Chehab  - Multiple segmented M64 windows: As usual, each window is split into 256
208*4d2e26a3SMauro Carvalho Chehab    equally-sized segments, and the segment number is the PE#.  But if we
209*4d2e26a3SMauro Carvalho Chehab    use several M64 windows, they can be set to different base addresses
210*4d2e26a3SMauro Carvalho Chehab    and different segment sizes.  If we have VFs that each have a 1MB BAR
211*4d2e26a3SMauro Carvalho Chehab    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
212*4d2e26a3SMauro Carvalho Chehab    another M64 window to assign 32MB segments.
213*4d2e26a3SMauro Carvalho Chehab
214*4d2e26a3SMauro Carvalho Chehab  Finally, the plan to use M64 windows for SR-IOV, which will be described
215*4d2e26a3SMauro Carvalho Chehab  more in the next two sections.  For a given VF BAR, we need to
216*4d2e26a3SMauro Carvalho Chehab  effectively reserve the entire 256 segments (256 * VF BAR size) and
217*4d2e26a3SMauro Carvalho Chehab  position the VF BAR to start at the beginning of a free range of
218*4d2e26a3SMauro Carvalho Chehab  segments/PEs inside that M64 window.
219*4d2e26a3SMauro Carvalho Chehab
220*4d2e26a3SMauro Carvalho Chehab  The goal is of course to be able to give a separate PE for each VF.
221*4d2e26a3SMauro Carvalho Chehab
222*4d2e26a3SMauro Carvalho Chehab  The IODA2 platform has 16 M64 windows, which are used to map MMIO
223*4d2e26a3SMauro Carvalho Chehab  range to PE#.  Each M64 window defines one MMIO range and this range is
224*4d2e26a3SMauro Carvalho Chehab  divided into 256 segments, with each segment corresponding to one PE.
225*4d2e26a3SMauro Carvalho Chehab
226*4d2e26a3SMauro Carvalho Chehab  We decide to leverage this M64 window to map VFs to individual PEs, since
227*4d2e26a3SMauro Carvalho Chehab  SR-IOV VF BARs are all the same size.
228*4d2e26a3SMauro Carvalho Chehab
229*4d2e26a3SMauro Carvalho Chehab  But doing so introduces another problem: total_VFs is usually smaller
230*4d2e26a3SMauro Carvalho Chehab  than the number of M64 window segments, so if we map one VF BAR directly
231*4d2e26a3SMauro Carvalho Chehab  to one M64 window, some part of the M64 window will map to another
232*4d2e26a3SMauro Carvalho Chehab  device's MMIO range.
233*4d2e26a3SMauro Carvalho Chehab
234*4d2e26a3SMauro Carvalho Chehab  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
235*4d2e26a3SMauro Carvalho Chehab  total_VFs is less than 256, we have the situation in Figure 1.0, where
236*4d2e26a3SMauro Carvalho Chehab  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
237*4d2e26a3SMauro Carvalho Chehab  other devices::
238*4d2e26a3SMauro Carvalho Chehab
239*4d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1
240*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+
241*4d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |
242*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+
243*4d2e26a3SMauro Carvalho Chehab
244*4d2e26a3SMauro Carvalho Chehab                           VF(n) BAR space
245*4d2e26a3SMauro Carvalho Chehab
246*4d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
247*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
248*4d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
249*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
250*4d2e26a3SMauro Carvalho Chehab
251*4d2e26a3SMauro Carvalho Chehab                           M64 window
252*4d2e26a3SMauro Carvalho Chehab
253*4d2e26a3SMauro Carvalho Chehab		Figure 1.0 Direct map VF(n) BAR space
254*4d2e26a3SMauro Carvalho Chehab
255*4d2e26a3SMauro Carvalho Chehab  Our current solution is to allocate 256 segments even if the VF(n) BAR
256*4d2e26a3SMauro Carvalho Chehab  space doesn't need that much, as shown in Figure 1.1::
257*4d2e26a3SMauro Carvalho Chehab
258*4d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
259*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
260*4d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
261*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
262*4d2e26a3SMauro Carvalho Chehab
263*4d2e26a3SMauro Carvalho Chehab                           VF(n) BAR space + extra
264*4d2e26a3SMauro Carvalho Chehab
265*4d2e26a3SMauro Carvalho Chehab     0      1                     total_VFs - 1                255
266*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
267*4d2e26a3SMauro Carvalho Chehab     |      |      |  ...  |      |      |   ...  |      |      |
268*4d2e26a3SMauro Carvalho Chehab     +------+------+-     -+------+------+-      -+------+------+
269*4d2e26a3SMauro Carvalho Chehab
270*4d2e26a3SMauro Carvalho Chehab			   M64 window
271*4d2e26a3SMauro Carvalho Chehab
272*4d2e26a3SMauro Carvalho Chehab		Figure 1.1 Map VF(n) BAR space + extra
273*4d2e26a3SMauro Carvalho Chehab
274*4d2e26a3SMauro Carvalho Chehab  Allocating the extra space ensures that the entire M64 window will be
275*4d2e26a3SMauro Carvalho Chehab  assigned to this one SR-IOV device and none of the space will be
276*4d2e26a3SMauro Carvalho Chehab  available for other devices.  Note that this only expands the space
277*4d2e26a3SMauro Carvalho Chehab  reserved in software; there are still only total_VFs VFs, and they only
278*4d2e26a3SMauro Carvalho Chehab  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
279*4d2e26a3SMauro Carvalho Chehab  responds to segments [total_VFs, 255].
280*4d2e26a3SMauro Carvalho Chehab
281*4d2e26a3SMauro Carvalho Chehab4. Implications for the Generic PCI Code
282*4d2e26a3SMauro Carvalho Chehab========================================
283*4d2e26a3SMauro Carvalho Chehab
284*4d2e26a3SMauro Carvalho ChehabThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
285*4d2e26a3SMauro Carvalho Chehabaligned to the size of an individual VF BAR.
286*4d2e26a3SMauro Carvalho Chehab
287*4d2e26a3SMauro Carvalho ChehabIn IODA2, the MMIO address determines the PE#.  If the address is in an M32
288*4d2e26a3SMauro Carvalho Chehabwindow, we can set the PE# by updating the table that translates segments
289*4d2e26a3SMauro Carvalho Chehabto PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
290*4d2e26a3SMauro Carvalho Chehabset the PE# for the window.  But if it's in a segmented M64 window, the
291*4d2e26a3SMauro Carvalho Chehabsegment number is the PE#.
292*4d2e26a3SMauro Carvalho Chehab
293*4d2e26a3SMauro Carvalho ChehabTherefore, the only way to control the PE# for a VF is to change the base
294*4d2e26a3SMauro Carvalho Chehabof the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
295*4d2e26a3SMauro Carvalho Chehabamount of space required for the VF(n) BAR space, the VF BAR value is fixed
296*4d2e26a3SMauro Carvalho Chehaband cannot be changed.
297*4d2e26a3SMauro Carvalho Chehab
298*4d2e26a3SMauro Carvalho ChehabOn the other hand, if the PCI core allocates additional space, the VF BAR
299*4d2e26a3SMauro Carvalho Chehabvalue can be changed as long as the entire VF(n) BAR space remains inside
300*4d2e26a3SMauro Carvalho Chehabthe space allocated by the core.
301*4d2e26a3SMauro Carvalho Chehab
302*4d2e26a3SMauro Carvalho ChehabIdeally the segment size will be the same as an individual VF BAR size.
303*4d2e26a3SMauro Carvalho ChehabThen each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
304*4d2e26a3SMauro Carvalho Chehabare contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
305*4d2e26a3SMauro Carvalho Chehaballocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
306*4d2e26a3SMauro Carvalho Chehab
307*4d2e26a3SMauro Carvalho ChehabIf the segment size is smaller than the VF BAR size, it will take several
308*4d2e26a3SMauro Carvalho Chehabsegments to cover a VF BAR, and a VF will be in several PEs.  This is
309*4d2e26a3SMauro Carvalho Chehabpossible, but the isolation isn't as good, and it reduces the number of PE#
310*4d2e26a3SMauro Carvalho Chehabchoices because instead of consuming only numVFs segments, the VF(n) BAR
311*4d2e26a3SMauro Carvalho Chehabspace will consume (numVFs * n) segments.  That means there aren't as many
312*4d2e26a3SMauro Carvalho Chehabavailable segments for adjusting base of the VF(n) BAR space.
313