#
b04149c2 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Move single TVE handling to NPU PE
Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only one which points to one of two tables of the corresponding PCI PE.
powerpc/powernv/npu: Move single TVE handling to NPU PE
Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only one which points to one of two tables of the corresponding PCI PE.
So whenever a new DMA window is programmed to PEs, the NPU PE needs to release old table in order to use the new one.
Commit d41ce7b1bcc3e ("powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is enabled") did just that but in pci-ioda.c while it actually belongs to npu-dma.c.
This moves the single TVE handling to npu-dma.c. This does not implement restoring though as it is highly unlikely that we can set the table to PCI PE and cannot to NPU PE and if that fails, we could only set 32bit table to NPU PE and this configuration is not really supported or wanted.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
847e6563 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Reference iommu_table while it is linked to a group
The iommu_table pointer stored in iommu_table_group may get stale by accident, this adds referencing and removes a redundant comm
powerpc/powernv: Reference iommu_table while it is linked to a group
The iommu_table pointer stored in iommu_table_group may get stale by accident, this adds referencing and removes a redundant comment about this.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
5eada8a3 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/iommu_api: Move IOMMU groups setup to a single place
Registering new IOMMU groups and adding devices to them are separated in code and the latter is dug in the DMA setup code which it does n
powerpc/iommu_api: Move IOMMU groups setup to a single place
Registering new IOMMU groups and adding devices to them are separated in code and the latter is dug in the DMA setup code which it does not really belong to.
This moved IOMMU groups setup to a separate helper which registers a group and adds devices as before. This does not make a difference as IOMMU groups are not used anyway; the only dependency here is that iommu_add_device() requires a valid pointer to an iommu_table (set by set_iommu_table_base()).
To keep the old behaviour, this does not add new IOMMU groups for PEs with no DMA weight and also skips NVLink bridges which do not have pci_controller_ops::setup_bridge (the normal way of adding PEs).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
c4e9d3c1 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/pseries: Rework device adding to IOMMU groups
The powernv platform registers IOMMU groups and adds devices to them from the pci_controller_ops::setup_bridge() hook except one case wh
powerpc/powernv/pseries: Rework device adding to IOMMU groups
The powernv platform registers IOMMU groups and adds devices to them from the pci_controller_ops::setup_bridge() hook except one case when virtual functions (SRIOV VFs) are added from a bus notifier.
The pseries platform registers IOMMU groups from the pci_controller_ops::dma_bus_setup() hook and adds devices from the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier used for powernv does not add devices for pseries though as __of_scan_bus() adds devices first, then it does the bus/dev DMA setup.
Both platforms use iommu_add_device() which takes a device and expects it to have a valid IOMMU table struct with an iommu_table_group pointer which in turn points the iommu_group struct (which represents an IOMMU group). Although the helper seems easy to use, it relies on some pre-existing device configuration and associated data structures which it does not really need.
This simplifies iommu_add_device() to take the table_group pointer directly. Pseries already has a table_group pointer handy and the bus notified is not used anyway. For powernv, this copies the existing bus notifier, makes it work for powernv only which means an easy way of getting to the table_group pointer. This was tested on VFs but should also support physical PCI hotplug.
Since iommu_add_device() receives the table_group pointer directly, pseries does not do TCE cache invalidation (the hypervisor does) nor allow multiple groups per a VFIO container (in other words sharing an IOMMU table between partitionable endpoints), this removes iommu_table_group_link from pseries.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
0e759bd7 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Move OPAL calls away from context manipulation
When introduced, the NPU context init/destroy helpers called OPAL which enabled/disabled PID (a userspace memory context ID) filte
powerpc/powernv/npu: Move OPAL calls away from context manipulation
When introduced, the NPU context init/destroy helpers called OPAL which enabled/disabled PID (a userspace memory context ID) filtering in an NPU per a GPU; this was a requirement for P9 DD1.0. However newer chip revision added a PID wildcard support so there is no more need to call OPAL every time a new context is initialized. Also, since the PID wildcard support was added, skiboot does not clear wildcard entries in the NPU so these remain in the hardware till the system reboot.
This moves LPID and wildcard programming to the PE setup code which executes once during the booting process so NPU2 context init/destroy won't need to do additional configuration.
This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as this is the way to tell if the NPU support is present and configured.
This moves pnv_npu2_init() declaration as pseries should be able to use it. This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to call that. This exports pnv_npu2_map_lpar_dev() as following patches will use it from the VFIO driver.
While at it, replace redundant list_for_each_entry_safe() with a simpler list_for_each_entry().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
46a1449d |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Move npu struct from pnv_phb to pci_controller
The powernv PCI code stores NPU data in the pnv_phb struct. The latter is referenced by pci_controller::private_data. We are going to
powerpc/powernv: Move npu struct from pnv_phb to pci_controller
The powernv PCI code stores NPU data in the pnv_phb struct. The latter is referenced by pci_controller::private_data. We are going to have NPU2 support in the pseries platform as well but it does not store any private_data in in the pci_controller struct; and even if it did, it would be a different data structure.
This makes npu a pointer and stores it one level higher in the pci_controller struct.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
ab7032e7 |
| 19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
The skiboot firmware has a hot reset handler which fences the NVIDIA V100 GPU RAM on Witherspoons and makes accesses no-op instead
powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2
The skiboot firmware has a hot reset handler which fences the NVIDIA V100 GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs: https://github.com/open-power/skiboot/commit/fca2b2b839a67
Now we are going to pass V100 via VFIO which most certainly involves KVM guests which are often terminated without getting a chance to offline GPU RAM so we end up with a running machine with misconfigured memory. Accessing this memory produces hardware management interrupts (HMI) which bring the host down.
To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable() via pci_disable_device() which switches NPU2 to a safe mode and prevents HMIs.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
5f639e5f |
| 14-Nov-2018 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv: Remove PCI_MSI ifdef checks
CONFIG_PCI_MSI was made mandatory by commit a311e738b6d8 ("powerpc/powernv: Make PCI non-optional") so the #ifdef checks around CONFIG_PCI_MSI here can b
powerpc/powernv: Remove PCI_MSI ifdef checks
CONFIG_PCI_MSI was made mandatory by commit a311e738b6d8 ("powerpc/powernv: Make PCI non-optional") so the #ifdef checks around CONFIG_PCI_MSI here can be removed entirely.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
c2057701 |
| 18-Nov-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in skiboot's NPU driver does not touch the pci_er
powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in skiboot's NPU driver does not touch the pci_error_type parameter so it might have garbage but the powernv code analyzes it nevertheless.
This initializes pcierr and fstate to zero in all call sites.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
a25de7af |
| 15-Oct-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda: Reduce a number of hooks in pnv_phb
fixup_phb() is never used, this removes it.
pick_m64_pe() and reserve_m64_pe() are always defined for all powernv PHBs: they are initialize
powerpc/powernv/ioda: Reduce a number of hooks in pnv_phb
fixup_phb() is never used, this removes it.
pick_m64_pe() and reserve_m64_pe() are always defined for all powernv PHBs: they are initialized by pnv_ioda_parse_m64_window() which is called unconditionally from pnv_pci_init_ioda_phb() which initializes all known PHB types on powernv so we can open code them.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
f21b0a45 |
| 15-Oct-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda1: Remove dead code for a single device PE
At the moment PNV_IODA_PE_DEV is only used for NPU PEs which are not present on IODA1 machines (i.e. POWER7) so let's remove a piece of
powerpc/powernv/ioda1: Remove dead code for a single device PE
At the moment PNV_IODA_PE_DEV is only used for NPU PEs which are not present on IODA1 machines (i.e. POWER7) so let's remove a piece of dead code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
7e1c4e27 |
| 30-Oct-2018 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done
memblock: stop using implicit alignment to SMP_CACHE_BYTES
When a memblock allocation APIs are called with align = 0, the alignment is implicitly set to SMP_CACHE_BYTES.
Implicit alignment is done deep in the memblock allocator and it can come as a surprise. Not that such an alignment would be wrong even when used incorrectly but it is better to be explicit for the sake of clarity and the prinicple of the least surprise.
Replace all such uses of memblock APIs with the 'align' parameter explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment in the memblock internal allocation functions.
For the case when memblock APIs are used via helper functions, e.g. like iommu_arena_new_node() in Alpha, the helper functions were detected with Coccinelle's help and then manually examined and updated where appropriate.
The direct memblock APIs users were updated using the semantic patch below:
@@ expression size, min_addr, max_addr, nid; @@ ( | - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid) + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid) | - memblock_alloc(size, 0) + memblock_alloc(size, SMP_CACHE_BYTES) | - memblock_alloc_raw(size, 0) + memblock_alloc_raw(size, SMP_CACHE_BYTES) | - memblock_alloc_from(size, 0, min_addr) + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_nopanic(size, 0) + memblock_alloc_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_low(size, 0) + memblock_alloc_low(size, SMP_CACHE_BYTES) | - memblock_alloc_low_nopanic(size, 0) + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES) | - memblock_alloc_from_nopanic(size, 0, min_addr) + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr) | - memblock_alloc_node(size, 0, nid) + memblock_alloc_node(size, SMP_CACHE_BYTES, nid) )
[mhocko@suse.com: changelog update] [akpm@linux-foundation.org: coding-style fixes] [rppt@linux.ibm.com: fix missed uses of implicit alignment] Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Paul Burton <paul.burton@mips.com> [MIPS] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
57c8a661 |
| 30-Oct-2018 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header.
The includes were replaced
mm: remove include/linux/bootmem.h
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header.
The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h>
@@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h>
[sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
eb31d559 |
| 30-Oct-2018 |
Mike Rapoport <rppt@linux.vnet.ibm.com> |
memblock: remove _virt from APIs returning virtual address
The conversion is done using
sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc)
Link: http://lkml.kern
memblock: remove _virt from APIs returning virtual address
The conversion is done using
sed -i 's@memblock_virt_alloc@memblock_alloc@g' \ $(git grep -l memblock_virt_alloc)
Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
db217319 |
| 17-Aug-2018 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv/pci: Work around races in PCI bridge enabling
The generic code is racy when multiple children of a PCI bridge try to enable it simultaneously.
This leads to drivers trying to access
powerpc/powernv/pci: Work around races in PCI bridge enabling
The generic code is racy when multiple children of a PCI bridge try to enable it simultaneously.
This leads to drivers trying to access a device through a not-yet-enabled bridge, and this EEH errors under various circumstances when using parallel driver probing.
There is work going on to fix that properly in the PCI core but it will take some time.
x86 gets away with it because (outside of hotplug), the BIOS enables all the bridges at boot time.
This patch does the same thing on powernv by enabling all bridges that have child devices at boot time, thus avoiding subsequent races. It's suitable for backporting to stable and distros, while the proper PCI fix will probably be significantly more invasive.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
a68bd126 |
| 04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda: Allocate indirect TCE levels on demand
At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally
powerpc/powernv/ioda: Allocate indirect TCE levels on demand
At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally have contigous memory and the guest will map entire RAM for 64bit DMA.
However if we have sparse RAM (one example is a memory device), then we will allocate TCEs which will never be used as the guest only maps actual memory for DMA. If it is a single level TCE table, there is nothing we can really do but if it a multilevel table, we can skip allocating TCEs we know we won't need.
This adds ability to allocate only first level, saving memory.
This changes iommu_table::free() to avoid allocating of an extra level; iommu_table::set() will do this when needed.
This adds @alloc parameter to iommu_table::exchange() to tell the callback if it can allocate an extra level; the flag is set to "false" for the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns H_TOO_HARD.
This still requires the entire table to be counted in mm::locked_vm.
To be conservative, this only does on-demand allocation when the usespace cache table is requested which is the case of VFIO.
The example math for a system replicating a powernv setup with NVLink2 in a guest: 16GB RAM mapped at 0x0 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
the table to cover that all with 64K pages takes: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
If we allocate only necessary TCE levels, we will only need: (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect levels).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
090bad39 |
| 04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Add indirect levels to it_userspace
We want to support sparse memory and therefore huge chunks of DMA windows do not need to be mapped. If a DMA window big enough to require 2 or mo
powerpc/powernv: Add indirect levels to it_userspace
We want to support sparse memory and therefore huge chunks of DMA windows do not need to be mapped. If a DMA window big enough to require 2 or more indirect levels, and a DMA window is used to map all RAM (which is a default case for 64bit window), we can actually save some memory by not allocation TCE for regions which we are not going to map anyway.
The hardware tables alreary support indirect levels but we also keep host-physical-to-userspace translation array which is allocated by vmalloc() and is a flat array which might use quite some memory.
This converts it_userspace from vmalloc'ed array to a multi level table.
As the format becomes platform dependend, this replaces the direct access to it_usespace with a iommu_table_ops::useraddrptr hook which returns a pointer to the userspace copy of a TCE; future extension will return NULL if the level was not allocated.
This should not change non-KVM handling of TCE tables and it_userspace will not be allocated for non-KVM tables.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
191c2287 |
| 04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Move TCE manupulation code to its own file
Right now we have allocation code in pci-ioda.c and traversing code in pci.c, let's keep them toghether. However both files are big enough
powerpc/powernv: Move TCE manupulation code to its own file
Right now we have allocation code in pci-ioda.c and traversing code in pci.c, let's keep them toghether. However both files are big enough already so let's move this business to a new file.
While we at it, move the code which links IOMMU table groups to IOMMU tables as it is not specific to any PNV PHB model.
These puts exported symbols from the new file together.
This fixes several warnings from checkpatch.pl like this: "WARNING: Prefer 'unsigned int' to bare use of 'unsigned'".
As this is almost cut-n-paste, there should be no behavioral change.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
da2bb0da |
| 04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Remove useless wrapper
This gets rid of a useless wrapper around pnv_pci_ioda2_table_free_pages().
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kar
powerpc/powernv: Remove useless wrapper
This gets rid of a useless wrapper around pnv_pci_ioda2_table_free_pages().
Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
00c376fd |
| 02-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Add 256M IOMMU page size to the default POWER8 case
The sketchy bypass uses 256M pages so add this page size as well.
This should cause no behavioral change but will be used
powerpc/powernv/ioda2: Add 256M IOMMU page size to the default POWER8 case
The sketchy bypass uses 256M pages so add this page size as well.
This should cause no behavioral change but will be used later.
Fixes: 477afd6ea6 "powerpc/ioda: Use ibm,supported-tce-sizes for IOMMU page size mask" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
8bf6b91a |
| 28-Jun-2018 |
Alastair D'Silva <alastair@d-silva.org> |
Revert "powerpc/powernv: Add support for the cxl kernel api on the real phb"
Remove abandonned capi support for the Mellanox CX4.
This reverts commit 4361b03430d685610e5feea3ec7846e8b9ae795f.
Sign
Revert "powerpc/powernv: Add support for the cxl kernel api on the real phb"
Remove abandonned capi support for the Mellanox CX4.
This reverts commit 4361b03430d685610e5feea3ec7846e8b9ae795f.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
0cfd7335 |
| 28-Jun-2018 |
Alastair D'Silva <alastair@d-silva.org> |
Revert "cxl: Add support for interrupts on the Mellanox CX4"
Remove abandonned capi support for the Mellanox CX4.
This reverts commit a2f67d5ee8d950caaa7a6144cf0bfb256500b73e.
Signed-off-by: Alast
Revert "cxl: Add support for interrupts on the Mellanox CX4"
Remove abandonned capi support for the Mellanox CX4.
This reverts commit a2f67d5ee8d950caaa7a6144cf0bfb256500b73e.
Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
d3d4ffaa |
| 01-Jun-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Reduce upper limit for DMA window size
We use PHB in mode1 which uses bit 59 to select a correct DMA window. However there is mode2 which uses bits 59:55 and allows up to 32 D
powerpc/powernv/ioda2: Reduce upper limit for DMA window size
We use PHB in mode1 which uses bit 59 to select a correct DMA window. However there is mode2 which uses bits 59:55 and allows up to 32 DMA windows per a PE.
Even though documentation does not clearly specify that, it seems that the actual hardware does not support bits 59:55 even in mode1, in other words we can create a window as big as 1<<58 but DMA simply won't work.
This reduces the upper limit from 59 to 55 bits to let the userspace know about the hardware limits.
Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested" Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
44bda4b7 |
| 03-Jul-2018 |
Hari Vyas <hari.vyas@broadcom.com> |
PCI: Fix is_added/is_busmaster race condition
When a PCI device is detected, pdev->is_added is set to 1 and proc and sysfs entries are created.
When the device is removed, pdev->is_added is checked
PCI: Fix is_added/is_busmaster race condition
When a PCI device is detected, pdev->is_added is set to 1 and proc and sysfs entries are created.
When the device is removed, pdev->is_added is checked for one and then device is detached with clearing of proc and sys entries and at end, pdev->is_added is set to 0.
is_added and is_busmaster are bit fields in pci_dev structure sharing same memory location.
A strange issue was observed with multiple removal and rescan of a PCIe NVMe device using sysfs commands where is_added flag was observed as zero instead of one while removing device and proc,sys entries are not cleared. This causes issue in later device addition with warning message "proc_dir_entry" already registered.
Debugging revealed a race condition between the PCI core setting the is_added bit in pci_bus_add_device() and the NVMe driver reset work-queue setting the is_busmaster bit in pci_set_master(). As these fields are not handled atomically, that clears the is_added bit.
Move the is_added bit to a separate private flag variable and use atomic functions to set and retrieve the device addition state. This avoids the race because is_added no longer shares a memory location with is_busmaster.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=200283 Signed-off-by: Hari Vyas <hari.vyas@broadcom.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|
#
98fd72fe |
| 30-May-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Remove redundant free of TCE pages
When IODA2 creates a PE, it creates an IOMMU table with it_ops::free set to pnv_ioda2_table_free() which calls pnv_pci_ioda2_table_free_page
powerpc/powernv/ioda2: Remove redundant free of TCE pages
When IODA2 creates a PE, it creates an IOMMU table with it_ops::free set to pnv_ioda2_table_free() which calls pnv_pci_ioda2_table_free_pages().
Since iommu_tce_table_put() calls it_ops::free when the last reference to the table is released, explicit call to pnv_pci_ioda2_table_free_pages() is not needed so let's remove it.
This should fix double free in the case of PCI hotuplug as pnv_pci_ioda2_table_free_pages() does not reset neither iommu_table::it_base nor ::it_size.
This was not exposed by SRIOV as it uses different code path via pnv_pcibios_sriov_disable().
IODA1 does not inialize it_ops::free so it does not have this issue.
Fixes: c5f7700bbd2e ("powerpc/powernv: Dynamically release PE") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
show more ...
|