| 17fd09c0 | 14-Jun-2021 |
Nicholas Piggin <npiggin@gmail.com> |
target/ppc/spapr: Update H_GET_CPU_CHARACTERISTICS L1D cache flush bits
There are several new L1D cache flush bits added to the hcall which reflect hardware security features for speculative cache a
target/ppc/spapr: Update H_GET_CPU_CHARACTERISTICS L1D cache flush bits
There are several new L1D cache flush bits added to the hcall which reflect hardware security features for speculative cache access issues.
These behaviours are now being specified as negative in order to simplify patched kernel compatibility with older firmware (a new problem found in existing systems would automatically be vulnerable).
[dwg: Technically this changes behaviour for existing machine types. After discussion with Nick, we've determined this is safe, because the worst that will happen if a guest gets the wrong information due to a migration is that it will perform some unnecessary workarounds, but will remain correct and secure (well, as secure as it was going to be anyway). In addition the change only affects cap-cfpc=safe which is not enabled by default, and in fact is not possible to set on any current hardware (though it's expected it will be possible on POWER10)]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20210615044107.1481608-1-npiggin@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| fc8c745d | 25-Jun-2021 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
spapr: Implement Open Firmware client interface
The PAPR platform describes an OS environment that's presented by a combination of a hypervisor and firmware. The features it specifies require collab
spapr: Implement Open Firmware client interface
The PAPR platform describes an OS environment that's presented by a combination of a hypervisor and firmware. The features it specifies require collaboration between the firmware and the hypervisor.
Since the beginning, the runtime component of the firmware (RTAS) has been implemented as a 20 byte shim which simply forwards it to a hypercall implemented in qemu. The boot time firmware component is SLOF - but a build that's specific to qemu, and has always needed to be updated in sync with it. Even though we've managed to limit the amount of runtime communication we need between qemu and SLOF, there's some, and it has become increasingly awkward to handle as we've implemented new features.
This implements a boot time OF client interface (CI) which is enabled by a new "x-vof" pseries machine option (stands for "Virtual Open Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall which implements Open Firmware Client Interface (OF CI). This allows using a smaller stateless firmware which does not have to manage the device tree.
The new "vof.bin" firmware image is included with source code under pc-bios/. It also includes RTAS blob.
This implements a handful of CI methods just to get -kernel/-initrd working. In particular, this implements the device tree fetching and simple memory allocator - "claim" (an OF CI memory allocator) and updates "/memory@0/available" to report the client about available memory.
This implements changing some device tree properties which we know how to deal with, the rest is ignored. To allow changes, this skips fdt_pack() when x-vof=on as not packing the blob leaves some room for appending.
In absence of SLOF, this assigns phandles to device tree nodes to make device tree traversing work.
When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
This adds basic instances support which are managed by a hash map ihandle -> [phandle].
Before the guest started, the used memory is: 0..e60 - the initial firmware 8000..10000 - stack 400000.. - kernel 3ea0000.. - initramdisk
This OF CI does not implement "interpret".
Unlike SLOF, this does not format uninitialized nvram. Instead, this includes a disk image with pre-formatted nvram.
With this basic support, this can only boot into kernel directly. However this is just enough for the petitboot kernel and initradmdisk to boot from any possible source. Note this requires reasonably recent guest kernel with: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
The immediate benefit is much faster booting time which especially crucial with fully emulated early CPU bring up environments. Also this may come handy when/if GRUB-in-the-userspace sees light of the day.
This separates VOF and sPAPR in a hope that VOF bits may be reused by other POWERPC boards which do not support pSeries.
This assumes potential support for booting from QEMU backends such as blockdev or netdev without devices/drivers used.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Message-Id: <20210625055155.2252896-1-aik@ozlabs.ru> Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu> [dwg: Adjusted some includes which broke compile in some more obscure compilation setups] Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| f93c8f14 | 18-May-2021 |
Shivaprasad G Bhat <sbhat@linux.ibm.com> |
spapr: nvdimm: Forward declare and move the definitions
The subsequent patches add definitions which tend to get the compilation to cyclic dependency. So, prepare with forward declarations, move the
spapr: nvdimm: Forward declare and move the definitions
The subsequent patches add definitions which tend to get the compilation to cyclic dependency. So, prepare with forward declarations, move the definitions and clean up.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Message-Id: <162133925415.610.11584121797866216417.stgit@4f1e6f2bd33e> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| 962104f0 | 06-May-2021 |
Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br> |
hw/ppc: moved hcalls that depend on softmmu
The hypercalls h_enter, h_remove, h_bulk_remove, h_protect, and h_read, have been moved to spapr_softmmu.c with the functions they depend on. The function
hw/ppc: moved hcalls that depend on softmmu
The hypercalls h_enter, h_remove, h_bulk_remove, h_protect, and h_read, have been moved to spapr_softmmu.c with the functions they depend on. The functions is_ram_address and push_sregs_to_kvm_pr are not static anymore as functions on both spapr_hcall.c and spapr_softmmu.c depend on them. The hypercalls h_resize_hpt_prepare and h_resize_hpt_commit have been divided, the KVM part stayed in spapr_hcall.c while the softmmu part was moved to spapr_softmmu.c
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br> Message-Id: <20210506163941.106984-2-lucas.araujo@eldorado.org.br> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| b7573092 | 08-Apr-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr.h: increase FDT_MAX_SIZE
Certain SMP topologies stress, e.g. 1 thread/core, 2048 cores and 1 socket, stress the current maximum size of the pSeries FDT:
Calling ibm,client-architecture-suppor
spapr.h: increase FDT_MAX_SIZE
Certain SMP topologies stress, e.g. 1 thread/core, 2048 cores and 1 socket, stress the current maximum size of the pSeries FDT:
Calling ibm,client-architecture-support...qemu-system-ppc64: error creating device tree: (fdt_setprop(fdt, offset, "ibm,processor-segment-sizes", segs, sizeof(segs))): FDT_ERR_NOSPACE
2048 is the default NR_CPUS value for the pSeries kernel. It's expected that users will want QEMU to be able to handle this kind of configuration.
Bumping FDT_MAX_SIZE to 2MB is enough for these setups to be created.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210408204049.221802-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| eb7f80fd | 02-Mar-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr.c: send QAPI event when memory hotunplug fails
Recent changes allowed the pSeries machine to rollback the hotunplug process for the DIMM when the guest kernel signals, via a reconfiguration of
spapr.c: send QAPI event when memory hotunplug fails
Recent changes allowed the pSeries machine to rollback the hotunplug process for the DIMM when the guest kernel signals, via a reconfiguration of the DR connector, that it's not going to release the LMBs.
Let's also warn QAPI listerners about it. One place to do it would be right after the unplug state is cleaned up, spapr_clear_pending_dimm_unplug_state(). This would mean that the function is now doing more than cleaning up the pending dimm state though.
This patch does the following changes in spapr.c:
- send a QAPI event to inform that we experienced a failure in the hotunplug of the DIMM;
- rename spapr_clear_pending_dimm_unplug_state() to spapr_memory_unplug_rollback(). This is a better fit for what the function is now doing, and it makes callers care more about what the function goal is and less about spapr.c internals such as clearing the pending dimm unplug state.
Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210302141019.153729-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| fe1831ef | 22-Feb-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr_drc.c: use DRC reconfiguration to cleanup DIMM unplug state
Handling errors in memory hotunplug in the pSeries machine is more complex than any other device type, because there are all the com
spapr_drc.c: use DRC reconfiguration to cleanup DIMM unplug state
Handling errors in memory hotunplug in the pSeries machine is more complex than any other device type, because there are all the complications that other devices has, and more.
For instance, determining a timeout for a DIMM hotunplug must consider if it's a Hash-MMU or a Radix-MMU guest, because Hash guests takes longer to hotunplug DIMMs. The size of the DIMM is also a factor, given that longer DIMMs naturally takes longer to be hotunplugged from the kernel. And there's also the guest memory usage to be considered: if there's a process that is consuming memory that would be lost by the DIMM unplug, the kernel will postpone the unplug process until the process finishes, and then initiate the regular hotunplug process. The first two considerations are manageable, but the last one is a deal breaker.
There is no sane way for the pSeries machine to determine the memory load in the guest when attempting a DIMM hotunplug - and even if there was a way, the guest can start using all the RAM in the middle of the unplug process and invalidate our previous assumptions - and in result we can't even begin to calculate a timeout for the operation. This means that we can't implement a viable timeout mechanism for memory unplug in pSeries.
Going back to why we would consider an unplug timeout, the reason is that we can't know if the kernel is giving up the unplug. Turns out that, sometimes, we can. Consider a failed memory hotunplug attempt where the kernel will error out with the following message:
'pseries-hotplug-mem: Memory indexed-count-remove failed, adding any removed LMBs'
This happens when there is a LMB that the kernel gave up in removing, and the LMBs previously marked for removal are now being added back. This happens in the pseries kernel in [1], dlpar_memory_remove_by_ic() into dlpar_add_lmb(), and after that update_lmb_associativity_index(). In this function, the kernel is configuring the LMB DRC connector again. Note that this is a valid usage in LOPAR, as stated in section "ibm,configure-connector RTAS Call":
'A subsequent sequence of calls to ibm,configure-connector with the same entry from the “ibm,drc-indexes” or “ibm,drc-info” property will restart the configuration of devices which were not completely configured.'
We can use this kernel behavior in our favor. If a DRC connector reconfiguration for a LMB that we marked as unplug pending happens, this indicates that the kernel changed its mind about the unplug and is reasserting that it will keep using all the LMBs of the DIMM. In this case, it's safe to assume that the whole DIMM device unplug was cancelled.
This patch hops into rtas_ibm_configure_connector() and, in the scenario described above, clear the unplug state for the DIMM device. This will not solve all the problems we still have with memory unplug, but it will cover this case where the kernel reconfigures LMBs after a failed unplug. We are a bit more resilient, without using an unreliable timeout, and we didn't make the remaining error cases any worse.
[1] arch/powerpc/platforms/pseries/hotplug-memory.c
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210222194531.62717-6-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| d1c2e3ce | 22-Feb-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr_drc.c: add hotunplug timeout for CPUs
There is a reliable way to make a CPU hotunplug fail in the pseries machine. Hotplug a CPU A, then offline all other CPUs inside the guest but A. When try
spapr_drc.c: add hotunplug timeout for CPUs
There is a reliable way to make a CPU hotunplug fail in the pseries machine. Hotplug a CPU A, then offline all other CPUs inside the guest but A. When trying to hotunplug A the guest kernel will refuse to do it, because A is now the last online CPU of the guest. PAPR has no 'error callback' in this situation to report back to the platform, so the guest kernel will deny the unplug in silent and QEMU will never know what happened. The unplug pending state of A will remain until the guest is shutdown or rebooted.
Previous attempts of fixing it (see [1] and [2]) were aimed at trying to mitigate the effects of the problem. In [1] we were trying to guess which guest CPUs were online to forbid hotunplug of the last online CPU in the QEMU layer, avoiding the scenario described above because QEMU is now failing in behalf of the guest. This is not robust because the last online CPU of the guest can change while we're in the middle of the unplug process, and our initial assumptions are now invalid. In [2] we were accepting that our unplug process is uncertain and the user should be allowed to spam the IRQ hotunplug queue of the guest in case the CPU hotunplug fails.
This patch presents another alternative, using the timeout infrastructure introduced in the previous patch. CPU hotunplugs in the pSeries machine will now timeout after 15 seconds. This is a long time for a single CPU unplug to occur, regardless of guest load - although the user is *strongly* encouraged to *not* hotunplug devices from a guest under high load - and we can be sure that something went wrong if it takes longer than that for the guest to release the CPU (the same can't be said about memory hotunplug - more on that in the next patch).
Timing out the unplug operation will reset the unplug state of the CPU and allow the user to try it again, regardless of the error situation that prevented the hotunplug to occur. Of all the not so pretty fixes/mitigations for CPU hotunplug errors in pSeries, timing out the operation is an admission that we have no control in the process, and must assume the worst case if the operation doesn't succeed in a sensible time frame.
[1] https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg03353.html [2] https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg04400.html
Reported-by: Xujun Ma <xuma@redhat.com> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414 Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210222194531.62717-5-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| 51254ffb | 22-Feb-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr_drc.c: introduce unplug_timeout_timer
The LoPAR spec provides no way for the guest kernel to report failure of hotplug/hotunplug events. This wouldn't be bad if those operations were granted t
spapr_drc.c: introduce unplug_timeout_timer
The LoPAR spec provides no way for the guest kernel to report failure of hotplug/hotunplug events. This wouldn't be bad if those operations were granted to always succeed, but that's far for the reality.
What ends up happening is that, in the case of a failed hotunplug, regardless of whether it was a QEMU error or a guest misbehavior, the pSeries machine is retaining the unplug state of the device in the running guest. This state is cleanup in machine reset, where it is assumed that this state represents a device that is pending unplug, and the device is hotunpluged from the board. Until the reset occurs, any hotunplug operation of the same device is forbid because there is a pending unplug state.
This behavior has at least one undesirable side effect. A long standing pending unplug state is, more often than not, the result of a hotunplug error. The user had to dealt with it, since retrying to unplug the device is noy allowed, and then in the machine reset we're removing the device from the guest. This means that we're failing the user twice - failed to hotunplug when asked, then hotunplugged without notice.
Solutions to this problem range between trying to predict when the hotunplug will fail and forbid the operation from the QEMU layer, from opening up the IRQ queue to allow for multiple hotunplug attempts, from telling the users to 'reboot the machine if something goes wrong'. The first solution is flawed because we can't fully predict guest behavior from QEMU, the second solution is a trial and error remediation that counts on a hope that the unplug will eventually succeed, and the third is ... well.
This patch introduces a crude, but effective solution to hotunplug errors in the pSeries machine. For each unplug done, we'll timeout after some time. If a certain amount of time passes, we'll cleanup the hotunplug state from the machine. During the timeout period, any unplug operations in the same device will still be blocked. After that, we'll assume that the guest failed the operation, and allow the user to try again. If the timeout is too short we'll prevent legitimate hotunplug situations to occur, so we'll need to overestimate the regular time an unplug operation takes to succeed to account that.
The true solution for the hotunplug errors in the pSeries machines is a PAPR change to allow for the guest to warn the platform about it. For now, the work done in this timeout design can be used for the new PAPR 'abort hcall' in the future, given that for both cases we'll need code to cleanup the existing unplug states of the DRCs.
At this moment we're adding the basic wiring of the timer into the DRC. Next patch will use the timer to timeout failed CPU hotunplugs.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210222194531.62717-4-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|
| 66407069 | 28-Jan-2021 |
Daniel Henrique Barboza <danielhb413@gmail.com> |
spapr_numa.c: create spapr_numa_initial_nvgpu_numa_id() helper
We'll need to check the initial value given to spapr->gpu_numa_id when building the rtas DT, so put it in a helper for easier access an
spapr_numa.c: create spapr_numa_initial_nvgpu_numa_id() helper
We'll need to check the initial value given to spapr->gpu_numa_id when building the rtas DT, so put it in a helper for easier access and to avoid repetition.
Tested-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> Message-Id: <20210128174213.1349181-3-danielhb413@gmail.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
show more ...
|