device.c - OpenGrok history log for /openbmc/linux/drivers/accel/habanalabs/common/device.c

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
# 7f6f26d7	16-Apr-2023	Thomas Zimmermann <tzimmermann@suse.de>	Merge drm/drm-next into drm-misc-next Backmerging drm-next to sync with msm tree. Resolves a conflict between aperture-helper changes and msm's use of those interfaces. Signed-off-by: Thomas Zimmer Merge drm/drm-next into drm-misc-next Backmerging drm-next to sync with msm tree. Resolves a conflict between aperture-helper changes and msm's use of those interfaces. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> show more ...
# ea68a3e9	11-Apr-2023	Joonas Lahtinen <joonas.lahtinen@linux.intel.com>	Merge drm/drm-next into drm-intel-gt-next Need to pull in commit from drm-next (earlier in drm-intel-next): 1eca0778f4b3 ("drm/i915: add struct i915_dsm to wrap dsm members together") In order to Merge drm/drm-next into drm-intel-gt-next Need to pull in commit from drm-next (earlier in drm-intel-next): 1eca0778f4b3 ("drm/i915: add struct i915_dsm to wrap dsm members together") In order to merge following patch to drm-intel-gt-next: https://patchwork.freedesktop.org/patch/530942/?series=114925&rev=6 Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> show more ...
# 838ac90d	11-Apr-2023	Daniel Vetter <daniel.vetter@ffwll.ch>	Merge tag 'drm-habanalabs-next-2023-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains additional habanalabs driver changes for v6.4: - uAPI cha Merge tag 'drm-habanalabs-next-2023-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains additional habanalabs driver changes for v6.4: - uAPI changes: - Add a definition of a new Gaudi2 server type. This is used by userspace to know what is the connectivity between the accelerators inside the server - New features and improvements: - speedup h/w queues test in Gaudi2 to reduce device initialization times. - Firmware related fixes: - Fixes to the handshake protocol during f/w initialization. - Sync f/w events interrupt in hard reset to avoid warning message. - Improvements to extraction of the firmware version. - Misc bug fixes and code cleanups. Notable fixes are: - Multiple fixes for interrupt handling in Gaudi2. - Unmap mapped memory in case TLB invalidation fails. Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230410124637.GA2441888@ogabbay-vm-u20.habana-labs.com show more ...
# 802f25b6	21-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: sync f/w events interrupt in hard reset Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in t accel/habanalabs: sync f/w events interrupt in hard reset Receiving events from FW, while the device is in hard reset, causes a warning message in Driver log. The message may point to a problem in the Driver or FW. But It also can appear as a result of events that have been sent from FW just before the hard reset. In order to avoid receiving events from FW while the device is in reset and is already in 'disabled' mode, sync the f/w events interrupt right before setting the device to 'disabled'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# 3a8d7c3a	22-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: send disable pci when compute ctx is active Fix an issue in hard reset flow in which the driver didn't send a disable pci message if there was an active compute context. In hard re accel/habanalabs: send disable pci when compute ctx is active Fix an issue in hard reset flow in which the driver didn't send a disable pci message if there was an active compute context. In hard reset, disable pci message should be sent no matter if a compute context exists or not. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# a855f710	21-Mar-2023	Tal Cohen <talcohen@habana.ai>	accel/habanalabs: remove duplicated disable pci msg The disable pci message is sent in reset device. It informs the FW not to raise more EQs. The Driver may ignore received EQs, when the device is i accel/habanalabs: remove duplicated disable pci msg The disable pci message is sent in reset device. It informs the FW not to raise more EQs. The Driver may ignore received EQs, when the device is in disabled mode. The duplication happens when hard reset is scheduled during compute reset and also performs 'escalate_reset_flow'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 248ed9e2	23-Mar-2023	Cai Huoqing <cai.huoqing@linux.dev>	accel/habanalabs: Remove redundant pci_clear_master Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 stat accel/habanalabs: Remove redundant pci_clear_master Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 static void do_pci_disable_device(struct pci_dev *dev) { u16 pci_command; pci_read_config_word(dev, PCI_COMMAND, &pci_command); if (pci_command & PCI_COMMAND_MASTER) { pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } pcibios_disable_device(dev); }. And dev->is_busmaster is set to 0 in pci_disable_device. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# 8ba264f4	30-Mar-2023	Maarten Lankhorst <maarten.lankhorst@linux.intel.com>	Merge remote-tracking branch 'drm/drm-next' into drm-misc-next Backmerge to get rc4. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
# cecdd52a	28-Mar-2023	Rodrigo Vivi <rodrigo.vivi@intel.com>	Merge drm/drm-next into drm-intel-next Catch up with 6.3-rc cycle... Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
# d36d68fd	21-Mar-2023	Dave Airlie <airlied@redhat.com>	Merge tag 'drm-habanalabs-next-2023-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains habanalabs driver and accel changes for v6.4: - uAPI chan Merge tag 'drm-habanalabs-next-2023-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains habanalabs driver and accel changes for v6.4: - uAPI changes: - Add opcodes to the CS ioctl to allow user to stall/resume specific engines inside Gaudi2. This is to allow the user to perform power testing/measurements when training different topologies. - Expose in the INFO ioctl the amount of device memory that the driver and f/w reserve for themselves. - Expose in the INFO ioctl a bit-mask of the available rotator engines in Gaudi2. This is to align with other engines that are already exposed. - Expose in the INFO ioctl the register's address of the f/w that should be used to trigger interrupts from within the user's code running in the compute engines. - Add a critical-event bit in the eventfd bitmask so the user will know the event that was received was critical, and a reset will now occur - Expose in the INFO ioctl two new opcodes to fetch information on h/w and f/w events. The events recorded are the events that were reported in the eventfd. - New features and improvements: - Add a dedicated interrupt ID in MSI-X in the device to the notification of an unexpected user-related event in Gaudi2. Handle it in the driver by reporting this event. - Allow the user to fetch the device memory current usage even when the device is undergoing compute-reset (a reset type that only clears the compute engines). - Enable graceful reset mechanism for compute-reset. This will give the user a few seconds before the device is reset. For example, the user can, during that time, perform certain device operations (dump data for debug) or close the device in an orderly fashion. - Align the decoder with the rest of the engines in regard to notification to the user about interrupts and in regard to performing graceful reset when needed (instead of immediate reset). - Add support for assert interrupt from the TPC engine. - Get the reset type that is necessary to perform per event from the auto-generated irq_map array. - Print the specific reason why a device is still in use when notifying to the user about it (after the user closed the device's FD). - Move to threaded IRQ when handling interrupts of workload completions. - Firmware related fixes: - Fix RAZWI event handler to match newest f/w version. - Read error cause register in dma core events because the f/w doesn't do that. - Increase maximum time to wait for completion of Gaudi2 reset due to f/w bug. - Align to the latest firmware specs. - Enforce the release order of the compute device and dma-buf. i.e increment the device file refcount for any dma-buf that was exported for that device. This will make sure the compute device release function won't be called until the user closes all the FDs of the relevant dma-bufs. Without this change, closing the device's FD before/without closing the dma-buf's FD would always lead to hard-reset of the device. - Fix a link in the drm documentation to correctly point to the accel section. - Compilation warnings cleanups - Misc bug fixes and code cleanups Signed-off-by: Dave Airlie <airlied@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCgAdFiEE7TEboABC71LctBLFZR1NuKta54AFAmQYfcAACgkQZR1NuKta # 54DB4Af/SuiHZkVXwr+yHPv9El726rz9ZQD7mQtzNmehWGonwAvz15yqocNMUSbF # JbqE/vrZjvbXrP1Uv5UrlRVdnFHSPV18VnHU4BMS/WOm19SsR6vZ0QOXOoa6/AUb # w+kF3D//DbFI4/mTGfpH5/pzwu51ti8aVktosPFlHIa8iI8CB4/4IV+ivQ8UW4oK # HyDRkIvHdRmER7vGOfhwhsr4zdqSlJBYrv3C3Z1dkSYBPW/5ICbiM1UlKycwdYKI # cajQBSdUQwUCWnI+i8RmSy3kjNO6OE4XRUvTv89F2bQeyK/1rJLG2m2xZR/Ml/o5 # 7Cgvbn0hWZyeqe7OObYiBlSOBSehCA== # =wclm # -----END PGP SIGNATURE----- # gpg: Signature made Tue 21 Mar 2023 01:37:36 AEST # gpg: using RSA key ED311BA00042EF52DCB412C5651D4DB8AB5AE780 # gpg: Can't check signature: No public key From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230320154026.GA766126@ogabbay-vm-u20.habana-labs.com show more ...
# e752ab11	20-Mar-2023	Rob Clark <robdclark@chromium.org>	Merge remote-tracking branch 'drm/drm-next' into msm-next Merge drm-next into msm-next to pick up external clk and PM dependencies for improved a6xx GPU reset sequence. Signed-off-by: Rob Clark <ro Merge remote-tracking branch 'drm/drm-next' into msm-next Merge drm-next into msm-next to pick up external clk and PM dependencies for improved a6xx GPU reset sequence. Signed-off-by: Rob Clark <robdclark@chromium.org> show more ...
# d26a3a6c	17-Mar-2023	Dmitry Torokhov <dmitry.torokhov@gmail.com>	Merge tag 'v6.3-rc2' into next Merge with mainline to get of_property_present() and other newer APIs.
# 2e8e9a89	01-Mar-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release() The memory manager IDR is currently destroyed when user releases the file descriptor. However, at this point the user context mi accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release() The memory manager IDR is currently destroyed when user releases the file descriptor. However, at this point the user context might be still held, and memory buffers might be still in use. Later on, calls to release those buffers will fail due to not finding their handles in the IDR, leading to a memory leak. To avoid this leak, split the IDR destruction from the memory manager fini, and postpone it to hpriv_release() when there is no user context and no buffers are used. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# 28fbc058	17-Feb-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: use scnprintf() in print_device_in_use_info() compose_device_in_use_info() was added to handle the snprintf() return value in a single place. However, the buffer size in print_devi accel/habanalabs: use scnprintf() in print_device_in_use_info() compose_device_in_use_info() was added to handle the snprintf() return value in a single place. However, the buffer size in print_device_in_use_info() is set such that it would be enough for the max possible print, so compose_device_in_use_info() is not really needed. Moreover, scnprintf() can be used instead of snprintf(), to save the check if the return value larger than the given size. Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 86b74d84	15-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: assert return value of hw_fini Since hw_fini return error code for failure indication, we should check its return value. Currently it might only fail upon soft-reset from hl_device accel/habanalabs: assert return value of hw_fini Since hw_fini return error code for failure indication, we should check its return value. Currently it might only fail upon soft-reset from hl_device_reset. Later patch will add hw_fini failure in case of polling timeout in hard-reset. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# efbd36b2	19-Feb-2023	Sagiv Ozeri <sozeri@habana.ai>	accel/habanalabs: add device id to all threads names Compute driver threads names will start with hlX-, when X is the device id. This will help distinguish them from the NIC thread names. Signed-o accel/habanalabs: add device id to all threads names Compute driver threads names will start with hlX-, when X is the device id. This will help distinguish them from the NIC thread names. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# a8c14f53	12-Feb-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: improve readability of engines idle mask print Remove leading zeroes when printing the idle mask to make it clearer. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Ode accel/habanalabs: improve readability of engines idle mask print Remove leading zeroes when printing the idle mask to make it clearer. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 3a621af6	13-Feb-2023	Tom Rix <trix@redhat.com>	accel/habanalabs: set hl_capture__err storage-class-specifier to static smatch reports drivers/accel/habanalabs/common/device.c:2619:6: warning: symbol 'hl_capture_hw_err' was not declared. Shoul accel/habanalabs: set hl_capture__err storage-class-specifier to static smatch reports drivers/accel/habanalabs/common/device.c:2619:6: warning: symbol 'hl_capture_hw_err' was not declared. Should it be static? drivers/accel/habanalabs/common/device.c:2641:6: warning: symbol 'hl_capture_fw_err' was not declared. Should it be static? both are only used in device.c, so they should be static Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> show more ...
# 4a2e9d11	01-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: don't trace cpu accessible dma alloc/free The cpu accessible dma allocations use the gen_pool api which actually does not allocate new memory from the system but manages memory alr accel/habanalabs: don't trace cpu accessible dma alloc/free The cpu accessible dma allocations use the gen_pool api which actually does not allocate new memory from the system but manages memory already allocated before. When tracing this together with real dma allocation/free it cause confusing logs like a '0' dma address and a cpu address appearing twice etc. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 1d0f9ad7	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: in hl_device_reset small refactor for readabilty in the out_err flow, combine the two cases of soft-reset since they have mostly common code. In addition unlock reset_info.lock aft accel/habanalabs: in hl_device_reset small refactor for readabilty in the out_err flow, combine the two cases of soft-reset since they have mostly common code. In addition unlock reset_info.lock after touching reset count. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 39ab4da9	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft' Because this field is only used for debug print, we can do more precise debug directly instead. Signed-off-by: Dafna Hirschfeld <d accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft' Because this field is only used for debug print, we can do more precise debug directly instead. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 7810c524	08-Feb-2023	Dafna Hirschfeld <dhirschfeld@habana.ai>	accel/habanalabs: tiny refactor of hl_device_reset for readability Align assignment of reset_upon_device_release to the convention used in this function. Signed-off-by: Dafna Hirschfeld <dhirschfel accel/habanalabs: tiny refactor of hl_device_reset for readability Align assignment of reset_upon_device_release to the convention used in this function. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 18d13584	25-Jan-2023	Tomer Tayar <ttayar@habana.ai>	accel/habanalabs: enable graceful reset mechanism for compute-reset The graceful reset mechanism is currently enabled only for reset requests that will end up with hard-reset. In future, reset reque accel/habanalabs: enable graceful reset mechanism for compute-reset The graceful reset mechanism is currently enabled only for reset requests that will end up with hard-reset. In future, reset requests due to errors in some device engines, are going to be modified to request compute-reset, as the much longer hard-reset is not really needed there. To allow it, enable graceful reset also for compute-reset, and reset after user releases the device won't be escalated to hard-reset in those cases. If watchdog expires and user didn't release the device, hard-reset will be initiated in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 57479adb	29-Jan-2023	Koby Elbaz <kelbaz@habana.ai>	accel/habanalabs: disable PCI when escalating compute to hard-reset In case a compute reset has failed or a request for a hard reset has just arrived, then we escalate current reset procedure from c accel/habanalabs: disable PCI when escalating compute to hard-reset In case a compute reset has failed or a request for a hard reset has just arrived, then we escalate current reset procedure from compute to hard-reset. In such a case, the FW should be aware of the updated error cause, and if LKD is the one who performs the reset (rather than the FW), then we ask the FW to disable PCI access. We would also like to have relevant debug info and therefore we print the currently escalating reset type. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
# 313e9f63	10-Jan-2023	Moti Haimovski <mhaimovski@habana.ai>	accel/habanalabs: add critical-event bit in notifier Enhance the existing user notifications by adding a HW and FW critical event bits to be used when a HW or FW event occur that requires both SW ab accel/habanalabs: add critical-event bit in notifier Enhance the existing user notifications by adding a HW and FW critical event bits to be used when a HW or FW event occur that requires both SW abort and hard-resetting the chip. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> show more ...
123