History log of /openbmc/linux/drivers/accel/habanalabs/common/device.c (Results 26 – 50 of 67)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 7f6f26d7 16-Apr-2023 Thomas Zimmermann <tzimmermann@suse.de>

Merge drm/drm-next into drm-misc-next

Backmerging drm-next to sync with msm tree. Resolves a conflict
between aperture-helper changes and msm's use of those interfaces.

Signed-off-by: Thomas Zimmer

Merge drm/drm-next into drm-misc-next

Backmerging drm-next to sync with msm tree. Resolves a conflict
between aperture-helper changes and msm's use of those interfaces.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>

show more ...


# ea68a3e9 11-Apr-2023 Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

Merge drm/drm-next into drm-intel-gt-next

Need to pull in commit from drm-next (earlier in drm-intel-next):

1eca0778f4b3 ("drm/i915: add struct i915_dsm to wrap dsm members together")

In order to

Merge drm/drm-next into drm-intel-gt-next

Need to pull in commit from drm-next (earlier in drm-intel-next):

1eca0778f4b3 ("drm/i915: add struct i915_dsm to wrap dsm members together")

In order to merge following patch to drm-intel-gt-next:

https://patchwork.freedesktop.org/patch/530942/?series=114925&rev=6

Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>

show more ...


# 838ac90d 11-Apr-2023 Daniel Vetter <daniel.vetter@ffwll.ch>

Merge tag 'drm-habanalabs-next-2023-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains additional habanalabs driver changes for v6.4:

- uAPI cha

Merge tag 'drm-habanalabs-next-2023-04-10' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains additional habanalabs driver changes for v6.4:

- uAPI changes:
- Add a definition of a new Gaudi2 server type. This is used by userspace
to know what is the connectivity between the accelerators inside the
server

- New features and improvements:
- speedup h/w queues test in Gaudi2 to reduce device initialization times.

- Firmware related fixes:
- Fixes to the handshake protocol during f/w initialization.
- Sync f/w events interrupt in hard reset to avoid warning message.
- Improvements to extraction of the firmware version.

- Misc bug fixes and code cleanups. Notable fixes are:
- Multiple fixes for interrupt handling in Gaudi2.
- Unmap mapped memory in case TLB invalidation fails.

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Oded Gabbay <ogabbay@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20230410124637.GA2441888@ogabbay-vm-u20.habana-labs.com

show more ...


# 802f25b6 21-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: sync f/w events interrupt in hard reset

Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in t

accel/habanalabs: sync f/w events interrupt in hard reset

Receiving events from FW, while the device is in hard reset, causes
a warning message in Driver log. The message may point to a
problem in the Driver or FW. But It also can appear as a result
of events that have been sent from FW just before the hard reset.
In order to avoid receiving events from FW while the device is in reset
and is already in 'disabled' mode, sync the f/w events interrupt right
before setting the device to 'disabled'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 3a8d7c3a 22-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: send disable pci when compute ctx is active

Fix an issue in hard reset flow in which the driver didn't send a
disable pci message if there was an active compute context.
In hard re

accel/habanalabs: send disable pci when compute ctx is active

Fix an issue in hard reset flow in which the driver didn't send a
disable pci message if there was an active compute context.
In hard reset, disable pci message should be sent no matter if
a compute context exists or not.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# a855f710 21-Mar-2023 Tal Cohen <talcohen@habana.ai>

accel/habanalabs: remove duplicated disable pci msg

The disable pci message is sent in reset device. It informs the FW not
to raise more EQs. The Driver may ignore received EQs, when the device
is i

accel/habanalabs: remove duplicated disable pci msg

The disable pci message is sent in reset device. It informs the FW not
to raise more EQs. The Driver may ignore received EQs, when the device
is in disabled mode.
The duplication happens when hard reset is scheduled during compute
reset and also performs 'escalate_reset_flow'.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 248ed9e2 23-Mar-2023 Cai Huoqing <cai.huoqing@linux.dev>

accel/habanalabs: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
stat

accel/habanalabs: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;

pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}

pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 8ba264f4 30-Mar-2023 Maarten Lankhorst <maarten.lankhorst@linux.intel.com>

Merge remote-tracking branch 'drm/drm-next' into drm-misc-next

Backmerge to get rc4.

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>


# cecdd52a 28-Mar-2023 Rodrigo Vivi <rodrigo.vivi@intel.com>

Merge drm/drm-next into drm-intel-next

Catch up with 6.3-rc cycle...

Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>


# d36d68fd 21-Mar-2023 Dave Airlie <airlied@redhat.com>

Merge tag 'drm-habanalabs-next-2023-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains habanalabs driver and accel changes for v6.4:

- uAPI chan

Merge tag 'drm-habanalabs-next-2023-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next

This tag contains habanalabs driver and accel changes for v6.4:

- uAPI changes:

- Add opcodes to the CS ioctl to allow user to stall/resume specific engines
inside Gaudi2. This is to allow the user to perform power
testing/measurements when training different topologies.

- Expose in the INFO ioctl the amount of device memory that the driver
and f/w reserve for themselves.

- Expose in the INFO ioctl a bit-mask of the available rotator engines
in Gaudi2. This is to align with other engines that are already exposed.

- Expose in the INFO ioctl the register's address of the f/w that should
be used to trigger interrupts from within the user's code running in the
compute engines.

- Add a critical-event bit in the eventfd bitmask so the user will know the
event that was received was critical, and a reset will now occur

- Expose in the INFO ioctl two new opcodes to fetch information on h/w and
f/w events. The events recorded are the events that were reported in the
eventfd.

- New features and improvements:

- Add a dedicated interrupt ID in MSI-X in the device to the notification of
an unexpected user-related event in Gaudi2. Handle it in the driver by
reporting this event.

- Allow the user to fetch the device memory current usage even when the
device is undergoing compute-reset (a reset type that only clears the
compute engines).

- Enable graceful reset mechanism for compute-reset. This will give the
user a few seconds before the device is reset. For example, the user can,
during that time, perform certain device operations (dump data for debug)
or close the device in an orderly fashion.

- Align the decoder with the rest of the engines in regard to notification
to the user about interrupts and in regard to performing graceful reset
when needed (instead of immediate reset).

- Add support for assert interrupt from the TPC engine.

- Get the reset type that is necessary to perform per event from the
auto-generated irq_map array.

- Print the specific reason why a device is still in use when notifying to
the user about it (after the user closed the device's FD).

- Move to threaded IRQ when handling interrupts of workload completions.

- Firmware related fixes:

- Fix RAZWI event handler to match newest f/w version.

- Read error cause register in dma core events because the f/w doesn't
do that.

- Increase maximum time to wait for completion of Gaudi2 reset due to f/w
bug.

- Align to the latest firmware specs.

- Enforce the release order of the compute device and dma-buf.
i.e increment the device file refcount for any dma-buf that was exported
for that device. This will make sure the compute device release function
won't be called until the user closes all the FDs of the relevant
dma-bufs. Without this change, closing the device's FD before/without
closing the dma-buf's FD would always lead to hard-reset of the device.

- Fix a link in the drm documentation to correctly point to the accel section.

- Compilation warnings cleanups

- Misc bug fixes and code cleanups

Signed-off-by: Dave Airlie <airlied@redhat.com>

# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEE7TEboABC71LctBLFZR1NuKta54AFAmQYfcAACgkQZR1NuKta
# 54DB4Af/SuiHZkVXwr+yHPv9El726rz9ZQD7mQtzNmehWGonwAvz15yqocNMUSbF
# JbqE/vrZjvbXrP1Uv5UrlRVdnFHSPV18VnHU4BMS/WOm19SsR6vZ0QOXOoa6/AUb
# w+kF3D//DbFI4/mTGfpH5/pzwu51ti8aVktosPFlHIa8iI8CB4/4IV+ivQ8UW4oK
# HyDRkIvHdRmER7vGOfhwhsr4zdqSlJBYrv3C3Z1dkSYBPW/5ICbiM1UlKycwdYKI
# cajQBSdUQwUCWnI+i8RmSy3kjNO6OE4XRUvTv89F2bQeyK/1rJLG2m2xZR/Ml/o5
# 7Cgvbn0hWZyeqe7OObYiBlSOBSehCA==
# =wclm
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 21 Mar 2023 01:37:36 AEST
# gpg: using RSA key ED311BA00042EF52DCB412C5651D4DB8AB5AE780
# gpg: Can't check signature: No public key
From: Oded Gabbay <ogabbay@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20230320154026.GA766126@ogabbay-vm-u20.habana-labs.com

show more ...


# e752ab11 20-Mar-2023 Rob Clark <robdclark@chromium.org>

Merge remote-tracking branch 'drm/drm-next' into msm-next

Merge drm-next into msm-next to pick up external clk and PM dependencies
for improved a6xx GPU reset sequence.

Signed-off-by: Rob Clark <ro

Merge remote-tracking branch 'drm/drm-next' into msm-next

Merge drm-next into msm-next to pick up external clk and PM dependencies
for improved a6xx GPU reset sequence.

Signed-off-by: Rob Clark <robdclark@chromium.org>

show more ...


# d26a3a6c 17-Mar-2023 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge tag 'v6.3-rc2' into next

Merge with mainline to get of_property_present() and other newer APIs.


# 2e8e9a89 01-Mar-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()

The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context mi

accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()

The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 28fbc058 17-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: use scnprintf() in print_device_in_use_info()

compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_devi

accel/habanalabs: use scnprintf() in print_device_in_use_info()

compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.

Cc: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 86b74d84 15-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: assert return value of hw_fini

Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device

accel/habanalabs: assert return value of hw_fini

Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# efbd36b2 19-Feb-2023 Sagiv Ozeri <sozeri@habana.ai>

accel/habanalabs: add device id to all threads names

Compute driver threads names will start with hlX-*, when X is the
device id.
This will help distinguish them from the NIC thread names.

Signed-o

accel/habanalabs: add device id to all threads names

Compute driver threads names will start with hlX-*, when X is the
device id.
This will help distinguish them from the NIC thread names.

Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# a8c14f53 12-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: improve readability of engines idle mask print

Remove leading zeroes when printing the idle mask to make it clearer.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Ode

accel/habanalabs: improve readability of engines idle mask print

Remove leading zeroes when printing the idle mask to make it clearer.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 3a621af6 13-Feb-2023 Tom Rix <trix@redhat.com>

accel/habanalabs: set hl_capture_*_err storage-class-specifier to static

smatch reports
drivers/accel/habanalabs/common/device.c:2619:6: warning:
symbol 'hl_capture_hw_err' was not declared. Shoul

accel/habanalabs: set hl_capture_*_err storage-class-specifier to static

smatch reports
drivers/accel/habanalabs/common/device.c:2619:6: warning:
symbol 'hl_capture_hw_err' was not declared. Should it be static?
drivers/accel/habanalabs/common/device.c:2641:6: warning:
symbol 'hl_capture_fw_err' was not declared. Should it be static?

both are only used in device.c, so they should be static

Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 4a2e9d11 01-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: don't trace cpu accessible dma alloc/free

The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory alr

accel/habanalabs: don't trace cpu accessible dma alloc/free

The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory already
allocated before. When tracing this together with real dma
allocation/free it cause confusing logs like a '0' dma address and
a cpu address appearing twice etc.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 1d0f9ad7 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in hl_device_reset small refactor for readabilty

in the out_err flow, combine the two cases of soft-reset since
they have mostly common code. In addition unlock reset_info.lock
aft

accel/habanalabs: in hl_device_reset small refactor for readabilty

in the out_err flow, combine the two cases of soft-reset since
they have mostly common code. In addition unlock reset_info.lock
after touching reset count.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 39ab4da9 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'

Because this field is only used for debug print,
we can do more precise debug directly instead.

Signed-off-by: Dafna Hirschfeld <d

accel/habanalabs: in hl_device_reset remove 'hard_instead_of_soft'

Because this field is only used for debug print,
we can do more precise debug directly instead.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 7810c524 08-Feb-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: tiny refactor of hl_device_reset for readability

Align assignment of reset_upon_device_release to the convention used
in this function.

Signed-off-by: Dafna Hirschfeld <dhirschfel

accel/habanalabs: tiny refactor of hl_device_reset for readability

Align assignment of reset_upon_device_release to the convention used
in this function.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 18d13584 25-Jan-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: enable graceful reset mechanism for compute-reset

The graceful reset mechanism is currently enabled only for reset
requests that will end up with hard-reset.
In future, reset reque

accel/habanalabs: enable graceful reset mechanism for compute-reset

The graceful reset mechanism is currently enabled only for reset
requests that will end up with hard-reset.
In future, reset requests due to errors in some device engines, are
going to be modified to request compute-reset, as the much longer
hard-reset is not really needed there.
To allow it, enable graceful reset also for compute-reset, and reset
after user releases the device won't be escalated to hard-reset in those
cases.
If watchdog expires and user didn't release the device, hard-reset will
be initiated in any case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 57479adb 29-Jan-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: disable PCI when escalating compute to hard-reset

In case a compute reset has failed or a request for a hard reset has
just arrived, then we escalate current reset procedure from c

accel/habanalabs: disable PCI when escalating compute to hard-reset

In case a compute reset has failed or a request for a hard reset has
just arrived, then we escalate current reset procedure from compute
to hard-reset.
In such a case, the FW should be aware of the updated error cause,
and if LKD is the one who performs the reset (rather than the FW),
then we ask the FW to disable PCI access.

We would also like to have relevant debug info and therefore
we print the currently escalating reset type.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


# 313e9f63 10-Jan-2023 Moti Haimovski <mhaimovski@habana.ai>

accel/habanalabs: add critical-event bit in notifier

Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW ab

accel/habanalabs: add critical-event bit in notifier

Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>

show more ...


123