Revision tags: v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10, v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51 |
|
#
7ce29357 |
| 06-Jul-2020 |
James Zhu <James.Zhu@amd.com> |
drm/amdgpu/nbio: add aldebaran support
Aldebaran has a new mmBIF_MMSCH1_DOORBELL_RANGE setting.
Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Al
drm/amdgpu/nbio: add aldebaran support
Aldebaran has a new mmBIF_MMSCH1_DOORBELL_RANGE setting.
Signed-off-by: James Zhu <James.Zhu@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43 |
|
#
f8a98f16 |
| 25-May-2020 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: fix incorrect EP_STRAP reg offset for aldebaran
mmRCC_DEV0_EPF0_STRAP0 offset in aldebaran is changed from arcturus
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Kev
drm/amdgpu: fix incorrect EP_STRAP reg offset for aldebaran
mmRCC_DEV0_EPF0_STRAP0 offset in aldebaran is changed from arcturus
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Kevin Wang <kevin1.wang@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12 |
|
#
759eb38e |
| 18-Nov-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu: correct mmBIF_SDMA4_DOORBELL_RANGE address for aldebaran
On aldebaran, mmBIF_SDMA4_DOORBELL_RANGE isn't right next to mmBIF_SDMA3_DOORBELL_RANGE.
Signed-off-by: Le Ma <le.ma@amd.com> Re
drm/amdgpu: correct mmBIF_SDMA4_DOORBELL_RANGE address for aldebaran
On aldebaran, mmBIF_SDMA4_DOORBELL_RANGE isn't right next to mmBIF_SDMA3_DOORBELL_RANGE.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
9ca0674a |
| 28-Dec-2020 |
Likun Gao <Likun.Gao@amd.com> |
drm/amdgpu: remove redundant logic related HDP
Remove hdp_flush function from amdgpu_nbio struct as it have been unified into hdp struct. Remove the include about hdp register which was not used. V2
drm/amdgpu: remove redundant logic related HDP
Remove hdp_flush function from amdgpu_nbio struct as it have been unified into hdp struct. Remove the include about hdp register which was not used. V2: Remove hdp golden setting which is unnecessary.
Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
f75e94d8 |
| 04-Aug-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: bypass querying ras error count registers
Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count registe
drm/amdgpu: bypass querying ras error count registers
Once ras recovery is issued by ras sync flood interrupt or ras controller interrupt, add this guard to bypass or execute ras error count register harvest of all IPs.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
6952e99c |
| 10-Apr-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: refine ras related message print
Prefix ras related kernel message logging with PCI device info by replacing DRM_INFO/WARN/ERROR with dev_info/warn/err. This can clearly tell user about
drm/amdgpu: refine ras related message print
Prefix ras related kernel message logging with PCI device info by replacing DRM_INFO/WARN/ERROR with dev_info/warn/err. This can clearly tell user about GPU device information where ras is. And add some other ras message printing to make it more clear and friendly as well.
Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
b635ae87 |
| 18-Mar-2020 |
Alex Sierra <alex.sierra@amd.com> |
drm/amdgpu: ih doorbell size of range changed for nbio v7.4
[Why] nbio v7.4 size of ih doorbell range is 64 bit. This requires 2 DWords per register.
[How] Change ih doorbell size from 2 to 4. This
drm/amdgpu: ih doorbell size of range changed for nbio v7.4
[Why] nbio v7.4 size of ih doorbell range is 64 bit. This requires 2 DWords per register.
[How] Change ih doorbell size from 2 to 4. This means two Dwords per ring. Current configuration uses two ih rings.
Signed-off-by: Alex Sierra <alex.sierra@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
3aa0115d |
| 04-Mar-2020 |
Monk Liu <Monk.Liu@amd.com> |
drm/amdgpu: cleanup all virtualization detection routine
we need to move virt detection much earlier because: 1) HW team confirms us that RCC_IOV_FUNC_IDENTIFIER will always be at DE5 (dw) mmio offs
drm/amdgpu: cleanup all virtualization detection routine
we need to move virt detection much earlier because: 1) HW team confirms us that RCC_IOV_FUNC_IDENTIFIER will always be at DE5 (dw) mmio offset from vega10, this way there is no need to implement detect_hw_virt() routine in each nbio/chip file. for VI SRIOV chip (tonga & fiji), the BIF_IOV_FUNC_IDENTIFIER is at 0x1503
2) we need to acknowledged we are SRIOV VF before we do IP discovery because the IP discovery content will be updated by host everytime after it recieved a new coming "REQ_GPU_INIT_DATA" request from guest (there will be patches for this new handshake soon).
Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
3cd4f618 |
| 13-Feb-2020 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU
When NBIO's RAS error happens, before trigging GPU reset, it's needed to record error counter information, which can corre
drm/amdgpu: record non-zero error counter info in NBIO before resetting GPU
When NBIO's RAS error happens, before trigging GPU reset, it's needed to record error counter information, which can correct the error counter value missed issue when reading from debugfs.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
8831fa6e |
| 24-Dec-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: simplify function return logic
Former return logic is redundant.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexand
drm/amdgpu: simplify function return logic
Former return logic is redundant.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
61934624 |
| 13-Dec-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it.
Signed-off-by
drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
5c39d600 |
| 22-Nov-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu: clear uncorrectable parity error status bit
This should be cleared during every nbif uncorrectable error cleanup work.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <H
drm/amdgpu: clear uncorrectable parity error status bit
This should be cleared during every nbif uncorrectable error cleanup work.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
28f87950 |
| 22-Nov-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu: clear ras controller status registers when interrupt occurs
To fix issue that ras controller interrupt cannot be triggered anymore after one time nbif uncorrectable error. And error coun
drm/amdgpu: clear ras controller status registers when interrupt occurs
To fix issue that ras controller interrupt cannot be triggered anymore after one time nbif uncorrectable error. And error count is stored in nbif ras object for query.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.3.11, v5.3.10, v5.3.9, v5.3.8 |
|
#
4a2d9356 |
| 21-Oct-2019 |
Le Ma <Le.Ma@amd.com> |
drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
v2: add notification when ras controller interrupt generates
Signed-off-by: Le Ma <Le.Ma@amd.com> Reviewed-by: Hawkin
drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
v2: add notification when ras controller interrupt generates
Signed-off-by: Le Ma <Le.Ma@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.3.7, v5.3.6 |
|
#
956f6705 |
| 11-Oct-2019 |
Le Ma <le.ma@amd.com> |
drm/amdgpu/soc15: disable doorbell interrupt as part of BACO entry sequence
Workaround to make RAS recovery work in BACO reset.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Alex Deucher <alexa
drm/amdgpu/soc15: disable doorbell interrupt as part of BACO entry sequence
Workaround to make RAS recovery work in BACO reset.
Signed-off-by: Le Ma <le.ma@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3 |
|
#
3e103fc3 |
| 10-Sep-2019 |
Kent Russell <kent.russell@amd.com> |
Revert "drm/amdgpu/nbio7.4: add hw bug workaround for vega20"
This reverts commit e01f2d41895102d824c6b8f5e011dd5e286d5e8b.
VG20 did not require this workaround, as the fix is in the VBIOS. Leave V
Revert "drm/amdgpu/nbio7.4: add hw bug workaround for vega20"
This reverts commit e01f2d41895102d824c6b8f5e011dd5e286d5e8b.
VG20 did not require this workaround, as the fix is in the VBIOS. Leave VG10/12 workaround as some older shipped cards do not have the VBIOS fix in place, and the kernel workaround is required in those situations
Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
1a3f2e8c |
| 10-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: implement ras query function for pcie bif
ras error query funtionality implementation
Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Review
drm/amdgpu: implement ras query function for pcie bif
ras error query funtionality implementation
Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.2.14, v5.3-rc8, v5.2.13, v5.2.12 |
|
#
52652ef2 |
| 04-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: add ras error query count interface for nbio
Add the interface query_ras_error_count for nbio.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.co
drm/amdgpu: add ras error query count interface for nbio
Add the interface query_ras_error_count for nbio.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
1bd252c5 |
| 10-Sep-2019 |
Guchun Chen <guchun.chen@amd.com> |
drm/amdgpu: remove duplicated header file include
amdgpu_ras.h is already included.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex De
drm/amdgpu: remove duplicated header file include
amdgpu_ras.h is already included.
Signed-off-by: Guchun Chen <guchun.chen@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
1c70d3d9 |
| 02-Sep-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu/nbio: switch to amdgpu_nbio_ras_late_init helper function
amdgpu_nbio_ras_late_init is used to init nbio specfic ras debugfs/sysfs node and nbio specific interrupt handler. It can be shar
drm/amdgpu/nbio: switch to amdgpu_nbio_ras_late_init helper function
amdgpu_nbio_ras_late_init is used to init nbio specfic ras debugfs/sysfs node and nbio specific interrupt handler. It can be shared among nbio generations
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
d094aea3 |
| 02-Sep-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: set ip specific ras interface pointer to NULL after free it
to prevent access to dangling pointers
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@
drm/amdgpu: set ip specific ras interface pointer to NULL after free it
to prevent access to dangling pointers
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
7c6e68c7 |
| 13-Sep-2019 |
Andrey Grodzovsky <andrey.grodzovsky@amd.com> |
drm/amdgpu: Avoid HW GPU reset for RAS.
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in
drm/amdgpu: Avoid HW GPU reset for RAS.
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP.
Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem.
v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count
v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
#
9ad1dc29 |
| 29-Aug-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: add ras_late_init callback function for nbio v7_4 (v3)
ras_late_init callback function will be used to do common ras init in late init phase.
v2: call ras_late_fini to do cleanup when f
drm/amdgpu: add ras_late_init callback function for nbio v7_4 (v3)
ras_late_init callback function will be used to do common ras init in late init phase.
v2: call ras_late_fini to do cleanup when fails to enable interrupt
v3: rename sysfs/debugfs node name to pcie_bif_xxx
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13, v5.1.12, v5.1.11, v5.1.10, v5.1.9, v5.1.8 |
|
#
4e644fff |
| 05-Jun-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu: add ras_controller and err_event_athub interrupt support
Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support.
Signed-off-by: Hawking
drm/amdgpu: add ras_controller and err_event_athub interrupt support
Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|
Revision tags: v5.1.7, v5.1.6 |
|
#
4241863a |
| 29-May-2019 |
Hawking Zhang <Hawking.Zhang@amd.com> |
drm/amdgpu/nbio: add functions to query ras specific interrupt status
ras_controller_interrupt and err_event_interrupt are ras specific interrupts. add functions to check their status and ack them i
drm/amdgpu/nbio: add functions to query ras specific interrupt status
ras_controller_interrupt and err_event_interrupt are ras specific interrupts. add functions to check their status and ack them if they are generated. both funcitons should only be invoked in ISR when BIF ring is disabled or even not initialized.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
show more ...
|