提交 · e1fa724dd17a6a9b9934636226e683912d12c876 · openeuler / Kernel

28 1月, 2021 30 次提交

habanalabs: add user available interrupt to hw_ip · e1fa724d

由 Ofir Bitton 提交于 1月 06, 2021

In order to support completions that arrive directly to the user,
the driver needs to supply the user with the first available msix
interrupt available.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

e1fa724d

habanalabs: always try to use the hint address · 8d79ce16

由 farah kassabri 提交于 1月 11, 2021

Currently hint address is ignored in case va block page size
is not power of 2. We need to support th user hint address also in this
case, but only if the hint address is aligned to page size.
Signed-off-by: Nfarah kassabri <fkassabri@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

8d79ce16

habanalabs: add security violations dump to debugfs · d2b980f3

由 Ofir Bitton 提交于 1月 07, 2021

In order to improve driver security debuggability, we add
security violations dump to debugfs.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

d2b980f3

habanalabs: ignore F/W BMC errors in case no BMC present · eea4c255

由 Ofir Bitton 提交于 1月 10, 2021

In order to support operation mode in which BMC is not active,
driver must not take BMC errors into consideration.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

eea4c255

habanalabs/gaudi: print sync manager SEI interrupt info · f8bc7f09

由 Ofir Bitton 提交于 1月 03, 2021

Driver must print sync manager SEI information upon receiving
interrupt from FW.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

f8bc7f09

habanalabs: Use 'dma_set_mask_and_coherent()' · 825b30c4

由 Christophe JAILLET 提交于 1月 04, 2021

Axe 'hl_pci_set_dma_mask()' and replace it with an equivalent
'dma_set_mask_and_coherent()' call.

This makes the code a bit less verbose.

It also removes an erroneous comment, because 'hl_pci_set_dma_mask()'
does not try to use a fall-back value.
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

825b30c4

habanalabs/gaudi: remove PCI access to SM block · 423815bf

由 Ofir Bitton 提交于 1月 05, 2021

Due to HW limitation we must remove all direct access to SM
registers, in order to do that we will access SM registers using
the HW QMANS.
When possible and no user context is present, we can directly access
the HW QMANS. Whenever there is an active user, driver will
prepare a pending command buffer list which will be sent upon
user submissions.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

423815bf

habanalabs: add driver support for internal cb scheduling · d3f139c4

由 Ofir Bitton 提交于 11月 18, 2020

In order to support scnenarios in which driver needs access to
HW components but it cannot access them directly, we add support for
scheduling command buffers internally.
These command buffers will be transmitted upon next user command
submission context.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

d3f139c4

habanalabs: increment ctx ref from within a cs allocation · 1e3f2536

由 Ofir Bitton 提交于 1月 03, 2021

A CS must increment the relevant context reference count.
We want to increment the reference inside the CS allocation function
as opposed for today where we increment it outside.
This is logical since we want to avoid explicitly incrementing
the context every time we call the CS allocate function.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

1e3f2536

habanalabs: separate common code to dedicated folders · 8563e191

由 Ofir Bitton 提交于 12月 28, 2020

We separate some of the common code source files to different
folders for a better maintainability and testability.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

8563e191

habanalabs: read device boot errors after cpucp is up · edb07cb6

由 Ofir Bitton 提交于 12月 27, 2020

Boot cpu can report errors in various boot stages.
Current implementaion does not take into consideration errors
reported in late stages, hence we will check for errors at the most
late stage when fetching cpucp information.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

edb07cb6

habanalabs: report correct dram size in info ioctl · 6769cea8

由 Ofir Bitton 提交于 12月 31, 2020

In case MMU is enabled, we must take MMU page size into
consideration when reporting dram size to the user.
This is because the MMU page size can be a value which is NOT
a power-of-2 value. As a result, the total DRAM size (which is always
a power-of-2 value) needed to be rounded-down.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

6769cea8

habanalabs: support non power-of-2 DRAM phys page sizes · b19dc67a

由 Moti Haimovski 提交于 11月 18, 2020

DRAM physical page sizes depend of the amount of HBMs available in
the device. this number is device-dependent and may also be subject
to binning when one or more of the DRAM controllers are found to
to be faulty. Such a configuration may lead to partitioning the DRAM
to non-power-of-2 pages.

To support this feature we also need to add infrastructure of address
scarmbling.
Signed-off-by: NMoti Haimovski <mhaimovski@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

b19dc67a

habanalabs: remove access to kernel memory using debugfs · a1f85332

由 Ofir Bitton 提交于 12月 28, 2020

Accessing kernel allocated memory through debugfs should not
be allowed as it introduces a security vulnerability.
We remove the option to read/write kernel memory for all asics.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

a1f85332

habanalabs/gaudi: set uninitialized symbol · 266cdfa2

由 Ofir Bitton 提交于 12月 22, 2020

Initialize local variable that is returned by the function, in
case it is never assigned.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

266cdfa2

habanalabs: return dram virtual address in info ioctl · 9402a336

由 Alon Mizrahi 提交于 12月 23, 2020

When working with DRAM MMU, we should supply the userspace with the
virtual start address of the DRAM instead of the physical one. This
is because the physical one has no meaning for the user as he only
knows the virtual address range.
Signed-off-by: NAlon Mizrahi <amizrahi@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

9402a336

habanalabs: update to latest hl_boot_if.h · 3abe1040

由 Oded Gabbay 提交于 12月 18, 2020

Update the latest version of this file that the F/W exports
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

3abe1040

habanalabs: add ASIC property of functional HBMs · 1530d468

由 Oded Gabbay 提交于 12月 18, 2020

The number of functional HBMs in the same ASIC can be different due
to malfunctioning HBM banks.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

1530d468

habanalabs/gaudi: add debug prints for security status · 2e368560

由 Ofir Bitton 提交于 12月 16, 2020

In order to have more information while debugging boot issues,
we should print the firmware security status at every boot stage.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

2e368560

habanalabs: modify memory functions signatures · f19040ce

由 Omer Shpigelman 提交于 12月 09, 2020

For consistency, modify all memory ioctl functions to get the ioctl
arguments structure rather than the arguments themselves.
Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

f19040ce

habanalabs: kernel doc format in memory functions · 3b762f55

由 Omer Shpigelman 提交于 12月 09, 2020

Change all memory functions documentation according to kernel doc
format.
Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

3b762f55

habanalabs: replace WARN/WARN_ON with dev_crit in driver · 75d9a2a0

由 Alon Mizrahi 提交于 12月 03, 2020

Often WARN is defined in data-centers as BUG and we would like to
avoid hanging the entire server on some internal error of the driver
(important as it might be).

Therefore, use dev_crit instead.
Signed-off-by: NAlon Mizrahi <amizrahi@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

75d9a2a0

habanalabs: report dram_page_size in hw_ip_info ioctl · 0eda23d7

由 Moti Haimovski 提交于 12月 07, 2020

Instead of having it hard-coded as a define, pass it to the user
in runtime.
Signed-off-by: NMoti Haimovski <mhaimovski@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

0eda23d7

habanalabs/goya: move mmu_prepare to context init · e1b85dba

由 Ohad Sharabi 提交于 12月 01, 2020

Currently mmu_prepare is located at context switch.
Since we support a single context, no reason to reconfigure
the MMU registers every context switch.
Signed-off-by: NOhad Sharabi <osharabi@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

e1b85dba

habanalabs/gaudi: remove duplicated gaudi packets masks · f8b0f2ec

由 Ofir Bitton 提交于 12月 06, 2020

As all packets use the same CTL register masks, we remove duplicated
masks and use common masks instead.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

f8b0f2ec

habanalabs: allow user to pass a staged submission seq · c209e742

由 Ofir Bitton 提交于 12月 03, 2020

In order to support the staged submission feature, user must be
allowed to use the same CS sequence for all submissions in the
same staged submission.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

c209e742

habanalabs/gaudi: support CS with no completion · ac6fdbfe

由 Ofir Bitton 提交于 12月 03, 2020

As part of the staged submission feature, we need Gaudi to support
command submissions that will never get a completion.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

ac6fdbfe

habanalabs: Init the VM module for kernel context · 8e39e75a

由 Ofir Bitton 提交于 11月 12, 2020

In order for reserving VA ranges for kernel memory, we need
to allow the VM module to be initiated with kernel context.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

8e39e75a

habanalabs: refactor MMU locks code · cb6ef0ee

由 Ohad Sharabi 提交于 11月 26, 2020

remove mmu_cache_lock as it protects a section which is already
protected by mmu_lock.

in addition, wrap mmu cache invalidate calls in hl_vm_ctx_fini with
mmu_lock.
Signed-off-by: NOhad Sharabi <osharabi@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

cb6ef0ee

O
habanalabs: update firmware boot interface · 4c998836
由 Oded Gabbay 提交于 12月 04, 2020
```
Update to latest firmware hl_boot_if.h file.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>
```
4c998836

22 1月, 2021 3 次提交

habanalabs: disable FW events on device removal · 2dc4a6d7

由 Oded Gabbay 提交于 1月 18, 2021

When device is removed, we need to make sure the F/W won't send us
any more events because during the remove process we disable the
interrupts.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

2dc4a6d7

habanalabs: fix backward compatibility of idle check · f8abaf37

由 Oded Gabbay 提交于 1月 18, 2021

Need to take the lower 32 bits of the driver's 64-bit idle mask and put
it in the legacy 32-bit variable that the userspace reads to know the
idle mask.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

f8abaf37

habanalabs: zero pci counters packet before submit to FW · 9354f1b4

由 Ofir Bitton 提交于 1月 17, 2021

Driver does not zero some pci counters packets before sending
to FW. This causes an out of sync PI/CI between driver and FW.
Signed-off-by: NOfir Bitton <obitton@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

9354f1b4

12 1月, 2021 3 次提交

habanalabs: prevent soft lockup during unmap · 9488307a

由 Oded Gabbay 提交于 1月 11, 2021

When using Deep learning framework such as tensorflow or pytorch, there
are tens of thousands of host memory mappings. When the user frees
all those mappings at the same time, the process of unmapping and
unpinning them can take a long time, which may cause a soft lockup
bug.

To prevent this, we need to free the core to do other things during
the unmapping process. For now, we chose to do it every 32K unmappings
(each unmap is a single 4K page).
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

9488307a

habanalabs: fix reset process in case of failures · aa6df653

由 Oded Gabbay 提交于 1月 11, 2021

There are some points in the reset process where if the code fails
for some reason, and the system admin tries to initiate the reset
process again we will get a kernel panic.

This is because there aren't any protections in different fini
functions that are called during the reset process.

The protections that are added in this patch make sure that if the fini
functions are called multiple times, without calling init functions
between them, there won't be double release of already released
resources.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

aa6df653

habanalabs: fix dma_addr passed to dma_mmap_coherent · a9d4ef64

由 Oded Gabbay 提交于 1月 11, 2021

When doing dma_alloc_coherent in the driver, we add a certain hard-coded
offset to the DMA address before returning to the callee function. This
offset is needed when our device use this DMA address to perform
outbound transactions to the host.

However, if we want to map the DMA'able memory to the user via
dma_mmap_coherent(), we need to pass the original dma address, without
this offset. Otherwise, we will get erronouos mapping.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

a9d4ef64

30 12月, 2020 1 次提交

habanalabs: Fix memleak in hl_device_reset · b000700d

由 Dinghao Liu 提交于 12月 26, 2020

When kzalloc() fails, we should execute hl_mmu_fini()
to release the MMU module. It's the same when
hl_ctx_init() fails.
Signed-off-by: NDinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

b000700d

28 12月, 2020 3 次提交

habanalabs: fix order of status check · 097c62b6

由 Oded Gabbay 提交于 12月 22, 2020

When the device is in reset or needs to be reset, the disabled property
is don't-care.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

097c62b6

habanalabs: register to pci shutdown callback · fcaebc73

由 Oded Gabbay 提交于 12月 14, 2020

We need to make sure our device is idle when rebooting a virtual
machine. This is done in the driver level.

The firmware will later handle FLR but we want to be extra safe and
stop the devices until the FLR is handled.
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

fcaebc73

habanalabs: add validation cs counter, fix misplaced counters · a3fd2830

由 Alon Mizrahi 提交于 12月 08, 2020

Up until now validation errors were counted in the parsing field
of the cs_counters struct, so we added a new counter and increased
it when needed.

In addition, there were some locations where only one of the counters
was updated (ctx or aggregate) so add the second one to be updated
as well.
Signed-off-by: NAlon Mizrahi <amizrahi@habana.ai>
Reviewed-by: NOded Gabbay <ogabbay@kernel.org>
Signed-off-by: NOded Gabbay <ogabbay@kernel.org>

a3fd2830

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功