- 25 5月, 2020 1 次提交
-
-
由 Oded Gabbay 提交于
GAUDI does not support soft-reset as it leaves the NIC ports in an awkward state, where their QMANs were reset but the NIC itself is still working. In addition, there is not much sense in doing soft-reset when training is done on multiple GAUDIs. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NTomer Tayar <ttayar@habana.ai>
-
- 19 5月, 2020 12 次提交
-
-
由 Ofir Bitton 提交于
Instead of writing similar event handling code for each ASIC, move the code to the common firmware file. This code will be used for GAUDI and all future ASICs. In addition, add two new fields to the auto-generated events file: valid and description. This will save the need to manually write the events description in the source code and simplify the code. Signed-off-by: NOfir Bitton <obitton@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
Add the ASIC-dependent code for GAUDI. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the GAUDI ASIC. It also contains the code to initialize the F/W of the GAUDI ASIC and to receive events from the F/W. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
In Gaudi there is a feature of clock gating certain engines. Therefore, add this property to the device structure. In addition, due to a limitation of this feature, the driver needs to dynamically enable or disable this feature during run-time. Therefore, add ASIC interface functions to enable/disable this function from the common code. Moreover, this feature must be turned off when the user wishes to debug the ASIC by reading/writing registers and/or memory through the driver's debugfs. Therefore, add an option to enable/disable clock gating via the debugfs interface. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Gaudi requires longer waiting during reset due to closing of network ports. Add this explanation to the relevant comment in the code and add a dedicated define for this reset timeout period, instead of multiplying another define. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Coresight is not supported on simulator, therefore add a boolean for checking that (currently used by un-upstreamed code). Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Add the following two operations to the CS IOCTL: Signal: The signal operation is basically a command submission, that is created by the driver upon user request. It will be implemented using a dedicated PQE that will increment a specific SOB. There will be a new flag: HL_CS_FLAGS_SIGNAL. When the user set this flag in the CS IOCTL structure, the driver will execute a dedicated code path that will prepare this special PQE and submit it. The user only needs to provide a queue index on which to put the signal. Wait: The wait operation is also a command submission that is created by the driver upon user request. It will be implemented using a dedicated PQE that will contain packets of "ARM a monitor" + FENCE packet. There will be a new flag: HL_CS_FLAGS_WAIT. When the user set this flag in the CS structure, the driver will execute a dedicated code path that will prepare this special PQE and submit it. The user needs to provide the following parameters: 1. queue ID 2. an array of signal_seq numbers and the number of signals to wait on (the length of signal_seq_arr). The IOCTL will return the CS sequence number of the wait it put on the queue ID. Currently, the code supports signal_seq_nr==1. But this API definition will allow us to put a single PQE that waits on multiple signals. To correctly configure the monitor and fence, the driver will need to retrieve the specified signal CS object that contains the relevant SOB and its expected value. In case the signal CS has already been completed, there is no point of adding a wait operation. In this case, the driver will return to the user *without* putting anything on the PQ. The return code should reflect to the user that the signal was completed, as we won't return a CS sequence number for this wait. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Define a structure representing the h/w sync object (SOB). a SOB can contain up to 2^15 values. Each signal CS will increment the SOB by 1, so after some time we will reach the maximum number the SOB can represent. When that happens, the driver needs to move to a different SOB for the signal operation. A SOB can be in 1 of 4 states: 1. Working state with value < 2^15 2. We reached a value of 2^15, but the signal operations weren't completed yet OR there are pending waits on this signal. For the next submission, the driver will move to another SOB. 3. ALL the signal operations on the SOB have finished AND there are no more pending waits on the SOB AND we reached a value of 2^15 (This basically means the refcnt of the SOB is 0 - see explanation below). When that happens, the driver can clear the SOB by simply doing WREG32 0 to it and set the refcnt back to 1. 4. The SOB is cleared and can be used next time by the driver when it needs to reuse an SOB. Per SOB, the driver will maintain a single refcnt, that will be initialized to 1. When a signal or wait operation on this SOB is submitted to the PQ, the refcnt will be incremented. When a signal or wait operation on this SOB completes, the refcnt will be decremented. After the submission of the signal operation that increments the SOB to a value of 2^15, the refcnt is also decremented. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
This feature requires handling h/w resources which are a bit different from one ASIC to the other. Therefore, we need to define a set of interfaces the ASIC code provides to the common code to signal, wait, reset sync object and to reset and init a queue. As this feature is not supported in Goya, provide an empty implementation of those functions. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
This is a pre-requisite to upstreaming GAUDI support. Signal/wait operations are done by the user to perform sync between two Primary Queues (PQs). The sync is done using the sync manager and it is usually resolved inside the device, but sometimes it can be resolved in the host, i.e. the user should be able to wait in the host until a signal has been completed. The mechanism to define signal and wait operations is done by the driver because it needs atomicity and serialization, which is already done in the driver when submitting work to the different queues. To implement this feature, the driver "takes" a couple of h/w resources, and this is reflected by the defines added to the uapi file. The signal/wait operations are done via the existing CS IOCTL, and they use the same data structure. There is a difference in the meaning of some of the parameters, and for that we added unions to make the code more readable. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Ofir Bitton 提交于
Load CPU device boot loader during driver boot time in order to avoid flash write for every boot loader update. To preserve backward-compatibility, skip the device boot load if the device doesn't request it. Signed-off-by: NOfir Bitton <obitton@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Christine Gharzuzi 提交于
Support hwmon_temp_reset_histroy, hwmon_in_reset_history and hwmon_curr_reset attribute which resets the historical highest value. Signed-off-by: NChristine Gharzuzi <cgharzuzi@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Tomer Tayar 提交于
Add a new opcode to the INFO IOCTL that retrieves the device time alongside the host time, to allow a user application that want to measure device time together with host time (such as a profiler) to synchronize these times. Signed-off-by: NTomer Tayar <ttayar@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
- 17 5月, 2020 6 次提交
-
-
由 Oded Gabbay 提交于
When we have DMA QMAN with multiple streams, we need to know whether the command buffer contains at least one DMA packet in order to configure the barriers correctly when adding the 2xMSG_PROT at the end of the JOB. If there is no DMA packet, then there is no need to put engine barrier. This is relevant only for GAUDI as GOYA doesn't have streams so the engine can't be busy by another stream. Reviewed-by: NTomer Tayar <ttayar@habana.ai> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
Retrieve from the firmware the DMA mask value we need to set according to the device's PCI controller configuration. This is needed when working on POWER9 machines, as the device's PCI controller is configured in a different way in those machines. Reviewed-by: NTomer Tayar <ttayar@habana.ai> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
When doing training, the DL framework (e.g. tensorflow) performs hundreds of thousands of memory allocations and mappings. In case the driver needs to perform hard-reset during training, the driver kills the application and unmaps all those memory allocations. Unfortunately, because of that large amount of mappings, the driver isn't able to do that in the current timeout (5 seconds). Therefore, increase the timeout significantly to 30 seconds to avoid situation where the driver resets the device with active mappings, which sometime can cause a kernel bug. BTW, it doesn't mean we will spend all the 30 seconds because the reset thread checks every one second if the unmap operation is done. Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
Move the code of device CPU initialization from being ASIC-Dependent to common code. In addition, add support for the new error reporting feature of the firmware boot code. Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
We want to remove the following restrictions/assumptions in our driver: 1. The H/W queue index is also the completion queue index. 2. The H/W queue index is also the IRQ number of the completion queue. 3. All queues of the same type have consecutive indexes. Therefore we add the support for H/W queues of the same type with nonconsecutive indexes and completion queue index and IRQ number different than the H/W queue index. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Stop-on-error mode in DMA is useful as it stops the transaction immediately upon error e.g. page fault. But it may cause the next command submission to fail as is leaves the DMA in unstable state. Therefore we remove the stop-on-error configuration from the DMA. Stop-on-err is still available for debug. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
- 24 3月, 2020 7 次提交
-
-
由 Oded Gabbay 提交于
If a GAUDI device is present in the system, display an error message that it is not supported by the current kernel. Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Moti Haimovski 提交于
The hl read and write routines implement the hwmon_ops read and write interface routines respectively. These routines are expected to return a completion status when called, which was not the case until this commit. This commit modifies these routines to return 0 upon success and a negative error value upon failure. Signed-off-by: NMoti Haimovski <mhaimovski@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Moti Haimovski 提交于
This commit adds support for offsetting the temperatures reading by a specified value as defined in https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface using the standard sysfs defined for hwmon. This is required by system administrators to inject errors to test their monitoring applications in data centers. Signed-off-by: NMoti Haimovski <mhaimovski@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Moti Haimovski 提交于
Allow debug user to write/read 64-bit data through debugfs. This will expedite the dump process of the (large) internal memories of the device done during debug. Signed-off-by: NMoti Haimovski <mhaimovski@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Tomer Tayar 提交于
As HL_MAX_JOBS_PER_CS is 512, it is possible that more than 255 CS jobs will be submitted for a certain queue. Hence, modify the "jobs_in_queue_cnt" parameter of the "hl_cs" structure to be u16 instead of u8. Signed-off-by: NTomer Tayar <ttayar@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Host memory may be allocated with huge pages. A different virtual range may be used for mapping in this case. Add Huge PCI MMU (HPMMU) properties to support it. This patch is a prerequisite for future ASICs support and has no effect on Goya ASIC as currently a single virtual host range is used for all page sizes. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Pawel Piskorski 提交于
Optimize hl_mmu_map and hl_mmu_unmap by not calling flush(ctx) within per-page loop. Signed-off-by: NPawel Piskorski <ppiskorski@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
- 21 11月, 2019 10 次提交
-
-
由 Omer Shpigelman 提交于
Split the properties used for MMU mappings to DRAM and PCI (host) types. This is a prerequisite for future ASICs support. Note that in Goya ASIC, the PMMU and DMMU are the same (except of page sizes) as only one MMU mechanism is used for both of the mapping types. Hence this patch should not have any effect on current behavior. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Add the ability to invalidate the necessary MMU cache only. This ability is a prerequisite for future ASICs support. Note that in Goya ASIC, a single cache is used for both host/DRAM mappings and hence this patch should not have any effect on current behavior. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Omer Shpigelman 提交于
Some of the functions in the memory module code were too long and/or contained multiple operations that are not always done together. Re-factor the code by dividing those functions to smaller functions which are more readable and maintainable. Signed-off-by: NOmer Shpigelman <oshpigelman@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Oded Gabbay 提交于
The two defines that control the maximum size of a command buffer and the maximum number of JOBS per CS need to be exported to the user as they are part of the API towards user-space. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai>
-
由 Oded Gabbay 提交于
In training, there is a need for a large amount of patching to the recipe. This results in many command buffers contains a lot of DMA packets. The number of command buffers per CS is larger than the current maximum of 64, which is an arbitrary number that is enough for inference, but it has no real affect on the code and/or resources of the host machine. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai>
-
由 Oded Gabbay 提交于
Add a new opcode to the INFO IOCTL to allow the user application to retrieve the ASIC's current and maximum clock rate. The rate is returned in MHz. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NTomer Tayar <ttayar@habana.ai>
-
由 Oded Gabbay 提交于
Reduce latency to memory during TPC kernel execution. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NTomer Tayar <ttayar@habana.ai>
-
由 Tomer Tayar 提交于
This patch adds a support for a new H/W queue type. This type of queue is for DMA and compute engines jobs, for which completion notification are sent by H/W. Command buffer for this queue can be created either through the CB IOCTL and using the retrieved CB handle, or by preparing a buffer on the host or device SRAM/DRAM, and using the device address to that buffer. The patch includes the handling of the 2 options, as well as the initialization of the H/W queue and its jobs scheduling. Signed-off-by: NTomer Tayar <ttayar@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Tomer Tayar 提交于
Jobs on some queues must be provided with a handle to a driver command buffer object, while for other queues, jobs must be provided with an address to a command buffer. Currently the distinction is done based on the queue type, which is less flexible if the same queue type behaves differently on different types of ASICs. This patch adds a new queue property for this target, which is configured per queue type per ASIC type. Signed-off-by: NTomer Tayar <ttayar@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
由 Tomer Tayar 提交于
s/paerser/parser/ s/requeusted/requested/ s/an JOB/a JOB/ Signed-off-by: NTomer Tayar <ttayar@habana.ai> Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com> Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>
-
- 05 9月, 2019 4 次提交
-
-
由 Oded Gabbay 提交于
When using the macro le32_to_cpu(x), we need to correctly convert x to be __le32 in case it is defined as u32 variable. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NTomer Tayar <ttayar@habana.ai>
-
由 Oded Gabbay 提交于
We want to stop using the acronym KMD. Therefore, replace all locations (except for register names we can't modify) where KMD is written to other terms such as "Linux kernel driver" or "Host kernel driver", etc. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai>
-
由 Oded Gabbay 提交于
Add a new opcode to INFO IOCTL to retrieve aggregate H/W events. i.e. the events counters are NOT cleared upon device reset, but count from the loading of the driver. Add the code to support it in the device event handling function. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai>
-
由 Oded Gabbay 提交于
Users and sysadmins usually want to know what is the device utilization as a level 0 indication if they are efficiently using the device. Add a new opcode to the INFO IOCTL that will return the device utilization over the last period of 100-1000ms. The return value is 0-100, representing as percentage the total utilization rate. Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com> Reviewed-by: NOmer Shpigelman <oshpigelman@habana.ai>
-