- 01 7月, 2021 6 次提交
-
-
由 Luben Tuikov 提交于
Split functionality between read and write, which simplifies the code and exposes areas of optimization and more or less complexity, and take advantage of that. Read and write the table in one go; use a separate stage to decode or encode the data, as opposed to on the fly, which keeps the I2C bus busy. Use a single read/write to read/write the table or at most two if the number of records we're reading/writing wraps around. Check the check-sum of a table in EEPROM on init. Update the checksum at the same time as when updating the table header signature, when the threshold was increased on boot. Take advantage of arithmetic modulo 256, that is, use a byte!, to greatly simplify checksum arithmetic. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Qualify with "ras_". Use kernel's own--don't redefine your own. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
RAS_MAX_RECORD_NUM may mean the maximum record number, as in the maximum house number on your street, or it may mean the maximum number of records, as in the count of records, which is also a number. To make this distinction whether the number is ordinal (index) or cardinal (count), rename this macro to RAS_MAX_RECORD_COUNT. This makes it easy to understand what it refers to, especially when we compute quantities such as, how many records do we have left in the table, especially when there are so many other numbers, quantities and numerical macros around. Also rename the long, amdgpu_ras_eeprom_get_record_max_length() to the more succinct and clear, amdgpu_ras_eeprom_max_record_count(). When computing the threshold, which also deals with counts, i.e. "how many", use cardinal "max_eeprom_records_count", than the quantitative "max_eeprom_records_len". Simplify the logic here and there, as well. Cc: Guchun Chen <guchun.chen@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
The low level EEPROM write method, doesn't return 1, but the number of bytes written. Thus do not compare to 1, instead, compare to greater than 0 for success. Other cleanup: if the lower layers returned -errno, then return that, as opposed to overwriting the error code with one-fits-all -EINVAL. For instance, some return -EAGAIN. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Wrap amdgpu_ras_eeprom_xfer(..., bool write), into amdgpu_ras_eeprom_read() and amdgpu_ras_eeprom_write(), as that makes reading and understanding the code clearer. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Instead of fixing the spelling in amdgpu_ras_eeprom_process_recods(), rename it to, amdgpu_ras_eeprom_xfer(), to look similar to other I2C and protocol transfer (read/write) functions. Also to keep the column span to within reason by using a shorter name. Change the "num" function parameter from "int" to "const u32" since it is the number of items (records) to xfer, i.e. their count, which cannot be a negative number. Also swap the order of parameters, keeping the pointer to records and their number next to each other, while the direction now becomes the last parameter. Cc: Jean Delvare <jdelvare@suse.de> Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Cc: Lijo Lazar <Lijo.Lazar@amd.com> Cc: Stanley Yang <Stanley.Yang@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 6月, 2021 2 次提交
-
-
由 Stanley.Yang 提交于
Use SMU to update the bad pages rather than directly accessing the EEPROM from the driver. Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 6月, 2021 1 次提交
-
-
由 Guchun Chen 提交于
Use adev_to_drm() to get to the drm_device pointer. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 28 5月, 2021 2 次提交
-
-
由 Luben Tuikov 提交于
On Context Query2 IOCTL return the correctable and uncorrectable errors in O(1) fashion, from cached values, and schedule a delayed work function to calculate and cache them for the next such IOCTL. v2: Cancel pending delayed work at ras_fini(). v3: Remove conditionals when dealing with delayed work manipulation as they're inherently racy. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
The correctable and uncorrectable errors are calculated at each invocation of this function. Therefore, it is highly inefficient to return just one of them based on a Boolean input. If the caller wants both, twice the work would be done. (And this work is O(n^3) on Vega20.) Fix this "interface" to simply return what it had calculated--both values. Let the caller choose what it wants to record, inspect, use. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 20 5月, 2021 2 次提交
-
-
由 Andrey Grodzovsky 提交于
Some of the stuff in amdgpu_device_fini such as HW interrupts disable and pending fences finilization must be done right away on pci_remove while most of the stuff which relates to finilizing and releasing driver data structures can be kept until drm_driver.release hook is called, i.e. when the last device reference is dropped. v4: Change functions prefix early->hw and late->sw Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com
-
由 John Clements 提交于
Only clear RAS error counters if perestent EDC harvesting is not supported Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 5月, 2021 11 次提交
-
-
由 Dwaipayan Ray 提交于
Fix a couple of syntax errors and removed one excess parameter in the function documentations which lead to kernel docs build warning. Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NDwaipayan Ray <dwaipayanray1@gmail.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dennis Li 提交于
For aldebaran, hardware will not clear error status automatically when reading error status register, insteadly driver should set clear bit of the error status register explicitly to clear error status. Signed-off-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dennis Li 提交于
The original codes use ras status and kernl errno together in the same function, which is a wrong code style. Signed-off-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
If RAS is disabled through amdgpu_ras_enable kernel parameter, we should quit the RAS initialization eariler to avoid initialization of some RAS data structure such as sysfs etc. Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Export the runtime-set "ras_hw_enabled" and "ras_enabled" to debugfs, for debugging. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Rename, ras_hw_supported --> ras_hw_enabled, and ras_features --> ras_enabled, to show that ras_enabled is a subset of ras_hw_enabled, which itself is a subset of the ASIC capability. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Move ras_hw_supported into struct amdgpu_dev. The dependency is: struct amdgpu_ras <== struct amdgpu_dev <== ASIC, read as "struct amdgpu_ras depends on struct amdgpu_dev, which depends on the hardware." This can be loosely understood as, "if RAS is supported, which is property of the ASIC (struct amdgpu_dev), then we can access struct amdgpu_ras." v2: Fix a typo: must binary AND in ternary cond in amdgpu_ras.c Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Remove redundant ras->supported, as this value is also stored in adev->ras_features. Use adev->ras_features, as that supercedes "ras", since the latter is its member. The dependency goes like this: ras <== adev->ras_features <== hw_supported, and is read as "ras depends on ras_features, which depends on hw_supported." The arrows show the flow of information, i.e. the dependency update. "hw_supported" should also live in "adev". Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
gfx ras now can be enabled by default in aldebaran Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
add hdp block ras error query and reset support in amdgpu ras error count query and reset interface Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 4月, 2021 1 次提交
-
-
由 Hawking Zhang 提交于
Add socket/die information in RAS messages for platforms that support query those information Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 4月, 2021 2 次提交
-
-
由 Hawking Zhang 提交于
aldebaran gfx ras is still under development Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 21 4月, 2021 2 次提交
-
-
由 Mukul Joshi 提交于
Reset the RAS error count and error status registers after reading to prevent over reporting error counts on Aldebaran. Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Reviewed-By: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dennis Li 提交于
because "sscanf(str, "retire_page")" always return 0, if application use the raw data for error injection, it always wrongly falls into "op == 3". Change to use strstr instead. Signed-off-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 4月, 2021 5 次提交
-
-
由 Luben Tuikov 提交于
Add back the double-sscanf so that both decimal and hexadecimal values could be read in, but this time invert the scan so that hexadecimal format with a leading 0x is tried first, and if that fails, then try decimal format. Also use a logical-AND instead of nesting double if-conditional. See commit "drm/amdgpu: Fix a bug for input with double sscanf" Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Imporve the kernel-doc for the RAS sysfs interface. Fix the grammar, fix the context. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Add bad_page_cnt_threshold to debugfs, an optional file system used for debugging, for reporting purposes only--it usually matches the size of EEPROM but may be different depending on the "bad_page_threshold" kernel module option. The "bad_page_cnt_threshold" is a dynamically computed value. It depends on three things: the VRAM size; the size of the EEPROM (or the size allocated to the RAS table therein); and the "bad_page_threshold" module parameter. It is a dynamically computed value, when the amdgpu module is run, on which further parameters and logic depend, and as such it is helpful to see the dynamically computed value in debugfs. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Fix if (ret) --> if (!ret), a bug, for "retire_page", which caused the kernel to recall the method with *pos == end of file, and that bounced back with error. On the first run, we advanced *pos, but returned 0 back to fs layer, also a bug. Fix the logic of the check of the result of amdgpu_reserve_page_direct()--it is 0 on success, and non-zero on error, not the other way around. This patch fixes this bug. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Luben Tuikov 提交于
Remove double-sscanf to scan for %llu and 0x%llx, as that is not going to work! The %llu will consume the "0" in "0x" of your input, and the hex value you think you're entering will always be 0. That is, a valid hex value can never be consumed. On the other hand, just entering a hex number without leading 0x will either be scanned as a string and not match, for instance FAB123, or the leading decimal portion is scanned as the %llu, for instance 123FAB will be scanned as 123, which is not correct. Thus remove the first %llu scan and leave only the %llx scan, removing the leading 0x since %llx can scan either. Addresses are usually always hex values, so this suffices. Cc: Alexander Deucher <Alexander.Deucher@amd.com> Cc: Xinhui Pan <xinhui.pan@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 10 4月, 2021 6 次提交
-
-
由 John Clements 提交于
added support in RAS debugfs to add bad page for isolated page retirement testing Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 John Clements 提交于
In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
gfx ras is only available in cerntain ip generations. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
mmhub ras is only avaiable in cerntain mmhub ip generation. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
umc ras is not managed by gpu driver when gpu is connected to cpu through xgmi. split umc callbacks into ras and non-ras ones so gpu driver only initializes umc ras callbacks when it manages umc ras. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
xgmi ras is not managed by gpu driver when gpu is connected to cpu through xgmi. move all xgmi ras functions to xgmi_ras_funcs so gpu driver only initializes xgmi ras functions when it manages xgmi ras. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-