提交 · bc1c99a5971aa7571e8b9731c28fa32abe12cab8 · openeuler / Kernel

20 11月, 2020 4 次提交

EDAC: Add DDR5 new memory type · bc1c99a5

由 Qiuxu Zhuo 提交于 11月 17, 2020

Add a new entry to 'enum mem_type' and a new string to
'edac_mem_types[]' for DDR5 new memory type.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

bc1c99a5

EDAC/i10nm: Use readl() to access MMIO registers · 83ff51c4

由 Qiuxu Zhuo 提交于 11月 17, 2020

Instead of raw access, use readl() to access MMIO registers of
memory controller to avoid possible compiler re-ordering.

Fixes: d4dc89d0 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Cc: <stable@vger.kernel.org>
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

83ff51c4

EDAC/igen6: Add debugfs interface for Intel client SoC EDAC driver · 2223d8c7

由 Qiuxu Zhuo 提交于 11月 05, 2020

Add debugfs support to fake memory correctable errors to test the
error reporting path and the error address decoding logic in the
igen6_edac driver.

Please note that the fake errors are also reported to EDAC core and
then the CE counter in EDAC sysfs is also increased.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

2223d8c7

EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC · 10590a9d

由 Qiuxu Zhuo 提交于 11月 05, 2020

This driver supports Intel client SoC with integrated memory controller
using In-Band ECC(IBECC). The memory correctable and uncorrectable errors
are reported via NMIs. The driver handles the NMIs and decodes the memory
error address to platform specific address. The first IBECC-supported SoC
is Elkhart Lake.

[Tony: s/#include <linux/nmi.h>/#include <asm/nmi.h>/ to fix randconfig build]
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

10590a9d

06 11月, 2020 1 次提交

EDAC: Add three new memory types · 3b203693

由 Qiuxu Zhuo 提交于 11月 05, 2020

There are {Low-Power DDR3/4, WIO2} types of memory.
Add new entries to 'enum mem_type' and new strings to
'edac_mem_types[]' for the new types.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>

3b203693

10 10月, 2020 1 次提交

EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh · b4210eab

由 Yazen Ghannam 提交于 10月 09, 2020

AMD Family 19h Models 20h-2Fh use the same PCI IDs as Family 17h Models
70h-7Fh. The same family ops and number of channels also apply.

Use the Family17h Model 70h family_type and ops for Family 19h Models
20h-2Fh. Update the controller name to match the system.
Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201009171803.3214354-1-Yazen.Ghannam@amd.com

b4210eab

18 9月, 2020 2 次提交

EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location · e6bbde8b

由 Xiongfeng Wang 提交于 9月 14, 2020

Reading those sysfs entries gives:

  [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/max_location
  memory 3 [root@localhost /]# cat /sys/devices/system/edac/mc/mc0/dimm0/dimm_location
  memory 0 [root@localhost /]#

Add newlines after the value it prints for better readability.

  [ bp: Make len a signed int and change the check to catch wraparound.
    Increment the pointer p only when the length check passes. Use
    scnprintf(). ]
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/1600051734-8993-1-git-send-email-wangxiongfeng2@huawei.com

e6bbde8b

EDAC/aspeed: Use module_platform_driver() to simplify · 07def587

由 Liu Shixin 提交于 9月 14, 2020

Use module_platform_driver() which makes the code simpler by eliminating
boilerplate code.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJoel Stanley <joel@jms.id.au>
Link: https://lkml.kernel.org/r/20200914065358.3726216-1-liushixin2@huawei.com

07def587

17 9月, 2020 2 次提交

cper,edac,efi: Memory Error Record: bank group/address and chip id · 612b5d50

由 Alex Kluver 提交于 8月 19, 2020

Updates to the UEFI 2.8 Memory Error Record allow splitting the bank field
into bank address and bank group, and using the last 3 bits of the extended
field as a chip identifier.

When needed, print correct version of bank field, bank group, and chip
identification.

Based on UEFI 2.8 Table 299. Memory Error Record.
Signed-off-by: NAlex Kluver <alex.kluver@hpe.com>
Reviewed-by: NRuss Anderson <russ.anderson@hpe.com>
Reviewed-by: NKyle Meyer <kyle.meyer@hpe.com>
Reviewed-by: NSteve Wahl <steve.wahl@hpe.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-3-alex.kluver@hpe.comSigned-off-by: NArd Biesheuvel <ardb@kernel.org>

612b5d50

edac,ghes,cper: Add Row Extension to Memory Error Record · 9baf68cc

由 Alex Kluver 提交于 8月 19, 2020

Memory errors could be printed with incorrect row values since the DIMM
size has outgrown the 16 bit row field in the CPER structure. UEFI
Specification Version 2.8 has increased the size of row by allowing it to
use the first 2 bits from a previously reserved space within the structure.

When needed, add the extension bits to the row value printed.

Based on UEFI 2.8 Table 299. Memory Error Record
Signed-off-by: NAlex Kluver <alex.kluver@hpe.com>
Tested-by: NRuss Anderson <russ.anderson@hpe.com>
Reviewed-by: NSteve Wahl <steve.wahl@hpe.com>
Reviewed-by: NKyle Meyer <kyle.meyer@hpe.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200819143544.155096-2-alex.kluver@hpe.comSigned-off-by: NArd Biesheuvel <ardb@kernel.org>

9baf68cc

15 9月, 2020 2 次提交

EDAC/ghes: Check whether the driver is on the safe list correctly · 251c54ea

由 Borislav Petkov 提交于 9月 11, 2020

With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe,
unregister and probe again a driver.

When ghes_edac is attempted to be loaded on a system which is not on
the safe platforms list, ghes_edac_register() would return early. The
unregister counterpart ghes_edac_unregister() would still attempt to
unregister and exit early at the refcount test, leading to the refcount
underflow below.

In order to not do *anything* on the unregister path too, reuse the
force_load parameter and check it on that path too, before fumbling with
the refcount.

  ghes_edac: ghes_edac_register: entry
  ghes_edac: ghes_edac_register: return -ENODEV
  ------------[ cut here ]------------
  refcount_t: underflow; use-after-free.
  WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100
  Modules linked in:
  CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12
  Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
  RIP: 0010:refcount_warn_saturate+0xb9/0x100
  Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c
  RSP: 0018:ffffc90000037d58 EFLAGS: 00010292
  RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff
  RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000
  R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0
  Call Trace:
   ghes_edac_unregister
   ghes_remove
   platform_drv_remove
   really_probe
   driver_probe_device
   device_driver_attach
   __driver_attach
   ? device_driver_attach
   ? device_driver_attach
   bus_for_each_dev
   bus_add_driver
   driver_register
   ? bert_init
   ghes_init
   do_one_initcall
   ? rcu_read_lock_sched_held
   kernel_init_freeable
   ? rest_init
   kernel_init
   ret_from_fork
   ...
  ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic

251c54ea

EDAC/ghes: Clear scanned data on unload · cd8100f1

由 Borislav Petkov 提交于 9月 11, 2020

Commit

  b972fdba ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")

didn't clear all the information from the scanned system and, more
specifically, left ghes_hw.num_dimms to its previous value. On a
second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use
the leftover num_dimms value which is not 0 and thus the 0 check in
enumerate_dimms() will get bypassed and it would go directly to the
pointer deref:

  d = &hw->dimms[hw->num_dimms];

which is, of course, NULL:

  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD 0 P4D 0
  Oops: 0002 [#1] PREEMPT SMP
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7
  Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018
  RIP: 0010:enumerate_dimms.cold+0x7b/0x375

Reset the whole ghes_hw on driver unregister so that no stale values are
used on a second system scan.

Fixes: b972fdba ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()")
Cc: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic

cd8100f1

09 9月, 2020 1 次提交

EDAC, sb_edac: Simplify switch statement · fbd4ab78

由 Tom Rix 提交于 9月 07, 2020

clang static analyzer reports this problem

sb_edac.c:959:2: warning: Undefined or garbage value
  returned to caller
        return type;
        ^~~~~~~~~~~

This is a false positive.

However by initializing the type to DEV_UNKNOWN the 3 case can be
removed from the switch, saving a comparison and jump.
Signed-off-by: NTom Rix <trix@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com

fbd4ab78

02 9月, 2020 2 次提交

EDAC/ti: Fix handling of platform_get_irq() error · 66077adb

由 Krzysztof Kozlowski 提交于 8月 27, 2020

platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.

 [ bp: Massage commit message. ]

Fixes: 86a18ee2 ("EDAC, ti: Add support for TI keystone and DRA7xx EDAC")
Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NTero Kristo <t-kristo@ti.com>
Link: https://lkml.kernel.org/r/20200827070743.26628-2-krzk@kernel.org

66077adb

EDAC/aspeed: Fix handling of platform_get_irq() error · afce6996

由 Krzysztof Kozlowski 提交于 8月 27, 2020

platform_get_irq() returns a negative error number on error. In such a
case, comparison to 0 would pass the check therefore check the return
value properly, whether it is negative.

 [ bp: Massage commit message. ]

Fixes: 9b7e6242 ("EDAC, aspeed: Add an Aspeed AST2500 EDAC driver")
Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NStefan Schaeckeler <schaecsn@gmx.net>
Link: https://lkml.kernel.org/r/20200827070743.26628-1-krzk@kernel.org

afce6996

01 9月, 2020 1 次提交

EDAC/i5100: Fix error handling order in i5100_init_one() · 857a3139

由 Dinghao Liu 提交于 8月 26, 2020

When pci_get_device_func() fails, the driver doesn't need to execute
pci_dev_put(). mci should still be freed, though, to prevent a memory
leak. When pci_enable_device() fails, the error injection PCI device
"einj" doesn't need to be disabled either.

 [ bp: Massage commit message, rename label to "bail_mc_free". ]

Fixes: 52608ba2 ("i5100_edac: probe for device 19 function 0")
Signed-off-by: NDinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200826121437.31606-1-dinghao.liu@zju.edu.cn

857a3139

28 8月, 2020 1 次提交

EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register() · b972fdba

由 Shiju Jose 提交于 8月 27, 2020

After

  b9cae277 ("EDAC/ghes: Scan the system once on driver init")

and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes
a NULL pointer after the second ->probe() (aka ghes_edac_register())
which the config option causes to be called.

This happens because the static variable which holds down whether
the system has been scanned already, doesn't get reset in
ghes_edac_unregister(). Then, on the second probe, ghes_scan_system()
doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining
NULL.

Clear the variable and rename it to something more descriptive so that a
second probe succeeds.

 [ bp: Rewrite commit message. ]

Fixes: b9cae277 ("EDAC/ghes: Scan the system once on driver init")
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NShiju Jose <shiju.jose@huawei.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com

b972fdba

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

20 8月, 2020 1 次提交

x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap · 368d1887

由 Yazen Ghannam 提交于 7月 20, 2020

The Extended Error Code Bitmap (xec_bitmap) for a Scalable MCA bank type
was intended to be used by the kernel to filter out invalid error codes
on a system. However, this is unnecessary after a few product releases
because the hardware will only report valid error codes. Thus, there's
no need for it with future systems.

Remove the xec_bitmap field and all references to it.
Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200720145353.43924-1-Yazen.Ghannam@amd.com

368d1887

18 8月, 2020 1 次提交

EDAC/{i7core,sb,pnd2,skx}: Fix error event severity · 45bc6098

由 Tony Luck 提交于 7月 07, 2020

IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.

Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").

Reverse the tests so that the error is marked fatal when RIPV is not set.
Reported-by: NGabriele Paoloni <gabriele.paoloni@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com

45bc6098

17 8月, 2020 4 次提交

EDAC/thunderx: Make symbol lmc_dfs_ents static · bd17e0b7

由 Wei Yongjun 提交于 7月 14, 2020

Symbol 'lmc_dfs_ents' is not used outside of thunderx_edac.c, so
make it static:

  drivers/edac/thunderx_edac.c:457:22: warning:
   symbol 'lmc_dfs_ents' was not declared. Should it be static?
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NRobert Richter <rrichter@marvell.com>
Link: https://lkml.kernel.org/r/20200714142308.46612-1-weiyongjun1@huawei.com

bd17e0b7

EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver · e23a7cde

由 Talel Shenhar 提交于 8月 16, 2020

The Amazon's Annapurna Labs Memory Controller EDAC supports ECC capability
for error detection and correction (Single bit error correction, Double
detection). This driver introduces EDAC driver for that capability.

 [ bp: Remove "EDAC" string from Kconfig tristate as it is redundant. ]
Signed-off-by: NTalel Shenhar <talel@amazon.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJames Morse <james.morse@arm.com>
Link: https://lkml.kernel.org/r/20200816185551.19108-3-talel@amazon.com

e23a7cde

EDAC/mce_amd: Add new error descriptions for existing types · dc7a8476

由 Yazen Ghannam 提交于 7月 08, 2020

A few existing MCA bank types will have new error types in future SMCA
systems.

Add the descriptions for the new error types.
Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200708153515.1911642-1-Yazen.Ghannam@amd.com

dc7a8476

EDAC: Replace HTTP links with HTTPS ones · 7d4c1ea2

由 Alexander A. Klimov 提交于 7月 08, 2020

Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

  Deterministic algorithm:
  For each file:
    If not .svg:
      For each line:
        If doesn't contain `\bxmlns\b`:
          For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
  	  If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
              If both the HTTP and HTTPS versions
              return 200 OK and serve the same content:
                Replace HTTP with HTTPS.

 [ bp: Merge all EDAC patches into a single one. ]
Signed-off-by: NAlexander A. Klimov <grandmaster@al2klimov.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac
Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de

7d4c1ea2

11 8月, 2020 1 次提交

EDAC/ie31200: Fallback if host bridge device is already initialized · 709ed1bc

由 Jason Baron 提交于 7月 16, 2020

The Intel uncore driver may claim some of the pci ids from ie31200 which
means that the ie31200 edac driver will not initialize them as part of
pci_register_driver().

Let's add a fallback for this case to 'pci_get_device()' to get a
reference on the device such that it can still be configured. This is
similar in approach to other edac drivers.
Signed-off-by: NJason Baron <jbaron@akamai.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com

709ed1bc

23 6月, 2020 1 次提交

x86/mce, EDAC/mce_amd: Print PPIN in machine check records · bb2de0ad

由 Smita Koralahalli 提交于 6月 23, 2020

Print the Protected Processor Identification Number (PPIN) on processors
which support it.

 [ bp: Massage. ]
Signed-off-by: NSmita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com

bb2de0ad

19 6月, 2020 1 次提交

EDAC/amd64: Read back the scrub rate PCI register on F15h · ee470bb2

由 Borislav Petkov 提交于 6月 18, 2020

Commit:

  da92110d ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")

added support for F15h, model 0x60 CPUs but in doing so, missed to read
back SCRCTRL PCI config register on F15h CPUs which are *not* model
0x60. Add that read so that doing

  $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate

can show the previously set DRAM scrub rate.

Fixes: da92110d ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
Reported-by: NAnders Andersson <pipatron@gmail.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> #v4.4..
Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com

ee470bb2

17 6月, 2020 2 次提交

EDAC: Fix reference count leaks · 17ed808a

由 Qiushi Wu 提交于 5月 28, 2020

When kobject_init_and_add() returns an error, it should be handled
because kobject_init_and_add() takes a reference even when it fails. If
this function returns an error, kobject_put() must be called to properly
clean up the memory associated with the object.

Therefore, replace calling kfree() and call kobject_put() and add a
missing kobject_put() in the edac_device_register_sysfs_main_kobj()
error path.

 [ bp: Massage and merge into a single patch. ]

Fixes: b2ed215a ("Kobject: change drivers/edac to use kobject_init_and_add")
Signed-off-by: NQiushi Wu <wu000273@umn.edu>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu
Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu

17ed808a

EDAC/ghes: Scan the system once on driver init · b9cae277

由 Borislav Petkov 提交于 6月 03, 2020

Change the hardware scanning and figuring out how many DIMMs a machine
has to a single, one-time thing which happens once on driver init. After
that scanning completes, struct ghes_hw_desc contains a representation
of the hardware which the driver can then use for later initialization.

Then, copy the DIMM information into the respective EDAC core
representation of those.

Get rid of ghes_edac_dimm_fill and use a struct dimm_info array
directly.

This way, hw detection and further driver initialization is nicely
and logically split. Further additions should all be added to
ghes_scan_system() and the hw representation extended as needed.

There should be no functionality change resulting from this patch.
Signed-off-by: NBorislav Petkov <bp@suse.de>

b9cae277

16 6月, 2020 5 次提交

EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt · b001694d

由 Robert Richter 提交于 5月 19, 2020

The struct members list and ghes of struct ghes_edac_pvt are unused,
remove them. On that occasion, rename it to the shorter name struct
ghes_pvt.
Signed-off-by: NRobert Richter <rrichter@marvell.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com

b001694d

EDAC/ghes: Setup DIMM label from DMI and use it in error reports · cb51a371

由 Robert Richter 提交于 5月 28, 2020

The ghes driver reports errors with 'unknown label' even if the actual
DIMM label is known, e.g.:

 EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
   module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
   page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
   node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
   location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
   DRAM memory)

Fix this by using struct dimm_info's label string in error reports:

 EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
   rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
   page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
   node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
   location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
   DRAM memory)

The labels are initialized by reading the bank and device strings
from DMI. Now, the label information can also read from sysfs. E.g. a
ThunderX2 system will show the following:

  /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
  /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
  /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
  /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
  /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
  /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
  /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
  /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
  /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
  /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
  /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
  /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
  /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
  /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
  /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
  /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0

Since dimm_labels can be rewritten, that label will be used in a later
error report:

  # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
  # # some error injection here
  # dmesg | grep foobar
  [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
  module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
  page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
  node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
  location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
  memory)

 [ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]
Signed-off-by: NRobert Richter <rrichter@marvell.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com

cb51a371

EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations · 8807e155

由 Qiuxu Zhuo 提交于 5月 09, 2020

Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU
stepping specific configurations to {skx,i10nm}_init(), so can delete
the CPU stepping check from 10nm_init().
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com

8807e155

EDAC/mc: Call edac_inc_ue_error() before panic · e9ff6636

由 Zhenzhong Duan 提交于 6月 10, 2020

By calling edac_inc_ue_error() before panic, we get a correct UE error
count for core dump analysis.
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@gmail.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200610065846.3626-2-zhenzhong.duan@gmail.com

e9ff6636

EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier · 30bf38e4

由 Zhenzhong Duan 提交于 6月 10, 2020

Avoid giving it MCE_PRIO_LOWEST priority by default.
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@gmail.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200610065846.3626-1-zhenzhong.duan@gmail.com

30bf38e4

14 6月, 2020 1 次提交

treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624

由 Masahiro Yamada 提交于 6月 14, 2020

Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>

a7f7f624

29 5月, 2020 1 次提交

EDAC/amd64: Remove redundant assignment to variable ret in hw_info_get() · f00eb5ff

由 Colin Ian King 提交于 4月 29, 2020

The variable ret is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant so remove it.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com

f00eb5ff

23 5月, 2020 1 次提交

EDAC/amd64: Add AMD family 17h model 60h PCI IDs · b6bea24d

由 Alexander Monakov 提交于 5月 10, 2020

Add support for AMD Renoir (4000-series Ryzen CPUs).
Signed-off-by: NAlexander Monakov <amonakov@ispras.ru>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NYazen Ghannam <yazen.ghannam@amd.com>
Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru

b6bea24d

20 5月, 2020 1 次提交

EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enable · 10320950

由 Qiuxu Zhuo 提交于 5月 15, 2020

The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.

Cc: <stable@vger.kernel.org>
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: NMatthew Riley <mattdr@google.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com

10320950

28 4月, 2020 2 次提交

EDAC/i10nm: Update driver to support different bus number config register offsets · ce206708

由 Qiuxu Zhuo 提交于 4月 24, 2020

The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.

Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.
Reported-by: NJerry Chen <jerry.t.chen@intel.com>
Reported-and-tested-by: NJin Wen <wen.jin@intel.com>
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic

ce206708

EDAC, {skx,i10nm}: Make some configurations CPU model specific · ee5340ab

由 Qiuxu Zhuo 提交于 4月 24, 2020

The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic

ee5340ab

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功