- 08 8月, 2016 1 次提交
-
-
由 Lukasz Odzioba 提交于
On Intel Xeon Phi Knights Landing processor family the channels of the memory controller have untypical arrangement - MC0 is mapped to CH3,4,5 and MC1 is mapped to CH0,1,2. This causes the EDAC driver to report the channel name incorrectly. We missed this change earlier, so the code already contains similar comment, but the translation function is incorrect. Without this patch: errors in DIMM_A and DIMM_D were reported in DIMM_D errors in DIMM_B and DIMM_E were reported in DIMM_E errors in DIMM_C and DIMM_F were reported in DIMM_F Correct this. Hubert Chrzaniuk: - rebased to 4.8 - comments and code cleanup Fixes: d0cdf900 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support") Reviewed-by: NTony Luck <tony.luck@intel.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Cc: lukasz.odzioba@intel.com Cc: mchehab@kernel.org Cc: <stable@vger.kernel.org> # v4.5.. Link: http://lkml.kernel.org/r/1469231089-22837-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: NLukasz Odzioba <lukasz.odzioba@intel.com> [ Boris: Simplify a bit by removing char mc. ] Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 16 7月, 2016 1 次提交
-
-
由 Tony Luck 提交于
In commit 2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection") I broke Knights Landing because I failed to notice that it called a wrapper macro "sbridge_get_all_devices_knl" instead of "sbridge_get_all_devices" like all the other types. Now that we include the processor type in the pci_id_table structure we can skip the wrappers and just have the sbridge_get_all_devices() check the type to decide whether to allow duplicate devices and controllers to have registers spread across buses. Fixes: 2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection") Tested-by: NLukasz Odzioba <lukasz.odzioba@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 6月, 2016 2 次提交
-
-
由 Tony Luck 提交于
In commit 2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection") we switched from using PCI ids to determine which platform we are running on to using CPU model instead. I forgot that Broadwell-DE has its own distinct model number different from Broadwell-EP or -EX. Fixing this isn't just adding a line to the array of cpuids - the exising code assumed a 1:1 mapping between entries in that array and the "enum type" values. Added the type to pci_id_table structure to remove this dependency and allows two Broadwell cpu models. Signed-off-by: NTony Luck <tony.luck@intel.com> Cc: Aristeu Rozanski <arozansk@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Fixes: 2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection") Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
由 Tony Luck 提交于
Broadwell made a small change to the rank target register moving the target rank ID field up from bits 16:19 to bits 20:23. Also found that the offset field grew by one bit in the IVY_BRIDGE to HASWELL transition, so fix the RIR_OFFSET() macro too. Signed-off-by: NTony Luck <tony.luck@intel.com> Cc: stable@vger.kernel.org # v3.19+ Cc: Aristeu Rozanski <arozansk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/2943fb819b1f7e396681165db9c12bb3df0e0b16.1464735623.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 03 5月, 2016 1 次提交
-
-
由 Tony Luck 提交于
Instead of picking a random PCI ID from the dozen or so we need to access, just use x86_match_cpu() to pick based on CPU model number. The choosing of PCI devices has been problematic in the past, see 11249e73 ("sb_edac: Fix detection on SNB machines") which fixed problems introduced by d0585cd8 ("sb_edac: Claim a different PCI device"). This is especially ugly if future hardware might not even have EDAC-relevant registers in PCI config space and we would still be required to choose some "random" PCI devices to scan for just so our driver loads. Is this cleaner/clearer? It deletes much more code than it adds. Only tested on Broadwell. The driver loads/unloads and loads again. Still decodes errors too. Signed-off-by: NTony Luck <tony.luck@intel.com> Suggested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 29 4月, 2016 1 次提交
-
-
由 Tony Luck 提交于
Both of these drivers can return NOTIFY_BAD, but this terminates processing other callbacks that were registered later on the chain. Since the driver did nothing to log the error it seems wrong to prevent other interested parties from seeing it. E.g. neither of them had even bothered to check the type of the error to see if it was a memory error before the return NOTIFY_BAD. Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/72937355dd92318d2630979666063f8a2853495b.1461864507.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 23 4月, 2016 1 次提交
-
-
由 Tony Luck 提交于
In the bad old days the functions from x86_mce_decoder_chain could be called in machine check context. So we used to carefully copy them and defer processing until later. But in f29a7aff ("x86/mce: Avoid potential deadlock due to printk() in MCE context") we switched the logging code to save the record in a genpool, and call the functions that registered to be notified later from a work queue. So drop all the double buffering and do all the work we want to do as soon as sbridge_mce_check_error() is called. Signed-off-by: NTony Luck <tony.luck@intel.com> Cc: Aristeu Rozanski <arozansk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: patrickg@supermicro.com Link: http://lkml.kernel.org/r/100025611cd780d9bca72792b2b2146760da53e0.1460756761.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 22 4月, 2016 2 次提交
-
-
由 Tony Luck 提交于
Haswell and Broadwell can be configured to hash the channel interleave function using bits [27:12] of the physical address. On those processor models we must check to see if hashing is enabled (bit21 of the HASWELL_HASYSDEFEATURE2 register) and act accordingly. Based on a patch by patrickg <patrickg@supermicro.com> Tested-by: NPatrick Geary <patrickg@supermicro.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Aristeu Rozanski <arozansk@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tony Luck 提交于
In commit: eb1af3b7 ("Fix computation of channel address") I switched the "sck_way" variable from holding the log2 value read from the h/w to instead be the actual number. Unfortunately it is needed in log2 form when used to shift the address. Tested-by: NPatrick Geary <patrickg@supermicro.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Aristeu Rozanski <arozansk@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac@vger.kernel.org Cc: stable@vger.kernel.org Fixes: eb1af3b7 ("Fix computation of channel address") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 11 3月, 2016 1 次提交
-
-
由 Luck, Tony 提交于
Large memory Haswell-EX systems with multiple DIMMs per channel were sometimes reporting the wrong DIMM. Found three problems: 1) Debug printouts for socket and channel interleave were not interpreting the register fields correctly. The socket interleave field is a 2^X value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2, 2=3. 3=4). 2) Actual use of the socket interleave value didn't interpret as 2^X 3) Conversion of address to channel address was complicated, and wrong. Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <arozansk@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 3月, 2016 1 次提交
-
-
由 Hubert Chrzaniuk 提交于
Correct a typo introduced by d0cdf900 ("EDAC, sb_edac: Add Knights Landing (Xeon Phi gen 2) support") As a result under some configurations DIMMs were not correctly recognized. Problem affects only Xeon Phi architecture. Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Link: http://lkml.kernel.org/r/1457361045-26221-1-git-send-email-hubert.chrzaniuk@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 11 12月, 2015 1 次提交
-
-
由 Hubert Chrzaniuk 提交于
Knights Landing does not come with register that could be used to fetch DIMM width. However the value is fixed for this architecture so it can be hardcoded. Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Link: http://lkml.kernel.org/r/1449840082-18673-1-git-send-email-hubert.chrzaniuk@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 06 12月, 2015 3 次提交
-
-
由 Jim Snow 提交于
Knights Landing is the next generation architecture for HPC market. KNL introduces concept of a tile and CHA - Cache/Home Agent for memory accesses. Some things are fixed in KNL: () There's single DIMM slot per channel () There's 2 memory controllers with 3 channels each, however, from EDAC standpoint, it is presented as single memory controller with 6 channels. In order to represent 2 MCs w/ 3 CH, it would require major redesign of EDAC core driver. Basically, two functionalities are added/extended: () during driver initialization KNL topology is being recognized, i.e. which channels are populated with what DIMM sizes (knl_get_dimm_capacity function) () handle MCE errors - channel swizzling Reviewed-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NJim Snow <jim.m.snow@intel.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Link: http://lkml.kernel.org/r/1449136134-23706-5-git-send-email-hubert.chrzaniuk@intel.com [ Rebase to 4.4-rc3. ] Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Jim Snow 提交于
Add options to sbridge_get_all_devices() to allow for duplicate device IDs and devices that are scattered across mulitple PCI buses. Signed-off-by: NJim Snow <jim.m.snow@intel.com> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Link: http://lkml.kernel.org/r/1449136134-23706-4-git-send-email-hubert.chrzaniuk@intel.com [ Rebase to 4.4-rc3. ] Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Jim Snow 提交于
SAD limit, interleave mode and DRAM related functionalities are now virtualized, so that overriding them is easier. Signed-off-by: NJim Snow <jim.m.snow@intel.com> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lukasz.anaczkowski@intel.com Link: http://lkml.kernel.org/r/1449136134-23706-3-git-send-email-hubert.chrzaniuk@intel.com [ Rebase to 4.4-rc3. ] Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 25 9月, 2015 1 次提交
-
-
由 Seth Jennings 提交于
In commit 7d375bff ("sb_edac: Fix support for systems with two home agents per socket") NUM_CHANNELS was changed to 8 and the channel space was renumerated to handle EN, EP, and EX configurations. The *_mci_bind_devs() functions - except for sbridge_mci_bind_devs() - got a new device presence check in the form of saw_chan_mask. However, sbridge_mci_bind_devs() still uses the NUM_CHANNELS for loop. With the increase in NUM_CHANNELS, this loop fails at index 4 since SB only has 4 TADs. This results in the following error on SB machines: EDAC sbridge: Some needed devices are missing EDAC sbridge: Couldn't find mci handler EDAC sbridge: Couldn't find mci handle This patch adapts the saw_chan_mask logic for sbridge_mci_bind_devs() as well. After this patch: EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (POLLED) EDAC MC1: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#1: DEV 0000:7f:0e.0 (POLLED) Signed-off-by: NSeth Jennings <sjenning@redhat.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Acked-by: NTony Luck <tony.luck@intel.com> Tested-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # v4.2 Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1438798561-10180-1-git-send-email-sjenning@redhat.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 09 9月, 2015 2 次提交
-
-
由 Aristeu Rozanski 提交于
dimm_dev_type has been incorrectly determined in sb_edac. This patch fixes it for Ivy Bridge and Haswell only since nothing like exists for Sandy Bridge. We tested this patch in multiple systems matching the results with the installed memory modules. Acked-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Aristeu Rozanski 提交于
In case the memory banks are populated so the first channel isn't used, the DDRIO PCI device won't be visible and it won't be possible to determine the memory type. Acked-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
- 13 8月, 2015 2 次提交
-
-
由 Borislav Petkov 提交于
This used to flush out MCEs logged during early boot and which were in the MCA registers from a previous system run. No need for that now, since we've moved to a genpool. Suggested-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439396985-12812-7-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Chen, Gong 提交于
Use unified genpool to save Action Optional error events and put Action Optional error handling in the same notification chain as MCE error decoding. Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> [ Fold in subsequent patch from Boris for early boot logging. ] Signed-off-by: NTony Luck <tony.luck@intel.com> [ Correct a lot. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439396985-12812-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 6月, 2015 3 次提交
-
-
由 Tony Luck 提交于
Basic support for the single socket Broadwell-DE processor was added back in commit 1f39581a sb_edac: Add support for Broadwell-DE processor This patch extends Broadwell support to cover the two socket "-EP" and four socket "-EX" versions of Broadwell. Only tested on the 2 socket - but this code is largely cloned from the Haswell path. Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Tony Luck 提交于
First noticed a problem on a 4 socket machine where EDAC only reported half the DIMMS. Tracked this down to the code that assumes that systems with two home agents only have two memory channels on each agent. This is true on 2 sockect ("-EP") machines. But four socket ("-EX") machines have four memory channels on each home agent. The old code would have had problems on two socket systems as it did a shuffling trick to make the internals of the code think that the channels from the first agent were '0' and '1', with the second agent providing '2' and '3'. But the code didn't uniformly convert from {ha,channel} tuples to this internal representation. New code always considers up to eight channels. On a machine with a single home agent these map easily to edac channels 0, 1, 2, 3. On machines with two home agents we map using: edac_channel = 4*ha# + channel So on a -EP machine where each home agent supports only two channels we'll fill in channels 0, 1, 4, 5, and on a -EX machine we use all of 0, 1, 2, 3, 4, 5, 6, 7. [mchehab@osg.samsung.com: fold a fixup patch as per Tony's request and fixed a few CodingStyle issues] Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Tony Luck 提交于
typo: "a7mode" chooses whether to use bits {8, 7, 9} or {8, 7, 6} in the algorithm to spread access between memory resources. But the non-a7mode path was incorrectly using GET_BITFIELD(addr, 7, 9) and so picking bits {9, 8, 7} thinko: BIT(1) of the dram_rule registers chooses whether to just use the {8, 7, 6} (or {8, 7, 9}) bits mentioned above as they are, or to XOR them with bits {18, 17, 16} but the code inverted the test. We need the additional XOR when dram_rule{1} == 0. Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
- 09 2月, 2015 1 次提交
-
-
由 Borislav Petkov 提交于
d0585cd8 ("sb_edac: Claim a different PCI device") changed the probing of sb_edac to look for PCI device 0x3ca0: 3f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07) 00: 86 80 a0 3c 00 00 00 00 07 00 80 08 00 00 80 00 ... but we're matching for 0x3ca8, i.e. PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_TA in sbridge_probe() therefore the probing fails. Changing it to probe for 0x3ca0 (PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_HA0), .i.e., the 14.0 device, fixes the issue and driver loads successfully again: [ 2449.013120] EDAC DEBUG: sbridge_init: [ 2449.017029] EDAC sbridge: Seeking for: PCI ID 8086:3ca0 [ 2449.022368] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca0 [ 2449.028498] EDAC sbridge: Seeking for: PCI ID 8086:3ca0 [ 2449.033768] EDAC sbridge: Seeking for: PCI ID 8086:3ca8 [ 2449.039028] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca8 [ 2449.045155] EDAC sbridge: Seeking for: PCI ID 8086:3ca8 ... Add a debug printk while at it to be able to catch the failure in the future and dump driver version on successful load. Fixes: d0585cd8 ("sb_edac: Claim a different PCI device") Cc: stable@vger.kernel.org # 3.18 Acked-by: NAristeu Rozanski <aris@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Acked-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NMauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 03 12月, 2014 2 次提交
-
-
由 Tony Luck 提交于
Code will always think there are 16 banks because of a typo Reported-by: Misha Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Tony Luck 提交于
Broadwell-DE is the microserver version of next generation Xeon processors. A whole bunch of new PCIe device ids, but otherwise pretty much the same as Haswell. Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
- 02 12月, 2014 2 次提交
-
-
由 Tony Luck 提交于
Haswell moved the TOLM/TOHM registers to a different device and offset. The sb_edac driver accounted for the change of device, but not for the new offset. There was also a typo in the constant to fill in the low 26 bits (was 0x1ffffff, should be 0x3ffffff). This resulted in a bogus value for the top of low memory: EDAC DEBUG: get_memory_layout: TOLM: 0.032 GB (0x0000000001ffffff) which would result in EDAC refusing to translate addresses for errors above the bogus value and below 4GB: sbridge MC3: HANDLING MCE MEMORY ERROR sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 sbridge MC3: TSC 0 sbridge MC3: ADDR 2000000 sbridge MC3: MISC 523eac86 sbridge MC3: PROCESSOR 0:306f3 TIME 1414600951 SOCKET 0 APIC 0 MC3: 1 CE Error at TOLM area, on addr 0x02000000 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0) With the fix we see the correct TOLM value: DEBUG: get_memory_layout: TOLM: 2.048 GB (0x000000007fffffff) and we decode address 2000000 correctly: sbridge MC3: HANDLING MCE MEMORY ERROR sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 sbridge MC3: TSC 0 sbridge MC3: ADDR 2000000 sbridge MC3: MISC 523e1086 sbridge MC3: PROCESSOR 0:306f3 TIME 1414601319 SOCKET 0 APIC 0 DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 0 DEBUG: get_memory_error_data: TAD#0: address 0x0000000002000000 < 0x000000007fffffff, socket interleave 1, channel interleave 4 (offset 0x00000000), index 0, base ch: 0, ch mask: 0x01 DEBUG: get_memory_error_data: RIR#0, limit: 4.095 GB (0x00000000ffffffff), way: 1 DEBUG: get_memory_error_data: RIR#0: channel address 0x00200000 < 0xffffffff, RIR interleave 0, index 0 DEBUG: sbridge_mce_output_error: area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0 MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2000 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0) Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Jim Snow 提交于
Signed-off-by: NJim Snow <jim.snow@intel.com> Signed-off-by: NLukasz Anaczkowski <lukasz.anaczkowski@intel.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
- 09 10月, 2014 3 次提交
-
-
由 Andy Lutomirski 提交于
sb_edac controls a large number of different PCI functions. Rather than registering as a normal PCI driver for all of them, it registers for just one so that it gets probed and, at probe time, it looks for all the others. Coincidentally, the device it registers for also contains the SMBUS registers, so the PCI core will refuse to probe both sb_edac and a future iMC SMBUS driver. The drivers don't actually conflict, so just change sb_edac's device table to probe a different device. An alternative fix would be to merge the two drivers, but sb_edac will also refuse to load on non-ECC systems, whereas i2c_imc would still be useful without ECC. The only user-visible change should be that sb_edac appears to bind a different device. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Cc: Rui Wang <ruiv.wang@gmail.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Andy Lutomirski 提交于
The i2c_imc driver will use two of them, and moving only part of the list seems messier. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NBjorn Helgaas <bhelgaas@google.com> Acked-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
由 Seth Jennings 提交于
Intel IA32 SDM Table 15-14 defines channel 0xf as 'not specified', but EDAC doesn't know about this and returns and INTERNAL ERROR when the channel is greater than NUM_CHANNELS: kernel: [ 1538.886456] CPU 0: Machine Check Exception: 0 Bank 1: 940000000000009f kernel: [ 1538.886669] TSC 2bc68b22e7e812 ADDR 46dae7000 MISC 0 PROCESSOR 0:306e4 TIME 1390414572 SOCKET 0 APIC 0 kernel: [ 1538.971948] EDAC MC1: INTERNAL ERROR: channel value is out of range (15 >= 4) kernel: [ 1538.972203] EDAC MC1: 0 CE memory read error on unknown memory (slot:0 page:0x46dae7 offset:0x0 grain:0 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 channel_mask:1 rank:0) This commit changes sb_edac to forward a channel of -1 to EDAC if the channel is not specified. edac_mc_handle_error() sets the channel to -1 internally after the error message anyway, so this commit should have no effect other than avoiding the INTERNAL ERROR message when the channel is not specified. Signed-off-by: NSeth Jennings <sjenning@redhat.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
-
- 27 6月, 2014 9 次提交
-
-
由 Aristeu Rozanski 提交于
Haswell memory controllers are very similar to Ivy Bridge and Sandy Bridge ones. This patch adds support to Haswell based systems. [m.chehab@samsung.com: Fix CodingStyle issues] Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Mauro Carvalho Chehab 提交于
We should not have spaces before ^I on alignments. Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
When a MC is handled, the correct sbridge_dev is searched based on the node, checking again later with the assumption the first memory controller found is the first socket's memory controller is a bogus assumption. Get rid of it. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
channel_mask will be used in the future to determine which group of memory modules is causing the errors since when mirroring, lockstep and close page are enabled you can't. While that doesn't happen, use the channel_mask to determine the channel instead of relying on the MC event/exception. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
This patch fixes the obvious bug while handling the socket/HA bitmask used in Ivy Bridge memory controllers. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
This patch changes the way devices are searched by using product id instead of device/function numbers. Tested in a Sandy Bridge and a Ivy Bridge machine to make sure everything works properly. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
Haswell has a different way to retrieve RIR limits, make this procedure per model. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
Haswell has a different way to retrieve the node id, make so this procedure can be reimplemented. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
由 Aristeu Rozanski 提交于
Haswell has different register, offset to determine memory type and supports DDR4 in some models. This patch makes it easier to have a different method depending on the memory controller type. Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-