提交 · 665f05e0b836d46b22f0b712eb76d8b7f69a5ea0 · Linux-御风守护者 / linux

03 6月, 2016 2 次提交

EDAC, sb_edac: Readd accidentally dropped Broadwell-D support · 665f05e0

由 Tony Luck 提交于 6月 02, 2016

In commit

  2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection")

we switched from using PCI ids to determine which platform we are
running on to using CPU model instead.

I forgot that Broadwell-DE has its own distinct model number different
from Broadwell-EP or -EX.

Fixing this isn't just adding a line to the array of cpuids - the
exising code assumed a 1:1 mapping between entries in that array and the
"enum type" values. Added the type to pci_id_table structure to remove
this dependency and allows two Broadwell cpu models.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Fixes: 2c1ea4c7 ("EDAC, sb_edac: Use cpu family/model in driver detection")
Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

665f05e0

EDAC, sb_edac: Fix rank lookup on Broadwell · c7103f65

由 Tony Luck 提交于 5月 31, 2016

Broadwell made a small change to the rank target register moving the
target rank ID field up from bits 16:19 to bits 20:23.

Also found that the offset field grew by one bit in the IVY_BRIDGE to
HASWELL transition, so fix the RIR_OFFSET() macro too.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Cc: stable@vger.kernel.org # v3.19+
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/2943fb819b1f7e396681165db9c12bb3df0e0b16.1464735623.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

c7103f65

03 5月, 2016 1 次提交

EDAC, sb_edac: Use cpu family/model in driver detection · 2c1ea4c7

由 Tony Luck 提交于 4月 28, 2016

Instead of picking a random PCI ID from the dozen or so we need to
access, just use x86_match_cpu() to pick based on CPU model number. The
choosing of PCI devices has been problematic in the past, see

  11249e73 ("sb_edac: Fix detection on SNB machines")

which fixed problems introduced by

  d0585cd8 ("sb_edac: Claim a different PCI device").

This is especially ugly if future hardware might not even have
EDAC-relevant registers in PCI config space and we would still be
required to choose some "random" PCI devices to scan for just so our
driver loads.

Is this cleaner/clearer? It deletes much more code than it adds. Only
tested on Broadwell. The driver loads/unloads and loads again. Still
decodes errors too.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Suggested-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>

2c1ea4c7

29 4月, 2016 1 次提交

EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback · c4fc1956

由 Tony Luck 提交于 4月 29, 2016

Both of these drivers can return NOTIFY_BAD, but this terminates
processing other callbacks that were registered later on the chain.
Since the driver did nothing to log the error it seems wrong to prevent
other interested parties from seeing it. E.g. neither of them had even
bothered to check the type of the error to see if it was a memory error
before the return NOTIFY_BAD.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/72937355dd92318d2630979666063f8a2853495b.1461864507.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

c4fc1956

23 4月, 2016 1 次提交

EDAC, sb_edac: Remove double buffering of error records · ad08c4e9

由 Tony Luck 提交于 4月 15, 2016

In the bad old days the functions from x86_mce_decoder_chain could be
called in machine check context. So we used to carefully copy them and
defer processing until later. But in

f29a7aff ("x86/mce: Avoid potential deadlock due to printk() in MCE context")

we switched the logging code to save the record in a genpool, and call
the functions that registered to be notified later from a work queue.

So drop all the double buffering and do all the work we want to do as
soon as sbridge_mce_check_error() is called.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: patrickg@supermicro.com
Link: http://lkml.kernel.org/r/100025611cd780d9bca72792b2b2146760da53e0.1460756761.git.tony.luck@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

ad08c4e9

22 4月, 2016 2 次提交

x86 EDAC, sb_edac.c: Take account of channel hashing when needed · ea5dfb5f

由 Tony Luck 提交于 4月 14, 2016

Haswell and Broadwell can be configured to hash the channel
interleave function using bits [27:12] of the physical address.

On those processor models we must check to see if hashing is
enabled (bit21 of the HASWELL_HASYSDEFEATURE2 register) and
act accordingly.

Based on a patch by patrickg <patrickg@supermicro.com>
Tested-by: NPatrick Geary <patrickg@supermicro.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

ea5dfb5f

x86 EDAC, sb_edac.c: Repair damage introduced when "fixing" channel address · ff15e95c

由 Tony Luck 提交于 4月 14, 2016

In commit:

  eb1af3b7 ("Fix computation of channel address")

I switched the "sck_way" variable from holding the log2 value read
from the h/w to instead be the actual number. Unfortunately it
is needed in log2 form when used to shift the address.
Tested-by: NPatrick Geary <patrickg@supermicro.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Aristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: eb1af3b7 ("Fix computation of channel address")
Signed-off-by: NIngo Molnar <mingo@kernel.org>

ff15e95c

11 3月, 2016 1 次提交

EDAC/sb_edac: Fix computation of channel address · eb1af3b7

由 Luck, Tony 提交于 3月 09, 2016

Large memory Haswell-EX systems with multiple DIMMs per channel were
sometimes reporting the wrong DIMM.

Found three problems:

 1) Debug printouts for socket and channel interleave were not interpreting
    the register fields correctly. The socket interleave field is a 2^X
    value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2,
    2=3. 3=4).

 2) Actual use of the socket interleave value didn't interpret as 2^X

 3) Conversion of address to channel address was complicated, and wrong.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <arozansk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

eb1af3b7

08 3月, 2016 1 次提交

EDAC, sb_edac: Fix logic when computing DIMM sizes on Xeon Phi · 83bdaad4

由 Hubert Chrzaniuk 提交于 3月 07, 2016

Correct a typo introduced by

  d0cdf900 ("EDAC, sb_edac: Add Knights Landing (Xeon Phi gen 2) support")

As a result under some configurations DIMMs were not correctly
recognized. Problem affects only Xeon Phi architecture.
Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1457361045-26221-1-git-send-email-hubert.chrzaniuk@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

83bdaad4

11 12月, 2015 1 次提交

EDAC, sb_edac: Set fixed DIMM width on Xeon Knights Landing · 45f4d3ab

由 Hubert Chrzaniuk 提交于 12月 11, 2015

Knights Landing does not come with register that could be used to fetch
DIMM width. However the value is fixed for this architecture so it can
be hardcoded.
Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449840082-18673-1-git-send-email-hubert.chrzaniuk@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>

45f4d3ab

06 12月, 2015 3 次提交

EDAC, sb_edac: Add Knights Landing (Xeon Phi gen 2) support · d0cdf900

由 Jim Snow 提交于 12月 03, 2015

Knights Landing is the next generation architecture for HPC market.

KNL introduces concept of a tile and CHA - Cache/Home Agent for memory
accesses.

Some things are fixed in KNL:
() There's single DIMM slot per channel
() There's 2 memory controllers with 3 channels each, however,
   from EDAC standpoint, it is presented as single memory controller
   with 6 channels. In order to represent 2 MCs w/ 3 CH, it would
   require major redesign of EDAC core driver.

Basically, two functionalities are added/extended:
() during driver initialization KNL topology is being recognized, i.e.
   which channels are populated with what DIMM sizes
   (knl_get_dimm_capacity function)
() handle MCE errors - channel swizzling
Reviewed-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NJim Snow <jim.m.snow@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-5-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>

d0cdf900

EDAC, sb_edac: Add support for duplicate device IDs · c1979ba2

由 Jim Snow 提交于 12月 03, 2015

Add options to sbridge_get_all_devices() to allow for duplicate device
IDs and devices that are scattered across mulitple PCI buses.
Signed-off-by: NJim Snow <jim.m.snow@intel.com>
Acked-by: NTony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-4-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>

c1979ba2

EDAC, sb_edac: Virtualize several hard-coded functions · c59f9c06

由 Jim Snow 提交于 12月 03, 2015

SAD limit, interleave mode and DRAM related functionalities are now
virtualized, so that overriding them is easier.
Signed-off-by: NJim Snow <jim.m.snow@intel.com>
Acked-by: NTony Luck <tony.luck@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: lukasz.anaczkowski@intel.com
Link: http://lkml.kernel.org/r/1449136134-23706-3-git-send-email-hubert.chrzaniuk@intel.com
[ Rebase to 4.4-rc3. ]
Signed-off-by: NHubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>

c59f9c06

25 9月, 2015 1 次提交

EDAC, sb_edac: Fix TAD presence check for sbridge_mci_bind_devs() · 2900ea60

由 Seth Jennings 提交于 8月 05, 2015

In commit

  7d375bff ("sb_edac: Fix support for systems with two home agents per socket")

NUM_CHANNELS was changed to 8 and the channel space was renumerated to
handle EN, EP, and EX configurations.

The *_mci_bind_devs() functions - except for sbridge_mci_bind_devs() -
got a new device presence check in the form of saw_chan_mask. However,
sbridge_mci_bind_devs() still uses the NUM_CHANNELS for loop.

With the increase in NUM_CHANNELS, this loop fails at index 4 since
SB only has 4 TADs.  This results in the following error on SB machines:

  EDAC sbridge: Some needed devices are missing
  EDAC sbridge: Couldn't find mci handler
  EDAC sbridge: Couldn't find mci handle

This patch adapts the saw_chan_mask logic for sbridge_mci_bind_devs() as
well.

After this patch:

  EDAC MC0: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#0: DEV 0000:3f:0e.0 (POLLED)
  EDAC MC1: Giving out device to module sbridge_edac.c controller Sandy Bridge Socket#1: DEV 0000:7f:0e.0 (POLLED)
Signed-off-by: NSeth Jennings <sjenning@redhat.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Acked-by: NTony Luck <tony.luck@intel.com>
Tested-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.2
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1438798561-10180-1-git-send-email-sjenning@redhat.comSigned-off-by: NBorislav Petkov <bp@suse.de>

2900ea60

09 9月, 2015 2 次提交

sb_edac: correctly fetch DIMM width on Ivy Bridge and Haswell · 12f0721c

由 Aristeu Rozanski 提交于 6月 12, 2015

dimm_dev_type has been incorrectly determined in sb_edac. This patch fixes it
for Ivy Bridge and Haswell only since nothing like exists for Sandy Bridge.
We tested this patch in multiple systems matching the results with the
installed memory modules.
Acked-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

12f0721c

sb_edac: look harder for DDRIO on Haswell systems · 7179385a

由 Aristeu Rozanski 提交于 6月 12, 2015

In case the memory banks are populated so the first channel isn't used, the
DDRIO PCI device won't be visible and it won't be possible to determine the
memory type.
Acked-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

7179385a

13 8月, 2015 2 次提交

x86/mce: Kill drain_mcelog_buffer() · eef4dfa0

由 Borislav Petkov 提交于 8月 12, 2015

This used to flush out MCEs logged during early boot and which
were in the MCA registers from a previous system run. No need
for that now, since we've moved to a genpool.
Suggested-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439396985-12812-7-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

eef4dfa0

x86/mce: Remove the MCE ring for Action Optional errors · fd4cf79f

由 Chen, Gong 提交于 8月 12, 2015

Use unified genpool to save Action Optional error events and put
Action Optional error handling in the same notification chain as
MCE error decoding.
Signed-off-by: NChen, Gong <gong.chen@linux.intel.com>
[ Fold in subsequent patch from Boris for early boot logging. ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
[ Correct a lot. ]
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1439396985-12812-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

fd4cf79f

03 6月, 2015 3 次提交

sb_edac: support for Broadwell -EP and -EX · fa2ce64f

由 Tony Luck 提交于 5月 20, 2015

Basic support for the single socket Broadwell-DE processor
was added back in commit 1f39581a
   sb_edac: Add support for Broadwell-DE processor
This patch extends Broadwell support to cover the two
socket "-EP" and four socket "-EX" versions of Broadwell.
Only tested on the 2 socket - but this code is largely
cloned from the Haswell path.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

fa2ce64f

sb_edac: Fix support for systems with two home agents per socket · 7d375bff

由 Tony Luck 提交于 5月 18, 2015

First noticed a problem on a 4 socket machine where EDAC only reported
half the DIMMS.  Tracked this down to the code that assumes that systems
with two home agents only have two memory channels on each agent. This
is true on 2 sockect ("-EP") machines. But four socket ("-EX") machines
have four memory channels on each home agent.

The old code would have had problems on two socket systems as it did
a shuffling trick to make the internals of the code think that the
channels from the first agent were '0' and '1', with the second agent
providing '2' and '3'. But the code didn't uniformly convert from
{ha,channel} tuples to this internal representation.

New code always considers up to eight channels.
On a machine with a single home agent these map easily to edac channels
0, 1, 2, 3. On machines with two home agents we map using:
  edac_channel = 4*ha# + channel
So on a -EP machine where each home agent supports only two channels
we'll fill in channels 0, 1, 4, 5, and on a -EX machine we use all of 0,
1, 2, 3, 4, 5, 6, 7.

[mchehab@osg.samsung.com: fold a fixup patch as per Tony's request and fixed
 a few CodingStyle issues]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

7d375bff

sb_edac: Fix a typo and a thinko in address handling for Haswell · bb89e714

由 Tony Luck 提交于 5月 18, 2015

typo: "a7mode" chooses whether to use bits {8, 7, 9} or {8, 7, 6}
in the algorithm to spread access between memory resources. But
the non-a7mode path was incorrectly using GET_BITFIELD(addr, 7, 9)
and so picking bits {9, 8, 7}

thinko: BIT(1) of the dram_rule registers chooses whether to just
use the {8, 7, 6} (or {8, 7, 9}) bits mentioned above as they are,
or to XOR them with bits {18, 17, 16} but the code inverted the
test. We need the additional XOR when dram_rule{1} == 0.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

bb89e714

09 2月, 2015 1 次提交

sb_edac: Fix detection on SNB machines · 11249e73

由 Borislav Petkov 提交于 2月 05, 2015

d0585cd8 ("sb_edac: Claim a different PCI device") changed the
probing of sb_edac to look for PCI device 0x3ca0:

3f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)
00: 86 80 a0 3c 00 00 00 00 07 00 80 08 00 00 80 00
...

but we're matching for 0x3ca8, i.e. PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_TA
in sbridge_probe() therefore the probing fails.

Changing it to probe for 0x3ca0 (PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_HA0),
.i.e., the 14.0 device, fixes the issue and driver loads successfully
again:

[ 2449.013120] EDAC DEBUG: sbridge_init:
[ 2449.017029] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
[ 2449.022368] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca0
[ 2449.028498] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
[ 2449.033768] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
[ 2449.039028] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca8
[ 2449.045155] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
...

Add a debug printk while at it to be able to catch the failure in the
future and dump driver version on successful load.

Fixes: d0585cd8 ("sb_edac: Claim a different PCI device")
Cc: stable@vger.kernel.org # 3.18
Acked-by: NAristeu Rozanski <aris@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Acked-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>

11249e73

03 12月, 2014 2 次提交

sb_edac: Fix typo computing number of banks · fec53af5

由 Tony Luck 提交于 12月 02, 2014

Code will always think there are 16 banks because of a typo

Reported-by: Misha
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

fec53af5

sb_edac: Add support for Broadwell-DE processor · 1f39581a

由 Tony Luck 提交于 12月 02, 2014

Broadwell-DE is the microserver version of next generation Xeon
processors.  A whole bunch of new PCIe device ids, but otherwise
pretty much the same as Haswell.
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

1f39581a

02 12月, 2014 2 次提交

sb_edac: Fix discovery of top-of-low-memory for Haswell · f7cf2a22

由 Tony Luck 提交于 10月 29, 2014

Haswell moved the TOLM/TOHM registers to a different device and offset.
The sb_edac driver accounted for the change of device, but not for the
new offset. There was also a typo in the constant to fill in the low
26 bits (was 0x1ffffff, should be 0x3ffffff).

This resulted in a bogus value for the top of low memory:

EDAC DEBUG: get_memory_layout: TOLM: 0.032 GB (0x0000000001ffffff)

which would result in EDAC refusing to translate addresses for
errors above the bogus value and below 4GB:

sbridge MC3: HANDLING MCE MEMORY ERROR
sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
sbridge MC3: TSC 0
sbridge MC3: ADDR 2000000
sbridge MC3: MISC 523eac86
sbridge MC3: PROCESSOR 0:306f3 TIME 1414600951 SOCKET 0 APIC 0
MC3: 1 CE Error at TOLM area, on addr 0x02000000 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)

With the fix we see the correct TOLM value:

DEBUG: get_memory_layout: TOLM: 2.048 GB (0x000000007fffffff)

and we decode address 2000000 correctly:

sbridge MC3: HANDLING MCE MEMORY ERROR
sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
sbridge MC3: TSC 0
sbridge MC3: ADDR 2000000
sbridge MC3: MISC 523e1086
sbridge MC3: PROCESSOR 0:306f3 TIME 1414601319 SOCKET 0 APIC 0
DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 0
DEBUG: get_memory_error_data: TAD#0: address 0x0000000002000000 < 0x000000007fffffff, socket interleave 1, channel interleave 4 (offset 0x00000000), index 0, base ch: 0, ch mask: 0x01
DEBUG: get_memory_error_data: RIR#0, limit: 4.095 GB (0x00000000ffffffff), way: 1
DEBUG: get_memory_error_data: RIR#0: channel address 0x00200000 < 0xffffffff, RIR interleave 0, index 0
DEBUG: sbridge_mce_output_error: area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0
MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2000 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
Signed-off-by: NTony Luck <tony.luck@intel.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

f7cf2a22

sb_edac: Fix erroneous bytes->gigabytes conversion · 8c009100

由 Jim Snow 提交于 11月 18, 2014

Signed-off-by: NJim Snow <jim.snow@intel.com>
Signed-off-by: NLukasz Anaczkowski <lukasz.anaczkowski@intel.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

8c009100

09 10月, 2014 3 次提交

sb_edac: Claim a different PCI device · d0585cd8

由 Andy Lutomirski 提交于 8月 14, 2014

sb_edac controls a large number of different PCI functions.  Rather
than registering as a normal PCI driver for all of them, it
registers for just one so that it gets probed and, at probe time, it
looks for all the others.

Coincidentally, the device it registers for also contains the SMBUS
registers, so the PCI core will refuse to probe both sb_edac and a
future iMC SMBUS driver.  The drivers don't actually conflict, so
just change sb_edac's device table to probe a different device.

An alternative fix would be to merge the two drivers, but sb_edac
will also refuse to load on non-ECC systems, whereas i2c_imc would
still be useful without ECC.

The only user-visible change should be that sb_edac appears to bind
a different device.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Cc: Rui Wang <ruiv.wang@gmail.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

d0585cd8

Move Intel SNB device ids from sb_edac to pci_ids.h · 68939df1

由 Andy Lutomirski 提交于 8月 14, 2014

The i2c_imc driver will use two of them, and moving only part of
the list seems messier.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

68939df1

sb_edac: avoid INTERNAL ERROR message in EDAC with unspecified channel · 351fc4a9

由 Seth Jennings 提交于 9月 05, 2014

Intel IA32 SDM Table 15-14 defines channel 0xf as 'not specified', but
EDAC doesn't know about this and returns and INTERNAL ERROR when the
channel is greater than NUM_CHANNELS:

kernel: [ 1538.886456] CPU 0: Machine Check Exception: 0 Bank 1: 940000000000009f
kernel: [ 1538.886669] TSC 2bc68b22e7e812 ADDR 46dae7000 MISC 0 PROCESSOR 0:306e4 TIME 1390414572 SOCKET 0 APIC 0
kernel: [ 1538.971948] EDAC MC1: INTERNAL ERROR: channel value is out of range (15 >= 4)
kernel: [ 1538.972203] EDAC MC1: 0 CE memory read error on unknown memory (slot:0 page:0x46dae7 offset:0x0 grain:0 syndrome:0x0 - area:DRAM err_code:0000:009f socket:1 channel_mask:1 rank:0)

This commit changes sb_edac to forward a channel of -1 to EDAC if the
channel is not specified. edac_mc_handle_error() sets the channel to -1
internally after the error message anyway, so this commit should have no
effect other than avoiding the INTERNAL ERROR message when the channel
is not specified.
Signed-off-by: NSeth Jennings <sjenning@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>

351fc4a9

27 6月, 2014 9 次提交

sb_edac: add support for Haswell based systems · 50d1bb93

由 Aristeu Rozanski 提交于 6月 20, 2014

Haswell memory controllers are very similar to Ivy Bridge and Sandy Bridge
ones. This patch adds support to Haswell based systems.

[m.chehab@samsung.com: Fix CodingStyle issues]
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

50d1bb93

sb_edac: Fix mix tab/spaces alignments · c41afdca

由 Mauro Carvalho Chehab 提交于 6月 26, 2014

We should not have spaces before ^I on alignments.
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

c41afdca

sb_edac: remove bogus assumption on mc ordering · adc61bcd

由 Aristeu Rozanski 提交于 6月 02, 2014

When a MC is handled, the correct sbridge_dev is searched based on the node,
checking again later with the assumption the first memory controller found is
the first socket's memory controller is a bogus assumption. Get rid of it.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

adc61bcd

sb_edac: make minimal use of channel_mask · d7c660b7

由 Aristeu Rozanski 提交于 6月 02, 2014

channel_mask will be used in the future to determine which group of memory
modules is causing the errors since when mirroring, lockstep and close page
are enabled you can't. While that doesn't happen, use the channel_mask to
determine the channel instead of relying on the MC event/exception.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

d7c660b7

sb_edac: fix socket detection on Ivy Bridge controllers · 2ff3a308

由 Aristeu Rozanski 提交于 6月 02, 2014

This patch fixes the obvious bug while handling the socket/HA bitmask used in
Ivy Bridge memory controllers.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

2ff3a308

sb_edac: search devices using product id · dbc954dd

由 Aristeu Rozanski 提交于 6月 02, 2014

This patch changes the way devices are searched by using product id instead of
device/function numbers. Tested in a Sandy Bridge and a Ivy Bridge machine to
make sure everything works properly.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

dbc954dd

sb_edac: make RIR limit retrieval per model · b976bcf2

由 Aristeu Rozanski 提交于 6月 02, 2014

Haswell has a different way to retrieve RIR limits, make this procedure per
model.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

b976bcf2

sb_edac: make node id retrieval per model · f14d6892

由 Aristeu Rozanski 提交于 6月 02, 2014

Haswell has a different way to retrieve the node id, make so this procedure
can be reimplemented.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

f14d6892

sb_edac: make memory type detection per memory controller · 9e375446

由 Aristeu Rozanski 提交于 6月 02, 2014

Haswell has different register, offset to determine memory type and supports
DDR4 in some models. This patch makes it easier to have a different method
depending on the memory controller type.

Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

9e375446

13 3月, 2014 2 次提交

sb_edac: mark MCE messages as KERN_DEBUG · 49856dc9

由 Aristeu Rozanski 提交于 3月 11, 2014

Since the driver is decoding the MCE, it's useless to have these
messages printed unless you're debugging a problem in the driver.
Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

49856dc9

sb_edac: use "event" instead of "exception" when MC wasnt signaled · cf40f80c

由 Aristeu Rozanski 提交于 3月 11, 2014

Corrected Errors are MC events, not exceptions and reporting as the
later might confuse users.
Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>

cf40f80c

Linux-御风守护者 / linux 与 Fork 源项目一致

Linux-御风守护者 / linux
与 Fork 源项目一致