1. 22 2月, 2013 2 次提交
  2. 21 2月, 2013 2 次提交
    • M
      edac: lock module owner to avoid error report conflicts · 80cc7d87
      Mauro Carvalho Chehab 提交于
      APEI GHES and i7core_edac/sb_edac currently can be loaded at
      the same time, but those are Highlander modules:
      	"There can be only one".
      
      There are two reasons for that:
      
      1) Each driver assumes that it is the only one registering at
         the EDAC core, as it is driver's responsibility to number
         the memory controllers, and all of them start from 0;
      
      2) If BIOS is handling the memory errors, the OS can't also be
         doing it, as one will mangle with the other.
      
      So, we need to add an module owner's lock at the EDAC core,
      in order to avoid having two different modules handling memory
      errors at the same time. The best way for doing this lock seems
      to use the driver's name, as this is unique, and won't require
      changes on every driver.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      80cc7d87
    • M
      edac: add a new memory layer type · c66b5a79
      Mauro Carvalho Chehab 提交于
      There are some cases where the memory controller layout is
      completely hidden. This is the case of firmware-driven error
      code, like the one provided by GHES. Add a new layer to be
      used on such memory error report mechanisms.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      c66b5a79
  3. 30 1月, 2013 1 次提交
  4. 21 12月, 2012 1 次提交
  5. 28 11月, 2012 2 次提交
  6. 25 10月, 2012 1 次提交
    • M
      edac: Fix the dimm filling for csrows-based layouts · 24bef66e
      Mauro Carvalho Chehab 提交于
      The driver is currently filling data in a wrong way, on drivers
      for csrows-based memory controller, when the first layer is a
      csrow.
      
      This is not easily to notice, as, in general, memories are
      filed in dual, interleaved, symetric mode, as very few memory
      controllers support asymetric modes.
      
      While digging into a bug for i82795_edac driver, the asymetric
      mode there is now working, allowing us to fill the machine with
      4x1GB ranks at channel 0, and 2x512GB at channel 1:
      
      Channel 0 ranks:
      EDAC DEBUG: i82975x_init_csrows: DIMM A0: from page 0x00000000 to 0x0003ffff (size: 0x00040000 pages)
      EDAC DEBUG: i82975x_init_csrows: DIMM A1: from page 0x00040000 to 0x0007ffff (size: 0x00040000 pages)
      EDAC DEBUG: i82975x_init_csrows: DIMM A2: from page 0x00080000 to 0x000bffff (size: 0x00040000 pages)
      EDAC DEBUG: i82975x_init_csrows: DIMM A3: from page 0x000c0000 to 0x000fffff (size: 0x00040000 pages)
      
      Channel 1 ranks:
      EDAC DEBUG: i82975x_init_csrows: DIMM B0: from page 0x00100000 to 0x0011ffff (size: 0x00020000 pages)
      EDAC DEBUG: i82975x_init_csrows: DIMM B1: from page 0x00120000 to 0x0013ffff (size: 0x00020000 pages)
      
      Instead of properly showing the memories as such, before this patch, it
      shows the memory layout as:
      
                +-----------------------------------+
                |                mc0                |
                |  csrow0   |  csrow1   |  csrow2   |
      ----------+-----------------------------------+
      channel1: |  1024 MB  |  1024 MB  |   512 MB  |
      channel0: |  1024 MB  |  1024 MB  |   512 MB  |
      ----------+-----------------------------------+
      
      as if both channels were symetric, grouping the DIMMs on a wrong
      layout.
      
      After this patch, the memory is correctly represented.
      So, for csrows at layers[0], it shows:
      
                +-----------------------------------------------+
                |                      mc0                      |
                |  csrow0   |  csrow1   |  csrow2   |  csrow3   |
      ----------+-----------------------------------------------+
      channel1: |   512 MB  |   512 MB  |     0 MB  |     0 MB  |
      channel0: |  1024 MB  |  1024 MB  |  1024 MB  |  1024 MB  |
      ----------+-----------------------------------------------+
      
      For csrows at layers[1], it shows:
      
              +-----------------------+
              |          mc0          |
              | channel0  | channel1  |
      --------+-----------------------+
      csrow3: |  1024 MB  |     0 MB  |
      csrow2: |  1024 MB  |     0 MB  |
      --------+-----------------------+
      csrow1: |  1024 MB  |   512 MB  |
      csrow0: |  1024 MB  |   512 MB  |
      --------+-----------------------+
      
      So, no matter of what comes first, the information between
      channel and csrow will be properly represented.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      24bef66e
  7. 24 9月, 2012 2 次提交
    • S
      edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs. · faa2ad09
      Shaun Ruffell 提交于
      Fix potential NULL pointer dereference in edac_unregister_sysfs() on
      system boot introduced in 3.6-rc1.
      
      Since commit 7a623c03 ("edac: rewrite the sysfs code to use struct
      device") edac_mc_alloc() no longer initializes embedded kobjects in
      struct mem_ctl_info.  Therefore edac_mc_free() can no longer simply
      decrement a kobject reference count to free the allocated memory unless
      the memory controller driver module had also called edac_mc_add_mc().
      
      Now edac_mc_free() will check if the newly embedded struct device has
      been registered with sysfs before using either the standard device
      release functions or freeing the data structures itself with logic
      pulled out of the error path of edac_mc_alloc().
      
      The BUG this patch resolves for me:
      
        BUG: unable to handle kernel NULL pointer dereference at   (null)
        EIP is at __wake_up_common+0x1a/0x6a
        Process modprobe (pid: 933, ti=f3dc6000 task=f3db9520 task.ti=f3dc6000)
        Call Trace:
          complete_all+0x3f/0x50
          device_pm_remove+0x23/0xa2
          device_del+0x34/0x142
          edac_unregister_sysfs+0x3b/0x5c [edac_core]
          edac_mc_free+0x29/0x2f [edac_core]
          e7xxx_probe1+0x268/0x311 [e7xxx_edac]
          e7xxx_init_one+0x56/0x61 [e7xxx_edac]
          local_pci_probe+0x13/0x15
        ...
      
      Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Signed-off-by: NShaun Ruffell <sruffell@digium.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faa2ad09
    • F
      edac_mc: fix messy kfree calls in the error path · ef6e7816
      Fengguang Wu 提交于
      coccinelle warns about:
      
      + drivers/edac/edac_mc.c:429:9-23: ERROR: reference preceded by free on line 429
      
         421         if (mci->csrows) {
       > 422                 for (chn = 0; chn < tot_channels; chn++) {
         423                         csr = mci->csrows[chn];
         424                         if (csr) {
       > 425                                 for (chn = 0; chn < tot_channels; chn++)
         426                                          kfree(csr->channels[chn]);
         427                                  kfree(csr);
         428                          }
       > 429                          kfree(mci->csrows[i]);
         430                  }
         431                  kfree(mci->csrows);
         432          }
      
      and that code block seem to mess things up in several ways (double free, memory
      leak, out-of-bound reads etc.):
      
      L422: The iterator "chn" and bound "tot_channels" are totally wrong. Should be
            "row" and "tot_csrows" respectively. Which means either memory leak, or
            out-of-bound reads (which if does not trigger an immediate page fault
            error, will further lead to kfree() on random addresses).
      
      L425: The inner loop is reusing the same iterator "chn" as the outer loop,
            which could lead to premature end of the outer loop, and hence memory leak.
      
      L429: The array index 'i' in mci->csrows[i] is a temporary value used in
            previous loops, and won't change at all in the current loop. Which
            means either out-of-bound read and possibly kfree(random number), or the
            same mci->csrows[i] get freed once and again, and possibly double free
            for the kfree(csr) in L427.
      
      L426/L427: a kfree(csr->channels) is needed in between to avoid leaking the memory.
      
      The buggy code was introduced by commit de3910eb ("edac: change the mem
      allocation scheme to make Documentation/kobject.txt happy") in the 3.6-rc1
      merge window. Fix it by freeing up resources in this order:
      
        free csrows[i]->channels[j]
        free csrows[i]->channels
        free csrows[i]
        free csrows
      
      CC: Mauro Carvalho Chehab <mchehab@redhat.com>
      CC: Shaun Ruffell <sruffell@digium.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef6e7816
  8. 14 8月, 2012 1 次提交
    • T
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo 提交于
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  9. 12 6月, 2012 9 次提交
    • M
      edac: edac_mc_handle_error(): add an error_count parameter · 9eb07a7f
      Mauro Carvalho Chehab 提交于
      In order to avoid loosing error events, it is desirable to group
      error events together and generate a single trace for several identical
      errors.
      
      The trace API already allows reporting multiple errors. Change the
      handle_error function to also allow that.
      
      The changes at the drivers were made by this small script:
      
      	$file .=$_ while (<>);
      	$file =~ s/(edac_mc_handle_error)\s*\(([^\,]+)\,([^\,]+)\,/$1($2,$3, 1,/g;
      	print $file;
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      9eb07a7f
    • M
      edac: remove arch-specific parameter for the error handler · 03f7eae8
      Mauro Carvalho Chehab 提交于
      Remove the arch-dependent parameter, as it were not used,
      as the MCE tracepoint weren't implemented. It probably doesn't
      make sense to have an MCE-specific tracepoint, as this will
      cost more bytes at the tracepoint, and tracepoint is not free.
      
      The changes at the EDAC drivers were done by this small perl script:
      
      	$file .=$_ while (<>);
      	$file =~ s/(edac_mc_handle_error)\s*\(([^\;]+)\,([^\,\)]+)\s*\)/$1($2)/g;
      	print $file;
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      03f7eae8
    • D
      edac_mc: check for allocation failure in edac_mc_alloc() · 08a4a136
      Dan Carpenter 提交于
      Add a check here for if kzalloc() failed.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      08a4a136
    • M
      edac_mc: Cleanup per-dimm_info debug messages · 6e84d359
      Mauro Carvalho Chehab 提交于
      The edac_mc_alloc() routine allocates one dimm_info device for all
      possible memories, including the non-filled ones. The debug messages
      there are somewhat confusing. So, cleans them, by moving the code
      that prints the memory location to edac_mc, and using it on both
      edac_mc_sysfs and edac_mc.
      
      Also, only dumps information when DIMM/ranks are actually
      filled.
      
      After this patch, a dimm-based memory controller will print the debug
      info as:
      
      [ 1011.380027] EDAC DEBUG: edac_mc_dump_csrow: csrow->csrow_idx = 0
      [ 1011.380029] EDAC DEBUG: edac_mc_dump_csrow:   csrow = ffff8801169be000
      [ 1011.380031] EDAC DEBUG: edac_mc_dump_csrow:   csrow->first_page = 0x0
      [ 1011.380032] EDAC DEBUG: edac_mc_dump_csrow:   csrow->last_page = 0x0
      [ 1011.380034] EDAC DEBUG: edac_mc_dump_csrow:   csrow->page_mask = 0x0
      [ 1011.380035] EDAC DEBUG: edac_mc_dump_csrow:   csrow->nr_channels = 3
      [ 1011.380037] EDAC DEBUG: edac_mc_dump_csrow:   csrow->channels = ffff8801149c2840
      [ 1011.380039] EDAC DEBUG: edac_mc_dump_csrow:   csrow->mci = ffff880117426000
      [ 1011.380041] EDAC DEBUG: edac_mc_dump_channel:   channel->chan_idx = 0
      [ 1011.380042] EDAC DEBUG: edac_mc_dump_channel:     channel = ffff8801149c2860
      [ 1011.380044] EDAC DEBUG: edac_mc_dump_channel:     channel->csrow = ffff8801169be000
      [ 1011.380046] EDAC DEBUG: edac_mc_dump_channel:     channel->dimm = ffff88010fe90400
      ...
      [ 1011.380095] EDAC DEBUG: edac_mc_dump_dimm: dimm0: channel 0 slot 0 mapped as virtual row 0, chan 0
      [ 1011.380097] EDAC DEBUG: edac_mc_dump_dimm:   dimm = ffff88010fe90400
      [ 1011.380099] EDAC DEBUG: edac_mc_dump_dimm:   dimm->label = 'CPU#0Channel#0_DIMM#0'
      [ 1011.380101] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000
      [ 1011.380103] EDAC DEBUG: edac_mc_dump_dimm:   dimm->grain = 8
      [ 1011.380104] EDAC DEBUG: edac_mc_dump_dimm:   dimm->nr_pages = 0x40000
      ...
      
      (a rank-based memory controller would print, instead of "dimm?", "rank?"
       on the above debug info)
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      6e84d359
    • J
      edac: Convert debugfX to edac_dbg(X, · 956b9ba1
      Joe Perches 提交于
      Use a more common debugging style.
      
      Remove __FILE__ uses, add missing newlines,
      coalesce formats and align arguments.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      956b9ba1
    • M
      edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs · dd23cd6e
      Mauro Carvalho Chehab 提交于
      The debug macro already adds that. Most of the work here was
      made by this small script:
      
      $f .=$_ while (<>);
      
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*": /\1"/g;
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*/\1/g;
      $f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*"MC: /\1"/g;
      
      $f =~ s/(debugf[0-9]\s*\(\")\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
      $f =~ s/(debugf[0-9]\s*\(\")\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;
      $f =~ s/(debugf[0-9]\s*\(\"MC\:\s*)\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
      $f =~ s/(debugf[0-9]\s*\(\"MC\:\s*)\%s[\:\,\(\)]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;
      
      $f =~ s/\"MC\: \\n\"/"MC:\\n"/g;
      
      print $f;
      
      After running the script, manual cleanups were done to fix it the remaining
      places.
      
      While here, removed the __LINE__ on most places, as it doesn't actually give
      useful info on most places.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      dd23cd6e
    • M
      edac: change the mem allocation scheme to make Documentation/kobject.txt happy · de3910eb
      Mauro Carvalho Chehab 提交于
      Kernel kobjects have rigid rules: each container object should be
      dynamically allocated, and can't be allocated into a single kmalloc.
      
      EDAC never obeyed this rule: it has a single malloc function that
      allocates all needed data into a single kzalloc.
      
      As this is not accepted anymore, change the allocation schema of the
      EDAC *_info structs to enforce this Kernel standard.
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Greg K H <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      de3910eb
    • M
      edac: Get rid of the old kobj's from the edac mc code · d90c0089
      Mauro Carvalho Chehab 提交于
      Now that al users for the old kobj raw access are gone,
      we can get rid of the legacy kobj-based structures and
      data.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      d90c0089
    • M
      edac: rewrite the sysfs code to use struct device · 7a623c03
      Mauro Carvalho Chehab 提交于
      The EDAC subsystem uses the old struct sysdev approach,
      creating all nodes using the raw sysfs API. This is bad,
      as the API is deprecated.
      
      As we'll be changing the EDAC API, let's first port the existing
      code to struct device.
      
      There's one drawback on this patch: driver-specific sysfs
      nodes, used by mpc85xx_edac, amd64_edac and i7core_edac
       won't be created anymore. While it would be possible to
      also port the device-specific code, that would mix kobj with
      struct device, with is not recommended. Also, it is easier and nicer
      to move the code to the drivers, instead, as the core can get rid
      of some complex logic that just emulates what the device_add()
      and device_create_file() already does.
      
      The next patches will convert the driver-specific code to use
      the device-specific calls. Then, the remaining bits of the old
      sysfs API will be removed.
      
      NOTE: a per-MC bus is required, otherwise devices with more than
      one memory controller will hit a bug like the one below:
      
      [  819.094946] EDAC DEBUG: find_mci_by_dev: find_mci_by_dev()
      [  819.094948] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device() idx=1
      [  819.094952] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device(): creating device mc1
      [  819.094967] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device creating dimm0, located at channel 0 slot 0
      [  819.094984] ------------[ cut here ]------------
      [  819.100142] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xc1/0xf0()
      [  819.107282] Hardware name: S2600CP
      [  819.111078] sysfs: cannot create duplicate filename '/bus/edac/devices/dimm0'
      [  819.119062] Modules linked in: sb_edac(+) edac_core ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm microcode pcspkr iTCO_wdt iTCO_vendor_support igb i2c_i801 i2c_core sg ioatdma dca sr_mod cdrom sd_mod crc_t10dif ahci libahci isci libsas libata scsi_transport_sas scsi_mod wmi dm_mod [last unloaded: scsi_wait_scan]
      [  819.175748] Pid: 10902, comm: modprobe Not tainted 3.3.0-0.11.el7.v12.2.x86_64 #1
      [  819.184113] Call Trace:
      [  819.186868]  [<ffffffff8105adaf>] warn_slowpath_common+0x7f/0xc0
      [  819.193573]  [<ffffffff8105aea6>] warn_slowpath_fmt+0x46/0x50
      [  819.200000]  [<ffffffff811f53d1>] sysfs_add_one+0xc1/0xf0
      [  819.206025]  [<ffffffff811f5cf5>] sysfs_do_create_link+0x135/0x220
      [  819.212944]  [<ffffffff811f7023>] ? sysfs_create_group+0x13/0x20
      [  819.219656]  [<ffffffff811f5df3>] sysfs_create_link+0x13/0x20
      [  819.226109]  [<ffffffff813b04f6>] bus_add_device+0xe6/0x1b0
      [  819.232350]  [<ffffffff813ae7cb>] device_add+0x2db/0x460
      [  819.238300]  [<ffffffffa0325634>] edac_create_dimm_object+0x84/0xf0 [edac_core]
      [  819.246460]  [<ffffffffa0325e18>] edac_create_sysfs_mci_device+0xe8/0x290 [edac_core]
      [  819.255215]  [<ffffffffa0322e2a>] edac_mc_add_mc+0x5a/0x2c0 [edac_core]
      [  819.262611]  [<ffffffffa03412df>] sbridge_register_mci+0x1bc/0x279 [sb_edac]
      [  819.270493]  [<ffffffffa03417a3>] sbridge_probe+0xef/0x175 [sb_edac]
      [  819.277630]  [<ffffffff813ba4e8>] ? pm_runtime_enable+0x58/0x90
      [  819.284268]  [<ffffffff812f430c>] local_pci_probe+0x5c/0xd0
      [  819.290508]  [<ffffffff812f5ba1>] __pci_device_probe+0xf1/0x100
      [  819.297117]  [<ffffffff812f5bea>] pci_device_probe+0x3a/0x60
      [  819.303457]  [<ffffffff813b1003>] really_probe+0x73/0x270
      [  819.309496]  [<ffffffff813b138e>] driver_probe_device+0x4e/0xb0
      [  819.316104]  [<ffffffff813b149b>] __driver_attach+0xab/0xb0
      [  819.322337]  [<ffffffff813b13f0>] ? driver_probe_device+0xb0/0xb0
      [  819.329151]  [<ffffffff813af5d6>] bus_for_each_dev+0x56/0x90
      [  819.335489]  [<ffffffff813b0d7e>] driver_attach+0x1e/0x20
      [  819.341534]  [<ffffffff813b0980>] bus_add_driver+0x1b0/0x2a0
      [  819.347884]  [<ffffffffa0347000>] ? 0xffffffffa0346fff
      [  819.353641]  [<ffffffff813b19f6>] driver_register+0x76/0x140
      [  819.359980]  [<ffffffff8159f18b>] ? printk+0x51/0x53
      [  819.365524]  [<ffffffffa0347000>] ? 0xffffffffa0346fff
      [  819.371291]  [<ffffffff812f5896>] __pci_register_driver+0x56/0xd0
      [  819.378096]  [<ffffffffa0347054>] sbridge_init+0x54/0x1000 [sb_edac]
      [  819.385231]  [<ffffffff8100203f>] do_one_initcall+0x3f/0x170
      [  819.391577]  [<ffffffff810bcd2e>] sys_init_module+0xbe/0x230
      [  819.397926]  [<ffffffff815bb529>] system_call_fastpath+0x16/0x1b
      [  819.404633] ---[ end trace 1654fdd39556689f ]---
      
      This happens because the bus is not being properly initialized.
      Instead of putting the memory sub-devices inside the memory controller,
      it is putting everything under the same directory:
      
      $ tree /sys/bus/edac/
      /sys/bus/edac/
      ├── devices
      │   ├── all_channel_counts -> ../../../devices/system/edac/mc/mc0/all_channel_counts
      │   ├── csrow0 -> ../../../devices/system/edac/mc/mc0/csrow0
      │   ├── csrow1 -> ../../../devices/system/edac/mc/mc0/csrow1
      │   ├── csrow2 -> ../../../devices/system/edac/mc/mc0/csrow2
      │   ├── dimm0 -> ../../../devices/system/edac/mc/mc0/dimm0
      │   ├── dimm1 -> ../../../devices/system/edac/mc/mc0/dimm1
      │   ├── dimm3 -> ../../../devices/system/edac/mc/mc0/dimm3
      │   ├── dimm6 -> ../../../devices/system/edac/mc/mc0/dimm6
      │   ├── inject_addrmatch -> ../../../devices/system/edac/mc/mc0/inject_addrmatch
      │   ├── mc -> ../../../devices/system/edac/mc
      │   └── mc0 -> ../../../devices/system/edac/mc/mc0
      ├── drivers
      ├── drivers_autoprobe
      ├── drivers_probe
      └── uevent
      
      On a multi-memory controller system, the names "csrow%d" and "dimm%d"
      should be under "mc%d", and not at the main hierarchy level.
      
      So, we need to create a per-MC bus, in order to have its own namespace.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Greg K H <gregkh@linuxfoundation.org>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      7a623c03
  10. 11 6月, 2012 3 次提交
    • C
      edac: Do alignment logic properly in edac_align_ptr() · 8447c4d1
      Chris Metcalf 提交于
      The logic was checking the sizeof the structure being allocated to
      determine whether an alignment fixup was required.  This isn't right;
      what we actually care about is the alignment of the actual pointer that's
      about to be returned.  This became an issue recently because struct
      edac_mc_layer has a size that is not zero modulo eight, so we were
      taking the correctly-aligned pointer and forcing it to be misaligned.
      On Tile this caused an alignment exception.
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      8447c4d1
    • M
      edac: Rename the parent dev to pdev · fd687502
      Mauro Carvalho Chehab 提交于
      As EDAC doesn't use struct device itself, it created a parent dev
      pointer called as "pdev".  Now that we'll be converting it to use
      struct device, instead of struct devsys, this needs to be fixed.
      
      No functional changes.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      fd687502
    • M
      RAS: Add a tracepoint for reporting memory controller events · 53f2d028
      Mauro Carvalho Chehab 提交于
      Add a new tracepoint-based hardware events report method for
      reporting Memory Controller events.
      
      Part of the description bellow is shamelessly copied from Tony
      Luck's notes about the Hardware Error BoF during LPC 2010 [1].
      Tony, thanks for your notes and discussions to generate the
      h/w error reporting requirements.
      
      [1] http://lwn.net/Articles/416669/
      
          We have several subsystems & methods for reporting hardware errors:
      
          1) EDAC ("Error Detection and Correction").  In its original form
          this consisted of a platform specific driver that read topology
          information and error counts from chipset registers and reported
          the results via a sysfs interface.
      
          2) mcelog - x86 specific decoding of machine check bank registers
          reporting in binary form via /dev/mcelog. Recent additions make use
          of the APEI extensions that were documented in version 4.0a of the
          ACPI specification to acquire more information about errors without
          having to rely reading chipset registers directly. A user level
          programs decodes into somewhat human readable format.
      
          3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
          decodes errors reported via machine check bank registers in AMD
          processors to the console log using printk();
      
          Each of these mechanisms has a band of followers ... and none
          of them appear to meet all the needs of all users.
      
      As part of a RAS subsystem, let's encapsulate the memory error hardware
      events into a trace facility.
      
      The tracepoint printk will be displayed like:
      
      mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]
      
      Where:
             	[quant] is the quantity of errors
      	[error msg] is the driver-specific error message
      		    (e. g. "memory read", "bus error", ...);
      	[location] is the location in terms of memory controller and
      		   branch/channel/slot, channel/slot or csrow/channel;
      	[label] is the memory stick label;
      	[edac_mc detail] describes the address location of the error
      			 and the syndrome;
      	[driver detail] is driver-specifig error message details,
      			when needed/provided (e. g. "area:DMA", ...)
      
      For example:
      
      mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
      
      Of course, any userspace tools meant to handle errors should not parse
      the above data. They should, instead, use the binary fields provided by
      the tracepoint, mapping them directly into their Management Information
      Base.
      
      NOTE: The original patch was providing an additional mechanism for
      MCA-based trace events that also contained MCA error register data.
      However, as no agreement was reached so far for the MCA-based trace
      events, for now, let's add events only for memory errors.
      A latter patch is planned to change the tracepoint, for those types
      of event.
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      53f2d028
  11. 29 5月, 2012 7 次提交
    • M
      edac: Initialize the dimm label with the known information · 5926ff50
      Mauro Carvalho Chehab 提交于
      While userspace doesn't fill the dimm labels, add there the dimm location,
      as described by the used memory model. This could eventually match what
      is described at the dmidecode, making easier for people to identify the
      memory.
      
      For example, on an Intel motherboard where the DMI table is reliable,
      the first memory stick is described as:
      
      Memory Device
      	Array Handle: 0x0029
      	Error Information Handle: Not Provided
      	Total Width: 64 bits
      	Data Width: 64 bits
      	Size: 2048 MB
      	Form Factor: DIMM
      	Set: 1
      	Locator: A1_DIMM0
      	Bank Locator: A1_Node0_Channel0_Dimm0
      	Type: <OUT OF SPEC>
      	Type Detail: Synchronous
      	Speed: 800 MHz
      	Manufacturer: A1_Manufacturer0
      	Serial Number: A1_SerNum0
      	Asset Tag: A1_AssetTagNum0
      	Part Number: A1_PartNum0
      
      The memory named as "A1_DIMM0" is physically located at the first
      memory controller (node 0), at channel 0, dimm slot 0.
      
      After this patch, the memory label will be filled with:
      	/sys/devices/system/edac/mc/csrow0/ch0_dimm_label:mc#0channel#0slot#0
      
      And (after the new EDAC API patches) as:
      	/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:mc#0channel#0slot#0
      
      So, even if the memory label is not initialized on userspace, an useful
      information with the error location is filled there, expecially since
      several systems/motherboards are provided with enough info to map from
      channel/slot (or branch/channel/slot) into the DIMM label. So, letting the
      EDAC core fill it by default is a good thing.
      
      It should noticed that, as the label filling happens at the
      edac_mc_alloc(), drivers can override it to better describe the memories
      (and some actually do it).
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      5926ff50
    • M
      edac: Remove the legacy EDAC ABI · ca0907b9
      Mauro Carvalho Chehab 提交于
      Now that all drivers got converted to use the new ABI, we can
      drop the old one.
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      ca0907b9
    • M
      edac: Change internal representation to work with layers · 4275be63
      Mauro Carvalho Chehab 提交于
      Change the EDAC internal representation to work with non-csrow
      based memory controllers.
      
      There are lots of those memory controllers nowadays, and more
      are coming. So, the EDAC internal representation needs to be
      changed, in order to work with those memory controllers, while
      preserving backward compatibility with the old ones.
      
      The edac core was written with the idea that memory controllers
      are able to directly access csrows.
      
      This is not true for FB-DIMM and RAMBUS memory controllers.
      
      Also, some recent advanced memory controllers don't present a per-csrows
      view. Instead, they view memories as DIMMs, instead of ranks.
      
      So, change the allocation and error report routines to allow
      them to work with all types of architectures.
      
      This will allow the removal of several hacks with FB-DIMM and RAMBUS
      memory controllers.
      
      Also, several tests were done on different platforms using different
      x86 drivers.
      
      TODO: a multi-rank DIMMs are currently represented by multiple DIMM
      entries in struct dimm_info. That means that changing a label for one
      rank won't change the same label for the other ranks at the same DIMM.
      This bug is present since the beginning of the EDAC, so it is not a big
      deal. However, on several drivers, it is possible to fix this issue, but
      it should be a per-driver fix, as the csrow => DIMM arrangement may not
      be equal for all. So, don't try to fix it here yet.
      
      I tried to make this patch as short as possible, preceding it with
      several other patches that simplified the logic here. Yet, as the
      internal API changes, all drivers need changes. The changes are
      generally bigger in the drivers for FB-DIMMs.
      
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      4275be63
    • M
      edac: rewrite edac_align_ptr() · 93e4fe64
      Mauro Carvalho Chehab 提交于
      The edac_align_ptr() function is used to prepare data for a single
      memory allocation kzalloc() call. It counts how many bytes are needed
      by some data structure.
      
      Using it as-is is not that trivial, as the quantity of memory elements
      reserved is not there, but, instead, it is on a next call.
      
      In order to avoid mistakes when using it, move the number of allocated
      elements into it, making easier to use it.
      Reviewed-by: NBorislav Petkov <bp@amd64.org>
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      93e4fe64
    • M
      edac: move nr_pages to dimm struct · a895bf8b
      Mauro Carvalho Chehab 提交于
      The number of pages is a dimm property. Move it to the dimm struct.
      
      After this change, it is possible to add sysfs nodes for the DIMM's that
      will properly represent the DIMM stick properties, including its size.
      
      A TODO fix here is to properly represent dual-rank/quad-rank DIMMs when
      the memory controller represents the memory via chip select rows.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      a895bf8b
    • M
      edac: move dimm properties to struct dimm_info · 084a4fcc
      Mauro Carvalho Chehab 提交于
      On systems based on chip select rows, all channels need to use memories
      with the same properties, otherwise the memories on channels A and B
      won't be recognized.
      
      However, such assumption is not true for all types of memory
      controllers.
      
      Controllers for FB-DIMM's don't have such requirements.
      
      Also, modern Intel controllers seem to be capable of handling such
      differences.
      
      So, we need to get rid of storing the DIMM information into a per-csrow
      data, storing it, instead at the right place.
      
      The first step is to move grain, mtype, dtype and edac_mode to the
      per-dimm struct.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Reviewed-by: NBorislav Petkov <borislav.petkov@amd.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Mark Gross <mark.gross@intel.com>
      Cc: Jason Uhlenkott <juhlenko@akamai.com>
      Cc: Tim Small <tim@buttersideup.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: Olof Johansson <olof@lixom.net>
      Cc: Egor Martovetsky <egor@pasemi.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Joe Perches <joe@perches.com>
      Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hitoshi Mitake <h.mitake@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: James Bottomley <James.Bottomley@parallels.com>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Mike Williams <mike@mikebwilliams.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      084a4fcc
    • M
      edac: Create a dimm struct and move the labels into it · a7d7d2e1
      Mauro Carvalho Chehab 提交于
      The way a DIMM is currently represented implies that they're
      linked into a per-csrow struct. However, some drivers don't see
      csrows, as they're ridden behind some chip like the AMB's
      on FBDIMM's, for example.
      
      This forced drivers to fake^Wvirtualize a csrow struct, and to create
      a mess under csrow/channel original's concept.
      
      Move the DIMM labels into a per-DIMM struct, and add there
      the real location of the socket, in terms of csrow/channel.
      Latter patches will modify the location to properly represent the
      memory architecture.
      
      All other drivers will use a per-csrow type of location.
      Some of those drivers will require a latter conversion, as
      they also fake the csrows internally.
      
      TODO: While this patch doesn't change the existing behavior, on
      csrows-based memory controllers, a csrow/channel pair points to a memory
      rank. There's a known bug at the EDAC core that allows having different
      labels for the same DIMM, if it has more than one rank. A latter patch
      is need to merge the several ranks for a DIMM into the same dimm_info
      struct, in order to avoid having different labels for the same DIMM.
      
      The edac_mc_alloc() will now contain a per-dimm initialization loop that
      will be changed by latter patches in order to match other types of
      memory architectures.
      Reviewed-by: NAristeu Rozanski <arozansk@redhat.com>
      Reviewed-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: Doug Thompson <norsk5@yahoo.com>
      Cc: Ranganathan Desikan <ravi@jetztechnologies.com>
      Cc: "Arvind R." <arvino55@gmail.com>
      Cc: "Niklas Söderlund" <niklas.soderlund@ericsson.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      a7d7d2e1
  12. 22 3月, 2012 1 次提交
    • M
      edac: rename channel_info to rank_info · a4b4be3f
      Mauro Carvalho Chehab 提交于
      What it is pointed by a csrow/channel vector is a rank information, and
      not a channel information.
      
      On a traditional architecture, the memory controller directly access the
      memory ranks, via chip select rows. Different ranks at the same DIMM is
      selected via different chip select rows. So, typically, one
      csrow/channel pair means one different DIMM.
      
      On FB-DIMMs, there's a microcontroller chip at the DIMM, called Advanced
      Memory Buffer (AMB) that serves as the interface between the memory
      controller and the memory chips.
      
      The AMB selection is via the DIMM slot, and not via a csrow.
      
      It is up to the AMB to talk with the csrows of the DRAM chips.
      
      So, the FB-DIMM memory controllers see the DIMM slot, and not the DIMM
      rank. RAMBUS is similar.
      
      Newer memory controllers, like the ones found on Intel Sandy Bridge and
      Nehalem, even working with normal DDR3 DIMM's, don't use the usual
      channel A/channel B interleaving schema to provide 128 bits data access.
      
      Instead, they have more channels (3 or 4 channels), and they can use
      several interleaving schemas. Such memory controllers see the DIMMs
      directly on their registers, instead of the ranks, which is better for
      the driver, as its main usageis to point to a broken DIMM stick (the
      Field Repleceable Unit), and not to point to a broken DRAM chip.
      
      The drivers that support such such newer memory architecture models
      currently need to fake information and to abuse on EDAC structures, as
      the subsystem was conceived with the idea that the csrow would always be
      visible by the CPU.
      
      To make things a little worse, those drivers don't currently fake
      csrows/channels on a consistent way, as the concepts there don't apply
      to the memory controllers they're talking with. So, each driver author
      interpreted the concepts using a different logic.
      
      In order to fix it, let's rename the data structure that points into a
      DIMM rank to "rank_info", in order to be clearer about what's stored
      there.
      
      Latter patches will provide a better way to represent the memory
      hierarchy for the other types of memory controller.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      a4b4be3f
  13. 20 3月, 2012 1 次提交
  14. 15 12月, 2011 1 次提交
  15. 27 5月, 2011 1 次提交
  16. 31 3月, 2011 1 次提交
  17. 07 1月, 2011 1 次提交
  18. 09 12月, 2010 1 次提交
    • B
      EDAC: Fix workqueue-related crashes · bb31b312
      Borislav Petkov 提交于
      00740c58 changed edac_core to
      un-/register a workqueue item only if a lowlevel driver supplies a
      polling routine. Normally, when we remove a polling low-level driver, we
      go and cancel all the queued work. However, the workqueue unreg happens
      based on the ->op_state setting, and edac_mc_del_mc() sets this to
      OP_OFFLINE _before_ we cancel the work item, leading to NULL ptr oops on
      the workqueue list.
      
      Fix it by putting the unreg stuff in proper order.
      
      Cc: <stable@kernel.org> #36.x
      Reported-and-tested-by: NTobias Karnat <tobias.karnat@googlemail.com>
      LKML-Reference: <1291201307.3029.21.camel@Tobias-Karnat>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      bb31b312
  19. 24 10月, 2010 2 次提交
    • M
      i7core_edac: don't use a freed mci struct · accf74ff
      Mauro Carvalho Chehab 提交于
      This is a nasty bug. Since kobject count will be reduced by zero by
      edac_mc_del_mc(), and this triggers the kobj release method, the
      mci memory will be freed automatically. So, all we have left is ctl_name,
      as shown by enabling debug:
      
      [   80.822186] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1020: edac_remove_sysfs_mci_device()  remove_link
      [   80.832590] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 1024: edac_remove_sysfs_mci_device()  remove_mci_instance
      [   80.843776] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 640: edac_mci_control_release() mci instance idx=0 releasing
      [   80.855163] EDAC MC: Removed device 0 for i7core_edac.c i7 core #0: DEV 0000:3f:03.0
      [   80.862936] EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 2089: (null): free structs
      [   80.871134] EDAC DEBUG: in drivers/edac/edac_mc.c, line at 238: edac_mc_free()
      [   80.878379] EDAC DEBUG: in drivers/edac/edac_mc_sysfs.c, line at 726: edac_mc_unregister_sysfs_main_kobj()
      [   80.888043] EDAC DEBUG: in drivers/edac/i7core_edac.c, line at 1232: drivers/edac/i7core_edac.c: i7core_put_devices()
      
      Also, kfree(mci) shouldn't happen at the kobj.release, as it happens
      when edac_remove_sysfs_mci_device() is called, but the logic is:
      	edac_remove_sysfs_mci_device(mci);
      	edac_printk(KERN_INFO, EDAC_MC,
      		"Removed device %d for %s %s: DEV %s\n", mci->mc_idx,
      		mci->mod_name, mci->ctl_name, edac_dev_name(mci));
      So, as the edac_printk() needs the mci struct, this generates an OOPS.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      accf74ff
    • M
      edac_core: Print debug messages at release calls · bbc560ae
      Mauro Carvalho Chehab 提交于
      This is important to track a nasty bug at the free logic.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      bbc560ae