1. 05 3月, 2014 1 次提交
    • T
      powerpc: Eeh: Kill another abuse of irq_desc · b8a9a11b
      Thomas Gleixner 提交于
      commit 91150af3 (powerpc/eeh: Fix unbalanced enable for IRQ) is
      another brilliant example of trainwreck engineering.
      
      The patch "fixes" the issue of an unbalanced call to irq_enable()
      which causes a prominent warning by checking the disabled state of the
      interrupt line and call conditionally into the core code.
      
      This is wrong in two aspects:
      
      1) The warning is there to tell users, that they need to fix their
         asymetric enable/disable patterns by finding the root cause and
         solving it there.
      
         It's definitely not meant to work around it by conditionally
         calling into the core code depending on the random state of the irq
         line.
      
         Asymetric irq_disable/enable calls are a clear sign of wrong usage
         of the interfaces which have to be cured at the root and not by
         somehow hacking around it.
      
      2) The abuse of core internal data structure instead of using the
         proper interfaces for retrieving the information for the 'hack
         around'
      
         irq_desc is core internal and it's clear enough stated.
      
      Replace at least the irq_desc abuse with the proper functions and add
      a big fat comment why this is absurd and completely wrong.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: ppc <linuxppc-dev@lists.ozlabs.org>
      Link: http://lkml.kernel.org/r/20140223212736.562906212@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8a9a11b
  2. 11 2月, 2014 1 次提交
  3. 16 1月, 2014 1 次提交
  4. 15 1月, 2014 2 次提交
    • G
      powerpc/eeh: Handle multiple EEH errors · 7e4e7867
      Gavin Shan 提交于
      For one PCI error relevant OPAL event, we possibly have multiple
      EEH errors for that. For example, multiple frozen PEs detected on
      different PHBs. Unfortunately, we didn't cover the case. The patch
      enumarates the return value from eeh_ops::next_error() and change
      eeh_handle_special_event() and eeh_ops::next_error() to handle all
      existing EEH errors.
      
      As Ben pointed out, we needn't list_for_each_entry_safe() since we
      are not deleting any PHB from the hose_list and the EEH serialized
      lock should be held while purging EEH events. The patch covers those
      suggestions as well.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7e4e7867
    • G
      powerpc/eeh: Hotplug improvement · f26c7a03
      Gavin Shan 提交于
      When EEH error comes to one specific PCI device before its driver
      is loaded, we will apply hotplug to recover the error. During the
      plug time, the PCI device will be probed and its driver is loaded.
      Then we wrongly calls to the error handlers if the driver supports
      EEH explicitly.
      
      The patch intends to fix by introducing flag EEH_DEV_NO_HANDLER and
      set it before we remove the PCI device. In turn, we can avoid wrongly
      calls the error handlers of the PCI device after its driver loaded.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f26c7a03
  5. 05 12月, 2013 1 次提交
  6. 24 7月, 2013 3 次提交
    • G
      powerpc/eeh: Fix unbalanced enable for IRQ · 91150af3
      Gavin Shan 提交于
      The patch fixes following issue:
      
      Unbalanced enable for IRQ 23
      ------------[ cut here ]------------
      WARNING: at kernel/irq/manage.c:437
      :
      NIP [c00000000016de8c] .__enable_irq+0x11c/0x140
      LR [c00000000016de88] .__enable_irq+0x118/0x140
      Call Trace:
      [c000003ea1f23880] [c00000000016de88] .__enable_irq+0x118/0x140 (unreliable)
      [c000003ea1f23910] [c00000000016df08] .enable_irq+0x58/0xa0
      [c000003ea1f239a0] [c0000000000388b4] .eeh_enable_irq+0xc4/0xe0
      [c000003ea1f23a30] [c000000000038a28] .eeh_report_reset+0x78/0x130
      [c000003ea1f23ac0] [c000000000037508] .eeh_pe_dev_traverse+0x98/0x170
      [c000003ea1f23b60] [c0000000000391ac] .eeh_handle_normal_event+0x2fc/0x3d0
      [c000003ea1f23bf0] [c000000000039538] .eeh_handle_event+0x2b8/0x2c0
      [c000003ea1f23c90] [c000000000039600] .eeh_event_handler+0xc0/0x170
      [c000003ea1f23d30] [c0000000000da9a0] .kthread+0xf0/0x100
      [c000003ea1f23e30] [c00000000000a1dc] .ret_from_kernel_thread+0x5c/0x80
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      91150af3
    • G
      powerpc/eeh: Use partial hotplug for EEH unaware drivers · f5c57710
      Gavin Shan 提交于
      When EEH error happens to one specific PE, some devices with drivers
      supporting EEH won't except hotplug on the device. However, there
      might have other deivces without driver, or with driver without EEH
      support. For the case, we need do partial hotplug in order to make
      sure that the PE becomes absolutely quite during reset. Otherise,
      the PE reset might fail and leads to failure of error recovery.
      
      The current code doesn't handle that 'mixed' case properly, it either
      uses the error callbacks to the drivers, or tries hotplug, but doesn't
      handle a PE (EEH domain) composed of a combination of the two.
      
      The patch intends to support so-called "partial" hotplug for EEH:
      Before we do reset, we stop and remove those PCI devices without
      EEH sensitive driver. The corresponding EEH devices are not detached
      from its PE, but with special flag. After the reset is done, those
      EEH devices with the special flag will be scanned one by one.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f5c57710
    • G
      powerpc/eeh: Keep PE during hotplug · 807a827d
      Gavin Shan 提交于
      When we do normal hotplug, the PE (shadow EEH structure) shouldn't be
      kept around.
      
      However, we need to keep it if the hotplug an artifial one caused by
      EEH errors recovery.
      
      Since we remove EEH device through the PCI hook pcibios_release_device(),
      the flag "purge_pe" passed to various functions is meaningless. So the patch
      removes the meaningless flag and introduce new flag "EEH_PE_KEEP"
      to save the PE while doing hotplug during EEH error recovery.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      807a827d
  7. 01 7月, 2013 1 次提交
  8. 20 6月, 2013 4 次提交
  9. 18 9月, 2012 2 次提交
    • G
      powerpc/eeh: Lock module while handling EEH event · feadf7c0
      Gavin Shan 提交于
      The EEH core is talking with the PCI device driver to determine the
      action (purely reset, or PCI device removal). During the period, the
      driver might be unloaded and in turn causes kernel crash as follows:
      
      EEH: Detected PCI bus error on PHB#4-PE#10000
      EEH: This PCI device has failed 3 times in the last hour
      lpfc 0004:01:00.0: 0:2710 PCI channel disable preparing for reset
      Unable to handle kernel paging request for data at address 0x00000490
      Faulting instruction address: 0xd00000000e682c90
      cpu 0x1: Vector: 300 (Data Access) at [c000000fc75ffa20]
          pc: d00000000e682c90: .lpfc_io_error_detected+0x30/0x240 [lpfc]
          lr: d00000000e682c8c: .lpfc_io_error_detected+0x2c/0x240 [lpfc]
          sp: c000000fc75ffca0
         msr: 8000000000009032
         dar: 490
       dsisr: 40000000
        current = 0xc000000fc79b88b0
        paca    = 0xc00000000edb0380	 softe: 0	 irq_happened: 0x00
          pid   = 3386, comm = eehd
      enter ? for help
      [c000000fc75ffca0] c000000fc75ffd30 (unreliable)
      [c000000fc75ffd30] c00000000004fd3c .eeh_report_error+0x7c/0xf0
      [c000000fc75ffdc0] c00000000004ee00 .eeh_pe_dev_traverse+0xa0/0x180
      [c000000fc75ffe70] c00000000004ffd8 .eeh_handle_event+0x68/0x300
      [c000000fc75fff00] c0000000000503a0 .eeh_event_handler+0x130/0x1a0
      [c000000fc75fff90] c000000000020138 .kernel_thread+0x54/0x70
      1:mon>
      
      The patch increases the reference of the corresponding driver modules
      while EEH core does the negotiation with PCI device driver so that the
      corresponding driver modules can't be unloaded during the period and
      we're safe to refer the callbacks.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      feadf7c0
    • G
      powerpc/eeh: Remove EEH PE for normal PCI hotplug · 20ee6a97
      Gavin Shan 提交于
      Function eeh_rmv_from_parent_pe() could be called by the path of
      either normal PCI hotplug, or EEH recovery. For the former case,
      we need purge the corresponding PE on removal of the associated
      PE bus.
      
      The patch tries to cover that by passing more information to function
      pcibios_remove_pci_devices() so that we know if the corresponding PE
      needs to be purged or be marked as "invalid".
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      20ee6a97
  10. 10 9月, 2012 2 次提交
  11. 09 3月, 2012 8 次提交
  12. 06 5月, 2011 1 次提交
  13. 17 2月, 2010 1 次提交
    • B
      powerpc/eeh: Fix a bug when pci structure is null · 8d3d50bf
      Breno Leitao 提交于
      During a EEH recover, the pci_dev structure can be null, mainly if an
      eeh event is detected during cpi config operation. In this case, the
      pci_dev will not be known (and will be null) the kernel will crash
      with the following message:
      
      Unable to handle kernel paging request for data at address 0x000000a0
      Faulting instruction address: 0xc00000000006b8b4
      Oops: Kernel access of bad area, sig: 11 [#1]
      
      NIP [c00000000006b8b4] .eeh_event_handler+0x10c/0x1a0
      LR [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
      Call Trace:
      [c0000003a80dff00] [c00000000006b8a8] .eeh_event_handler+0x100/0x1a0
      [c0000003a80dff90] [c000000000031f1c] .kernel_thread+0x54/0x70
      
      The bug occurs because pci_name() tries to access a null pointer.
      This patch just guarantee that pci_name() is not called on Null pointers.
      Signed-off-by: NBreno Leitao <leitao@linux.vnet.ibm.com>
      Signed-off-by: NLinas Vepstas <linasvepstas@gmail.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8d3d50bf
  14. 09 2月, 2010 1 次提交
  15. 30 10月, 2009 1 次提交
  16. 17 6月, 2009 1 次提交
  17. 15 4月, 2009 1 次提交
    • M
      powerpc/pseries: Set error_state to pci_channel_io_normal in eeh_report_reset() · c58dc575
      Mike Mason 提交于
      While adding native EEH support to Emulex and Qlogic drivers, it was
      discovered that dev->error_state was set to pci_io_channel_normal too
      late in the recovery process. These drivers rely on error_state to
      determine if they can access the device in their slot_reset callback,
      thus error_state needs to be set to pci_io_channel_normal in
      eeh_report_reset(). Below is a detailed explanation (courtesy of Richard
      Lary) as to why this is necessary.
      
      Background:
      PCI MMIO or DMA accesses to a frozen slot generate additional EEH
      errors. If the number of additional EEH errors exceeds EEH_MAX_FAILS the
      adapter will be shutdown. To avoid triggering excessive EEH errors and
      an undesirable adapter shutdown, some drivers use the
      pci_channel_offline(dev) wrapper function to return a Boolean value
      based on the value of pci_dev->error_state to determine if PCI MMIO or
      DMA accesses are safe. If the wrapper returns TRUE, drivers must not
      make PCI MMIO or DMA access to their hardware.
      
      The pci_dev structure member error_state reflects one of three values,
      1) pci_channel_io_normal, 2) pci_channel_io_frozen, 3)
      pci_channel_io_perm_failure.  Function pci_channel_offline(dev) returns
      TRUE if error_state is pci_channel_io_frozen or pci_channel_io_perm_failure.
      
      The EEH driver sets pci_dev->error_state to pci_channel_io_frozen at the
      point where the PCI slot is frozen. Currently, the EEH driver restores
      dev->error_state to pci_channel_io_normal in eeh_report_resume() before
      calling the driver's resume callback. However, when the EEH driver calls
      the driver's slot_reset callback() from eeh_report_reset(), it
      incorrectly indicates the error state is still pci_channel_io_frozen.
      
      Waiting until eeh_report_resume() to restore dev->error_state to
      pci_channel_io_normal is too late for Emulex and QLogic FC drivers and
      any other drivers which are designed to use common code paths in these
      two cases: i) those called after the driver's slot_reset callback() and
      ii) those called after the PCI slot is frozen but before the driver's
      slot_reset callback is called. Case i) all driver paths executed to
      reinitialize the hardware after a reset and case ii) all code paths
      executed by driver kernel threads that run asynchronous to the main
      driver thread, such as interrupt handlers and worker threads to process
      driver work queues.
      
      Emulex and QLogic FC drivers are designed with common code paths which
      require that pci_channel_offline(dev) reflect the true state of the
      hardware. The state transitions that the hardware takes from Normal
      Operations to Slot Frozen to Reset to Normal Operations are documented
      in the Power Architecture™ Platform Requirements+ (PAPR+) in Table 75.
      PE State Control.
      
      PAPR defines the following 3 states:
      
      0 -- Not reset, Not EEH stopped, MMIO load/store allowed, DMA allowed
           (Normal Operations)
      1 -- Reset, Not EEH stopped, MMIO load/store disabled, DMA disabled
      2 -- Not reset, EEH stopped, MMIO load/store disabled, DMA disabled
           (Slot Frozen)
      
      An EEH error places the slot in state 2 (Frozen) and the adapter driver
      is notified that an EEH error was detected. If the adapter driver
      returns PCI_ERS_RESULT_NEED_RESET, the EEH driver calls
      eeh_reset_device() to place the slot into state 1 (Reset) and
      eeh_reset_device completes by placing the slot into State 0 (Normal
      Operations). Upon return from eeh_reset_device(), the EEH driver calls
      eeh_report_reset, which then calls the adapter's slot_reset callback. At
      the time the adapter's slot_reset callback is called, the true state of
      the hardware is Normal Operations and should be accurately reflected by
      setting dev->error_state to pci_channel_io_normal.
      
      The current implementation of EEH driver does not do so and requires
      this change to correct this deficiency.
      Signed-off-by: NMike Mason <mmlnx@us.ibm.com>
      Acked-by: NLinas Vepstas <linasvepstas@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      c58dc575
  18. 11 2月, 2009 1 次提交
  19. 20 8月, 2008 1 次提交
  20. 16 6月, 2008 1 次提交
  21. 11 12月, 2007 1 次提交
  22. 03 12月, 2007 1 次提交
  23. 08 11月, 2007 2 次提交
  24. 14 6月, 2007 1 次提交