1. 18 4月, 2019 7 次提交
  2. 29 3月, 2019 10 次提交
  3. 28 3月, 2019 3 次提交
    • S
      scsi: zfcp: reduce flood of fcrscn1 trace records on multi-element RSCN · c8206579
      Steffen Maier 提交于
      If an incoming ELS of type RSCN contains more than one element, zfcp
      suboptimally causes repeated erp trigger NOP trace records for each
      previously failed port. These could be ports that went away.  It loops over
      each RSCN element, and for each of those in an inner loop over all
      zfcp_ports.
      
      The trigger to recover failed ports should be just the reception of some
      RSCN, no matter how many elements it has. So we can loop over failed ports
      separately, and only then loop over each RSCN element to handle the
      non-failed ports.
      
      The call chain was:
      
        zfcp_fc_incoming_rscn
          for (i = 1; i < no_entries; i++)
            _zfcp_fc_incoming_rscn
              list_for_each_entry(port, &adapter->port_list, list)
                if (masked port->d_id match) zfcp_fc_test_link
                if (!port->d_id) zfcp_erp_port_reopen "fcrscn1"   <===
      
      In order the reduce the "flooding" of the REC trace area in such cases, we
      factor out handling the failed ports to be outside of the entries loop:
      
        zfcp_fc_incoming_rscn
          if (no_entries > 1)                                     <===
            list_for_each_entry(port, &adapter->port_list, list)  <===
              if (!port->d_id) zfcp_erp_port_reopen "fcrscn1"     <===
          for (i = 1; i < no_entries; i++)
            _zfcp_fc_incoming_rscn
              list_for_each_entry(port, &adapter->port_list, list)
                if (masked port->d_id match) zfcp_fc_test_link
      
      Abbreviated example trace records before this code change:
      
      Tag            : fcrscn1
      WWPN           : 0x500507630310d327
      ERP want       : 0x02
      ERP need       : 0x02
      
      Tag            : fcrscn1
      WWPN           : 0x500507630310d327
      ERP want       : 0x02
      ERP need       : 0x00                 NOP => superfluous trace record
      
      The last trace entry repeats if there are more than 2 RSCN elements.
      Signed-off-by: NSteffen Maier <maier@linux.ibm.com>
      Reviewed-by: NBenjamin Block <bblock@linux.ibm.com>
      Reviewed-by: NJens Remus <jremus@linux.ibm.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      c8206579
    • S
      scsi: zfcp: fix scsi_eh host reset with port_forced ERP for non-NPIV FCP devices · 242ec145
      Steffen Maier 提交于
      Suppose more than one non-NPIV FCP device is active on the same channel.
      Send I/O to storage and have some of the pending I/O run into a SCSI
      command timeout, e.g. due to bit errors on the fibre. Now the error
      situation stops. However, we saw FCP requests continue to timeout in the
      channel. The abort will be successful, but the subsequent TUR fails.
      Scsi_eh starts. The LUN reset fails. The target reset fails.  The host
      reset only did an FCP device recovery. However, for non-NPIV FCP devices,
      this does not close and reopen ports on the SAN-side if other non-NPIV FCP
      device(s) share the same open ports.
      
      In order to resolve the continuing FCP request timeouts, we need to
      explicitly close and reopen ports on the SAN-side.
      
      This was missing since the beginning of zfcp in v2.6.0 history commit
      ea127f975424 ("[PATCH] s390 (7/7): zfcp host adapter.").
      
      Note: The FSF requests for forced port reopen could run into FSF request
      timeouts due to other reasons. This would trigger an internal FCP device
      recovery. Pending forced port reopen recoveries would get dismissed. So
      some ports might not get fully reopened during this host reset handler.
      However, subsequent I/O would trigger the above described escalation and
      eventually all ports would be forced reopen to resolve any continuing FCP
      request timeouts due to earlier bit errors.
      Signed-off-by: NSteffen Maier <maier@linux.ibm.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: <stable@vger.kernel.org> #3.0+
      Reviewed-by: NJens Remus <jremus@linux.ibm.com>
      Reviewed-by: NBenjamin Block <bblock@linux.ibm.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      242ec145
    • S
      scsi: zfcp: fix rport unblock if deleted SCSI devices on Scsi_Host · fe67888f
      Steffen Maier 提交于
      An already deleted SCSI device can exist on the Scsi_Host and remain there
      because something still holds a reference.  A new SCSI device with the same
      H:C:T:L and FCP device, target port WWPN, and FCP LUN can be created.  When
      we try to unblock an rport, we still find the deleted SCSI device and
      return early because the zfcp_scsi_dev of that SCSI device is not
      ZFCP_STATUS_COMMON_UNBLOCKED. Hence we miss to unblock the rport, even if
      the new proper SCSI device would be in good state.
      
      Therefore, skip deleted SCSI devices when iterating the sdevs of the shost.
      [cf. __scsi_device_lookup{_by_target}() or scsi_device_get()]
      
      The following abbreviated trace sequence can indicate such problem:
      
      Area           : REC
      Tag            : ersfs_3
      LUN            : 0x4045400300000000
      WWPN           : 0x50050763031bd327
      LUN status     : 0x40000000     not ZFCP_STATUS_COMMON_UNBLOCKED
      Ready count    : n		not incremented yet
      Running count  : 0x00000000
      ERP want       : 0x01
      ERP need       : 0xc1		ZFCP_ERP_ACTION_NONE
      
      Area           : REC
      Tag            : ersfs_3
      LUN            : 0x4045400300000000
      WWPN           : 0x50050763031bd327
      LUN status     : 0x41000000
      Ready count    : n+1
      Running count  : 0x00000000
      ERP want       : 0x01
      ERP need       : 0x01
      
      ...
      
      Area           : REC
      Level          : 4		only with increased trace level
      Tag            : ertru_l
      LUN            : 0x4045400300000000
      WWPN           : 0x50050763031bd327
      LUN status     : 0x40000000
      Request ID     : 0x0000000000000000
      ERP status     : 0x01800000
      ERP step       : 0x1000
      ERP action     : 0x01
      ERP count      : 0x00
      
      NOT followed by a trace record with tag "scpaddy"
      for WWPN 0x50050763031bd327.
      Signed-off-by: NSteffen Maier <maier@linux.ibm.com>
      Fixes: 6f2ce1c6 ("scsi: zfcp: fix rport unblock race with LUN recovery")
      Cc: <stable@vger.kernel.org> #2.6.32+
      Reviewed-by: NJens Remus <jremus@linux.ibm.com>
      Reviewed-by: NBenjamin Block <bblock@linux.ibm.com>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      fe67888f
  4. 19 3月, 2019 3 次提交
  5. 12 3月, 2019 1 次提交
  6. 11 3月, 2019 1 次提交
    • C
      vfio: ccw: only free cp on final interrupt · 50b7f1b7
      Cornelia Huck 提交于
      When we get an interrupt for a channel program, it is not
      necessarily the final interrupt; for example, the issuing
      guest may request an intermediate interrupt by specifying
      the program-controlled-interrupt flag on a ccw.
      
      We must not switch the state to idle if the interrupt is not
      yet final; even more importantly, we must not free the translated
      channel program if the interrupt is not yet final, or the host
      can crash during cp rewind.
      
      Fixes: e5f84dba ("vfio: ccw: return I/O results asynchronously")
      Cc: stable@vger.kernel.org # v4.12+
      Reviewed-by: NEric Farman <farman@linux.ibm.com>
      Signed-off-by: NCornelia Huck <cohuck@redhat.com>
      50b7f1b7
  7. 07 3月, 2019 2 次提交
  8. 06 3月, 2019 1 次提交
    • H
      s390/zcrypt: revisit ap device remove procedure · 01396a37
      Harald Freudenberger 提交于
      Working with the vfio-ap driver let to some revisit of the way
      how an ap (queue) device is removed from the driver.
      With the current implementation all the cleanup was done before
      the driver even got notified about the removal. Now the ap
      queue removal is done in 3 steps:
      1) A preparation step, all ap messages within the queue
         are flushed and so the driver does 'receive' them.
         Also a new state AP_STATE_REMOVE assigned to the queue
         makes sure there are no new messages queued in.
      2) Now the driver's remove function is invoked and the
         driver should do the job of cleaning up it's internal
         administration lists or whatever. After 2) is done
         it is guaranteed, that the driver is not invoked any
         more. On the other hand the driver has to make sure
         that the APQN is not accessed any more after step 2
         is complete.
      3) Now the ap bus code does the job of total cleanup of the
         APQN. A reset with zero is triggered and the state of
         the queue goes to AP_STATE_UNBOUND.
         After step 3) is complete, the ap queue has no pending
         messages and the APQN is cleared and so there are no
         requests and replies lingering around in the firmware
         queue for this APQN. Also the interrupts are disabled.
      
      After these remove steps the ap queue device may be assigned
      to another driver.
      
      Stress testing this remove/probe procedure showed a problem with the
      correct module reference counting. The actual receive of an reply in
      the driver is done asynchronous with completions. So with a driver
      change on an ap queue the message flush triggers completions but the
      threads waiting for the completions may run at a time where the queue
      already has the new driver assigned. So the module_put() at receive
      time needs to be done on the driver module which queued the ap
      message. This change is also part of this patch.
      Signed-off-by: NHarald Freudenberger <freude@linux.ibm.com>
      Reviewed-by: NIngo Franzki <ifranzki@linux.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      01396a37
  9. 01 3月, 2019 10 次提交
  10. 27 2月, 2019 2 次提交
    • E
      s390/cio: Use cpa range elsewhere within vfio-ccw · 2904337f
      Eric Farman 提交于
      Since we have a little function to see whether a channel
      program address falls within a range of CCWs, let's use
      it in the other places of code that make these checks.
      
      (Why isn't ccw_head fully removed?  Well, because this
      way some longs lines don't have to be reflowed.)
      Signed-off-by: NEric Farman <farman@linux.ibm.com>
      Message-Id: <20190222183941.29596-3-farman@linux.ibm.com>
      Reviewed-by: NFarhan Ali <alifm@linux.ibm.com>
      Signed-off-by: NCornelia Huck <cohuck@redhat.com>
      2904337f
    • E
      s390/cio: Fix vfio-ccw handling of recursive TICs · 48bd0eee
      Eric Farman 提交于
      The routine ccwchain_calc_length() is tasked with looking at a
      channel program, seeing how many CCWs are chained together by
      the presence of the Chain-Command flag, and returning a count
      to the caller.
      
      Previously, it also considered a Transfer-in-Channel CCW as being
      an appropriate mechanism for chaining.  The problem at the time
      was that the TIC CCW will almost certainly not go to the next CCW
      in memory (because the CC flag would be sufficient), and so
      advancing to the next 8 bytes will cause us to read potentially
      invalid memory.  So that comparison was removed, and the target
      of the TIC is processed as a new chain.
      
      This is fine when a TIC goes to a new chain (consider a NOP+TIC to
      a channel program that is being redriven), but there is another
      scenario where this falls apart.  A TIC can be used to "rewind"
      a channel program, for example to find a particular record on a
      disk with various orientation CCWs.  In this case, we DO want to
      consider the memory after the TIC since the TIC will be skipped
      once the requested criteria is met.  This is due to the Status
      Modifier presented by the device, though software doesn't need to
      operate on it beyond understanding the behavior change of how the
      channel program is executed.
      
      So to handle this, we will re-introduce the check for a TIC CCW
      but limit it by examining the target of the TIC.  If the TIC
      doesn't go back into the current chain, then current behavior
      applies; we should stop counting CCWs and let the target of the
      TIC be handled as a new chain.  But, if the TIC DOES go back into
      the current chain, then we need to keep looking at the memory after
      the TIC for when the channel breaks out of the TIC loop.  We can't
      use tic_target_chain_exists() because the chain in question hasn't
      been built yet, so we will redefine that comparison with some small
      functions to make it more readable and to permit refactoring later.
      
      Fixes: 405d566f ("vfio-ccw: Don't assume there are more ccws after a TIC")
      Signed-off-by: NEric Farman <farman@linux.ibm.com>
      Message-Id: <20190222183941.29596-2-farman@linux.ibm.com>
      Reviewed-by: NHalil Pasic <pasic@linux.ibm.com>
      Reviewed-by: NFarhan Ali <alifm@linux.ibm.com>
      Signed-off-by: NCornelia Huck <cohuck@redhat.com>
      48bd0eee