1. 07 7月, 2017 13 次提交
    • B
      xen/scsiback: Make TMF processing slightly faster · af90e84d
      Bart Van Assche 提交于
      Target drivers must guarantee that struct se_cmd and struct se_tmr_req
      exist as long as target_tmr_work() is in progress. Since the last
      access by the LIO core is a call to .check_stop_free() and since the
      Xen scsiback .check_stop_free() drops a reference to the TMF, it is
      already guaranteed that the struct se_cmd that corresponds to the TMF
      exists as long as target_tmr_work() is in progress. Hence change the
      second argument of transport_generic_free_cmd() from 1 into 0.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: xen-devel@lists.xenproject.org
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      af90e84d
    • B
      xen/scsiback: Replace a waitqueue and a counter by a completion · e3eac124
      Bart Van Assche 提交于
      This patch simplifies the implementation of the scsiback driver
      but does not change its behavior.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: xen-devel@lists.xenproject.org
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      e3eac124
    • B
      xen/scsiback: Fix a TMR related use-after-free · 9f4ab18a
      Bart Van Assche 提交于
      scsiback_release_cmd() must not dereference se_cmd->se_tmr_req
      because that memory is freed by target_free_cmd_mem() before
      scsiback_release_cmd() is called. Fix this use-after-free by
      inlining struct scsiback_tmr into struct vscsibk_pend.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Cc: xen-devel@lists.xenproject.org
      Cc: <stable@vger.kernel.org> # 3.18+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      9f4ab18a
    • B
      IB/srpt: Make a debug statement in srpt_abort_cmd() more informative · 13fdd445
      Bart Van Assche 提交于
      Do not only report the state of the I/O context before srpt_abort_cmd()
      was called but also the new state assigned by srpt_abort_cmd()
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andy Grover <agrover@redhat.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      13fdd445
    • B
      target: Fix a deadlock between the XCOPY code and iSCSI session shutdown · d877d727
      Bart Van Assche 提交于
      Move the code for parsing an XCOPY command from the context of
      the iSCSI receiver thread to the context of the XCOPY workqueue.
      Keep the simple XCOPY checks in the context of the iSCSI receiver
      thread. Move the code for allocating and freeing struct xcopy_op
      from the code that parses an XCOPY command to its caller.
      
      This patch fixes the following deadlock:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      4.10.0-rc7-dbg+ #1 Not tainted
      -------------------------------------------------------
      rmdir/13321 is trying to acquire lock:
       (&sess->cmdsn_mutex){+.+.+.}, at: [<ffffffffa02cb47d>] iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]
      
      but task is already holding lock:
       (&sb->s_type->i_mutex_key#14){++++++}, at: [<ffffffff811c6e20>] vfs_rmdir+0x50/0x140
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      -> #1 (&sb->s_type->i_mutex_key#14){++++++}:
       lock_acquire+0x71/0x90
       down_write+0x3f/0x70
       configfs_depend_item+0x3a/0xb0 [configfs]
       target_depend_item+0x13/0x20 [target_core_mod]
       target_xcopy_locate_se_dev_e4+0xdd/0x1a0 [target_core_mod]
       target_do_xcopy+0x34b/0x970 [target_core_mod]
       __target_execute_cmd+0x22/0xa0 [target_core_mod]
       target_execute_cmd+0x233/0x2c0 [target_core_mod]
       iscsit_execute_cmd+0x208/0x270 [iscsi_target_mod]
       iscsit_sequence_cmd+0x10b/0x190 [iscsi_target_mod]
       iscsit_get_rx_pdu+0x37d/0xcd0 [iscsi_target_mod]
       iscsi_target_rx_thread+0x6e/0xa0 [iscsi_target_mod]
       kthread+0x102/0x140
       ret_from_fork+0x31/0x40
      
      -> #0 (&sess->cmdsn_mutex){+.+.+.}:
       __lock_acquire+0x10e6/0x1260
       lock_acquire+0x71/0x90
       mutex_lock_nested+0x5f/0x670
       iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]
       iscsit_close_session+0xac/0x200 [iscsi_target_mod]
       lio_tpg_close_session+0x9f/0xb0 [iscsi_target_mod]
       target_shutdown_sessions+0xc3/0xd0 [target_core_mod]
       core_tpg_del_initiator_node_acl+0x91/0x140 [target_core_mod]
       target_fabric_nacl_base_release+0x20/0x30 [target_core_mod]
       config_item_release+0x5a/0xc0 [configfs]
       config_item_put+0x1d/0x1f [configfs]
       configfs_rmdir+0x1a6/0x300 [configfs]
       vfs_rmdir+0xb7/0x140
       do_rmdir+0x1f4/0x200
       SyS_rmdir+0x11/0x20
       entry_SYSCALL_64_fastpath+0x23/0xc6
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&sb->s_type->i_mutex_key#14);
                                     lock(&sess->cmdsn_mutex);
                                     lock(&sb->s_type->i_mutex_key#14);
        lock(&sess->cmdsn_mutex);
      
       *** DEADLOCK ***
      
      3 locks held by rmdir/13321:
       #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff811e1aff>] mnt_want_write+0x1f/0x50
       #1:  (&default_group_class[depth - 1]#2/1){+.+.+.}, at: [<ffffffff811cc8ce>] do_rmdir+0x15e/0x200
       #2:  (&sb->s_type->i_mutex_key#14){++++++}, at: [<ffffffff811c6e20>] vfs_rmdir+0x50/0x140
      
      stack backtrace:
      CPU: 2 PID: 13321 Comm: rmdir Not tainted 4.10.0-rc7-dbg+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Call Trace:
       dump_stack+0x86/0xc3
       print_circular_bug+0x1c7/0x220
       __lock_acquire+0x10e6/0x1260
       lock_acquire+0x71/0x90
       mutex_lock_nested+0x5f/0x670
       iscsit_free_all_ooo_cmdsns+0x2d/0xb0 [iscsi_target_mod]
       iscsit_close_session+0xac/0x200 [iscsi_target_mod]
       lio_tpg_close_session+0x9f/0xb0 [iscsi_target_mod]
       target_shutdown_sessions+0xc3/0xd0 [target_core_mod]
       core_tpg_del_initiator_node_acl+0x91/0x140 [target_core_mod]
       target_fabric_nacl_base_release+0x20/0x30 [target_core_mod]
       config_item_release+0x5a/0xc0 [configfs]
       config_item_put+0x1d/0x1f [configfs]
       configfs_rmdir+0x1a6/0x300 [configfs]
       vfs_rmdir+0xb7/0x140
       do_rmdir+0x1f4/0x200
       SyS_rmdir+0x11/0x20
       entry_SYSCALL_64_fastpath+0x23/0xc6
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andy Grover <agrover@redhat.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      d877d727
    • B
      target: Use {get,put}_unaligned_be*() instead of open coding these functions · a85d667e
      Bart Van Assche 提交于
      Introduce the function get_unaligned_be24(). Use {get,put}_unaligned_be*()
      where appropriate. This patch does not change any functionality.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Cc: Andy Grover <agrover@redhat.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      a85d667e
    • B
      target: Fix transport_init_se_cmd() · f2b72d6a
      Bart Van Assche 提交于
      Avoid that aborting a command before it has been submitted onto
      a workqueue triggers the following warning:
      
      INFO: trying to register non-static key.
      the code is fine but needs lockdep annotation.
      turning off the locking correctness validator.
      CPU: 3 PID: 46 Comm: kworker/u8:1 Not tainted 4.12.0-rc2-dbg+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
      Workqueue: tmr-iblock target_tmr_work [target_core_mod]
      Call Trace:
       dump_stack+0x86/0xcf
       register_lock_class+0xe8/0x570
       __lock_acquire+0xa1/0x11d0
       lock_acquire+0x59/0x80
       flush_work+0x42/0x2b0
       __cancel_work_timer+0x10c/0x180
       cancel_work_sync+0xb/0x10
       core_tmr_lun_reset+0x352/0x740 [target_core_mod]
       target_tmr_work+0xd6/0x130 [target_core_mod]
       process_one_work+0x1ca/0x3f0
       worker_thread+0x49/0x3b0
       kthread+0x109/0x140
       ret_from_fork+0x31/0x40
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      f2b72d6a
    • B
      target: Remove se_device.dev_list · 9f2f3428
      Bart Van Assche 提交于
      The last user of se_device.dev_list was removed through commit
      0fd97ccf ("target: kill struct se_subsystem_dev"). Hence
      also remove se_device.dev_list.
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Andy Grover <agrover@redhat.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      9f2f3428
    • B
      target: Use symbolic value for WRITE_VERIFY_16 · 3e182db7
      Bart Van Assche 提交于
      Now that a symbolic value has been introduced for WRITE_VERIFY_16,
      use it. This patch does not change any functionality.
      
      References: commit c2d26f18 ("target: Add WRITE_VERIFY_16")
      Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
      Cc: Andy Grover <agrover@redhat.com>
      Cc: David Disseldorp <ddiss@suse.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      3e182db7
    • N
      qla2xxx: Convert QLA_TGT_ABTS to TARGET_SCF_LOOKUP_LUN_FROM_TAG · eb5ae233
      Nicholas Bellinger 提交于
      Following Himanshu's earlier patch to drop the redundant tag
      lookup within __qlt_24xx_handle_abts(), go ahead and drop this
      now QLA_TGT_ABTS can use TARGET_SCF_LOOKUP_LUN_FROM_TAG and
      have target_submit_tmr() do this from common code.
      Reviewed-by: NHimanshu Madhani <himanshu.madhani@cavium.com>
      Acked-by: NHimanshu Madhani <himanshu.madhani@cavium.com>
      Reviewed-by: NQuinn Tran <quinn.tran@cavium.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      eb5ae233
    • N
      target: Add TARGET_SCF_LOOKUP_LUN_FROM_TAG support for ABORT_TASK · 5465e7d3
      Nicholas Bellinger 提交于
      This patch introduces support in target_submit_tmr() for locating a
      unpacked_lun from an existing se_cmd->tag during ABORT_TASK.
      
      When TARGET_SCF_LOOKUP_LUN_FROM_TAG is set, target_submit_tmr()
      will do the extra lookup via target_lookup_lun_from_tag() and
      subsequently invoke transport_lookup_tmr_lun() so a proper
      percpu se_lun->lun_ref is taken before workqueue dispatch into
      se_device->tmr_wq happens.
      
      Aside from the extra target_lookup_lun_from_tag(), the existing
      code-path remains unchanged.
      Reviewed-by: NHimanshu Madhani <himanshu.madhani@cavium.com>
      Reviewed-by: NQuinn Tran <quinn.tran@cavium.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      5465e7d3
    • N
      target: Add support for TMR percpu reference counting · eeb64d23
      Nicholas Bellinger 提交于
      This patch introduces TMR percpu reference counting using
      se_lun->lun_ref in transport_lookup_tmr_lun(), following
      how existing non TMR per se_lun reference counting works
      within transport_lookup_cmd_lun().
      
      It also adds explicit transport_lun_remove_cmd() calls to
      drop the reference in the three tmr related locations that
      invoke transport_cmd_check_stop_to_fabric();
      
         - target_tmr_work() during normal ->queue_tm_rsp()
         - target_complete_tmr_failure() during error ->queue_tm_rsp()
         - transport_generic_handle_tmr() during early failure
      
      Also, note the exception paths in transport_generic_free_cmd()
      and transport_cmd_finish_abort() already check SCF_SE_LUN_CMD,
      and will invoke transport_lun_remove_cmd() when necessary.
      Reviewed-by: NHimanshu Madhani <himanshu.madhani@cavium.com>
      Reviewed-by: NQuinn Tran <quinn.tran@cavium.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      eeb64d23
    • J
      target: reject COMPARE_AND_WRITE if emulate_caw is not set · 12f66e4a
      Jiang Yi 提交于
      In struct se_dev_attrib, there is a field emulate_caw exposed
      as a /sys/kernel/config/target/core/$HBA/$DEV/attrib/.
      
      If this field is set zero, it means the corresponding struct se_device
      does not support the scsi cmd COMPARE_AND_WRITE
      
      In function sbc_parse_cdb(), go ahead and reject scsi COMPARE_AND_WRITE
      if emulate_caw is not set, because it has been explicitly disabled
      from user-space.
      
      (Make pr_err ratelimited - nab)
      Signed-off-by: NJiang Yi <jiangyilism@gmail.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      12f66e4a
  2. 09 6月, 2017 10 次提交
    • N
      iscsi-target: Avoid holding ->tpg_state_lock during param update · eceb4459
      Nicholas Bellinger 提交于
      As originally reported by Jia-Ju, iscsit_tpg_enable_portal_group()
      holds iscsi_portal_group->tpg_state_lock while updating AUTHMETHOD
      via iscsi_update_param_value(), which performs a GFP_KERNEL
      allocation.
      
      However, since iscsit_tpg_enable_portal_group() is already protected
      by iscsit_get_tpg() -> iscsi_portal_group->tpg_access_lock in it's
      parent caller, ->tpg_state_lock only needs to be held when setting
      TPG_STATE_ACTIVE.
      Reported-by: NJia-Ju Bai <baijiaju1990@163.com>
      Reviewed-by: NJia-Ju Bai <baijiaju1990@163.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      eceb4459
    • N
      target/configfs: Kill se_lun->lun_link_magic · 9ae0e9ad
      Nicholas Bellinger 提交于
      Instead of using a hardcoded magic value in se_lun when verifying
      a target config_item symlink source during target_fabric_mappedlun_link(),
      go ahead and use target_fabric_port_item_ops directly instead.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      9ae0e9ad
    • N
      target/configfs: Kill se_device->dev_link_magic · c17cd249
      Nicholas Bellinger 提交于
      Instead of using a hardcoded magic value in se_device when verifying
      a target config_item symlink source during target_fabric_port_link(),
      go ahead and use target_core_dev_item_ops directly instead.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      c17cd249
    • N
      target/iblock: Convert WRITE_SAME to blkdev_issue_zeroout · 2237498f
      Nicholas Bellinger 提交于
      The people who are actively using iblock_execute_write_same_direct() are
      doing so in the context of ESX VAAI BlockZero, together with
      EXTENDED_COPY and COMPARE_AND_WRITE primitives.
      
      In practice though I've not seen any users of IBLOCK WRITE_SAME for
      anything other than VAAI BlockZero, so just using blkdev_issue_zeroout()
      when available, and falling back to iblock_execute_write_same() if the
      WRITE_SAME buffer contains anything other than zeros should be OK.
      
      (Hook up max_write_zeroes_sectors to signal LBPRZ feature bit in
       target_configure_unmap_from_queue - nab)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      2237498f
    • M
      ibmvscsis: Enable Logical Partition Migration Support · 464fd641
      Michael Cyr 提交于
      Changes to support a new mechanism from phyp to better synchronize the
      logical partition migration (LPM) of the client partition.
      This includes a new VIOCTL to register that we support this new
      functionality, and 2 new Transport Event types, and finally another
      new VIOCTL to let phyp know once we're ready for the Suspend.
      Signed-off-by: NMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      464fd641
    • B
      vhost/scsi: Don't reinvent the wheel but use existing llist API · 12bdcbd5
      Byungchul Park 提交于
      Although llist provides proper APIs, they are not used. Make them used.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Acked-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      12bdcbd5
    • G
      target: remove dead code · fb418240
      Gustavo A. R. Silva 提交于
      Local variable _ret_ is assigned to a constant value and it is never
      updated again. Remove this variable and the dead code it guards.
      
      Addresses-Coverity-ID: 140761
      Signed-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com>
      Reviewed-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      fb418240
    • N
      iscsi-target: Reject immediate data underflow larger than SCSI transfer length · abb85a9b
      Nicholas Bellinger 提交于
      When iscsi WRITE underflow occurs there are two different scenarios
      that can happen.
      
      Normally in practice, when an EDTL vs. SCSI CDB TRANSFER LENGTH
      underflow is detected, the iscsi immediate data payload is the
      smaller SCSI CDB TRANSFER LENGTH.
      
      That is, when a host fabric LLD is using a fixed size EDTL for
      a specific control CDB, the SCSI CDB TRANSFER LENGTH and actual
      SCSI payload ends up being smaller than EDTL.  In iscsi, this
      means the received iscsi immediate data payload matches the
      smaller SCSI CDB TRANSFER LENGTH, because there is no more
      SCSI payload to accept beyond SCSI CDB TRANSFER LENGTH.
      
      However, it's possible for a malicous host to send a WRITE
      underflow where EDTL is larger than SCSI CDB TRANSFER LENGTH,
      but incoming iscsi immediate data actually matches EDTL.
      
      In the wild, we've never had a iscsi host environment actually
      try to do this.
      
      For this special case, it's wrong to truncate part of the
      control CDB payload and continue to process the command during
      underflow when immediate data payload received was larger than
      SCSI CDB TRANSFER LENGTH, so go ahead and reject and drop the
      bogus payload as a defensive action.
      
      Note this potential bug was originally relaxed by the following
      for allowing WRITE underflow in MSFT FCP host environments:
      
         commit c72c5250
         Author: Roland Dreier <roland@purestorage.com>
         Date:   Wed Jul 22 15:08:18 2015 -0700
      
            target: allow underflow/overflow for PR OUT etc. commands
      
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      abb85a9b
    • N
      iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP · 105fa2f4
      Nicholas Bellinger 提交于
      This patch fixes a BUG() in iscsit_close_session() that could be
      triggered when iscsit_logout_post_handler() execution from within
      tx thread context was not run for more than SECONDS_FOR_LOGOUT_COMP
      (15 seconds), and the TCP connection didn't already close before
      then forcing tx thread context to automatically exit.
      
      This would manifest itself during explicit logout as:
      
      [33206.974254] 1 connection(s) still exist for iSCSI session to iqn.1993-08.org.debian:01:3f5523242179
      [33206.980184] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 2100.772 msecs
      [33209.078643] ------------[ cut here ]------------
      [33209.078646] kernel BUG at drivers/target/iscsi/iscsi_target.c:4346!
      
      Normally when explicit logout attempt fails, the tx thread context
      exits and iscsit_close_connection() from rx thread context does the
      extra cleanup once it detects conn->conn_logout_remove has not been
      cleared by the logout type specific post handlers.
      
      To address this special case, if the logout post handler in tx thread
      context detects conn->tx_thread_active has already been cleared, simply
      return and exit in order for existing iscsit_close_connection()
      logic from rx thread context do failed logout cleanup.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: NGary Guo <ghg@datera.io>
      Tested-by: NChu Yuan Lin <cyl@datera.io>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      105fa2f4
    • N
      target: Fix kref->refcount underflow in transport_cmd_finish_abort · 73d4e580
      Nicholas Bellinger 提交于
      This patch fixes a se_cmd->cmd_kref underflow during CMD_T_ABORTED
      when a fabric driver drops it's second reference from below the
      target_core_tmr.c based callers of transport_cmd_finish_abort().
      
      Recently with the conversion of kref to refcount_t, this bug was
      manifesting itself as:
      
      [705519.601034] refcount_t: underflow; use-after-free.
      [705519.604034] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 20116.512 msecs
      [705539.719111] ------------[ cut here ]------------
      [705539.719117] WARNING: CPU: 3 PID: 26510 at lib/refcount.c:184 refcount_sub_and_test+0x33/0x51
      
      Since the original kref atomic_t based kref_put() didn't check for
      underflow and only invoked the final callback when zero was reached,
      this bug did not manifest in practice since all se_cmd memory is
      using preallocated tags.
      
      To address this, go ahead and propigate the existing return from
      transport_put_cmd() up via transport_cmd_finish_abort(), and
      change transport_cmd_finish_abort() + core_tmr_handle_tas_abort()
      callers to only do their local target_put_sess_cmd() if necessary.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: NGary Guo <ghg@datera.io>
      Tested-by: NChu Yuan Lin <cyl@datera.io>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      73d4e580
  3. 01 6月, 2017 2 次提交
    • J
      iscsi-target: Always wait for kthread_should_stop() before kthread exit · 5e0cf5e6
      Jiang Yi 提交于
      There are three timing problems in the kthread usages of iscsi_target_mod:
      
       - np_thread of struct iscsi_np
       - rx_thread and tx_thread of struct iscsi_conn
      
      In iscsit_close_connection(), it calls
      
       send_sig(SIGINT, conn->tx_thread, 1);
       kthread_stop(conn->tx_thread);
      
      In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive
      SIGINT the kthread will exit without checking the return value of
      kthread_should_stop().
      
      So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...)
      and kthread_stop(...), the kthread_stop() will try to stop an already
      stopped kthread.
      
      This is invalid according to the documentation of kthread_stop().
      
      (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and
       early iscsi_target_rx_thread failure case - nab)
      Signed-off-by: NJiang Yi <jiangyilism@gmail.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      5e0cf5e6
    • N
      iscsi-target: Fix initial login PDU asynchronous socket close OOPs · 25cdda95
      Nicholas Bellinger 提交于
      This patch fixes a OOPs originally introduced by:
      
         commit bb048357
         Author: Nicholas Bellinger <nab@linux-iscsi.org>
         Date:   Thu Sep 5 14:54:04 2013 -0700
      
         iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
      
      which would trigger a NULL pointer dereference when a TCP connection
      was closed asynchronously via iscsi_target_sk_state_change(), but only
      when the initial PDU processing in iscsi_target_do_login() from iscsi_np
      process context was blocked waiting for backend I/O to complete.
      
      To address this issue, this patch makes the following changes.
      
      First, it introduces some common helper functions used for checking
      socket closing state, checking login_flags, and atomically checking
      socket closing state + setting login_flags.
      
      Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
      connection has dropped via iscsi_target_sk_state_change(), but the
      initial PDU processing within iscsi_target_do_login() in iscsi_np
      context is still running.  For this case, it sets LOGIN_FLAGS_CLOSED,
      but doesn't invoke schedule_delayed_work().
      
      The original NULL pointer dereference case reported by MNC is now handled
      by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
      transitioning to FFP to determine when the socket has already closed,
      or iscsi_target_start_negotiation() if the login needs to exchange
      more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
      closed.  For both of these cases, the cleanup up of remaining connection
      resources will occur in iscsi_target_start_negotiation() from iscsi_np
      process context once the failure is detected.
      
      Finally, to handle to case where iscsi_target_sk_state_change() is
      called after the initial PDU procesing is complete, it now invokes
      conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
      existing iscsi_target_sk_check_close() checks detect connection failure.
      For this case, the cleanup of remaining connection resources will occur
      in iscsi_target_do_login_rx() from delayed workqueue process context
      once the failure is detected.
      Reported-by: NMike Christie <mchristi@redhat.com>
      Reviewed-by: NMike Christie <mchristi@redhat.com>
      Tested-by: NMike Christie <mchristi@redhat.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Reported-by: NHannes Reinecke <hare@suse.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Varun Prakash <varun@chelsio.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      25cdda95
  4. 24 5月, 2017 1 次提交
    • M
      tcmu: fix crash during device removal · f3cdbe39
      Mike Christie 提交于
      We currently do
      
      tcmu_free_device ->tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE) ->
      uio_unregister_device -> kfree(tcmu_dev).
      
      The problem is that the kernel does not wait for userspace to
      do the close() on the uio device before freeing the tcmu_dev.
      We can then hit a race where the kernel frees the tcmu_dev before
      userspace does close() and so when close() -> release -> tcmu_release
      is done, we try to access a freed tcmu_dev.
      
      This patch made over the target-pending master branch moves the freeing
      of the tcmu_dev to when the last reference has been dropped.
      
      This also fixes a leak where if tcmu_configure_device was not called on a
      device we did not free udev->name which was allocated at tcmu_alloc_device time.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      f3cdbe39
  5. 16 5月, 2017 3 次提交
    • N
      target: Re-add check to reject control WRITEs with overflow data · 4ff83daa
      Nicholas Bellinger 提交于
      During v4.3 when the overflow/underflow check was relaxed by
      commit c72c5250:
      
        commit c72c5250
        Author: Roland Dreier <roland@purestorage.com>
        Date:   Wed Jul 22 15:08:18 2015 -0700
      
             target: allow underflow/overflow for PR OUT etc. commands
      
      to allow underflow/overflow for Windows compliance + FCP, a
      consequence was to allow control CDBs to process overflow
      data for iscsi-target with immediate data as well.
      
      As per Roland's original change, continue to allow underflow
      cases for control CDBs to make Windows compliance + FCP happy,
      but until overflow for control CDBs is supported tree-wide,
      explicitly reject all control WRITEs with overflow following
      pre v4.3.y logic.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      4ff83daa
    • B
      ibmvscsis: Fix the incorrect req_lim_delta · 75dbf2d3
      Bryant G. Ly 提交于
      The current code is not correctly calculating the req_lim_delta.
      
      We want to make sure vscsi->credit is always incremented when
      we do not send a response for the scsi op. Thus for the case where
      there is a successfully aborted task we need to make sure the
      vscsi->credit is incremented.
      
      v2 - Moves the original location of the vscsi->credit increment
      to a better spot. Since if we increment credit, the next command
      we send back will have increased req_lim_delta. But we probably
      shouldn't be doing that until the aborted cmd is actually released.
      Otherwise the client will think that it can send a new command, and
      we could find ourselves short of command elements. Not likely, but could
      happen.
      
      This patch depends on both:
      commit 25e78531 ("ibmvscsis: Do not send aborted task response")
      commit 98883f1b ("ibmvscsis: Clear left-over abort_cmd pointers")
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: NMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      75dbf2d3
    • B
      ibmvscsis: Clear left-over abort_cmd pointers · 98883f1b
      Bryant G. Ly 提交于
      With the addition of ibmvscsis->abort_cmd pointer within
      commit 25e78531 ("ibmvscsis: Do not send aborted task response"),
      make sure to explicitly NULL these pointers when clearing
      DELAY_SEND flag.
      
      Do this for two cases, when getting the new new ibmvscsis
      descriptor in ibmvscsis_get_free_cmd() and before posting
      the response completion in ibmvscsis_send_messages().
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: NMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      98883f1b
  6. 14 5月, 2017 5 次提交
    • L
      Linux 4.12-rc1 · 2ea659a9
      Linus Torvalds 提交于
      2ea659a9
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · cd636458
      Linus Torvalds 提交于
      Pull some more input subsystem updates from Dmitry Torokhov:
       "An updated xpad driver with a few more recognized device IDs, and a
        new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
        via SPI bus"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: cros_ec_keyb - remove extraneous 'const'
        Input: add support for PlayStation 1/2 joypads connected via SPI
        Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
        Input: xpad - sync supported devices with xboxdrv
        Input: xpad - sort supported devices by USB ID
      cd636458
    • L
      Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs · b53c4d5e
      Linus Torvalds 提交于
      Pull UBI/UBIFS updates from Richard Weinberger:
      
       - new config option CONFIG_UBIFS_FS_SECURITY
      
       - minor improvements
      
       - random fixes
      
      * tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
        ubi: Add debugfs file for tracking PEB state
        ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
        ubifs: Remove unnecessary assignment
        ubifs: Fix cut and paste error on sb type comparisons
        ubi: fastmap: Fix slab corruption
        ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
        ubi: Make mtd parameter readable
        ubi: Fix section mismatch
      b53c4d5e
    • L
      Merge branch 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · ec059019
      Linus Torvalds 提交于
      Pull UML fixes from Richard Weinberger:
       "No new stuff, just fixes"
      
      * 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: Add missing NR_CPUS include
        um: Fix to call read_initrd after init_bootmem
        um: Include kbuild.h instead of duplicating its macros
        um: Fix PTRACE_POKEUSER on x86_64
        um: Set number of CPUs
        um: Fix _print_addr()
      ec059019
    • L
      Merge branch 'akpm' (patches from Andrew) · 1251704a
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, docs: update memory.stat description with workingset* entries
        mm: vmscan: scan until it finds eligible pages
        mm, thp: copying user pages must schedule on collapse
        dax: fix PMD data corruption when fault races with write
        dax: fix data corruption when fault races with write
        ext4: return to starting transaction in ext4_dax_huge_fault()
        mm: fix data corruption due to stale mmap reads
        dax: prevent invalidation of mapped DAX entries
        Tigran has moved
        mm, vmalloc: fix vmalloc users tracking properly
        mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
        gcov: support GCC 7.1
        mm, vmstat: Remove spurious WARN() during zoneinfo print
        time: delete current_fs_time()
        hwpoison, memcg: forcibly uncharge LRU pages
      1251704a
  7. 13 5月, 2017 6 次提交
    • R
      mm, docs: update memory.stat description with workingset* entries · b340959e
      Roman Gushchin 提交于
      Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
      workingset transition") introduced three new entries in memory stat
      file:
      
       - workingset_refault
       - workingset_activate
       - workingset_nodereclaim
      
      This commit adds a corresponding description to the cgroup v2 docs.
      
      Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b340959e
    • M
      mm: vmscan: scan until it finds eligible pages · 791b48b6
      Minchan Kim 提交于
      Although there are a ton of free swap and anonymous LRU page in elgible
      zones, OOM happened.
      
        balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
        CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
         oom_kill_process+0x21d/0x3f0
         out_of_memory+0xd8/0x390
         __alloc_pages_slowpath+0xbc1/0xc50
         __alloc_pages_nodemask+0x1a5/0x1c0
         pte_alloc_one+0x20/0x50
         __pte_alloc+0x1e/0x110
         __handle_mm_fault+0x919/0x960
         handle_mm_fault+0x77/0x120
         __do_page_fault+0x27a/0x550
         trace_do_page_fault+0x43/0x150
         do_async_page_fault+0x2c/0x90
         async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:424716 inactive_anon:65314 isolated_anon:0
         active_file:52 inactive_file:46 isolated_file:0
         unevictable:0 dirty:27 writeback:0 unstable:0
         slab_reclaimable:3967 slab_unreclaimable:4125
         mapped:133 shmem:43 pagetables:1674 bounce:0
         free:4637 free_pcp:225 free_cma:0
        Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 992 992 1952
        DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 959
        Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
        DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
        Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
        378 total pagecache pages
        17 pages in swap cache
        Swap cache stats: add 17325, delete 17302, find 0/27
        Free swap  = 978940kB
        Total swap = 1048572kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        12629 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
        [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
        [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
        [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd
      
      With investigation, skipping page of isolate_lru_pages makes reclaim
      void because it returns zero nr_taken easily so LRU shrinking is
      effectively nothing and just increases priority aggressively.  Finally,
      OOM happens.
      
      The problem is that get_scan_count determines nr_to_scan with eligible
      zones so although priority drops to zero, it couldn't reclaim any pages
      if the LRU contains mostly ineligible pages.
      
      get_scan_count:
      
              size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
      	size = size >> sc->priority;
      
      Assumes sc->priority is 0 and LRU list is as follows.
      
      	N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H
      
      (Ie, small eligible pages are in the head of LRU but others are
       almost ineligible pages)
      
      In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
      tail of the LRU are not eligible pages.  If get_scan_count counts
      skipped pages, it doesn't reclaim any pages remained after scanning 4
      pages so it ends up OOM happening.
      
      This patch makes isolate_lru_pages try to scan pages until it encounters
      eligible zones's pages.
      
      [akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
      Fixes: 3db65812 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
      Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      791b48b6
    • D
      mm, thp: copying user pages must schedule on collapse · 338a16ba
      David Rientjes 提交于
      We have encountered need_resched warnings in __collapse_huge_page_copy()
      while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.
      
      mm->mmap_sem is held for write, but the iteration is well bounded.
      
      Reschedule as needed.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      338a16ba
    • R
      dax: fix PMD data corruption when fault races with write · 876f2946
      Ross Zwisler 提交于
      This is based on a patch from Jan Kara that fixed the equivalent race in
      the DAX PTE fault path.
      
      Currently DAX PMD read fault can race with write(2) in the following
      way:
      
      CPU1 - write(2)                 CPU2 - read fault
                                      dax_iomap_pmd_fault()
                                        ->iomap_begin() - sees hole
      
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      
                                        grab_mapping_entry()
      				  - we add huge zero page to the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
      Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      876f2946
    • J
      dax: fix data corruption when fault races with write · 13e451fd
      Jan Kara 提交于
      Currently DAX read fault can race with write(2) in the following way:
      
      CPU1 - write(2)			CPU2 - read fault
      				dax_iomap_pte_fault()
      				  ->iomap_begin() - sees hole
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      				  grab_mapping_entry()
      				  - we add zero page in the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e451fd
    • J
      ext4: return to starting transaction in ext4_dax_huge_fault() · fb26a1cb
      Jan Kara 提交于
      DAX will return to locking exceptional entry before mapping blocks for a
      page fault to fix possible races with concurrent writes.  To avoid lock
      inversion between exceptional entry lock and transaction start, start
      the transaction already in ext4_dax_huge_fault().
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb26a1cb