1. 07 7月, 2017 7 次提交
  2. 09 6月, 2017 10 次提交
    • N
      iscsi-target: Avoid holding ->tpg_state_lock during param update · eceb4459
      Nicholas Bellinger 提交于
      As originally reported by Jia-Ju, iscsit_tpg_enable_portal_group()
      holds iscsi_portal_group->tpg_state_lock while updating AUTHMETHOD
      via iscsi_update_param_value(), which performs a GFP_KERNEL
      allocation.
      
      However, since iscsit_tpg_enable_portal_group() is already protected
      by iscsit_get_tpg() -> iscsi_portal_group->tpg_access_lock in it's
      parent caller, ->tpg_state_lock only needs to be held when setting
      TPG_STATE_ACTIVE.
      Reported-by: NJia-Ju Bai <baijiaju1990@163.com>
      Reviewed-by: NJia-Ju Bai <baijiaju1990@163.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      eceb4459
    • N
      target/configfs: Kill se_lun->lun_link_magic · 9ae0e9ad
      Nicholas Bellinger 提交于
      Instead of using a hardcoded magic value in se_lun when verifying
      a target config_item symlink source during target_fabric_mappedlun_link(),
      go ahead and use target_fabric_port_item_ops directly instead.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      9ae0e9ad
    • N
      target/configfs: Kill se_device->dev_link_magic · c17cd249
      Nicholas Bellinger 提交于
      Instead of using a hardcoded magic value in se_device when verifying
      a target config_item symlink source during target_fabric_port_link(),
      go ahead and use target_core_dev_item_ops directly instead.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      c17cd249
    • N
      target/iblock: Convert WRITE_SAME to blkdev_issue_zeroout · 2237498f
      Nicholas Bellinger 提交于
      The people who are actively using iblock_execute_write_same_direct() are
      doing so in the context of ESX VAAI BlockZero, together with
      EXTENDED_COPY and COMPARE_AND_WRITE primitives.
      
      In practice though I've not seen any users of IBLOCK WRITE_SAME for
      anything other than VAAI BlockZero, so just using blkdev_issue_zeroout()
      when available, and falling back to iblock_execute_write_same() if the
      WRITE_SAME buffer contains anything other than zeros should be OK.
      
      (Hook up max_write_zeroes_sectors to signal LBPRZ feature bit in
       target_configure_unmap_from_queue - nab)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      2237498f
    • M
      ibmvscsis: Enable Logical Partition Migration Support · 464fd641
      Michael Cyr 提交于
      Changes to support a new mechanism from phyp to better synchronize the
      logical partition migration (LPM) of the client partition.
      This includes a new VIOCTL to register that we support this new
      functionality, and 2 new Transport Event types, and finally another
      new VIOCTL to let phyp know once we're ready for the Suspend.
      Signed-off-by: NMichael Cyr <mikecyr@us.ibm.com>
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      464fd641
    • B
      vhost/scsi: Don't reinvent the wheel but use existing llist API · 12bdcbd5
      Byungchul Park 提交于
      Although llist provides proper APIs, they are not used. Make them used.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Acked-by: NNicholas Bellinger <nab@linux-iscsi.org>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      12bdcbd5
    • G
      target: remove dead code · fb418240
      Gustavo A. R. Silva 提交于
      Local variable _ret_ is assigned to a constant value and it is never
      updated again. Remove this variable and the dead code it guards.
      
      Addresses-Coverity-ID: 140761
      Signed-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com>
      Reviewed-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      fb418240
    • N
      iscsi-target: Reject immediate data underflow larger than SCSI transfer length · abb85a9b
      Nicholas Bellinger 提交于
      When iscsi WRITE underflow occurs there are two different scenarios
      that can happen.
      
      Normally in practice, when an EDTL vs. SCSI CDB TRANSFER LENGTH
      underflow is detected, the iscsi immediate data payload is the
      smaller SCSI CDB TRANSFER LENGTH.
      
      That is, when a host fabric LLD is using a fixed size EDTL for
      a specific control CDB, the SCSI CDB TRANSFER LENGTH and actual
      SCSI payload ends up being smaller than EDTL.  In iscsi, this
      means the received iscsi immediate data payload matches the
      smaller SCSI CDB TRANSFER LENGTH, because there is no more
      SCSI payload to accept beyond SCSI CDB TRANSFER LENGTH.
      
      However, it's possible for a malicous host to send a WRITE
      underflow where EDTL is larger than SCSI CDB TRANSFER LENGTH,
      but incoming iscsi immediate data actually matches EDTL.
      
      In the wild, we've never had a iscsi host environment actually
      try to do this.
      
      For this special case, it's wrong to truncate part of the
      control CDB payload and continue to process the command during
      underflow when immediate data payload received was larger than
      SCSI CDB TRANSFER LENGTH, so go ahead and reject and drop the
      bogus payload as a defensive action.
      
      Note this potential bug was originally relaxed by the following
      for allowing WRITE underflow in MSFT FCP host environments:
      
         commit c72c5250
         Author: Roland Dreier <roland@purestorage.com>
         Date:   Wed Jul 22 15:08:18 2015 -0700
      
            target: allow underflow/overflow for PR OUT etc. commands
      
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      abb85a9b
    • N
      iscsi-target: Fix delayed logout processing greater than SECONDS_FOR_LOGOUT_COMP · 105fa2f4
      Nicholas Bellinger 提交于
      This patch fixes a BUG() in iscsit_close_session() that could be
      triggered when iscsit_logout_post_handler() execution from within
      tx thread context was not run for more than SECONDS_FOR_LOGOUT_COMP
      (15 seconds), and the TCP connection didn't already close before
      then forcing tx thread context to automatically exit.
      
      This would manifest itself during explicit logout as:
      
      [33206.974254] 1 connection(s) still exist for iSCSI session to iqn.1993-08.org.debian:01:3f5523242179
      [33206.980184] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 2100.772 msecs
      [33209.078643] ------------[ cut here ]------------
      [33209.078646] kernel BUG at drivers/target/iscsi/iscsi_target.c:4346!
      
      Normally when explicit logout attempt fails, the tx thread context
      exits and iscsit_close_connection() from rx thread context does the
      extra cleanup once it detects conn->conn_logout_remove has not been
      cleared by the logout type specific post handlers.
      
      To address this special case, if the logout post handler in tx thread
      context detects conn->tx_thread_active has already been cleared, simply
      return and exit in order for existing iscsit_close_connection()
      logic from rx thread context do failed logout cleanup.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: NGary Guo <ghg@datera.io>
      Tested-by: NChu Yuan Lin <cyl@datera.io>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      105fa2f4
    • N
      target: Fix kref->refcount underflow in transport_cmd_finish_abort · 73d4e580
      Nicholas Bellinger 提交于
      This patch fixes a se_cmd->cmd_kref underflow during CMD_T_ABORTED
      when a fabric driver drops it's second reference from below the
      target_core_tmr.c based callers of transport_cmd_finish_abort().
      
      Recently with the conversion of kref to refcount_t, this bug was
      manifesting itself as:
      
      [705519.601034] refcount_t: underflow; use-after-free.
      [705519.604034] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 20116.512 msecs
      [705539.719111] ------------[ cut here ]------------
      [705539.719117] WARNING: CPU: 3 PID: 26510 at lib/refcount.c:184 refcount_sub_and_test+0x33/0x51
      
      Since the original kref atomic_t based kref_put() didn't check for
      underflow and only invoked the final callback when zero was reached,
      this bug did not manifest in practice since all se_cmd memory is
      using preallocated tags.
      
      To address this, go ahead and propigate the existing return from
      transport_put_cmd() up via transport_cmd_finish_abort(), and
      change transport_cmd_finish_abort() + core_tmr_handle_tas_abort()
      callers to only do their local target_put_sess_cmd() if necessary.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Tested-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Himanshu Madhani <himanshu.madhani@qlogic.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: stable@vger.kernel.org # 3.14+
      Tested-by: NGary Guo <ghg@datera.io>
      Tested-by: NChu Yuan Lin <cyl@datera.io>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      73d4e580
  3. 01 6月, 2017 2 次提交
    • J
      iscsi-target: Always wait for kthread_should_stop() before kthread exit · 5e0cf5e6
      Jiang Yi 提交于
      There are three timing problems in the kthread usages of iscsi_target_mod:
      
       - np_thread of struct iscsi_np
       - rx_thread and tx_thread of struct iscsi_conn
      
      In iscsit_close_connection(), it calls
      
       send_sig(SIGINT, conn->tx_thread, 1);
       kthread_stop(conn->tx_thread);
      
      In conn->tx_thread, which is iscsi_target_tx_thread(), when it receive
      SIGINT the kthread will exit without checking the return value of
      kthread_should_stop().
      
      So if iscsi_target_tx_thread() exit right between send_sig(SIGINT...)
      and kthread_stop(...), the kthread_stop() will try to stop an already
      stopped kthread.
      
      This is invalid according to the documentation of kthread_stop().
      
      (Fix -ECONNRESET logout handling in iscsi_target_tx_thread and
       early iscsi_target_rx_thread failure case - nab)
      Signed-off-by: NJiang Yi <jiangyilism@gmail.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      5e0cf5e6
    • N
      iscsi-target: Fix initial login PDU asynchronous socket close OOPs · 25cdda95
      Nicholas Bellinger 提交于
      This patch fixes a OOPs originally introduced by:
      
         commit bb048357
         Author: Nicholas Bellinger <nab@linux-iscsi.org>
         Date:   Thu Sep 5 14:54:04 2013 -0700
      
         iscsi-target: Add sk->sk_state_change to cleanup after TCP failure
      
      which would trigger a NULL pointer dereference when a TCP connection
      was closed asynchronously via iscsi_target_sk_state_change(), but only
      when the initial PDU processing in iscsi_target_do_login() from iscsi_np
      process context was blocked waiting for backend I/O to complete.
      
      To address this issue, this patch makes the following changes.
      
      First, it introduces some common helper functions used for checking
      socket closing state, checking login_flags, and atomically checking
      socket closing state + setting login_flags.
      
      Second, it introduces a LOGIN_FLAGS_INITIAL_PDU bit to know when a TCP
      connection has dropped via iscsi_target_sk_state_change(), but the
      initial PDU processing within iscsi_target_do_login() in iscsi_np
      context is still running.  For this case, it sets LOGIN_FLAGS_CLOSED,
      but doesn't invoke schedule_delayed_work().
      
      The original NULL pointer dereference case reported by MNC is now handled
      by iscsi_target_do_login() doing a iscsi_target_sk_check_close() before
      transitioning to FFP to determine when the socket has already closed,
      or iscsi_target_start_negotiation() if the login needs to exchange
      more PDUs (eg: iscsi_target_do_login returned 0) but the socket has
      closed.  For both of these cases, the cleanup up of remaining connection
      resources will occur in iscsi_target_start_negotiation() from iscsi_np
      process context once the failure is detected.
      
      Finally, to handle to case where iscsi_target_sk_state_change() is
      called after the initial PDU procesing is complete, it now invokes
      conn->login_work -> iscsi_target_do_login_rx() to perform cleanup once
      existing iscsi_target_sk_check_close() checks detect connection failure.
      For this case, the cleanup of remaining connection resources will occur
      in iscsi_target_do_login_rx() from delayed workqueue process context
      once the failure is detected.
      Reported-by: NMike Christie <mchristi@redhat.com>
      Reviewed-by: NMike Christie <mchristi@redhat.com>
      Tested-by: NMike Christie <mchristi@redhat.com>
      Cc: Mike Christie <mchristi@redhat.com>
      Reported-by: NHannes Reinecke <hare@suse.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Varun Prakash <varun@chelsio.com>
      Cc: <stable@vger.kernel.org> # v3.12+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      25cdda95
  4. 24 5月, 2017 1 次提交
    • M
      tcmu: fix crash during device removal · f3cdbe39
      Mike Christie 提交于
      We currently do
      
      tcmu_free_device ->tcmu_netlink_event(TCMU_CMD_REMOVED_DEVICE) ->
      uio_unregister_device -> kfree(tcmu_dev).
      
      The problem is that the kernel does not wait for userspace to
      do the close() on the uio device before freeing the tcmu_dev.
      We can then hit a race where the kernel frees the tcmu_dev before
      userspace does close() and so when close() -> release -> tcmu_release
      is done, we try to access a freed tcmu_dev.
      
      This patch made over the target-pending master branch moves the freeing
      of the tcmu_dev to when the last reference has been dropped.
      
      This also fixes a leak where if tcmu_configure_device was not called on a
      device we did not free udev->name which was allocated at tcmu_alloc_device time.
      Signed-off-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      f3cdbe39
  5. 16 5月, 2017 3 次提交
    • N
      target: Re-add check to reject control WRITEs with overflow data · 4ff83daa
      Nicholas Bellinger 提交于
      During v4.3 when the overflow/underflow check was relaxed by
      commit c72c5250:
      
        commit c72c5250
        Author: Roland Dreier <roland@purestorage.com>
        Date:   Wed Jul 22 15:08:18 2015 -0700
      
             target: allow underflow/overflow for PR OUT etc. commands
      
      to allow underflow/overflow for Windows compliance + FCP, a
      consequence was to allow control CDBs to process overflow
      data for iscsi-target with immediate data as well.
      
      As per Roland's original change, continue to allow underflow
      cases for control CDBs to make Windows compliance + FCP happy,
      but until overflow for control CDBs is supported tree-wide,
      explicitly reject all control WRITEs with overflow following
      pre v4.3.y logic.
      Reported-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: <stable@vger.kernel.org> # v4.3+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      4ff83daa
    • B
      ibmvscsis: Fix the incorrect req_lim_delta · 75dbf2d3
      Bryant G. Ly 提交于
      The current code is not correctly calculating the req_lim_delta.
      
      We want to make sure vscsi->credit is always incremented when
      we do not send a response for the scsi op. Thus for the case where
      there is a successfully aborted task we need to make sure the
      vscsi->credit is incremented.
      
      v2 - Moves the original location of the vscsi->credit increment
      to a better spot. Since if we increment credit, the next command
      we send back will have increased req_lim_delta. But we probably
      shouldn't be doing that until the aborted cmd is actually released.
      Otherwise the client will think that it can send a new command, and
      we could find ourselves short of command elements. Not likely, but could
      happen.
      
      This patch depends on both:
      commit 25e78531 ("ibmvscsis: Do not send aborted task response")
      commit 98883f1b ("ibmvscsis: Clear left-over abort_cmd pointers")
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: NMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      75dbf2d3
    • B
      ibmvscsis: Clear left-over abort_cmd pointers · 98883f1b
      Bryant G. Ly 提交于
      With the addition of ibmvscsis->abort_cmd pointer within
      commit 25e78531 ("ibmvscsis: Do not send aborted task response"),
      make sure to explicitly NULL these pointers when clearing
      DELAY_SEND flag.
      
      Do this for two cases, when getting the new new ibmvscsis
      descriptor in ibmvscsis_get_free_cmd() and before posting
      the response completion in ibmvscsis_send_messages().
      Signed-off-by: NBryant G. Ly <bryantly@linux.vnet.ibm.com>
      Reviewed-by: NMichael Cyr <mikecyr@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org> # v4.8+
      Signed-off-by: NNicholas Bellinger <nab@linux-iscsi.org>
      98883f1b
  6. 14 5月, 2017 5 次提交
    • L
      Linux 4.12-rc1 · 2ea659a9
      Linus Torvalds 提交于
      2ea659a9
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · cd636458
      Linus Torvalds 提交于
      Pull some more input subsystem updates from Dmitry Torokhov:
       "An updated xpad driver with a few more recognized device IDs, and a
        new psxpad-spi driver, allowing connecting Playstation 1 and 2 joypads
        via SPI bus"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: cros_ec_keyb - remove extraneous 'const'
        Input: add support for PlayStation 1/2 joypads connected via SPI
        Input: xpad - add USB IDs for Mad Catz Brawlstick and Razer Sabertooth
        Input: xpad - sync supported devices with xboxdrv
        Input: xpad - sort supported devices by USB ID
      cd636458
    • L
      Merge tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs · b53c4d5e
      Linus Torvalds 提交于
      Pull UBI/UBIFS updates from Richard Weinberger:
      
       - new config option CONFIG_UBIFS_FS_SECURITY
      
       - minor improvements
      
       - random fixes
      
      * tag 'upstream-4.12-rc1' of git://git.infradead.org/linux-ubifs:
        ubi: Add debugfs file for tracking PEB state
        ubifs: Fix a typo in comment of ioctl2ubifs & ubifs2ioctl
        ubifs: Remove unnecessary assignment
        ubifs: Fix cut and paste error on sb type comparisons
        ubi: fastmap: Fix slab corruption
        ubifs: Add CONFIG_UBIFS_FS_SECURITY to disable/enable security labels
        ubi: Make mtd parameter readable
        ubi: Fix section mismatch
      b53c4d5e
    • L
      Merge branch 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml · ec059019
      Linus Torvalds 提交于
      Pull UML fixes from Richard Weinberger:
       "No new stuff, just fixes"
      
      * 'for-linus-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
        um: Add missing NR_CPUS include
        um: Fix to call read_initrd after init_bootmem
        um: Include kbuild.h instead of duplicating its macros
        um: Fix PTRACE_POKEUSER on x86_64
        um: Set number of CPUs
        um: Fix _print_addr()
      ec059019
    • L
      Merge branch 'akpm' (patches from Andrew) · 1251704a
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "15 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, docs: update memory.stat description with workingset* entries
        mm: vmscan: scan until it finds eligible pages
        mm, thp: copying user pages must schedule on collapse
        dax: fix PMD data corruption when fault races with write
        dax: fix data corruption when fault races with write
        ext4: return to starting transaction in ext4_dax_huge_fault()
        mm: fix data corruption due to stale mmap reads
        dax: prevent invalidation of mapped DAX entries
        Tigran has moved
        mm, vmalloc: fix vmalloc users tracking properly
        mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin
        gcov: support GCC 7.1
        mm, vmstat: Remove spurious WARN() during zoneinfo print
        time: delete current_fs_time()
        hwpoison, memcg: forcibly uncharge LRU pages
      1251704a
  7. 13 5月, 2017 12 次提交
    • R
      mm, docs: update memory.stat description with workingset* entries · b340959e
      Roman Gushchin 提交于
      Commit 4b4cea91691d ("mm: vmscan: fix IO/refault regression in cache
      workingset transition") introduced three new entries in memory stat
      file:
      
       - workingset_refault
       - workingset_activate
       - workingset_nodereclaim
      
      This commit adds a corresponding description to the cgroup v2 docs.
      
      Link: http://lkml.kernel.org/r/1494530293-31236-1-git-send-email-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b340959e
    • M
      mm: vmscan: scan until it finds eligible pages · 791b48b6
      Minchan Kim 提交于
      Although there are a ton of free swap and anonymous LRU page in elgible
      zones, OOM happened.
      
        balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
        CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
         oom_kill_process+0x21d/0x3f0
         out_of_memory+0xd8/0x390
         __alloc_pages_slowpath+0xbc1/0xc50
         __alloc_pages_nodemask+0x1a5/0x1c0
         pte_alloc_one+0x20/0x50
         __pte_alloc+0x1e/0x110
         __handle_mm_fault+0x919/0x960
         handle_mm_fault+0x77/0x120
         __do_page_fault+0x27a/0x550
         trace_do_page_fault+0x43/0x150
         do_async_page_fault+0x2c/0x90
         async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:424716 inactive_anon:65314 isolated_anon:0
         active_file:52 inactive_file:46 isolated_file:0
         unevictable:0 dirty:27 writeback:0 unstable:0
         slab_reclaimable:3967 slab_unreclaimable:4125
         mapped:133 shmem:43 pagetables:1674 bounce:0
         free:4637 free_pcp:225 free_cma:0
        Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 992 992 1952
        DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 959
        Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
        DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
        Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
        378 total pagecache pages
        17 pages in swap cache
        Swap cache stats: add 17325, delete 17302, find 0/27
        Free swap  = 978940kB
        Total swap = 1048572kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        12629 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
        [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
        [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
        [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd
      
      With investigation, skipping page of isolate_lru_pages makes reclaim
      void because it returns zero nr_taken easily so LRU shrinking is
      effectively nothing and just increases priority aggressively.  Finally,
      OOM happens.
      
      The problem is that get_scan_count determines nr_to_scan with eligible
      zones so although priority drops to zero, it couldn't reclaim any pages
      if the LRU contains mostly ineligible pages.
      
      get_scan_count:
      
              size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
      	size = size >> sc->priority;
      
      Assumes sc->priority is 0 and LRU list is as follows.
      
      	N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H
      
      (Ie, small eligible pages are in the head of LRU but others are
       almost ineligible pages)
      
      In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
      tail of the LRU are not eligible pages.  If get_scan_count counts
      skipped pages, it doesn't reclaim any pages remained after scanning 4
      pages so it ends up OOM happening.
      
      This patch makes isolate_lru_pages try to scan pages until it encounters
      eligible zones's pages.
      
      [akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
      Fixes: 3db65812 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
      Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      791b48b6
    • D
      mm, thp: copying user pages must schedule on collapse · 338a16ba
      David Rientjes 提交于
      We have encountered need_resched warnings in __collapse_huge_page_copy()
      while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.
      
      mm->mmap_sem is held for write, but the iteration is well bounded.
      
      Reschedule as needed.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      338a16ba
    • R
      dax: fix PMD data corruption when fault races with write · 876f2946
      Ross Zwisler 提交于
      This is based on a patch from Jan Kara that fixed the equivalent race in
      the DAX PTE fault path.
      
      Currently DAX PMD read fault can race with write(2) in the following
      way:
      
      CPU1 - write(2)                 CPU2 - read fault
                                      dax_iomap_pmd_fault()
                                        ->iomap_begin() - sees hole
      
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      
                                        grab_mapping_entry()
      				  - we add huge zero page to the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
      Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      876f2946
    • J
      dax: fix data corruption when fault races with write · 13e451fd
      Jan Kara 提交于
      Currently DAX read fault can race with write(2) in the following way:
      
      CPU1 - write(2)			CPU2 - read fault
      				dax_iomap_pte_fault()
      				  ->iomap_begin() - sees hole
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      				  grab_mapping_entry()
      				  - we add zero page in the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e451fd
    • J
      ext4: return to starting transaction in ext4_dax_huge_fault() · fb26a1cb
      Jan Kara 提交于
      DAX will return to locking exceptional entry before mapping blocks for a
      page fault to fix possible races with concurrent writes.  To avoid lock
      inversion between exceptional entry lock and transaction start, start
      the transaction already in ext4_dax_huge_fault().
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb26a1cb
    • J
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara 提交于
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • R
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler 提交于
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
    • A
      Tigran has moved · cea58224
      Andrew Morton 提交于
      Cc: Tigran Aivazian <aivazian.tigran@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cea58224
    • M
      mm, vmalloc: fix vmalloc users tracking properly · 8594a21c
      Michal Hocko 提交于
      Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has
      pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
      turned out to be a bad idea for some architectures.  E.g.  m68k fails
      with
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                          from arch/m68k/include/asm/pgtable.h:4,
                          from include/linux/vmalloc.h:9,
                          from arch/m68k/kernel/module.c:9:
         arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
      >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
          #define pgd_offset_k(address) pgd_offset(&init_mm, address)
      
      as spotted by kernel build bot. nios2 fails for other reason
      
        In file included from include/asm-generic/io.h:767:0,
                         from arch/nios2/include/asm/io.h:61,
                         from include/linux/io.h:25,
                         from arch/nios2/include/asm/pgtable.h:18,
                         from include/linux/mm.h:70,
                         from include/linux/pid_namespace.h:6,
                         from include/linux/ptrace.h:9,
                         from arch/nios2/include/uapi/asm/elf.h:23,
                         from arch/nios2/include/asm/elf.h:22,
                         from include/linux/elf.h:4,
                         from include/linux/module.h:15,
                         from init/main.c:16:
        include/linux/vmalloc.h: In function '__vmalloc_node_flags':
        include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?
      
      which is due to the newly added #include <asm/pgtable.h>, which on nios2
      includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
      again includes <linux/vmalloc.h>.
      
      Tweaking that around just turns out a bigger headache than necessary.
      This patch reverts 1f5307b1 and reimplements the original fix in a
      different way.  __vmalloc_node_flags can stay static inline which will
      cover vmalloc* functions.  We only have one external user
      (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
      provide the caller directly.  This is much simpler and it doesn't really
      need any games with header files.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: revert old comment]
        Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
      Fixes: 1f5307b1 ("mm, vmalloc: properly track vmalloc users")
      Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tobias Klauser <tklauser@distanz.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8594a21c
    • S
      mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin · 835152a2
      SeongJae Park 提交于
      One return case of `__collapse_huge_page_swapin()` does not invoke
      tracepoint while every other return case does.  This commit adds a
      tracepoint invocation for the case.
      
      Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.comSigned-off-by: NSeongJae Park <sj38.park@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      835152a2
    • M
      gcov: support GCC 7.1 · 05384213
      Martin Liska 提交于
      Starting from GCC 7.1, __gcov_exit is a new symbol expected to be
      implemented in a profiling runtime.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mliska@suse.cz: v2]
        Link: http://lkml.kernel.org/r/e63a3c59-0149-c97e-4084-20ca8f146b26@suse.cz
      Link: http://lkml.kernel.org/r/8c4084fa-3885-29fe-5fc4-0d4ca199c785@suse.czSigned-off-by: NMartin Liska <mliska@suse.cz>
      Acked-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05384213