1. 08 10月, 2019 1 次提交
    • T
      writeback: fix use-after-free in finish_writeback_work() · 8e00c4e9
      Tejun Heo 提交于
      finish_writeback_work() reads @done->waitq after decrementing
      @done->cnt.  However, once @done->cnt reaches zero, @done may be freed
      (from stack) at any moment and @done->waitq can contain something
      unrelated by the time finish_writeback_work() tries to read it.  This
      led to the following crash.
      
        "BUG: kernel NULL pointer dereference, address: 0000000000000002"
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 0 P4D 0
        Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
        CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted
        ...
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30
        Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90
        RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002
        RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0
        Call Trace:
         __wake_up_common_lock+0x63/0xc0
         wb_workfn+0xd2/0x3e0
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         kthread+0x111/0x130
         ret_from_fork+0x1f/0x30
      
      Fix it by reading and caching @done->waitq before decrementing
      @done->cnt.
      
      Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com
      Fixes: 5b9cce4c ("writeback: Generalize and expose wb_completion")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Debugged-by: NChris Mason <clm@fb.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e00c4e9
  2. 30 8月, 2019 1 次提交
  3. 27 8月, 2019 2 次提交
  4. 16 8月, 2019 2 次提交
    • T
      writeback, cgroup: inode_switch_wbs() shouldn't give up on wb_switch_rwsem trylock fail · 6444f47e
      Tejun Heo 提交于
      As inode wb switching may make sync(2) miss some inodes, they're
      synchronized using wb_switch_rwsem so that no wb switching happens
      while sync(2) is in progress.  In addition to synchronizing the actual
      switching, the rwsem is also used to prevent queueing new switch
      attempts while sync(2) is in progress.  This is to avoid queueing too
      many instances while the rwsem is held by sync(2).  Unfortunately,
      this is too agressive and can block wb switching for a long time if
      sync(2) is frequent.
      
      The goal is avoiding expolding the number of scheduled switches, not
      avoiding scheduling anything.  Let's use wb_switch_rwsem only for
      synchronizing the actual switching and sync(2) and use
      isw_nr_in_flight instead for limiting the maximum number of scheduled
      switches.  The limit is set to 1024 which should be more than enough
      while still avoiding extreme situations.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6444f47e
    • T
      writeback, cgroup: Adjust WB_FRN_TIME_CUT_DIV to accelerate foreign inode switching · 55a694df
      Tejun Heo 提交于
      WB_FRN_TIME_CUT_DIV is used to tell the foreign inode detection logic
      to ignore short writeback rounds to prevent getting confused by a
      burst of short writebacks.  The parameter is currently 2 meaning that
      anything smaller than half of the running average writback duration
      will be ignored.
      
      This is unnecessarily aggressive.  The detection logic uses 16 history
      slots and is already reasonably protected against some short bursts
      confusing it and the current parameter can lead to tens of seconds of
      missed detection depending on the writeback pattern.
      
      Let's change the parameter to 8, so that it only ignores writeback
      with are smaller than 12.5% of the current running average.
      
      v2: Add comment explaining what's going on with the foreign detection
          parameters.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      55a694df
  5. 10 7月, 2019 3 次提交
  6. 16 6月, 2019 1 次提交
    • T
      blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration · 66311422
      Tejun Heo 提交于
      wbc_account_io() collects information on cgroup ownership of writeback
      pages to determine which cgroup should own the inode.  Pages can stay
      associated with dead memcgs but we want to avoid attributing IOs to
      dead blkcgs as much as possible as the association is likely to be
      stale.  However, currently, pages associated with dead memcgs
      contribute to the accounting delaying and/or confusing the
      arbitration.
      
      Fix it by ignoring pages associated with dead memcgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      66311422
  7. 21 5月, 2019 1 次提交
  8. 19 5月, 2019 1 次提交
  9. 23 1月, 2019 1 次提交
  10. 21 10月, 2018 1 次提交
  11. 04 5月, 2018 1 次提交
    • J
      bdi: Fix oops in wb_workfn() · b8b78495
      Jan Kara 提交于
      Syzbot has reported that it can hit a NULL pointer dereference in
      wb_workfn() due to wb->bdi->dev being NULL. This indicates that
      wb_workfn() was called for an already unregistered bdi which should not
      happen as wb_shutdown() called from bdi_unregister() should make sure
      all pending writeback works are completed before bdi is unregistered.
      Except that wb_workfn() itself can requeue the work with:
      
      	mod_delayed_work(bdi_wq, &wb->dwork, 0);
      
      and if this happens while wb_shutdown() is waiting in:
      
      	flush_delayed_work(&wb->dwork);
      
      the dwork can get executed after wb_shutdown() has finished and
      bdi_unregister() has cleared wb->bdi->dev.
      
      Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
      the necessary precautions against racing with bdi unregistration.
      
      CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      CC: Tejun Heo <tj@kernel.org>
      Fixes: 839a8e86Reported-by: Nsyzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b8b78495
  12. 21 4月, 2018 1 次提交
    • G
      writeback: safer lock nesting · 2e898e4c
      Greg Thelen 提交于
      lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
      the page's memcg is undergoing move accounting, which occurs when a
      process leaves its memcg for a new one that has
      memory.move_charge_at_immigrate set.
      
      unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
      the given inode is switching writeback domains.  Switches occur when
      enough writes are issued from a new domain.
      
      This existing pattern is thus suspicious:
          lock_page_memcg(page);
          unlocked_inode_to_wb_begin(inode, &locked);
          ...
          unlocked_inode_to_wb_end(inode, locked);
          unlock_page_memcg(page);
      
      If both inode switch and process memcg migration are both in-flight then
      unlocked_inode_to_wb_end() will unconditionally enable interrupts while
      still holding the lock_page_memcg() irq spinlock.  This suggests the
      possibility of deadlock if an interrupt occurs before unlock_page_memcg().
      
          truncate
          __cancel_dirty_page
          lock_page_memcg
          unlocked_inode_to_wb_begin
          unlocked_inode_to_wb_end
          <interrupts mistakenly enabled>
                                          <interrupt>
                                          end_page_writeback
                                          test_clear_page_writeback
                                          lock_page_memcg
                                          <deadlock>
          unlock_page_memcg
      
      Due to configuration limitations this deadlock is not currently possible
      because we don't mix cgroup writeback (a cgroupv2 feature) and
      memory.move_charge_at_immigrate (a cgroupv1 feature).
      
      If the kernel is hacked to always claim inode switching and memcg
      moving_account, then this script triggers lockup in less than a minute:
      
        cd /mnt/cgroup/memory
        mkdir a b
        echo 1 > a/memory.move_charge_at_immigrate
        echo 1 > b/memory.move_charge_at_immigrate
        (
          echo $BASHPID > a/cgroup.procs
          while true; do
            dd if=/dev/zero of=/mnt/big bs=1M count=256
          done
        ) &
        while true; do
          sync
        done &
        sleep 1h &
        SLEEP=$!
        while true; do
          echo $SLEEP > a/cgroup.procs
          echo $SLEEP > b/cgroup.procs
        done
      
      The deadlock does not seem possible, so it's debatable if there's any
      reason to modify the kernel.  I suggest we should to prevent future
      surprises.  And Wang Long said "this deadlock occurs three times in our
      environment", so there's more reason to apply this, even to stable.
      Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
      see "[PATCH for-4.4] writeback: safer lock nesting"
      https://lkml.org/lkml/2018/4/11/146
      
      Wang Long said "this deadlock occurs three times in our environment"
      
      [gthelen@google.com: v4]
        Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
      [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
      Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
      Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
      Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>	[v4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e898e4c
  13. 12 4月, 2018 1 次提交
  14. 28 3月, 2018 1 次提交
  15. 07 1月, 2018 1 次提交
  16. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  17. 10 10月, 2017 1 次提交
  18. 05 10月, 2017 1 次提交
    • J
      writeback: eliminate work item allocation in bd_start_writeback() · 85009b4f
      Jens Axboe 提交于
      Handle start-all writeback like we do periodic or kupdate
      style writeback - by marking the bdi_writeback as needing a full
      flush, and simply waking the thread. This eliminates the need to
      allocate and queue a specific work item just for this purpose.
      
      After this change, we truly only ever have one of them running at
      any point in time. We mark the need to start all flushes, and the
      writeback thread will clear it once it has processed the request.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85009b4f
  19. 03 10月, 2017 7 次提交
  20. 13 7月, 2017 1 次提交
  21. 16 5月, 2017 1 次提交
  22. 13 3月, 2017 1 次提交
  23. 13 12月, 2016 1 次提交
  24. 10 8月, 2016 1 次提交
    • K
      mm, writeback: flush plugged IO in wakeup_flusher_threads() · 51350ea0
      Konstantin Khlebnikov 提交于
      I've found funny live-lock between raid10 barriers during resync and
      memory controller hard limits. Inside mpage_readpages() task holds on to
      its plug bio which blocks the barrier in raid10. Its memory cgroup have
      no free memory thus the task goes into reclaimer but all reclaimable
      pages are dirty and cannot be written because raid10 is rebuilding and
      stuck on the barrier.
      
      Common flush of such IO in schedule() never happens, because the caller
      doesn't go to sleep.
      
      Lock is 'live' because changing memory limit or killing tasks which
      holds that stuck bio unblock whole progress.
      
      That was what happened in 3.18.x but I see no difference in upstream
      logic.  Theoretically this might happen even without memory cgroup.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      51350ea0
  25. 05 8月, 2016 1 次提交
    • J
      writeback: Write dirty times for WB_SYNC_ALL writeback · dc5ff2b1
      Jan Kara 提交于
      Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
      queue_io() so that inodes which have only dirty timestamps are properly
      written on fsync(2) and sync(2). However there are other call sites -
      most notably going through write_inode_now() - which expect inode to be
      clean after WB_SYNC_ALL writeback. This is not currently true as we do
      not clear I_DIRTY_TIME in __writeback_single_inode() even for
      WB_SYNC_ALL writeback in all the cases. This then resulted in the
      following oops because bdev_write_inode() did not clean the inode and
      writeback code later stumbled over a dirty inode with detached wb.
      
        general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
        Modules linked in:
        CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Workqueue: writeback wb_workfn (flush-11:0)
        task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
        RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
        locked_inode_to_wb_and_lock_list+0xa2/0x750
        RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
        RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
        RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
        R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
        R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
        FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
        DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
        Stack:
         ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
         ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
         ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
        Call Trace:
         [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
         [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
         [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
         [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
         [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
         [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
         [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
         [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
         [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
         [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
        Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
        00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
        80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
        RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
        RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
        fs/fs-writeback.c:281
         RSP <ffff88006cdaf7d0>
        ---[ end trace 986a4d314dcb2694 ]---
      
      Fix the problem by making sure __writeback_single_inode() writes inode
      only with dirty times in WB_SYNC_ALL mode.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc5ff2b1
  26. 29 7月, 2016 1 次提交
  27. 27 7月, 2016 2 次提交
  28. 01 7月, 2016 1 次提交
    • T
      writeback: inode cgroup wb switch should not call ihold() · 74524955
      Tahsin Erdogan 提交于
      Asynchronous wb switching of inodes takes an additional ref count on an
      inode to make sure inode remains valid until switchover is completed.
      
      However, anyone calling ihold() must already have a ref count on inode,
      but in this case inode->i_count may already be zero:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
      CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: writeback wb_workfn (flush-8:16)
       0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
       0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
       ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
      Call Trace:
       [<ffffffff805990af>] dump_stack+0x4d/0x6e
       [<ffffffff80268702>] __warn+0xc2/0xe0
       [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
       [<ffffffff8035b4ab>] ihold+0x2b/0x30
       [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
       [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
       [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
       [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
       [<ffffffff8036a147>] wb_workfn+0xd7/0x280
       [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
       [<ffffffff8027bb09>] process_one_work+0x129/0x300
       [<ffffffff8027be06>] worker_thread+0x126/0x480
       [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
       [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
       [<ffffffff80280ff4>] kthread+0xc4/0xe0
       [<ffffffff80335578>] ? kfree+0xc8/0x100
       [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
       [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
      ---[ end trace aaefd2fd9f306bc4 ]---
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74524955
  29. 21 5月, 2016 1 次提交
    • T
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa 提交于
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7