1. 16 5月, 2019 2 次提交
    • L
      fs/writeback: Attach inode's wb to root if needed · dda65ea8
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      dda65ea8
    • J
      fs/writeback: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount · 6644f956
      Jiufei Xue 提交于
      synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
      switch may not go to the workqueue after synchronize_rcu(). Thus
      previous scheduled switches was not finished even flushing the
      workqueue, which will cause a NULL pointer dereferenced followed below.
      
      VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds.  Have a nice day...
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
      [<ffffffff8126a303>] evict+0xb3/0x180
      [<ffffffff8126a760>] iput+0x1b0/0x230
      [<ffffffff8127c690>] inode_switch_wbs_work_fn+0x3c0/0x6a0
      [<ffffffff810a5b2e>] worker_thread+0x4e/0x490
      [<ffffffff810a5ae0>] ? process_one_work+0x410/0x410
      [<ffffffff810ac056>] kthread+0xe6/0x100
      [<ffffffff8173c199>] ret_from_fork+0x39/0x50
      
      Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
      pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
      in inode_switch_wbs() to make more sense.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6644f956
  2. 06 3月, 2019 1 次提交
  3. 04 5月, 2018 1 次提交
    • J
      bdi: Fix oops in wb_workfn() · b8b78495
      Jan Kara 提交于
      Syzbot has reported that it can hit a NULL pointer dereference in
      wb_workfn() due to wb->bdi->dev being NULL. This indicates that
      wb_workfn() was called for an already unregistered bdi which should not
      happen as wb_shutdown() called from bdi_unregister() should make sure
      all pending writeback works are completed before bdi is unregistered.
      Except that wb_workfn() itself can requeue the work with:
      
      	mod_delayed_work(bdi_wq, &wb->dwork, 0);
      
      and if this happens while wb_shutdown() is waiting in:
      
      	flush_delayed_work(&wb->dwork);
      
      the dwork can get executed after wb_shutdown() has finished and
      bdi_unregister() has cleared wb->bdi->dev.
      
      Make wb_workfn() use wakeup_wb() for requeueing the work which takes all
      the necessary precautions against racing with bdi unregistration.
      
      CC: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      CC: Tejun Heo <tj@kernel.org>
      Fixes: 839a8e86Reported-by: Nsyzbot <syzbot+9873874c735f2892e7e9@syzkaller.appspotmail.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b8b78495
  4. 21 4月, 2018 1 次提交
    • G
      writeback: safer lock nesting · 2e898e4c
      Greg Thelen 提交于
      lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
      the page's memcg is undergoing move accounting, which occurs when a
      process leaves its memcg for a new one that has
      memory.move_charge_at_immigrate set.
      
      unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
      the given inode is switching writeback domains.  Switches occur when
      enough writes are issued from a new domain.
      
      This existing pattern is thus suspicious:
          lock_page_memcg(page);
          unlocked_inode_to_wb_begin(inode, &locked);
          ...
          unlocked_inode_to_wb_end(inode, locked);
          unlock_page_memcg(page);
      
      If both inode switch and process memcg migration are both in-flight then
      unlocked_inode_to_wb_end() will unconditionally enable interrupts while
      still holding the lock_page_memcg() irq spinlock.  This suggests the
      possibility of deadlock if an interrupt occurs before unlock_page_memcg().
      
          truncate
          __cancel_dirty_page
          lock_page_memcg
          unlocked_inode_to_wb_begin
          unlocked_inode_to_wb_end
          <interrupts mistakenly enabled>
                                          <interrupt>
                                          end_page_writeback
                                          test_clear_page_writeback
                                          lock_page_memcg
                                          <deadlock>
          unlock_page_memcg
      
      Due to configuration limitations this deadlock is not currently possible
      because we don't mix cgroup writeback (a cgroupv2 feature) and
      memory.move_charge_at_immigrate (a cgroupv1 feature).
      
      If the kernel is hacked to always claim inode switching and memcg
      moving_account, then this script triggers lockup in less than a minute:
      
        cd /mnt/cgroup/memory
        mkdir a b
        echo 1 > a/memory.move_charge_at_immigrate
        echo 1 > b/memory.move_charge_at_immigrate
        (
          echo $BASHPID > a/cgroup.procs
          while true; do
            dd if=/dev/zero of=/mnt/big bs=1M count=256
          done
        ) &
        while true; do
          sync
        done &
        sleep 1h &
        SLEEP=$!
        while true; do
          echo $SLEEP > a/cgroup.procs
          echo $SLEEP > b/cgroup.procs
        done
      
      The deadlock does not seem possible, so it's debatable if there's any
      reason to modify the kernel.  I suggest we should to prevent future
      surprises.  And Wang Long said "this deadlock occurs three times in our
      environment", so there's more reason to apply this, even to stable.
      Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
      see "[PATCH for-4.4] writeback: safer lock nesting"
      https://lkml.org/lkml/2018/4/11/146
      
      Wang Long said "this deadlock occurs three times in our environment"
      
      [gthelen@google.com: v4]
        Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
      [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
      Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
      Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
      Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NWang Long <wanglong19@meituan.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: <stable@vger.kernel.org>	[v4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e898e4c
  5. 12 4月, 2018 1 次提交
  6. 28 3月, 2018 1 次提交
  7. 07 1月, 2018 1 次提交
  8. 28 11月, 2017 1 次提交
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  9. 10 10月, 2017 1 次提交
  10. 05 10月, 2017 1 次提交
    • J
      writeback: eliminate work item allocation in bd_start_writeback() · 85009b4f
      Jens Axboe 提交于
      Handle start-all writeback like we do periodic or kupdate
      style writeback - by marking the bdi_writeback as needing a full
      flush, and simply waking the thread. This eliminates the need to
      allocate and queue a specific work item just for this purpose.
      
      After this change, we truly only ever have one of them running at
      any point in time. We mark the need to start all flushes, and the
      writeback thread will clear it once it has processed the request.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85009b4f
  11. 03 10月, 2017 7 次提交
  12. 13 7月, 2017 1 次提交
  13. 16 5月, 2017 1 次提交
  14. 13 3月, 2017 1 次提交
  15. 13 12月, 2016 1 次提交
  16. 10 8月, 2016 1 次提交
    • K
      mm, writeback: flush plugged IO in wakeup_flusher_threads() · 51350ea0
      Konstantin Khlebnikov 提交于
      I've found funny live-lock between raid10 barriers during resync and
      memory controller hard limits. Inside mpage_readpages() task holds on to
      its plug bio which blocks the barrier in raid10. Its memory cgroup have
      no free memory thus the task goes into reclaimer but all reclaimable
      pages are dirty and cannot be written because raid10 is rebuilding and
      stuck on the barrier.
      
      Common flush of such IO in schedule() never happens, because the caller
      doesn't go to sleep.
      
      Lock is 'live' because changing memory limit or killing tasks which
      holds that stuck bio unblock whole progress.
      
      That was what happened in 3.18.x but I see no difference in upstream
      logic.  Theoretically this might happen even without memory cgroup.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      51350ea0
  17. 05 8月, 2016 1 次提交
    • J
      writeback: Write dirty times for WB_SYNC_ALL writeback · dc5ff2b1
      Jan Kara 提交于
      Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
      queue_io() so that inodes which have only dirty timestamps are properly
      written on fsync(2) and sync(2). However there are other call sites -
      most notably going through write_inode_now() - which expect inode to be
      clean after WB_SYNC_ALL writeback. This is not currently true as we do
      not clear I_DIRTY_TIME in __writeback_single_inode() even for
      WB_SYNC_ALL writeback in all the cases. This then resulted in the
      following oops because bdev_write_inode() did not clean the inode and
      writeback code later stumbled over a dirty inode with detached wb.
      
        general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
        Modules linked in:
        CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Workqueue: writeback wb_workfn (flush-11:0)
        task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
        RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
        locked_inode_to_wb_and_lock_list+0xa2/0x750
        RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
        RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
        RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
        R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
        R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
        FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
        DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
        Stack:
         ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
         ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
         ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
        Call Trace:
         [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
         [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
         [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
         [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
         [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
         [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
         [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
         [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
         [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
         [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
        Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
        00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
        80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
        RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
        RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
        fs/fs-writeback.c:281
         RSP <ffff88006cdaf7d0>
        ---[ end trace 986a4d314dcb2694 ]---
      
      Fix the problem by making sure __writeback_single_inode() writes inode
      only with dirty times in WB_SYNC_ALL mode.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc5ff2b1
  18. 29 7月, 2016 1 次提交
  19. 27 7月, 2016 2 次提交
  20. 01 7月, 2016 1 次提交
    • T
      writeback: inode cgroup wb switch should not call ihold() · 74524955
      Tahsin Erdogan 提交于
      Asynchronous wb switching of inodes takes an additional ref count on an
      inode to make sure inode remains valid until switchover is completed.
      
      However, anyone calling ihold() must already have a ref count on inode,
      but in this case inode->i_count may already be zero:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
      CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: writeback wb_workfn (flush-8:16)
       0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
       0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
       ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
      Call Trace:
       [<ffffffff805990af>] dump_stack+0x4d/0x6e
       [<ffffffff80268702>] __warn+0xc2/0xe0
       [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
       [<ffffffff8035b4ab>] ihold+0x2b/0x30
       [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
       [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
       [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
       [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
       [<ffffffff8036a147>] wb_workfn+0xd7/0x280
       [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
       [<ffffffff8027bb09>] process_one_work+0x129/0x300
       [<ffffffff8027be06>] worker_thread+0x126/0x480
       [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
       [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
       [<ffffffff80280ff4>] kthread+0xc4/0xe0
       [<ffffffff80335578>] ? kfree+0xc8/0x100
       [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
       [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
      ---[ end trace aaefd2fd9f306bc4 ]---
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74524955
  21. 21 5月, 2016 1 次提交
    • T
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa 提交于
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7
  22. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  23. 20 3月, 2016 2 次提交
    • T
      writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode · aaf25593
      Tejun Heo 提交于
      When cgroup writeback is in use, there can be multiple wb's
      (bdi_writeback's) per bdi and an inode may switch among them
      dynamically.  In a couple places, the wrong wb was used leading to
      performing operations on the wrong list under the wrong lock
      corrupting the io lists.
      
      * writeback_single_inode() was taking @wb parameter and used it to
        remove the inode from io lists if it becomes clean after writeback.
        The callers of this function were always passing in the root wb
        regardless of the actual wb that the inode was associated with,
        which could also change while writeback is in progress.
      
        Fix it by dropping the @wb parameter and using
        inode_to_wb_and_lock_list() to determine and lock the associated wb.
      
      * After writeback_sb_inodes() writes out an inode, it re-locks @wb and
        inode to remove it from or move it to the right io list.  It assumes
        that the inode is still associated with @wb; however, the inode may
        have switched to another wb while writeback was in progress.
      
        Fix it by using inode_to_wb_and_lock_list() to determine and lock
        the associated wb after writeback is complete.  As the function
        requires the original @wb->list_lock locked for the next iteration,
        in the unlikely case where the inode has changed association, switch
        the locks.
      
      Kudos to Tahsin for pinpointing these subtle breakages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.comReported-and-diagnosed-by: NTahsin Erdogan <tahsin@google.com>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aaf25593
    • T
      writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() · 614a4e37
      Tejun Heo 提交于
      locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
      the target inode, unlocks inode, locks the wb's list_lock and verifies
      that the inode is still associated with the wb.  To prevent the wb
      going away between dropping inode lock and acquiring list_lock, the wb
      is pinned while inode lock is held.  The wb reference is put right
      after acquiring list_lock citing that the wb won't be dereferenced
      anymore.
      
      This isn't true.  If the inode is still associated with the wb, the
      inode has reference and it's safe to return the wb; however, if inode
      has been switched, the wb still needs to be unlocked which is a
      dereference and can lead to use-after-free if it it races with wb
      destruction.
      
      Fix it by putting the reference after releasing list_lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 87e1d789 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      614a4e37
  24. 04 3月, 2016 1 次提交
    • T
      writeback: flush inode cgroup wb switches instead of pinning super_block · a1a0e23e
      Tejun Heo 提交于
      If cgroup writeback is in use, inodes can be scheduled for
      asynchronous wb switching.  Before 5ff8eaac ("writeback: keep
      superblock pinned during cgroup writeback association switches"), this
      could race with umount leading to super_block being destroyed while
      inodes are pinned for wb switching.  5ff8eaac fixed it by bumping
      s_active while wb switches are in flight; however, this allowed
      in-flight wb switches to make umounts asynchronous when the userland
      expected synchronosity - e.g. fsck immediately following umount may
      fail because the device is still busy.
      
      This patch removes the problematic super_block pinning and instead
      makes generic_shutdown_super() flush in-flight wb switches.  wb
      switches are now executed on a dedicated isw_wq so that they can be
      flushed and isw_nr_in_flight keeps track of the number of in-flight wb
      switches so that flushing can be avoided in most cases.
      
      v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
          check in inode_switch_wbs() as Jan an Al suggested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTahsin Erdogan <tahsin@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: 5ff8eaac ("writeback: keep superblock pinned during cgroup writeback association switches")
      Cc: stable@vger.kernel.org #v4.5
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a1a0e23e
  25. 17 2月, 2016 1 次提交
    • T
      writeback: keep superblock pinned during cgroup writeback association switches · 5ff8eaac
      Tejun Heo 提交于
      If cgroup writeback is in use, an inode is associated with a cgroup
      for writeback.  If the inode's main dirtier changes to another cgroup,
      the association gets updated asynchronously.  Nothing was pinning the
      superblock while such switches are in progress and superblock could go
      away while async switching is pending or in progress leading to
      crashes like the following.
      
       kernel BUG at fs/jbd2/transaction.c:319!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
       Hardware name: Google Google, BIOS Google 01/01/2011
       Workqueue: events inode_switch_wbs_work_fn
       task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
       RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
       RSP: 0018:ffff880209267c30  EFLAGS: 00010202
       ...
       Call Trace:
        [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
        [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
        [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
        [<ffffffff8035338b>] evict+0xbb/0x190
        [<ffffffff80354190>] iput+0x130/0x190
        [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
        [<ffffffff80279819>] process_one_work+0x129/0x300
        [<ffffffff80279b16>] worker_thread+0x126/0x480
        [<ffffffff8027ed14>] kthread+0xc4/0xe0
        [<ffffffff809771df>] ret_from_fork+0x3f/0x70
      
      Fix it by bumping s_active while cgroup association switching is in
      flight.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTahsin Erdogan <tahsin@google.com>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Cc: stable@vger.kernel.org #v4.5+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5ff8eaac
  26. 16 1月, 2016 1 次提交
  27. 11 11月, 2015 1 次提交
  28. 10 11月, 2015 1 次提交
  29. 06 11月, 2015 1 次提交
    • J
      mm/filemap.c: make global sync not clear error status of individual inodes · aa750fd7
      Junichi Nomura 提交于
      filemap_fdatawait() is a function to wait for on-going writeback to
      complete but also consume and clear error status of the mapping set during
      writeback.
      
      The latter functionality is critical for applications to detect writeback
      error with system calls like fsync(2)/fdatasync(2).
      
      However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
      which don't check error status of individual mappings.
      
      As a result, fsync() may not be able to detect writeback error if events
      happen in the following order:
      
         Application                    System admin
         ----------------------------------------------------------
         write data on page cache
                                        Run sync command
                                        writeback completes with error
                                        filemap_fdatawait() clears error
         fsync returns success
         (but the data is not on disk)
      
      This patch adds filemap_fdatawait_keep_errors() for call sites where
      writeback error is not handled so that they don't clear error status.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa750fd7
  30. 28 10月, 2015 1 次提交
    • T
      fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() · b33e18f6
      Tejun Heo 提交于
      bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
      to walk @bdi->wb_list.  To set up the initial iteration
      condition, it uses list_entry_rcu() to calculate the entry
      pointer corresponding to the list head; however, this isn't an
      actual RCU dereference and using list_entry_rcu() for it ended
      up breaking a proposed list_entry_rcu() change because it was
      feeding an non-lvalue pointer into the macro.
      
      Don't use the RCU variant for simple pointer offsetting.  Use
      list_entry() instead.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Patrick Marlier <patrick.marlier@gmail.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pranith kumar <bobby.prani@gmail.com>
      Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b33e18f6
  31. 13 10月, 2015 1 次提交
    • T
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo 提交于
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b817525a