1. 10 8月, 2016 1 次提交
    • K
      mm, writeback: flush plugged IO in wakeup_flusher_threads() · 51350ea0
      Konstantin Khlebnikov 提交于
      I've found funny live-lock between raid10 barriers during resync and
      memory controller hard limits. Inside mpage_readpages() task holds on to
      its plug bio which blocks the barrier in raid10. Its memory cgroup have
      no free memory thus the task goes into reclaimer but all reclaimable
      pages are dirty and cannot be written because raid10 is rebuilding and
      stuck on the barrier.
      
      Common flush of such IO in schedule() never happens, because the caller
      doesn't go to sleep.
      
      Lock is 'live' because changing memory limit or killing tasks which
      holds that stuck bio unblock whole progress.
      
      That was what happened in 3.18.x but I see no difference in upstream
      logic.  Theoretically this might happen even without memory cgroup.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      51350ea0
  2. 05 8月, 2016 1 次提交
    • J
      writeback: Write dirty times for WB_SYNC_ALL writeback · dc5ff2b1
      Jan Kara 提交于
      Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
      queue_io() so that inodes which have only dirty timestamps are properly
      written on fsync(2) and sync(2). However there are other call sites -
      most notably going through write_inode_now() - which expect inode to be
      clean after WB_SYNC_ALL writeback. This is not currently true as we do
      not clear I_DIRTY_TIME in __writeback_single_inode() even for
      WB_SYNC_ALL writeback in all the cases. This then resulted in the
      following oops because bdev_write_inode() did not clean the inode and
      writeback code later stumbled over a dirty inode with detached wb.
      
        general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
        Modules linked in:
        CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Workqueue: writeback wb_workfn (flush-11:0)
        task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
        RIP: 0010:[<ffffffff818884d2>]  [<ffffffff818884d2>]
        locked_inode_to_wb_and_lock_list+0xa2/0x750
        RSP: 0018:ffff88006cdaf7d0  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
        RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
        RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
        R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
        R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
        FS:  0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
        DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
        Stack:
         ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
         ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
         ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
        Call Trace:
         [<     inline     >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
         [<ffffffff8188e12e>] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
         [<ffffffff8188efa4>] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
         [<ffffffff8188f9ae>] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
         [<     inline     >] wb_do_writeback fs/fs-writeback.c:1844
         [<ffffffff81891079>] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
         [<ffffffff813bcd1e>] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
         [<ffffffff813bdc2b>] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
         [<ffffffff813cdeef>] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
         [<ffffffff867bc5d2>] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
        Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
        00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03 <42>
        80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
        RIP  [<     inline     >] wb_get include/linux/backing-dev-defs.h:212
        RIP  [<ffffffff818884d2>] locked_inode_to_wb_and_lock_list+0xa2/0x750
        fs/fs-writeback.c:281
         RSP <ffff88006cdaf7d0>
        ---[ end trace 986a4d314dcb2694 ]---
      
      Fix the problem by making sure __writeback_single_inode() writes inode
      only with dirty times in WB_SYNC_ALL mode.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc5ff2b1
  3. 29 7月, 2016 1 次提交
  4. 27 7月, 2016 2 次提交
  5. 01 7月, 2016 1 次提交
    • T
      writeback: inode cgroup wb switch should not call ihold() · 74524955
      Tahsin Erdogan 提交于
      Asynchronous wb switching of inodes takes an additional ref count on an
      inode to make sure inode remains valid until switchover is completed.
      
      However, anyone calling ihold() must already have a ref count on inode,
      but in this case inode->i_count may already be zero:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
      CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
      01/01/2011
      Workqueue: writeback wb_workfn (flush-8:16)
       0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
       0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
       ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
      Call Trace:
       [<ffffffff805990af>] dump_stack+0x4d/0x6e
       [<ffffffff80268702>] __warn+0xc2/0xe0
       [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
       [<ffffffff8035b4ab>] ihold+0x2b/0x30
       [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
       [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
       [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
       [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
       [<ffffffff8036a147>] wb_workfn+0xd7/0x280
       [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
       [<ffffffff8027bb09>] process_one_work+0x129/0x300
       [<ffffffff8027be06>] worker_thread+0x126/0x480
       [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
       [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
       [<ffffffff80280ff4>] kthread+0xc4/0xe0
       [<ffffffff80335578>] ? kfree+0xc8/0x100
       [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
       [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
      ---[ end trace aaefd2fd9f306bc4 ]---
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74524955
  6. 21 5月, 2016 1 次提交
    • T
      mm,writeback: don't use memory reserves for wb_start_writeback · 78ebc2f7
      Tetsuo Handa 提交于
      When writeback operation cannot make forward progress because memory
      allocation requests needed for doing I/O cannot be satisfied (e.g.
      under OOM-livelock situation), we can observe flood of order-0 page
      allocation failure messages caused by complete depletion of memory
      reserves.
      
      This is caused by unconditionally allocating "struct wb_writeback_work"
      objects using GFP_ATOMIC from PF_MEMALLOC context.
      
      __alloc_pages_nodemask() {
        __alloc_pages_slowpath() {
          __alloc_pages_direct_reclaim() {
            __perform_reclaim() {
              current->flags |= PF_MEMALLOC;
              try_to_free_pages() {
                do_try_to_free_pages() {
                  wakeup_flusher_threads() {
                    wb_start_writeback() {
                      kzalloc(sizeof(*work), GFP_ATOMIC) {
                        /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
                      }
                    }
                  }
                }
              }
              current->flags &= ~PF_MEMALLOC;
            }
          }
        }
      }
      
      Since I/O is stalling, allocating writeback requests forever shall
      deplete memory reserves.  Fortunately, since wb_start_writeback() can
      fall back to wb_wakeup() when allocating "struct wb_writeback_work"
      failed, we don't need to allow wb_start_writeback() to use memory
      reserves.
      
        Mem-Info:
        active_anon:289393 inactive_anon:2093 isolated_anon:29
         active_file:10838 inactive_file:113013 isolated_file:859
         unevictable:0 dirty:108531 writeback:5308 unstable:0
         slab_reclaimable:5526 slab_unreclaimable:7077
         mapped:9970 shmem:2159 pagetables:2387 bounce:0
         free:3042 free_pcp:0 free_cma:0
        Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
        Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        126847 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
        Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
        kthreadd: page allocation failure: order:0, mode:0x2200020
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          warn_alloc_failed+0xf7/0x150
          __alloc_pages_nodemask+0x23f/0xa60
          alloc_pages_current+0x87/0x110
          new_slab+0x3a1/0x440
          ___slab_alloc+0x3cf/0x590
          __slab_alloc.isra.64+0x18/0x1d
          kmem_cache_alloc+0x11c/0x150
          wb_start_writeback+0x39/0x90
          wakeup_flusher_threads+0x7f/0xf0
          do_try_to_free_pages+0x1f9/0x410
          try_to_free_pages+0x94/0xc0
          __alloc_pages_nodemask+0x566/0xa60
          alloc_pages_current+0x87/0x110
          __page_cache_alloc+0xaf/0xc0
          pagecache_get_page+0x88/0x260
          grab_cache_page_write_begin+0x21/0x40
          xfs_vm_write_begin+0x2f/0xf0
          generic_perform_write+0xca/0x1c0
          xfs_file_buffered_aio_write+0xcc/0x1f0
          xfs_file_write_iter+0x84/0x140
          __vfs_write+0xc7/0x100
          vfs_write+0x9d/0x190
          SyS_write+0x50/0xc0
          entry_SYSCALL_64_fastpath+0x12/0x6a
        Mem-Info:
        active_anon:293335 inactive_anon:2093 isolated_anon:0
         active_file:10829 inactive_file:110045 isolated_file:32
         unevictable:0 dirty:109275 writeback:822 unstable:0
         slab_reclaimable:5489 slab_unreclaimable:10070
         mapped:9999 shmem:2159 pagetables:2420 bounce:0
         free:3 free_pcp:0 free_cma:0
        Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
        lowmem_reserve[]: 0 1732 1732 1732
        Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        123086 total pagecache pages
        0 pages in swap cache
        Swap cache stats: add 0, delete 0, find 0/0
        Free swap  = 0kB
        Total swap = 0kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        76348 pages reserved
        0 pages hwpoisoned
        SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
          cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
          node 0: slabs: 3218, objs: 205952, free: 0
        file_io.00: page allocation failure: order:0, mode:0x2200020
        CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
      
      Assuming that somebody will find a better solution, let's apply this
      patch for now to stop bleeding, for this problem frequently prevents me
      from testing OOM livelock condition.
      
      Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.czSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78ebc2f7
  7. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  8. 20 3月, 2016 2 次提交
    • T
      writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode · aaf25593
      Tejun Heo 提交于
      When cgroup writeback is in use, there can be multiple wb's
      (bdi_writeback's) per bdi and an inode may switch among them
      dynamically.  In a couple places, the wrong wb was used leading to
      performing operations on the wrong list under the wrong lock
      corrupting the io lists.
      
      * writeback_single_inode() was taking @wb parameter and used it to
        remove the inode from io lists if it becomes clean after writeback.
        The callers of this function were always passing in the root wb
        regardless of the actual wb that the inode was associated with,
        which could also change while writeback is in progress.
      
        Fix it by dropping the @wb parameter and using
        inode_to_wb_and_lock_list() to determine and lock the associated wb.
      
      * After writeback_sb_inodes() writes out an inode, it re-locks @wb and
        inode to remove it from or move it to the right io list.  It assumes
        that the inode is still associated with @wb; however, the inode may
        have switched to another wb while writeback was in progress.
      
        Fix it by using inode_to_wb_and_lock_list() to determine and lock
        the associated wb after writeback is complete.  As the function
        requires the original @wb->list_lock locked for the next iteration,
        in the unlikely case where the inode has changed association, switch
        the locks.
      
      Kudos to Tahsin for pinpointing these subtle breakages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.comReported-and-diagnosed-by: NTahsin Erdogan <tahsin@google.com>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aaf25593
    • T
      writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list() · 614a4e37
      Tejun Heo 提交于
      locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
      the target inode, unlocks inode, locks the wb's list_lock and verifies
      that the inode is still associated with the wb.  To prevent the wb
      going away between dropping inode lock and acquiring list_lock, the wb
      is pinned while inode lock is held.  The wb reference is put right
      after acquiring list_lock citing that the wb won't be dereferenced
      anymore.
      
      This isn't true.  If the inode is still associated with the wb, the
      inode has reference and it's safe to return the wb; however, if inode
      has been switched, the wb still needs to be unlocked which is a
      dereference and can lead to use-after-free if it it races with wb
      destruction.
      
      Fix it by putting the reference after releasing list_lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 87e1d789 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
      Cc: stable@vger.kernel.org # v4.2+
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      614a4e37
  9. 04 3月, 2016 1 次提交
    • T
      writeback: flush inode cgroup wb switches instead of pinning super_block · a1a0e23e
      Tejun Heo 提交于
      If cgroup writeback is in use, inodes can be scheduled for
      asynchronous wb switching.  Before 5ff8eaac ("writeback: keep
      superblock pinned during cgroup writeback association switches"), this
      could race with umount leading to super_block being destroyed while
      inodes are pinned for wb switching.  5ff8eaac fixed it by bumping
      s_active while wb switches are in flight; however, this allowed
      in-flight wb switches to make umounts asynchronous when the userland
      expected synchronosity - e.g. fsck immediately following umount may
      fail because the device is still busy.
      
      This patch removes the problematic super_block pinning and instead
      makes generic_shutdown_super() flush in-flight wb switches.  wb
      switches are now executed on a dedicated isw_wq so that they can be
      flushed and isw_nr_in_flight keeps track of the number of in-flight wb
      switches so that flushing can be avoided in most cases.
      
      v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
          check in inode_switch_wbs() as Jan an Al suggested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTahsin Erdogan <tahsin@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: 5ff8eaac ("writeback: keep superblock pinned during cgroup writeback association switches")
      Cc: stable@vger.kernel.org #v4.5
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a1a0e23e
  10. 17 2月, 2016 1 次提交
    • T
      writeback: keep superblock pinned during cgroup writeback association switches · 5ff8eaac
      Tejun Heo 提交于
      If cgroup writeback is in use, an inode is associated with a cgroup
      for writeback.  If the inode's main dirtier changes to another cgroup,
      the association gets updated asynchronously.  Nothing was pinning the
      superblock while such switches are in progress and superblock could go
      away while async switching is pending or in progress leading to
      crashes like the following.
      
       kernel BUG at fs/jbd2/transaction.c:319!
       invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
       Hardware name: Google Google, BIOS Google 01/01/2011
       Workqueue: events inode_switch_wbs_work_fn
       task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
       RIP: 0010:[<ffffffff803e6922>]  [<ffffffff803e6922>] start_this_handle+0x382/0x3e0
       RSP: 0018:ffff880209267c30  EFLAGS: 00010202
       ...
       Call Trace:
        [<ffffffff803e6be4>] jbd2__journal_start+0xf4/0x190
        [<ffffffff803cfc7e>] __ext4_journal_start_sb+0x4e/0x70
        [<ffffffff803b31ec>] ext4_evict_inode+0x12c/0x3d0
        [<ffffffff8035338b>] evict+0xbb/0x190
        [<ffffffff80354190>] iput+0x130/0x190
        [<ffffffff80360223>] inode_switch_wbs_work_fn+0x343/0x4c0
        [<ffffffff80279819>] process_one_work+0x129/0x300
        [<ffffffff80279b16>] worker_thread+0x126/0x480
        [<ffffffff8027ed14>] kthread+0xc4/0xe0
        [<ffffffff809771df>] ret_from_fork+0x3f/0x70
      
      Fix it by bumping s_active while cgroup association switching is in
      flight.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTahsin Erdogan <tahsin@google.com>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: d10c8095 ("writeback: implement foreign cgroup inode bdi_writeback switching")
      Cc: stable@vger.kernel.org #v4.5+
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5ff8eaac
  11. 16 1月, 2016 1 次提交
  12. 11 11月, 2015 1 次提交
  13. 10 11月, 2015 1 次提交
  14. 06 11月, 2015 1 次提交
    • J
      mm/filemap.c: make global sync not clear error status of individual inodes · aa750fd7
      Junichi Nomura 提交于
      filemap_fdatawait() is a function to wait for on-going writeback to
      complete but also consume and clear error status of the mapping set during
      writeback.
      
      The latter functionality is critical for applications to detect writeback
      error with system calls like fsync(2)/fdatasync(2).
      
      However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
      which don't check error status of individual mappings.
      
      As a result, fsync() may not be able to detect writeback error if events
      happen in the following order:
      
         Application                    System admin
         ----------------------------------------------------------
         write data on page cache
                                        Run sync command
                                        writeback completes with error
                                        filemap_fdatawait() clears error
         fsync returns success
         (but the data is not on disk)
      
      This patch adds filemap_fdatawait_keep_errors() for call sites where
      writeback error is not handled so that they don't clear error status.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa750fd7
  15. 28 10月, 2015 1 次提交
    • T
      fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() · b33e18f6
      Tejun Heo 提交于
      bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
      to walk @bdi->wb_list.  To set up the initial iteration
      condition, it uses list_entry_rcu() to calculate the entry
      pointer corresponding to the list head; however, this isn't an
      actual RCU dereference and using list_entry_rcu() for it ended
      up breaking a proposed list_entry_rcu() change because it was
      feeding an non-lvalue pointer into the macro.
      
      Don't use the RCU variant for simple pointer offsetting.  Use
      list_entry() instead.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dipankar Sarma <dipankar@in.ibm.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Patrick Marlier <patrick.marlier@gmail.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pranith kumar <bobby.prani@gmail.com>
      Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b33e18f6
  16. 13 10月, 2015 2 次提交
    • T
      writeback: bdi_writeback iteration must not skip dying ones · b817525a
      Tejun Heo 提交于
      bdi_for_each_wb() is used in several places to wake up or issue
      writeback work items to all wb's (bdi_writeback's) on a given bdi.
      The iteration is performed by walking bdi->cgwb_tree; however, the
      tree only indexes wb's which are currently active.
      
      For example, when a memcg gets associated with a different blkcg, the
      old wb is removed from the tree so that the new one can be indexed.
      The old wb starts dying from then on but will linger till all its
      inodes are drained.  As these dying wb's may still host dirty inodes,
      writeback operations which affect all wb's must include them.
      bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
      failing to sync the inodes belonging to those wb's.
      
      This patch adds a RCU protected @bdi->wb_list which lists all wb's
      beloinging to that bdi.  wb's are added on creation and removed on
      release rather than on the start of destruction.  bdi_for_each_wb()
      usages are replaced with list_for_each[_continue]_rcu() iterations
      over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.
      
      v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
          fixed and unnecessary list head severing in cgwb_bdi_destroy()
          removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Fixes: ebe41ab0 ("writeback: implement bdi_for_each_wb()")
      Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b817525a
    • T
      writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback() · 6fdf860f
      Tejun Heo 提交于
      wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
      unfortunately, it was always waking up bdi->wb instead of the wb being
      walked.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 001fe6f6 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
      Reviewed-by: NJan Kara <jack@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6fdf860f
  17. 20 9月, 2015 1 次提交
    • C
      fs-writeback: unplug before cond_resched in writeback_sb_inodes · 590dca3a
      Chris Mason 提交于
      Commit 505a666e ("writeback: plug writeback in wb_writeback() and
      writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
      which increases the merge rate when relatively contiguous small files
      are written by the filesystem.  It helps both on flash and spindles.
      
      For an fs_mark workload creating 4K files in parallel across 8 drives,
      this commit improves performance ~9% more by unplugging before calling
      cond_resched().  cond_resched() doesn't trigger an implicit unplug, so
      explicitly getting the IO down to the device before scheduling reduces
      latencies for anyone waiting on clean pages.
      
      It also cuts down on how often we use kblockd to unplug, which means
      less work bouncing from one workqueue to another.
      
      Many more details about how we got here:
      
        https://lkml.org/lkml/2015/9/11/570Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      590dca3a
  18. 13 9月, 2015 1 次提交
    • L
      writeback: plug writeback in wb_writeback() and writeback_inodes_wb() · 505a666e
      Linus Torvalds 提交于
      We had to revert the pluggin in writeback_sb_inodes() because the
      wb->list_lock is held, but we could easily plug at a higher level before
      taking that lock, and unplug after releasing it.  This does that.
      
      Chris will run performance numbers, just to verify that this approach is
      comparable to the alternative (we could just drop and re-take the lock
      around the blk_finish_plug() rather than these two commits.
      
      I'd have preferred waiting for actual performance numbers before picking
      one approach over the other, but I don't want to release rc1 with the
      known "sleeping function called from invalid context" issue, so I'll
      pick this cleanup version for now.  But if the numbers show that we
      really want to plug just at the writeback_sb_inodes() level, and we
      should just play ugly games with the spinlock, we'll switch to that.
      
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505a666e
  19. 12 9月, 2015 1 次提交
    • L
      Revert "writeback: plug writeback at a high level" · 0ba13fd1
      Linus Torvalds 提交于
      This reverts commit d353d758.
      
      Doing the block layer plug/unplug inside writeback_sb_inodes() is
      broken, because that function is actually called with a spinlock held:
      wb->list_lock, as pointed out by Chris Mason.
      
      Chris suggested just dropping and re-taking the spinlock around the
      blk_finish_plug() call (the plgging itself can happen under the
      spinlock), and that would technically work, but is just disgusting.
      
      We do something fairly similar - but not quite as disgusting because we
      at least have a better reason for it - in writeback_single_inode(), so
      it's not like the caller can depend on the lock being held over the
      call, but in this case there just isn't any good reason for that
      "release and re-take the lock" pattern.
      
      [ In general, we should really strive to avoid the "release and retake"
        pattern for locks, because in the general case it can easily cause
        subtle bugs when the caller caches any state around the call that
        might be invalidated by dropping the lock even just temporarily. ]
      
      But in this case, the plugging should be easy to just move up to the
      callers before the spinlock is taken, which should even improve the
      effectiveness of the plug.  So there is really no good reason to play
      games with locking here.
      
      I'll send off a test-patch so that Dave Chinner can verify that that
      plug movement works.  In the meantime this just reverts the problematic
      commit and adds a comment to the function so that we hopefully don't
      make this mistake again.
      Reported-by: NChris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba13fd1
  20. 26 8月, 2015 1 次提交
    • T
      writeback: sync_inodes_sb() must write out I_DIRTY_TIME inodes and always call wait_sb_inodes() · 006a0973
      Tejun Heo 提交于
      e7972912 ("writeback: don't issue wb_writeback_work if clean")
      updated writeback path to avoid kicking writeback work items if there
      are no inodes to be written out; unfortunately, the avoidance logic
      was too aggressive and broke sync_inodes_sb().
      
      * sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME
        inodes dont't contribute to bdi/wb_has_dirty_io() tests and were
        being skipped over.
      
      * inodes are taken off wb->b_dirty/io/more_io lists after writeback
        starts on them.  sync_inodes_sb() skipping wait_sb_inodes() when
        bdi_has_dirty_io() breaks it by making it return while writebacks
        are in-flight.
      
      This patch fixes the breakages by
      
      * Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs().
        The callers are already testing the condition.
      
      * Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that
        it always calls into bdi_split_work_to_wbs() and wait_sb_inodes().
      
      * Making bdi_split_work_to_wbs() consider the b_dirty_time list for
        WB_SYNC_ALL writebacks.
      
      Kudos to Eryu, Dave and Jan for tracking down the issue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: e7972912 ("writeback: don't issue wb_writeback_work if clean")
      Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.comReported-and-bisected-by: NEryu Guan <eguan@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Ted Ts'o <tytso@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      006a0973
  21. 19 8月, 2015 4 次提交
    • T
      writeback: update writeback tracepoints to report cgroup · 5634cc2a
      Tejun Heo 提交于
      The following tracepoints are updated to report the cgroup used during
      cgroup writeback.
      
      * writeback_write_inode[_start]
      * writeback_queue
      * writeback_exec
      * writeback_start
      * writeback_written
      * writeback_wait
      * writeback_nowork
      * writeback_wake_background
      * wbc_writepage
      * writeback_queue_io
      * bdi_dirty_ratelimit
      * balance_dirty_pages
      * writeback_sb_inodes_requeue
      * writeback_single_inode[_start]
      
      Note that writeback_bdi_register is separated out from writeback_class
      as reporting cgroup doesn't make sense to it.  Tracepoints which take
      bdi are updated to take bdi_writeback instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5634cc2a
    • T
      60292bcc
    • T
      writeback: remove wb_writeback_work->single_wait/done · 8a1270cd
      Tejun Heo 提交于
      wb_writeback_work->single_wait/done are used for the wait mechanism
      for synchronous wb_work (wb_writeback_work) items which are issued
      when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
      wb_work items; however, there's no reason to use a separate wait
      mechanism for this.  bdi_split_work_to_wbs() can simply use on-stack
      fallback wb_work item and separate wb_completion to wait for it.
      
      This patch removes wb_work->single_wait/done and the related code and
      make bdi_split_work_to_wbs() use on-stack fallback wb_work and
      wb_completion instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a1270cd
    • T
      writeback: bdi_for_each_wb() iteration is memcg ID based not blkcg · 1ed8d48c
      Tejun Heo 提交于
      wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
      earlier implementation, wb's were keyed by blkcg ID.
      bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
      allows iterations to start from an arbitrary ID which is used to
      interrupt and resume iterations.
      
      Unfortunately, while changing wb to be keyed by memcg ID instead of
      blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
      are keyed by blkcg ID.  This doesn't affect iterations which don't get
      interrupted but bdi_split_work_to_wbs() makes use of iteration
      resuming on allocation failures and thus may incorrectly skip or
      repeat wb's.
      
      Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
      blkcg IDs and updating bdi_split_work_to_wbs() accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1ed8d48c
  22. 18 8月, 2015 4 次提交
    • D
      inode: rename i_wb_list to i_io_list · c7f54084
      Dave Chinner 提交于
      There's a small consistency problem between the inode and writeback
      naming. Writeback calls the "for IO" inode queues b_io and
      b_more_io, but the inode calls these the "writeback list" or
      i_wb_list. This makes it hard to an new "under writeback" list to
      the inode, or call it an "under IO" list on the bdi because either
      way we'll have writeback on IO and IO on writeback and it'll just be
      confusing. I'm getting confused just writing this!
      
      So, rename the inode "for IO" list variable to i_io_list so we can
      add a new "writeback list" in a subsequent patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      c7f54084
    • D
      sync: serialise per-superblock sync operations · e97fedb9
      Dave Chinner 提交于
      When competing sync(2) calls walk the same filesystem, they need to
      walk the list of inodes on the superblock to find all the inodes
      that we need to wait for IO completion on. However, when multiple
      wait_sb_inodes() calls do this at the same time, they contend on the
      the inode_sb_list_lock and the contention causes system wide
      slowdowns. In effect, concurrent sync(2) calls can take longer and
      burn more CPU than if they were serialised.
      
      Stop the worst of the contention by adding a per-sb mutex to wrap
      around wait_sb_inodes() so that we only execute one sync(2) IO
      completion walk per superblock superblock at a time and hence avoid
      contention being triggered by concurrent sync(2) calls.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      e97fedb9
    • D
      inode: convert inode_sb_list_lock to per-sb · 74278da9
      Dave Chinner 提交于
      The process of reducing contention on per-superblock inode lists
      starts with moving the locking to match the per-superblock inode
      list. This takes the global lock out of the picture and reduces the
      contention problems to within a single filesystem. This doesn't get
      rid of contention as the locks still have global CPU scope, but it
      does isolate operations on different superblocks form each other.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      74278da9
    • D
      writeback: plug writeback at a high level · d353d758
      Dave Chinner 提交于
      Doing writeback on lots of little files causes terrible IOPS storms
      because of the per-mapping writeback plugging we do. This
      essentially causes imeediate dispatch of IO for each mapping,
      regardless of the context in which writeback is occurring.
      
      IOWs, running a concurrent write-lots-of-small 4k files using fsmark
      on XFS results in a huge number of IOPS being issued for data
      writes.  Metadata writes are sorted and plugged at a high level by
      XFS, so aggregate nicely into large IOs. However, data writeback IOs
      are dispatched in individual 4k IOs, even when the blocks of two
      consecutively written files are adjacent.
      
      Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
      metadata CRCs enabled.
      
      Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
      
      Test:
      
      $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
      /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
      /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
      /mnt/scratch/6  -d  /mnt/scratch/7
      
      Result:
      
      		wall	sys	create rate	Physical write IO
      		time	CPU	(avg files/s)	 IOPS	Bandwidth
      		-----	-----	------------	------	---------
      unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
      patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
      improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
      
      If I use zero length files, this workload at about 500 IOPS, so
      plugging drops the data IOs from roughly 25,500/s to 1000/s.
      3 lines of code, 35% better throughput for 15% less CPU.
      
      The benefits of plugging at this layer are likely to be higher for
      spinning media as the IO patterns for this workload are going make a
      much bigger difference on high IO latency devices.....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d353d758
  23. 24 7月, 2015 1 次提交
  24. 18 6月, 2015 1 次提交
    • T
      writeback: do foreign inode detection iff cgroup writeback is enabled · dd73e4b7
      Tejun Heo 提交于
      Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
      flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
      foreign inode detection and migration logic still ends up activating
      cgroup writeback which is unexpected.  This patch ensures that the
      foreign inode detection logic stays disabled when inode_cgwb_enabled()
      is false by not associating writeback_control's with bdi_writeback's.
      
      This also avoids unnecessary operations in wbc_init_bio(),
      wbc_account_io() and wbc_detach_inode() for filesystems which don't
      support cgroup writeback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dd73e4b7
  25. 02 6月, 2015 7 次提交
    • T
      writeback: disassociate inodes from dying bdi_writebacks · e8a7abf5
      Tejun Heo 提交于
      For the purpose of foreign inode detection, wb's (bdi_writeback's) are
      identified by the associated memcg ID.  As we create a separate wb for
      each memcg, this is enough to identify the active wb's; however, when
      blkcg is enabled or disabled higher up in the hierarchy, the mapping
      between memcg and blkcg changes which in turn creates a new wb to
      service the new mapping.  The old wb is unlinked from index and
      released after all references are drained.  The foreign inode
      detection logic can't detect this condition because both the old and
      new wb's point to the same memcg and thus never decides to move inodes
      attached to the old wb to the new one.
      
      This patch adds logic to initiate switching immediately in
      wbc_attach_and_unlock_inode() if the associated wb is dying.  We can
      make the usual foreign detection logic to distinguish the different
      wb's mapped to the memcg but the dying wb is never gonna be in active
      service again and there's no point in tracking the usage history and
      reaching the switch verdict after enough data points are collected.
      It's already known that the wb has to be switched.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8a7abf5
    • T
      writeback: implement foreign cgroup inode bdi_writeback switching · d10c8095
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and transfers the ownership to it.  The previous patches
      implemented the foreign condition detection mechanism and laid the
      groundwork.  This patch implements the actual switching.
      
      With the previously implemented [unlocked_]inode_to_wb_and_list_lock()
      and wb stat transaction, grabbing wb->list_lock, inode->i_lock and
      mapping->tree_lock gives us full exclusion against all wb operations
      on the target inode.  inode_switch_wb_work_fn() grabs all the locks
      and transfers the inode atomically along with its RECLAIMABLE and
      WRITEBACK stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d10c8095
    • T
      writeback: add lockdep annotation to inode_to_wb() · aaa2cacf
      Tejun Heo 提交于
      With the previous three patches, all operations which acquire wb from
      inode are either under one of inode->i_lock, mapping->tree_lock or
      wb->list_lock or protected by unlocked_inode_to_wb transaction.  This
      will be depended upon by foreign inode wb switching.
      
      This patch adds lockdep assertion to inode_to_wb() so that usages
      outside the above list locks can be caught easily.  There are three
      exceptions.
      
      * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
        wb may not be the inode's.  Ensuring that is the function's role
        after all.  Updated to deref inode->i_wb directly.
      
      * inode_wb_stat_unlocked_begin() is usually protected by combination
        of !I_WB_SWITCH and rcu_read_lock().  Updated to deref inode->i_wb
        directly.
      
      * inode_congested() wants to test whether inode->i_wb is set before
        starting the transaction.  Added inode_to_wb_is_valid() which tests
        inode->i_wb directly.
      
      v5: might_lock() removed.  It annotates that the lock is grabbed w/
          irq enabled which isn't the case and triggering lockdep warning
          spuriously.
      
      v4: might_lock() added to unlocked_inode_to_wb_begin().
      
      v3: inode_congested() conversion added.
      
      v2: locked_inode_to_wb_and_lock_list() was missing in the first
          version.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aaa2cacf
    • T
      writeback: use unlocked_inode_to_wb transaction in inode_congested() · 5cb8b824
      Tejun Heo 提交于
      Similar to wb stat updates, inode_congested() accesses the associated
      wb of an inode locklessly, which will break with foreign inode wb
      switching.  This path updates inode_congested() to use unlocked inode
      wb access transaction introduced by the previous patch.
      
      Combined with the previous two patches, this makes all wb list and
      access operations to be protected by either of inode->i_lock,
      wb->list_lock, or mapping->tree_lock while wb switching is in
      progress.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5cb8b824
    • T
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo 提交于
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      682aa8e1
    • T
      writeback: implement [locked_]inode_to_wb_and_lock_list() · 87e1d789
      Tejun Heo 提交于
      cgroup writeback currently assumes that inode to wb association
      doesn't change; however, with the planned foreign inode wb switching
      mechanism, the association will change dynamically.
      
      When an inode needs to be put on one of the IO lists of its wb, the
      current code simply calls inode_to_wb() and locks the returned wb;
      however, with the planned wb switching, the association may change
      before locking the wb and may even get released.
      
      This patch implements [locked_]inode_to_wb_and_lock_list() which pins
      the associated wb while holding i_lock, releases it, acquires
      wb->list_lock and verifies that the association hasn't changed
      inbetween.  As the association will be protected by both locks among
      other things, this guarantees that the wb is the inode's associated wb
      until the list_lock is released.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      87e1d789
    • T
      writeback: implement foreign cgroup inode detection · 2a814908
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and will transfer the ownership to it.  To avoid
      unnnecessary oscillation, the detection mechanism keeps track of
      history and gives out the switch verdict only if the foreign usage
      pattern is stable over a certain amount of time and/or writeback
      attempts.
      
      The detection mechanism has fairly low space and computation overhead.
      It adds 8 bytes to struct inode (one int and two u16's) and minimal
      amount of calculation per IO.  The detection mechanism converges to
      the correct answer usually in several seconds of IO time when there's
      a clear majority dirtier.  Even when there isn't, it can reach an
      acceptable answer fairly quickly under most circumstances.
      
      Please see wb_detach_inode() for more details.
      
      This patch only implements detection.  Following patches will
      implement actual switching.
      
      v2: wbc_account_io() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a814908