• G
    writeback: safer lock nesting · 2e898e4c
    Greg Thelen 提交于
    lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.
    
    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains.  Switches occur when
    enough writes are issued from a new domain.
    
    This existing pattern is thus suspicious:
        lock_page_memcg(page);
        unlocked_inode_to_wb_begin(inode, &locked);
        ...
        unlocked_inode_to_wb_end(inode, locked);
        unlock_page_memcg(page);
    
    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock.  This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().
    
        truncate
        __cancel_dirty_page
        lock_page_memcg
        unlocked_inode_to_wb_begin
        unlocked_inode_to_wb_end
        <interrupts mistakenly enabled>
                                        <interrupt>
                                        end_page_writeback
                                        test_clear_page_writeback
                                        lock_page_memcg
                                        <deadlock>
        unlock_page_memcg
    
    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).
    
    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:
    
      cd /mnt/cgroup/memory
      mkdir a b
      echo 1 > a/memory.move_charge_at_immigrate
      echo 1 > b/memory.move_charge_at_immigrate
      (
        echo $BASHPID > a/cgroup.procs
        while true; do
          dd if=/dev/zero of=/mnt/big bs=1M count=256
        done
      ) &
      while true; do
        sync
      done &
      sleep 1h &
      SLEEP=$!
      while true; do
        echo $SLEEP > a/cgroup.procs
        echo $SLEEP > b/cgroup.procs
      done
    
    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel.  I suggest we should to prevent future
    surprises.  And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch.  For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146
    
    Wang Long said "this deadlock occurs three times in our environment"
    
    [gthelen@google.com: v4]
      Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: NGreg Thelen <gthelen@google.com>
    Reported-by: NWang Long <wanglong19@meituan.com>
    Acked-by: NWang Long <wanglong19@meituan.com>
    Acked-by: NMichal Hocko <mhocko@suse.com>
    Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Nicholas Piggin <npiggin@gmail.com>
    Cc: <stable@vger.kernel.org>	[v4.2+]
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
    2e898e4c
backing-dev.h 13.6 KB