1. 10 5月, 2020 4 次提交
  2. 07 5月, 2020 1 次提交
  3. 02 4月, 2020 1 次提交
    • T
      blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it · d866dbf6
      Tejun Heo 提交于
      blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
      don't get offlined while there are active cgwbs on them.  However, it
      ends up making offlining unordered sometimes causing parents to be
      offlined before children.
      
      To fix it, we want child blkcgs to pin the parents' online states
      turning the refcnt into a more generic online pinning mechanism.
      
      In prepartion,
      
      * blkcg->cgwb_refcnt -> blkcg->online_pin
      * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
      * Take them out of CONFIG_CGROUP_WRITEBACK
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d866dbf6
  4. 01 2月, 2020 1 次提交
    • T
      memcg: fix a crash in wb_workfn when a device disappears · 68f23b89
      Theodore Ts'o 提交于
      Without memcg, there is a one-to-one mapping between the bdi and
      bdi_writeback structures.  In this world, things are fairly
      straightforward; the first thing bdi_unregister() does is to shutdown
      the bdi_writeback structure (or wb), and part of that writeback ensures
      that no other work queued against the wb, and that the wb is fully
      drained.
      
      With memcg, however, there is a one-to-many relationship between the bdi
      and bdi_writeback structures; that is, there are multiple wb objects
      which can all point to a single bdi.  There is a refcount which prevents
      the bdi object from being released (and hence, unregistered).  So in
      theory, the bdi_unregister() *should* only get called once its refcount
      goes to zero (bdi_put will drop the refcount, and when it is zero,
      release_bdi gets called, which calls bdi_unregister).
      
      Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
      the Brave New memcg World, and calls bdi_unregister directly.  It does
      this without informing the file system, or the memcg code, or anything
      else.  This causes the root wb associated with the bdi to be
      unregistered, but none of the memcg-specific wb's are shutdown.  So when
      one of these wb's are woken up to do delayed work, they try to
      dereference their wb->bdi->dev to fetch the device name, but
      unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
      called by del_gendisk().  As a result, *boom*.
      
      Fortunately, it looks like the rest of the writeback path is perfectly
      happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
      create a bdi_dev_name() function which can handle bdi->dev being NULL.
      This also allows us to bulletproof the writeback tracepoints to prevent
      them from dereferencing a NULL pointer and crashing the kernel if one is
      tracing with memcg's enabled, and an iSCSI device dies or a USB storage
      stick is pulled.
      
      The most common way of triggering this will be hotremoval of a device
      while writeback with memcg enabled is going on.  It was triggering
      several times a day in a heavily loaded production environment.
      
      Google Bug Id: 145475544
      
      Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
      Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.eduSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68f23b89
  5. 06 10月, 2019 1 次提交
    • M
      bdi: Do not use freezable workqueue · a2b90f11
      Mika Westerberg 提交于
      A removable block device, such as NVMe or SSD connected over Thunderbolt
      can be hot-removed any time including when the system is suspended. When
      device is hot-removed during suspend and the system gets resumed, kernel
      first resumes devices and then thaws the userspace including freezable
      workqueues. What happens in that case is that the NVMe driver notices
      that the device is unplugged and removes it from the system. This ends
      up calling bdi_unregister() for the gendisk which then schedules
      wb_workfn() to be run one more time.
      
      However, since the bdi_wq is still frozen flush_delayed_work() call in
      wb_shutdown() blocks forever halting system resume process. User sees
      this as hang as nothing is happening anymore.
      
      Triggering sysrq-w reveals this:
      
        Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
        Call Trace:
         ? __schedule+0x2c5/0x630
         ? wait_for_completion+0xa4/0x120
         schedule+0x3e/0xc0
         schedule_timeout+0x1c9/0x320
         ? resched_curr+0x1f/0xd0
         ? wait_for_completion+0xa4/0x120
         wait_for_completion+0xc3/0x120
         ? wake_up_q+0x60/0x60
         __flush_work+0x131/0x1e0
         ? flush_workqueue_prep_pwqs+0x130/0x130
         bdi_unregister+0xb9/0x130
         del_gendisk+0x2d2/0x2e0
         nvme_ns_remove+0xed/0x110 [nvme_core]
         nvme_remove_namespaces+0x96/0xd0 [nvme_core]
         nvme_remove+0x5b/0x160 [nvme]
         pci_device_remove+0x36/0x90
         device_release_driver_internal+0xdf/0x1c0
         nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
         process_one_work+0x1c2/0x3f0
         worker_thread+0x48/0x3e0
         kthread+0x100/0x140
         ? current_work+0x30/0x30
         ? kthread_park+0x80/0x80
         ret_from_fork+0x35/0x40
      
      This is not limited to NVMes so exactly same issue can be reproduced by
      hot-removing SSD (over Thunderbolt) while the system is suspended.
      
      Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.
      Reported-by: NAceLan Kao <acelan.kao@canonical.com>
      Link: https://marc.info/?l=linux-kernel&m=138695698516487
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
      Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#tAcked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a2b90f11
  6. 27 8月, 2019 2 次提交
  7. 03 6月, 2019 1 次提交
  8. 21 5月, 2019 1 次提交
  9. 23 1月, 2019 1 次提交
  10. 01 9月, 2018 1 次提交
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
  11. 23 8月, 2018 2 次提交
  12. 23 6月, 2018 1 次提交
    • J
      bdi: Fix another oops in wb_workfn() · 3ee7e869
      Jan Kara 提交于
      syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
      wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
      WB_shutting_down after wb->bdi->dev became NULL. This indicates that
      unregister_bdi() failed to call wb_shutdown() on one of wb objects.
      
      The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
      drops bdi's reference to wb structures before going through the list of
      wbs again and calling wb_shutdown() on each of them. This way the loop
      iterating through all wbs can easily miss a wb if that wb has already
      passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
      from cgwb_release_workfn() and as a result fully shutdown bdi although
      wb_workfn() for this wb structure is still running. In fact there are
      also other ways cgwb_bdi_unregister() can race with
      cgwb_release_workfn() leading e.g. to use-after-free issues:
      
      CPU1                            CPU2
                                      cgwb_bdi_unregister()
                                        cgwb_kill(*slot);
      
      cgwb_release()
        queue_work(cgwb_release_wq, &wb->release_work);
      cgwb_release_workfn()
                                        wb = list_first_entry(&bdi->wb_list, ...)
                                        spin_unlock_irq(&cgwb_lock);
        wb_shutdown(wb);
        ...
        kfree_rcu(wb, rcu);
                                        wb_shutdown(wb); -> oops use-after-free
      
      We solve these issues by synchronizing writeback structure shutdown from
      cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
      way we also no longer need synchronization using WB_shutting_down as the
      mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
      CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
      bdi_unregister().
      Reported-by: Nsyzbot <syzbot+4a7438e774b21ddd8eca@syzkaller.appspotmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3ee7e869
  13. 08 6月, 2018 1 次提交
  14. 24 5月, 2018 1 次提交
    • T
      bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue · f1834646
      Tejun Heo 提交于
      From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Wed, 23 May 2018 10:29:00 -0700
      
      cgwb_release() punts the actual release to cgwb_release_workfn() on
      system_wq.  Depending on the number of cgroups or block devices, there
      can be a lot of cgwb_release_workfn() in flight at the same time.
      
      We're periodically seeing close to 256 kworkers getting stuck with the
      following stack trace and overtime the entire system gets stuck.
      
        [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
        [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
        [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
        [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
        [<ffffffff811cd645>] wb_exit+0x85/0xa0
        [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
        [<ffffffff810a68d0>] process_one_work+0x150/0x410
        [<ffffffff810a71fd>] worker_thread+0x6d/0x520
        [<ffffffff810ad3dc>] kthread+0x12c/0x160
        [<ffffffff81969019>] ret_from_fork+0x29/0x40
        [<ffffffffffffffff>] 0xffffffffffffffff
      
      The events leading to the lockup are...
      
      1. A lot of cgwb_release_workfn() is queued at the same time and all
         system_wq kworkers are assigned to execute them.
      
      2. They all end up calling synchronize_rcu_expedited().  One of them
         wins and tries to perform the expedited synchronization.
      
      3. However, that invovles queueing rcu_exp_work to system_wq and
         waiting for it.  Because #1 is holding all available kworkers on
         system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
         is waiting for synchronize_rcu_expedited() which in turn is waiting
         for cgwb_release_workfn() to free up some of the kworkers.
      
      We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
      same time.  There's nothing to be gained from that.  This patch
      updates cgwb release path to use a dedicated percpu workqueue with
      @max_active of 1.
      
      While this resolves the problem at hand, it might be a good idea to
      isolate rcu_exp_work to its own workqueue too as it can be used from
      various paths and is prone to this sort of indirect A-A deadlocks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1834646
  15. 03 5月, 2018 2 次提交
  16. 12 4月, 2018 1 次提交
    • A
      mm/vmscan: don't mess with pgdat->flags in memcg reclaim · e3c1ac58
      Andrey Ryabinin 提交于
      memcg reclaim may alter pgdat->flags based on the state of LRU lists in
      cgroup and its children.  PGDAT_WRITEBACK may force kswapd to sleep
      congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
      pages.  But the worst here is PGDAT_CONGESTED, since it may force all
      direct reclaims to stall in wait_iff_congested().  Note that only kswapd
      have powers to clear any of these bits.  This might just never happen if
      cgroup limits configured that way.  So all direct reclaims will stall as
      long as we have some congested bdi in the system.
      
      Leave all pgdat->flags manipulations to kswapd.  kswapd scans the whole
      pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
      it's reasonable to leave all decisions about node state to kswapd.
      
      Why only kswapd? Why not allow to global direct reclaim change these
      flags? It is because currently only kswapd can clear these flags.  I'm
      less worried about the case when PGDAT_CONGESTED falsely not set, and
      more worried about the case when it falsely set.  If direct reclaimer
      sets PGDAT_CONGESTED, do we have guarantee that after the congestion
      problem is sorted out, kswapd will be woken up and clear the flag? It
      seems like there is no such guarantee.  E.g.  direct reclaimers may
      eventually balance pgdat and kswapd simply won't wake up (see
      wakeup_kswapd()).
      
      Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
      now loses its congestion throttling mechanism.  Add per-cgroup
      congestion state and throttle cgroup2 reclaimers if memcg is in
      congestion state.
      
      Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
      bits since they alter only kswapd behavior.
      
      The problem could be easily demonstrated by creating heavy congestion in
      one cgroup:
      
          echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
          mkdir -p /sys/fs/cgroup/congester
          echo 512M > /sys/fs/cgroup/congester/memory.max
          echo $$ > /sys/fs/cgroup/congester/cgroup.procs
          /* generate a lot of diry data on slow HDD */
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
          ....
          while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
      
      and some job in another cgroup:
      
          mkdir /sys/fs/cgroup/victim
          echo 128M > /sys/fs/cgroup/victim/memory.max
      
          # time cat /dev/sda > /dev/null
          real    10m15.054s
          user    0m0.487s
          sys     1m8.505s
      
      According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
      of the time sleeping there.
      
      With the patch, cat don't waste time anymore:
      
          # time cat /dev/sda > /dev/null
          real    5m32.911s
          user    0m0.411s
          sys     0m56.664s
      
      [aryabinin@virtuozzo.com: congestion state should be per-node]
        Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
      [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
        Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3c1ac58
  17. 06 4月, 2018 1 次提交
  18. 01 3月, 2018 1 次提交
  19. 22 12月, 2017 1 次提交
  20. 20 11月, 2017 2 次提交
  21. 06 10月, 2017 1 次提交
  22. 12 9月, 2017 1 次提交
  23. 21 4月, 2017 5 次提交
  24. 23 3月, 2017 6 次提交
    • J
      bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() · b1c51afc
      Jan Kara 提交于
      Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
      from bdi_unregister() which is not necessarily called from bdi_destroy()
      and thus the name is somewhat misleading.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b1c51afc
    • J
      bdi: Do not wait for cgwbs release in bdi_unregister() · 4514451e
      Jan Kara 提交于
      Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
      (called from bdi_unregister()). That is however unnecessary now when
      cgwb->bdi is a proper refcounted reference (thus bdi cannot get
      released before all cgwbs are released) and when cgwb_bdi_destroy()
      shuts down writeback directly.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4514451e
    • J
      bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() · 5318ce7d
      Jan Kara 提交于
      Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
      which also means that writeback has been shutdown on them. Since this
      wait is going away, directly shutdown writeback on cgwbs from
      cgwb_bdi_destroy() to avoid live writeback structures after
      bdi_unregister() has finished. To make that safe with concurrent
      shutdown from cgwb_release_workfn(), we also have to make sure
      wb_shutdown() returns only after the bdi_writeback structure is really
      shutdown.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5318ce7d
    • J
      bdi: Unify bdi->wb_list handling for root wb_writeback · e8cb72b3
      Jan Kara 提交于
      Currently root wb_writeback structure is added to bdi->wb_list in
      bdi_init() and never removed. That is different from all other
      wb_writeback structures which get added to the list when created and
      removed from it before wb_shutdown().
      
      So move list addition of root bdi_writeback to bdi_register() and list
      removal of all wb_writeback structures to wb_shutdown(). That way a
      wb_writeback structure is on bdi->wb_list if and only if it can handle
      writeback and it will make it easier for us to handle shutdown of all
      wb_writeback structures in bdi_unregister().
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      e8cb72b3
    • J
      bdi: Make wb->bdi a proper reference · 810df54a
      Jan Kara 提交于
      Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
      structures except for the one embedded inside struct backing_dev_info.
      That will allow us to simplify bdi unregistration.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      810df54a
    • J
      bdi: Mark congested->bdi as internal · b7d680d7
      Jan Kara 提交于
      congested->bdi pointer is used only to be able to remove congested
      structure from bdi->cgwb_congested_tree on structure release. Moreover
      the pointer can become NULL when we unregister the bdi. Rename the field
      to __bdi and add a comment to make it more explicit this is internal
      stuff of memcg writeback code and people should not use the field as
      such use will be likely race prone.
      
      We do not bother with converting congested->bdi to a proper refcounted
      reference. It will be slightly ugly to special-case bdi->wb.congested to
      avoid effectively a cyclic reference of bdi to itself and the reference
      gets cleared from bdi_unregister() making it impossible to reference
      a freed bdi.
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7d680d7