1. 17 1月, 2020 1 次提交
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
  2. 22 11月, 2017 1 次提交
    • K
      block/laptop_mode: Convert timers to use timer_setup() · bca237a5
      Kees Cook 提交于
      In preparation for unconditionally passing the struct timer_list pointer to
      all timer callbacks, switch to using the new timer_setup() and from_timer()
      to pass the timer pointer explicitly.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      bca237a5
  3. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  4. 10 10月, 2017 1 次提交
  5. 05 10月, 2017 1 次提交
    • J
      writeback: eliminate work item allocation in bd_start_writeback() · 85009b4f
      Jens Axboe 提交于
      Handle start-all writeback like we do periodic or kupdate
      style writeback - by marking the bdi_writeback as needing a full
      flush, and simply waking the thread. This eliminates the need to
      allocate and queue a specific work item just for this purpose.
      
      After this change, we truly only ever have one of them running at
      any point in time. We mark the need to start all flushes, and the
      writeback thread will clear it once it has processed the request.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85009b4f
  6. 03 10月, 2017 2 次提交
  7. 23 3月, 2017 1 次提交
    • J
      block: Fix oops in locked_inode_to_wb_and_lock_list() · f759741d
      Jan Kara 提交于
      When block device is closed, we call inode_detach_wb() in __blkdev_put()
      which sets inode->i_wb to NULL. That is contrary to expectations that
      inode->i_wb stays valid once set during the whole inode's lifetime and
      leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
      inode_to_wb() returned NULL.
      
      The reason why we called inode_detach_wb() is not valid anymore though.
      BDI is guaranteed to stay along until we call bdi_put() from
      bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
      moment.
      
      Also add a warning to catch if someone uses inode_detach_wb() in a
      dangerous way.
      Reported-by: NThiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f759741d
  8. 25 2月, 2017 1 次提交
  9. 26 12月, 2016 1 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
  10. 03 11月, 2016 2 次提交
  11. 01 11月, 2016 1 次提交
  12. 08 10月, 2016 1 次提交
    • M
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko 提交于
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
  13. 29 7月, 2016 1 次提交
  14. 27 7月, 2016 1 次提交
  15. 04 3月, 2016 1 次提交
    • T
      writeback: flush inode cgroup wb switches instead of pinning super_block · a1a0e23e
      Tejun Heo 提交于
      If cgroup writeback is in use, inodes can be scheduled for
      asynchronous wb switching.  Before 5ff8eaac ("writeback: keep
      superblock pinned during cgroup writeback association switches"), this
      could race with umount leading to super_block being destroyed while
      inodes are pinned for wb switching.  5ff8eaac fixed it by bumping
      s_active while wb switches are in flight; however, this allowed
      in-flight wb switches to make umounts asynchronous when the userland
      expected synchronosity - e.g. fsck immediately following umount may
      fail because the device is still busy.
      
      This patch removes the problematic super_block pinning and instead
      makes generic_shutdown_super() flush in-flight wb switches.  wb
      switches are now executed on a dedicated isw_wq so that they can be
      flushed and isw_nr_in_flight keeps track of the number of in-flight wb
      switches so that flushing can be avoided in most cases.
      
      v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
          check in inode_switch_wbs() as Jan an Al suggested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTahsin Erdogan <tahsin@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: 5ff8eaac ("writeback: keep superblock pinned during cgroup writeback association switches")
      Cc: stable@vger.kernel.org #v4.5
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a1a0e23e
  16. 02 6月, 2015 12 次提交
    • T
      writeback: implement foreign cgroup inode detection · 2a814908
      Tejun Heo 提交于
      As concurrent write sharing of an inode is expected to be very rare
      and memcg only tracks page ownership on first-use basis severely
      confining the usefulness of such sharing, cgroup writeback tracks
      ownership per-inode.  While the support for concurrent write sharing
      of an inode is deemed unnecessary, an inode being written to by
      different cgroups at different points in time is a lot more common,
      and, more importantly, charging only by first-use can too readily lead
      to grossly incorrect behaviors (single foreign page can lead to
      gigabytes of writeback to be incorrectly attributed).
      
      To resolve this issue, cgroup writeback detects the majority dirtier
      of an inode and will transfer the ownership to it.  To avoid
      unnnecessary oscillation, the detection mechanism keeps track of
      history and gives out the switch verdict only if the foreign usage
      pattern is stable over a certain amount of time and/or writeback
      attempts.
      
      The detection mechanism has fairly low space and computation overhead.
      It adds 8 bytes to struct inode (one int and two u16's) and minimal
      amount of calculation per IO.  The detection mechanism converges to
      the correct answer usually in several seconds of IO time when there's
      a clear majority dirtier.  Even when there isn't, it can reach an
      acceptable answer fairly quickly under most circumstances.
      
      Please see wb_detach_inode() for more details.
      
      This patch only implements detection.  Following patches will
      implement actual switching.
      
      v2: wbc_account_io() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a814908
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • T
      writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() · 21c6321f
      Tejun Heo 提交于
      Currently, majority of cgroup writeback support including all the
      above functions are implemented in include/linux/backing-dev.h and
      mm/backing-dev.c; however, the portion closely related to writeback
      logic implemented in include/linux/writeback.h and mm/page-writeback.c
      will expand to support foreign writeback detection and correction.
      
      This patch moves wb[_try]_get() and wb_put() to
      include/linux/backing-dev-defs.h so that they can be used from
      writeback.h and inode_{attach|detach}_wb() to writeback.h and
      page-writeback.c.
      
      This is pure reorganization and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      21c6321f
    • T
      writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes · 2529bb3a
      Tejun Heo 提交于
      The amount of available memory to a memcg wb_domain can change as
      memcg configuration changes.  A domain's ->dirty_limit exists to
      smooth out sudden drops in dirty threshold; however, when a domain's
      size actually drops significantly, it hinders the dirty throttling
      from adjusting to the new configuration leading to unexpected
      behaviors including unnecessary OOM kills.
      
      This patch resolves the issue by adding wb_domain_size_changed() which
      resets ->dirty_limit[_tstmp] and making memcg call it on configuration
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2529bb3a
    • T
      writeback: implement memcg wb_domain · 841710aa
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      The previous patches laid the groundwork to support the two wb_domains
      and this patch implements memcg wb_domain.  memcg->cgwb_domain is
      initialized on css online and destroyed on css release,
      wb->memcg_completions is added, and __wb_writeout_inc() is updated to
      increment completions against both global and memcg wb_domains.
      
      The following patches will update balance_dirty_pages() and its
      subroutines to actually consider memcg wb_domain for throttling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      841710aa
    • T
      writeback: move over_bground_thresh() to mm/page-writeback.c · aa661bbe
      Tejun Heo 提交于
      and rename it to wb_over_bg_thresh().  The function is closely tied to
      the dirty throttling mechanism implemented in page-writeback.c.  This
      relocation will allow future updates necessary for cgroup writeback
      support.
      
      While at it, add function comment.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      aa661bbe
    • T
      writeback: move global_dirty_limit into wb_domain · dcc25ae7
      Tejun Heo 提交于
      This patch is a part of the series to define wb_domain which
      represents a domain that wb's (bdi_writeback's) belong to and are
      measured against each other in.  This will enable IO backpressure
      propagation for cgroup writeback.
      
      global_dirty_limit exists to regulate the global dirty threshold which
      is a property of the wb_domain.  This patch moves hard_dirty_limit,
      dirty_lock, and update_time into wb_domain.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dcc25ae7
    • T
      writeback: implement wb_domain · 380c27ca
      Tejun Heo 提交于
      Dirtyable memory is distributed to a wb (bdi_writeback) according to
      the relative bandwidth the wb is writing out in the whole system.
      This distribution is global - each wb is measured against all other
      wb's and gets the proportinately sized portion of the memory in the
      whole system.
      
      For cgroup writeback, the amount of dirtyable memory is scoped by
      memcg and thus each wb would need to be measured and controlled in its
      memcg.  IOW, a wb will belong to two writeback domains - the global
      and memcg domains.
      
      Currently, what constitutes the global writeback domain are scattered
      across a number of global states.  This patch starts collecting them
      into struct wb_domain.
      
      * fprop_global which serves as the basis for proportional bandwidth
        measurement and its period timer are moved into struct wb_domain.
      
      * global_wb_domain hosts the states for the global domain.
      
      * While at it, flatten wb_writeout_fraction() into its callers.  This
        thin wrapper doesn't provide any actual benefits while getting in
        the way.
      
      This is pure reorganization and doesn't introduce any behavioral
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      380c27ca
    • T
      writeback: reorganize [__]wb_update_bandwidth() · 8a731799
      Tejun Heo 提交于
      __wb_update_bandwidth() is called from two places -
      fs/fs-writeback.c::balance_dirty_pages() and
      mm/page-writeback.c::wb_writeback().  The latter updates only the
      write bandwidth while the former also deals with the dirty ratelimit.
      The two callsites are distinguished by whether @thresh parameter is
      zero or not, which is cryptic.  In addition, the two files define
      their own different versions of wb_update_bandwidth() on top of
      __wb_update_bandwidth(), which is confusing to say the least.  This
      patch cleans up [__]wb_update_bandwidth() in the following ways.
      
      * __wb_update_bandwidth() now takes explicit @update_ratelimit
        parameter to gate dirty ratelimit handling.
      
      * mm/page-writeback.c::wb_update_bandwidth() is flattened into its
        caller - balance_dirty_pages().
      
      * fs/fs-writeback.c::wb_update_bandwidth() is moved to
        mm/page-writeback.c and __wb_update_bandwidth() is made static.
      
      * While at it, add a lockdep assertion to __wb_update_bandwidth().
      
      Except for the lockdep addition, this is pure reorganization and
      doesn't introduce any behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8a731799
    • T
      writeback: clean up wb_dirty_limit() · 0d960a38
      Tejun Heo 提交于
      The function name wb_dirty_limit(), its argument @dirty and the local
      variable @wb_dirty are mortally confusing given that the function
      calculates per-wb threshold value not dirty pages, especially given
      that @dirty and @wb_dirty are used elsewhere for dirty pages.
      
      Let's rename the function to wb_calc_thresh() and wb_dirty to
      wb_thresh.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0d960a38
    • T
      writeback: restructure try_writeback_inodes_sb[_nr]() · f30a7d0c
      Tejun Heo 提交于
      try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
      handles s_umount locking and skips if writeback is already in
      progress.  The in progress test is performed on the root wb
      (bdi_writeback) which isn't sufficient for cgroup writeback support.
      The test must be done per-wb.
      
      To prepare for the change, this patch factors out
      __writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
      @skip_if_busy and moves the in progress test right before queueing the
      wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
      s_umount and invokes __writeback_inodes_sb_nr() with asserted
      @skip_if_busy.  This way, later addition of multiple wb handling can
      skip only the wb's which already have writeback in progress.
      
      This swaps the order between in progress test and s_umount test which
      can flip the return value when writeback is in progress and s_umount
      is being held by someone else but this shouldn't cause any meaningful
      difference.  It's a fringe condition and the return value is an
      unsynchronized hint anyway.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f30a7d0c
    • T
      writeback: move bandwidth related fields from backing_dev_info into bdi_writeback · a88a341a
      Tejun Heo 提交于
      Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
      and the role of the separation is unclear.  For cgroup support for
      writeback IOs, a bdi will be updated to host multiple wb's where each
      wb serves writeback IOs of a different cgroup on the bdi.  To achieve
      that, a wb should carry all states necessary for servicing writeback
      IOs for a cgroup independently.
      
      This patch moves bandwidth related fields from backing_dev_info into
      bdi_writeback.
      
      * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
        write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
        balanced_dirty_ratelimit, completions and dirty_exceeded.
      
      * writeback_chunk_size() and over_bground_thresh() now take @wb
        instead of @bdi.
      
      * bdi_writeout_fraction(bdi, ...)	-> wb_writeout_fraction(wb, ...)
        bdi_dirty_limit(bdi, ...)		-> wb_dirty_limit(wb, ...)
        bdi_position_ration(bdi, ...)		-> wb_position_ratio(wb, ...)
        bdi_update_writebandwidth(bdi, ...)	-> wb_update_write_bandwidth(wb, ...)
        [__]bdi_update_bandwidth(bdi, ...)	-> [__]wb_update_bandwidth(wb, ...)
        bdi_{max|min}_pause(bdi, ...)		-> wb_{max|min}_pause(wb, ...)
        bdi_dirty_limits(bdi, ...)		-> wb_dirty_limits(wb, ...)
      
      * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
        respectively.  Note that explicit zeroing is dropped in the process
        as wb's are cleared in entirety anyway.
      
      * As there's still only one bdi_writeback per backing_dev_info, all
        uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
        introducing no behavior changes.
      
      v2: Typo in description fixed as suggested by Jan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a88a341a
  17. 18 3月, 2015 1 次提交
  18. 09 1月, 2015 1 次提交
    • J
      mm: protect set_page_dirty() from ongoing truncation · 2d6d7f98
      Johannes Weiner 提交于
      Tejun, while reviewing the code, spotted the following race condition
      between the dirtying and truncation of a page:
      
      __set_page_dirty_nobuffers()       __delete_from_page_cache()
        if (TestSetPageDirty(page))
                                           page->mapping = NULL
      				     if (PageDirty())
      				       dec_zone_page_state(page, NR_FILE_DIRTY);
      				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
          if (page->mapping)
            account_page_dirtied(page)
              __inc_zone_page_state(page, NR_FILE_DIRTY);
      	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
      
      which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
      
      Dirtiers usually lock out truncation, either by holding the page lock
      directly, or in case of zap_pte_range(), by pinning the mapcount with
      the page table lock held.  The notable exception to this rule, though,
      is do_wp_page(), for which this race exists.  However, do_wp_page()
      already waits for a locked page to unlock before setting the dirty bit,
      in order to prevent a race where clear_page_dirty() misses the page bit
      in the presence of dirty ptes.  Upgrade that wait to a fully locked
      set_page_dirty() to also cover the situation explained above.
      
      Afterwards, the code in set_page_dirty() dealing with a truncation race
      is no longer needed.  Remove it.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d7f98
  19. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  20. 08 4月, 2014 1 次提交
  21. 22 2月, 2014 1 次提交
    • J
      Revert "writeback: do not sync data dirtied after sync start" · 0dc83bd3
      Jan Kara 提交于
      This reverts commit c4a391b5. Dave
      Chinner <david@fromorbit.com> has reported the commit may cause some
      inodes to be left out from sync(2). This is because we can call
      redirty_tail() for some inode (which sets i_dirtied_when to current time)
      after sync(2) has started or similarly requeue_inode() can set
      i_dirtied_when to current time if writeback had to skip some pages. The
      real problem is in the functions clobbering i_dirtied_when but fixing
      that isn't trivial so revert is a safer choice for now.
      
      CC: stable@vger.kernel.org # >= 3.13
      Signed-off-by: NJan Kara <jack@suse.cz>
      0dc83bd3
  22. 13 11月, 2013 1 次提交
    • J
      writeback: do not sync data dirtied after sync start · c4a391b5
      Jan Kara 提交于
      When there are processes heavily creating small files while sync(2) is
      running, it can easily happen that quite some new files are created
      between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
      especially if there are several busy filesystems (remember that sync
      traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
      fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
      (e.g.  causes a transaction commit and cache flush for each inode in
      ext3), resulting sync(2) times are rather large.
      
      The following script reproduces the problem:
      
        function run_writers
        {
          for (( i = 0; i < 10; i++ )); do
            mkdir $1/dir$i
            for (( j = 0; j < 40000; j++ )); do
              dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
            done &
          done
        }
      
        for dir in "$@"; do
          run_writers $dir
        done
      
        sleep 40
        time sync
      
      Fix the problem by disregarding inodes dirtied after sync(2) was called
      in the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes
      a time stamp when sync has started which is used for setting up work for
      flusher threads.
      
      To give some numbers, when above script is run on two ext4 filesystems
      on simple SATA drive, the average sync time from 10 runs is 267.549
      seconds with standard deviation 104.799426.  With the patched kernel,
      the average sync time from 10 runs is 2.995 seconds with standard
      deviation 0.096.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a391b5
  23. 12 9月, 2013 1 次提交
  24. 10 7月, 2013 3 次提交
  25. 03 7月, 2013 1 次提交
    • D
      sync: don't block the flusher thread waiting on IO · 7747bd4b
      Dave Chinner 提交于
      When sync does it's WB_SYNC_ALL writeback, it issues data Io and
      then immediately waits for IO completion. This is done in the
      context of the flusher thread, and hence completely ties up the
      flusher thread for the backing device until all the dirty inodes
      have been synced. On filesystems that are dirtying inodes constantly
      and quickly, this means the flusher thread can be tied up for
      minutes per sync call and hence badly affect system level write IO
      performance as the page cache cannot be cleaned quickly.
      
      We already have a wait loop for IO completion for sync(2), so cut
      this out of the flusher thread and delegate it to wait_sb_inodes().
      Hence we can do rapid IO submission, and then wait for it all to
      complete.
      
      Effect of sync on fsmark before the patch:
      
      FSUse%        Count         Size    Files/sec     App Overhead
      .....
           0       640000         4096      35154.6          1026984
           0       720000         4096      36740.3          1023844
           0       800000         4096      36184.6           916599
           0       880000         4096       1282.7          1054367
           0       960000         4096       3951.3           918773
           0      1040000         4096      40646.2           996448
           0      1120000         4096      43610.1           895647
           0      1200000         4096      40333.1           921048
      
      And a single sync pass took:
      
        real    0m52.407s
        user    0m0.000s
        sys     0m0.090s
      
      After the patch, there is no impact on fsmark results, and each
      individual sync(2) operation run concurrently with the same fsmark
      workload takes roughly 7s:
      
        real    0m6.930s
        user    0m0.000s
        sys     0m0.039s
      
      IOWs, sync is 7-8x faster on a busy filesystem and does not have an
      adverse impact on ongoing async data write operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7747bd4b