1. 15 1月, 2020 33 次提交
  2. 02 1月, 2020 3 次提交
    • J
      x86/mm: Split vmalloc_sync_all() · 4797417e
      Joerg Roedel 提交于
      commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.
      
      Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
      vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance back,
      split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the above
      mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
      Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4797417e
    • M
      block: fix .bi_size overflow · 842ed2ab
      Ming Lei 提交于
      commit 79d08f89bb1b5c2c1ff90d9bb95497ab9e8aa7e0 upstream
      
      'bio->bi_iter.bi_size' is 'unsigned int', which at most hold 4G - 1
      bytes.
      
      Before 07173c3ec276 ("block: enable multipage bvecs"), one bio can
      include very limited pages, and usually at most 256, so the fs bio
      size won't be bigger than 1M bytes most of times.
      
      Since we support multi-page bvec, in theory one fs bio really can
      be added > 1M pages, especially in case of hugepage, or big writeback
      with too many dirty pages. Then there is chance in which .bi_size
      is overflowed.
      
      Fixes this issue by using bio_full() to check if the added segment may
      overflow .bi_size.
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Cc: Liu Yiding <liuyd.fnst@cn.fujitsu.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: linux-xfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 07173c3ec276 ("block: enable multipage bvecs")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      842ed2ab
    • H
      mm, swap: fix race between swapoff and some swap operations · 8afafd92
      Huang Ying 提交于
      commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.
      
      [zhuhui] Change SWP_VALID to (1 << 12).
      [Joseph] call the new __swap_count() in swap_slot_free_notify()
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8afafd92
  3. 27 12月, 2019 4 次提交
    • T
      blkcg: implement blk-iocost · e383d72b
      Tejun Heo 提交于
      commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.
      
      This patchset implements IO cost model based work-conserving
      proportional controller.
      
      While io.latency provides the capability to comprehensively prioritize
      and protect IOs depending on the cgroups, its protection is binary -
      the lowest latency target cgroup which is suffering is protected at
      the cost of all others.  In many use cases including stacking multiple
      workload containers in a single system, it's necessary to distribute
      IO capacity with better granularity.
      
      One challenge of controlling IO resources is the lack of trivially
      observable cost metric.  The most common metrics - bandwidth and iops
      - can be off by orders of magnitude depending on the device type and
      IO pattern.  However, the cost isn't a complete mystery.  Given
      several key attributes, we can make fairly reliable predictions on how
      expensive a given stream of IOs would be, at least compared to other
      IO patterns.
      
      The function which determines the cost of a given IO is the IO cost
      model for the device.  This controller distributes IO capacity based
      on the costs estimated by such model.  The more accurate the cost
      model the better but the controller adapts based on IO completion
      latency and as long as the relative costs across differents IO
      patterns are consistent and sensible, it'll adapt to the actual
      performance of the device.
      
      Currently, the only implemented cost model is a simple linear one with
      a few sets of default parameters for different classes of device.
      This covers most common devices reasonably well.  All the
      infrastructure to tune and add different cost models is already in
      place and a later patch will also allow using bpf progs for cost
      models.
      
      Please see the top comment in blk-iocost.c and documentation for
      more details.
      
      v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
          for a divide-by-zero bug in current_hweight() triggered by zero
          inuse_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: fix confilcts with ioc_rqos_throttle()]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e383d72b
    • T
      cgroup: add cgroup_parse_float() · e4b4935f
      Tejun Heo 提交于
      commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.
      
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e4b4935f
    • T
      blk-mq: add optional request->alloc_time_ns · 378f7c75
      Tejun Heo 提交于
      commit 6f816b4b746c2241540e537682d30d8e9997d674 upstream.
      
      There are currently two start time timestamps - start_time_ns and
      io_start_time_ns.  The former marks the request allocation and and the
      second issue-to-device time.  The planned io.weight controller needs
      to measure the total time bios take to execute after it leaves rq_qos
      including the time spent waiting for request to become available,
      which can easily dominate on saturated devices.
      
      This patch adds request->alloc_time_ns which records when the request
      allocation attempt started.  As it isn't used for the usual stats,
      make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
      QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
      no users and it's active only on queues which need it even when
      compiled in.
      
      v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
          gating as suggested by Jens.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      378f7c75
    • T
      blkcg: separate blkcg_conf_get_disk() out of blkg_conf_prep() · cf569826
      Tejun Heo 提交于
      commit 015d254cb02b6d8eec4b3366274bf4672f9e0b64 upstream.
      
      Separate out blkcg_conf_get_disk() so that it can be used by blkcg
      policy interface file input parsers before the policy is actually
      enabled.  This doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      cf569826