1. 03 10月, 2011 4 次提交
    • W
      writeback: dirty rate control · be3ffa27
      Wu Fengguang 提交于
      It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
      when there are N dd tasks.
      
      On write() syscall, use bdi->dirty_ratelimit
      ============================================
      
          balance_dirty_pages(pages_dirtied)
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              pause = pages_dirtied / task_ratelimit;
              sleep(pause);
          }
      
      On every 200ms, update bdi->dirty_ratelimit
      ===========================================
      
          bdi_update_dirty_ratelimit()
          {
              task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
              balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
              bdi->dirty_ratelimit = balanced_dirty_ratelimit
          }
      
      Estimation of balanced bdi->dirty_ratelimit
      ===========================================
      
      balanced task_ratelimit
      -----------------------
      
      balance_dirty_pages() needs to throttle tasks dirtying pages such that
      the total amount of dirty pages stays below the specified dirty limit in
      order to avoid memory deadlocks. Furthermore we desire fairness in that
      tasks get throttled proportionally to the amount of pages they dirty.
      
      IOW we want to throttle tasks such that we match the dirty rate to the
      writeout bandwidth, this yields a stable amount of dirty pages:
      
              dirty_rate == write_bw                                          (1)
      
      The fairness requirement gives us:
      
              task_ratelimit = balanced_dirty_ratelimit
                             == write_bw / N                                  (2)
      
      where N is the number of dd tasks.  We don't know N beforehand, but
      still can estimate balanced_dirty_ratelimit within 200ms.
      
      Start by throttling each dd task at rate
      
              task_ratelimit = task_ratelimit_0                               (3)
                               (any non-zero initial value is OK)
      
      After 200ms, we measured
      
              dirty_rate = # of pages dirtied by all dd's / 200ms
              write_bw   = # of pages written to the disk / 200ms
      
      For the aggressive dd dirtiers, the equality holds
      
              dirty_rate == N * task_rate
                         == N * task_ratelimit_0                              (4)
      Or
              task_ratelimit_0 == dirty_rate / N                              (5)
      
      Now we conclude that the balanced task ratelimit can be estimated by
      
                                                            write_bw
              balanced_dirty_ratelimit = task_ratelimit_0 * ----------        (6)
                                                            dirty_rate
      
      Because with (4) and (5) we can get the desired equality (1):
      
                                                             write_bw
              balanced_dirty_ratelimit == (dirty_rate / N) * ----------
                                                             dirty_rate
                                       == write_bw / N
      
      Then using the balanced task ratelimit we can compute task pause times like:
      
              task_pause = task->nr_dirtied / task_ratelimit
      
      task_ratelimit with position control
      ------------------------------------
      
      However, while the above gives us means of matching the dirty rate to
      the writeout bandwidth, it at best provides us with a stable dirty page
      count (assuming a static system). In order to control the dirty page
      count such that it is high enough to provide performance, but does not
      exceed the specified limit we need another control.
      
      The dirty position control works by extending (2) to
      
              task_ratelimit = balanced_dirty_ratelimit * pos_ratio           (7)
      
      where pos_ratio is a negative feedback function that subjects to
      
      1) f(setpoint) = 1.0
      2) df/dx < 0
      
      That is, if the dirty pages are ABOVE the setpoint, we throttle each
      task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
      pages are created less fast than they are cleaned, thus DROP to the
      setpoints (and the reverse).
      
      Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
      remains CONSTANT for the past 200ms, we get
      
              task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio         (8)
      
      Putting (8) into (6), we get the formula used in
      bdi_update_dirty_ratelimit():
      
                                                      write_bw
              balanced_dirty_ratelimit *= pos_ratio * ----------              (9)
                                                      dirty_rate
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      be3ffa27
    • W
      writeback: add bg_threshold parameter to __bdi_update_bandwidth() · af6a3113
      Wu Fengguang 提交于
      No behavior change.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      af6a3113
    • W
      writeback: dirty position control · 6c14ae1e
      Wu Fengguang 提交于
      bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
      that the resulted task rate limit can drive the dirty pages back to the
      global/bdi setpoints.
      
      Old scheme is,
                                                |
                                 free run area  |  throttle area
        ----------------------------------------+---------------------------->
                                          thresh^                  dirty pages
      
      New scheme is,
      
        ^ task rate limit
        |
        |            *
        |             *
        |              *
        |[free run]      *      [smooth throttled]
        |                  *
        |                     *
        |                         *
        ..bdi->dirty_ratelimit..........*
        |                               .     *
        |                               .          *
        |                               .              *
        |                               .                 *
        |                               .                    *
        +-------------------------------.-----------------------*------------>
                                setpoint^                  limit^  dirty pages
      
      The slope of the bdi control line should be
      
      1) large enough to pull the dirty pages to setpoint reasonably fast
      
      2) small enough to avoid big fluctuations in the resulted pos_ratio and
         hence task ratelimit
      
      Since the fluctuation range of the bdi dirty pages is typically observed
      to be within 1-second worth of data, the bdi control line's slope is
      selected to be a linear function of bdi write bandwidth, so that it can
      adapt to slow/fast storage devices well.
      
      Assume the bdi control line
      
      	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
      
      where k is the negative slope.
      
      If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
      are fluctuating in range
      
      	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
      
      we get slope
      
      	k = - 1 / (8 * write_bw)
      
      Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
      
      	x_intercept = bdi_setpoint + 8 * write_bw
      
      The global/bdi slopes are nicely complementing each other when the
      system has only one major bdi (indicated by bdi_thresh ~= thresh):
      
      1) slope of global control line    => scaling to the control scope size
      2) slope of main bdi control line  => scaling to the writeout bandwidth
      
      so that
      
      - in memory tight systems, (1) becomes strong enough to squeeze dirty
        pages inside the control scope
      
      - in large memory systems where the "gravity" of (1) for pulling the
        dirty pages to setpoint is too weak, (2) can back (1) up and drive
        dirty pages to bdi_setpoint ~= setpoint reasonably fast.
      
      Unfortunately in JBOD setups, the fluctuation range of bdi threshold
      is related to memory size due to the interferences between disks.  In
      this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
      
      Given equations
      
              span = x_intercept - bdi_setpoint
              k = df/dx = - 1 / span
      
      and the extremum values
      
              span = bdi_thresh
              dx = bdi_thresh
      
      we get
      
              df = - dx / span = - 1.0
      
      That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
      task ratelimit will fluctuate by -100%.
      
      peter: use 3rd order polynomial for the global control line
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6c14ae1e
    • W
      writeback: account per-bdi accumulated dirtied pages · c8e28ce0
      Wu Fengguang 提交于
      Introduce the BDI_DIRTIED counter. It will be used for estimating the
      bdi's dirty bandwidth.
      
      CC: Jan Kara <jack@suse.cz>
      CC: Michael Rubin <mrubin@google.com>
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c8e28ce0
  2. 15 9月, 2011 7 次提交
  3. 03 9月, 2011 2 次提交
  4. 27 8月, 2011 1 次提交
  5. 26 8月, 2011 4 次提交
  6. 19 8月, 2011 1 次提交
    • W
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang 提交于
      Revert the pass-good area introduced in ffd1f609 ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bb082295
  7. 18 8月, 2011 1 次提交
  8. 15 8月, 2011 1 次提交
  9. 10 8月, 2011 2 次提交
    • M
      Revert "memcg: get rid of percpu_charge_mutex lock" · 9f50fad6
      Michal Hocko 提交于
      This reverts commit 8521fc50.
      
      The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
      bit operations is sufficient but that is not true.  Johannes Weiner has
      reported a crash during parallel memory cgroup removal:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
        IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
        Oops: 0000 [#1] PREEMPT SMP
        Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
        RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
        RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
        Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
        Call Trace:
         [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
         [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
         [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
         [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
         [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
         [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
         [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
         [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
         [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
      
      We are crashing because we try to dereference cached memcg when we are
      checking whether we should wait for draining on the cache.  The cache is
      already cleaned up, though.
      
      There is also a theoretical chance that the cached memcg gets freed
      between we test for the FLUSHING_CACHED_CHARGE and dereference it in
      mem_cgroup_same_or_subtree:
      
              CPU0                    CPU1                         CPU2
        mem=stock->cached
        stock->cached=NULL
                                    clear_bit
                                                              test_and_set_bit
        test_bit()                    ...
        <preempted>             mem_cgroup_destroy
        use after free
      
      The percpu_charge_mutex protected from this race because sync draining
      is exclusive.
      
      It is safer to revert now and come up with a more parallel
      implementation later.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f50fad6
    • C
      slub: Fix partial count comparison confusion · 81107188
      Christoph Lameter 提交于
      deactivate_slab() has the comparison if more than the minimum number of
      partial pages are in the partial list wrong. An effect of this may be that
      empty pages are not freed from deactivate_slab(). The result could be an
      OOM due to growth of the partial slabs per node. Frees mostly occur from
      __slab_free which is okay so this would only affect use cases where a lot
      of switching around of per cpu slabs occur.
      
      Switching per cpu slabs occurs with high frequency if debugging options are
      enabled.
      Reported-and-tested-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      81107188
  10. 09 8月, 2011 2 次提交
    • A
      slub: fix check_bytes() for slub debugging · ef62fb32
      Akinobu Mita 提交于
      The check_bytes() function is used by slub debugging.  It returns a pointer
      to the first unmatching byte for a character in the given memory area.
      
      If the character for matching byte is greater than 0x80, check_bytes()
      doesn't work.  Becuase 64-bit pattern is generated as below.
      
      	value64 = value | value << 8 | value << 16 | value << 24;
      	value64 = value64 | value64 << 32;
      
      The integer promotions are performed and sign-extended as the type of value
      is u8.  The upper 32 bits of value64 is 0xffffffff in the first line, and
      the second line has no effect.
      
      This fixes the 64-bit pattern generation.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Reviewed-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      ef62fb32
    • C
      slub: Fix full list corruption if debugging is on · 6fbabb20
      Christoph Lameter 提交于
      When a slab is freed by __slab_free() and the slab can only contain a
      single object ever then it was full (and therefore not on the partial
      lists but on the full list in the debug case) before we reached
      slab_empty.
      
      This caused the following full list corruption when SLUB debugging was enabled:
      
        [ 5913.233035] ------------[ cut here ]------------
        [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        [ 5913.233101] Hardware name: Adamo 13
        [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520
        [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
        [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
        [ 5913.233213] Call Trace:
        [ 5913.233213]  <IRQ>  [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b
        [ 5913.233213]  [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48
        [ 5913.233213]  [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98
        [ 5913.233213]  [<ffffffff8127e7da>] list_del+0xe/0x2d
        [ 5913.233213]  [<ffffffff814e0430>] __slab_free+0x1db/0x235
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff81133085>] kmem_cache_free+0x88/0x102
        [ 5913.233213]  [<ffffffff811706ab>] bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706e1>] bio_free+0x34/0x64
        [ 5913.233213]  [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14
        [ 5913.233213]  [<ffffffff8116fef6>] bio_put+0x2b/0x2d
        [ 5913.233213]  [<ffffffff813dccab>] clone_endio+0x9e/0xb4
        [ 5913.233213]  [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f
        [ 5913.233213]  [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt]
        [ 5913.233213]  [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt]
      
      [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]
      
      Make sure that we remove such a slab also from the full lists.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      6fbabb20
  11. 04 8月, 2011 15 次提交
    • P
      slab, lockdep: Annotate the locks before using them · 30765b92
      Peter Zijlstra 提交于
      Fernando found we hit the regular OFF_SLAB 'recursion' before we
      annotate the locks, cure this.
      
      The relevant portion of the stack-trace:
      
      > [    0.000000]  [<c085e24f>] rt_spin_lock+0x50/0x56
      > [    0.000000]  [<c04fb406>] __cache_free+0x43/0xc3
      > [    0.000000]  [<c04fb23f>] kmem_cache_free+0x6c/0xdc
      > [    0.000000]  [<c04fb2fe>] slab_destroy+0x4f/0x53
      > [    0.000000]  [<c04fb396>] free_block+0x94/0xc1
      > [    0.000000]  [<c04fc551>] do_tune_cpucache+0x10b/0x2bb
      > [    0.000000]  [<c04fc8dc>] enable_cpucache+0x7b/0xa7
      > [    0.000000]  [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61
      > [    0.000000]  [<c0bba687>] start_kernel+0x24c/0x363
      > [    0.000000]  [<c0bba0ba>] i386_start_kernel+0xa9/0xaf
      Reported-by: NFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>
      30765b92
    • P
      slab, lockdep: Annotate slab -> rcu -> debug_object -> slab · 83835b3d
      Peter Zijlstra 提交于
      Lockdep thinks there's lock recursion through:
      
      	kmem_cache_free()
      	  cache_flusharray()
      	    spin_lock(&l3->list_lock)  <----------------.
      	    free_block()                                |
      	      slab_destroy()                            |
      		call_rcu()                              |
      		  debug_object_activate()               |
      		    debug_object_init()                 |
      		      __debug_object_init()             |
      			kmem_cache_alloc()              |
      			  cache_alloc_refill()          |
      			    spin_lock(&l3->list_lock) --'
      
      Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no
      actual possibility of recursing. Luckily debug objects marks it slab
      with SLAB_DEBUG_OBJECTS so we can identify the thing.
      
      Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special
      lockdep key so that lockdep sees its a different cachep.
      
      Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU |
      SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble.
      Reported-and-tested-by: NSebastian Siewior <sebastian@breakpoint.cc>
      [ fixes to the initial patch ]
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
      83835b3d
    • H
      mm: clarify the radix_tree exceptional cases · 8079b1c8
      Hugh Dickins 提交于
      Make the radix_tree exceptional cases, mostly in filemap.c, clearer.
      
      It's hard to devise a suitable snappy name that illuminates the use by
      shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
      generality.  And akpm points out that /* radix_tree_deref_retry(page) */
      comments look like calls that have been commented out for unknown
      reason.
      
      Skirt the naming difficulty by rearranging these blocks to handle the
      transient radix_tree_deref_retry(page) case first; then just explain the
      remaining shmem/tmpfs swap case in a comment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8079b1c8
    • H
      tmpfs radix_tree: locate_item to speed up swapoff · e504f3fd
      Hugh Dickins 提交于
      We have already acknowledged that swapoff of a tmpfs file is slower than
      it was before conversion to the generic radix_tree: a little slower
      there will be acceptable, if the hotter paths are faster.
      
      But it was a shock to find swapoff of a 500MB file 20 times slower on my
      laptop, taking 10 minutes; and at that rate it significantly slows down
      my testing.
      
      Now, most of that turned out to be overhead from PROVE_LOCKING and
      PROVE_RCU: without those it was only 4 times slower than before; and
      more realistic tests on other machines don't fare as badly.
      
      I've tried a number of things to improve it, including tagging the swap
      entries, then doing lookup by tag: I'd expected that to halve the time,
      but in practice it's erratic, and often counter-productive.
      
      The only change I've so far found to make a consistent improvement, is
      to short-circuit the way we go back and forth, gang lookup packing
      entries into the array supplied, then shmem scanning that array for the
      target entry.  Scanning in place doubles the speed, so it's now only
      twice as slow as before (or three times slower when the PROVEs are on).
      
      So, add radix_tree_locate_item() as an expedient, once-off,
      single-caller hack to do the lookup directly in place.  #ifdef it on
      CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
      applicability as save space in other configurations.  And, sadly,
      #include sched.h for cond_resched().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e504f3fd
    • H
      mm: a few small updates for radix-swap · 31475dd6
      Hugh Dickins 提交于
      Remove PageSwapBacked (!page_is_file_cache) cases from
      add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
      go through shmem_add_to_page_cache().
      
      Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
      and add a comment on swap entries to invalidate_mapping_pages().
      
      And mincore_page() uses find_get_page() on what might be shmem or a
      tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
      find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31475dd6
    • H
      tmpfs: use kmemdup for short symlinks · 69f07ec9
      Hugh Dickins 提交于
      But we've not yet removed the old swp_entry_t i_direct[16] from
      shmem_inode_info.  That's because it was still being shared with the
      inline symlink.  Remove it now (saving 64 or 128 bytes from shmem inode
      size), and use kmemdup() for short symlinks, say, those up to 128 bytes.
      
      I wonder why mpol_free_shared_policy() is done in shmem_destroy_inode()
      rather than shmem_evict_inode(), where we usually do such freeing? I
      guess it doesn't matter, and I'm not into NUMA mpol testing right now.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69f07ec9
    • H
      tmpfs: convert shmem_writepage and enable swap · 6922c0c7
      Hugh Dickins 提交于
      Convert shmem_writepage() to use shmem_delete_from_page_cache() to use
      shmem_radix_tree_replace() to substitute swap entry for page pointer
      atomically in the radix tree.
      
      As with shmem_add_to_page_cache(), it's not entirely satisfactory to be
      copying such code from delete_from_swap_cache, but again judged easier
      to sell than making its other callers go through the extras.
      
      Remove the toy implementation's shmem_put_swap() and shmem_get_swap(),
      now unreferenced, and the hack to disable swap: it's now good to go.
      
      The way things have worked out, info->lock no longer helps to guard the
      shmem_swaplist: we increment swapped under shmem_swaplist_mutex only.
      That global mutex exclusion between shmem_writepage() and shmem_unuse()
      is not pretty, and we ought to find another way; but it's been forced on
      us by recent race discoveries, not a consequence of this patchset.
      
      And what has become of the WARN_ON_ONCE(1) free_swap_and_cache() if a
      swap entry was found already present? That's no longer possible, the
      (unknown) one inserting this page into filecache would hit the swap
      entry occupying that slot.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6922c0c7
    • H
      tmpfs: convert mem_cgroup shmem to radix-swap · aa3b1895
      Hugh Dickins 提交于
      Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
      had to move swappage to filecache with GFP_NOWAIT.
      
      Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
      moving its call out from shmem_add_to_page_cache() to two of thats three
      callers.  But leave it doing mem_cgroup_uncharge_cache_page() on error:
      although asymmetrical, it's easier for all 3 callers to handle.
      
      These two changes would also be appropriate if anyone were to start
      using shmem_read_mapping_page_gfp() with GFP_NOWAIT.
      
      Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
      radix_tree_exceptional_entry() to get what it needs for itself.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa3b1895
    • H
      tmpfs: convert shmem_getpage_gfp to radix-swap · 54af6042
      Hugh Dickins 提交于
      Convert shmem_getpage_gfp(), the engine-room of shmem, to expect page or
      swap entry returned from radix tree by find_lock_page().
      
      Whereas the repetitive old method proceeded mainly under info->lock,
      dropping and repeating whenever one of the conditions needed was not
      met, now we can proceed without it, leaving shmem_add_to_page_cache() to
      check for a race.
      
      This way there is no need to preallocate a page, no need for an early
      radix_tree_preload(), no need for mem_cgroup_shmem_charge_fallback().
      
      Move the error unwinding down to the bottom instead of repeating it
      throughout.  ENOSPC handling is a little different from before: there is
      no longer any race between find_lock_page() and finding swap, but we can
      arrive at ENOSPC before calling shmem_recalc_inode(), which might
      occasionally discover freed space.
      
      Be stricter to check i_size before returning.  info->lock is used for
      little but alloced, swapped, i_blocks updates.  Move i_blocks updates
      out from under the max_blocks check, so even an unlimited size=0 mount
      can show accurate du.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54af6042
    • H
      tmpfs: convert shmem_unuse_inode to radix-swap · 46f65ec1
      Hugh Dickins 提交于
      Convert shmem_unuse_inode() to use a lockless gang lookup of the radix
      tree, searching for matching swap.
      
      This is somewhat slower than the old method: because of repeated radix
      tree descents, because of copying entries up, but probably most because
      the old method noted and skipped once a vector page was cleared of swap.
      Perhaps we can devise a use of radix tree tagging to achieve that later.
      
      shmem_add_to_page_cache() uses shmem_radix_tree_replace() to compensate
      for the lockless lookup by checking that the expected entry is in place,
      under lock.  It is not very satisfactory to be copying this much from
      add_to_page_cache_locked(), but I think easier to sell than insisting
      that every caller of add_to_page_cache*() go through the extras.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46f65ec1
    • H
      tmpfs: convert shmem_truncate_range to radix-swap · 7a5d0fbb
      Hugh Dickins 提交于
      Disable the toy swapping implementation in shmem_writepage() - it's hard
      to support two schemes at once - and convert shmem_truncate_range() to a
      lockless gang lookup of swap entries along with pages, freeing both.
      
      Since the second loop tightens its noose until all entries of either
      kind have been squeezed out (and we shall make sure that there's not an
      instant when neither is visible), there is no longer a need for yet
      another pass below.
      
      shmem_radix_tree_replace() compensates for the lockless lookup by
      checking that the expected entry is in place, under lock, before
      replacing it.  Here it just deletes, but will be used in later patches
      to substitute swap entry for page or page for swap entry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a5d0fbb
    • H
      tmpfs: copy truncate_inode_pages_range · bda97eab
      Hugh Dickins 提交于
      Bring truncate.c's code for truncate_inode_pages_range() inline into
      shmem_truncate_range(), replacing its first call (there's a followup
      call below, but leave that one, it will disappear next).
      
      Don't play with it yet, apart from leaving out the cleancache flush, and
      (importantly) the nrpages == 0 skip, and moving shmem_setattr()'s
      partial page preparation into its partial page handling.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda97eab
    • H
      tmpfs: miscellaneous trivial cleanups · 41ffe5d5
      Hugh Dickins 提交于
      While it's at its least, make a number of boring nitpicky cleanups to
      shmem.c, mostly for consistency of variable naming.  Things like "swap"
      instead of "entry", "pgoff_t index" instead of "unsigned long idx".
      
      And since everything else here is prefixed "shmem_", better change
      init_tmpfs() to shmem_init().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ffe5d5
    • H
      tmpfs: demolish old swap vector support · 285b2c4f
      Hugh Dickins 提交于
      The maximum size of a shmem/tmpfs file has been limited by the maximum
      size of its triple-indirect swap vector.  With 4kB page size, maximum
      filesize was just over 2TB on a 32-bit kernel, but sadly one eighth of
      that on a 64-bit kernel.  (With 8kB page size, maximum filesize was just
      over 4TB on a 64-bit kernel, but 16TB on a 32-bit kernel,
      MAX_LFS_FILESIZE being then more restrictive than swap vector layout.)
      
      It's a shame that tmpfs should be more restrictive than ramfs, and this
      limitation has now been noticed.  Add another level to the swap vector?
      No, it became obscure and hard to maintain, once I complicated it to
      make use of highmem pages nine years ago: better choose another way.
      
      Surely, if 2.4 had had the radix tree pagecache introduced in 2.5, then
      tmpfs would never have invented its own peculiar radix tree: we would
      have fitted swap entries into the common radix tree instead, in much the
      same way as we fit swap entries into page tables.
      
      And why should each file have a separate radix tree for its pages and
      for its swap entries? The swap entries are required precisely where and
      when the pages are not.  We want to put them together in a single radix
      tree: which can then avoid much of the locking which was needed to
      prevent them from being exchanged underneath us.
      
      This also avoids the waste of memory devoted to swap vectors, first in
      the shmem_inode itself, then at least two more pages once a file grew
      beyond 16 data pages (pages accounted by df and du, but not by memcg).
      Allocated upfront, to avoid allocation when under swapping pressure, but
      pure waste when CONFIG_SWAP is not set - I have never spattered around
      the ifdefs to prevent that, preferring this move to sharing the common
      radix tree instead.
      
      There are three downsides to sharing the radix tree.  One, that it binds
      tmpfs more tightly to the rest of mm, either requiring knowledge of swap
      entries in radix tree there, or duplication of its code here in shmem.c.
      I believe that the simplications and memory savings (and probable higher
      performance, not yet measured) justify that.
      
      Two, that on HIGHMEM systems with SWAP enabled, it's the lowmem radix
      nodes that cannot be freed under memory pressure - whereas before it was
      the less precious highmem swap vector pages that could not be freed.
      I'm hoping that 64-bit has now been accessible for long enough, that the
      highmem argument has grown much less persuasive.
      
      Three, that swapoff is slower than it used to be on tmpfs files, since
      it's using a simple generic mechanism not tailored to it: I find this
      noticeable, and shall want to improve, but maybe nobody else will
      notice.
      
      So...  now remove most of the old swap vector code from shmem.c.  But,
      for the moment, keep the simple i_direct vector of 16 pages, with simple
      accessors shmem_put_swap() and shmem_get_swap(), as a toy implementation
      to help mark where swap needs to be handled in subsequent patches.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      285b2c4f
    • H
      mm: let swap use exceptional entries · a2c16d6c
      Hugh Dickins 提交于
      If swap entries are to be stored along with struct page pointers in a
      radix tree, they need to be distinguished as exceptional entries.
      
      Most of the handling of swap entries in radix tree will be contained in
      shmem.c, but a few functions in filemap.c's common code need to check
      for their appearance: find_get_page(), find_lock_page(),
      find_get_pages() and find_get_pages_contig().
      
      So as not to slow their fast paths, tuck those checks inside the
      existing checks for unlikely radix_tree_deref_slot(); except for
      find_lock_page(), where it is an added test.  And make it a BUG in
      find_get_pages_tag(), which is not applied to tmpfs files.
      
      A part of the reason for eliminating shmem_readpage() earlier, was to
      minimize the places where common code would need to allow for swap
      entries.
      
      The swp_entry_t known to swapfile.c must be massaged into a slightly
      different form when stored in the radix tree, just as it gets massaged
      into a pte_t when stored in page tables.
      
      In an i386 kernel this limits its information (type and page offset) to
      30 bits: given 32 "types" of swapfile and 4kB pagesize, that's a maximum
      swapfile size of 128GB.  Which is less than the 512GB we previously
      allowed with X86_PAE (where the swap entry can occupy the entire upper
      32 bits of a pte_t), but not a new limitation on 32-bit without PAE; and
      there's not a new limitation on 64-bit (where swap filesize is already
      limited to 16TB by a 32-bit page offset).  Thirty areas of 128GB is
      probably still enough swap for a 64GB 32-bit machine.
      
      Provide swp_to_radix_entry() and radix_to_swp_entry() conversions, and
      enforce filesize limit in read_swap_header(), just as for ptes.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c16d6c
反馈
建议
客服 返回
顶部