1. 08 12月, 2011 3 次提交
    • W
      writeback: set max_pause to lowest value on zero bdi_dirty · 82e230a0
      Wu Fengguang 提交于
      Some trace shows lots of bdi_dirty=0 lines where it's actually some
      small value if w/o the accounting errors in the per-cpu bdi stats.
      
      In this case the max pause time should really be set to the smallest
      (non-zero) value to avoid IO queue underrun and improve throughput.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      82e230a0
    • W
      writeback: permit through good bdi even when global dirty exceeded · c5c6343c
      Wu Fengguang 提交于
      On a system with 1 local mount and 1 NFS mount, if the NFS server
      becomes not responding when dd to the NFS mount, the NFS dirty pages may
      exceed the global dirty limit and _every_ task involving writing will be
      blocked. The whole system appears unresponsive.
      
      The workaround is to permit through the bdi's that only has a small
      number of dirty pages. The number chosen (bdi_stat_error pages) is not
      enough to enable the local disk to run in optimal throughput, however is
      enough to make the system responsive on a broken NFS mount. The user can
      then kill the dirtiers on the NFS mount and increase the global dirty
      limit to bring up the local disk's throughput.
      
      It risks allowing dirty pages to grow much larger than the global dirty
      limit when there are 1000+ mounts, however that's very unlikely to happen,
      especially in low memory profiles.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      c5c6343c
    • W
      writeback: comment on the bdi dirty threshold · aed21ad2
      Wu Fengguang 提交于
      We do "floating proportions" to let active devices to grow its target
      share of dirty pages and stalled/inactive devices to decrease its target
      share over time.
      
      It works well except in the case of "an inactive disk suddenly goes
      busy", where the initial target share may be too small. To mitigate
      this, bdi_position_ratio() has the below line to raise a small
      bdi_thresh when it's safe to do so, so that the disk be feed with enough
      dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
      
              bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
      
      balance_dirty_pages() normally does negative feedback control which
      adjusts ratelimit to balance the bdi dirty pages around the target.
      In some extreme cases when that is not enough, it will have to block
      the tasks completely until the bdi dirty pages drop below bdi_thresh.
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      aed21ad2
  2. 02 12月, 2011 1 次提交
  3. 24 11月, 2011 1 次提交
  4. 23 11月, 2011 2 次提交
    • T
      percpu: fix chunk range calculation · a855b84c
      Tejun Heo 提交于
      Percpu allocator recorded the cpus which map to the first and last
      units in pcpu_first/last_unit_cpu respectively and used them to
      determine the address range of a chunk - e.g. it assumed that the
      first unit has the lowest address in a chunk while the last unit has
      the highest address.
      
      This simply isn't true.  Groups in a chunk can have arbitrary positive
      or negative offsets from the previous one and there is no guarantee
      that the first unit occupies the lowest offset while the last one the
      highest.
      
      Fix it by actually comparing unit offsets to determine cpus occupying
      the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu
      to pcpu_low/high_unit_cpu to avoid confusion.
      
      The chunk address range is used to flush cache on vmalloc area
      map/unmap and decide whether a given address is in the first chunk by
      per_cpu_ptr_to_phys() and the bug was discovered by invalid
      per_cpu_ptr_to_phys() translation for crash_note.
      
      Kudos to Dave Young for tracking down the problem.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NWANG Cong <xiyou.wangcong@gmail.com>
      Reported-by: NDave Young <dyoung@redhat.com>
      Tested-by: NDave Young <dyoung@redhat.com>
      LKML-Reference: <4EC21F67.10905@redhat.com>
      Cc: stable @kernel.org
      a855b84c
    • B
      percpu: rename pcpu_mem_alloc to pcpu_mem_zalloc · 90459ce0
      Bob Liu 提交于
      Currently pcpu_mem_alloc() is implemented always return zeroed memory.
      So rename it to make user like pcpu_get_pages_and_bitmap() know don't
      reinit it.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      90459ce0
  5. 17 11月, 2011 3 次提交
    • W
      writeback: remove vm_dirties and task->dirties · 468e6a20
      Wu Fengguang 提交于
      They are not used any more.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      468e6a20
    • W
      writeback: hard throttle 1000+ dd on a slow USB stick · 1df64719
      Wu Fengguang 提交于
      The sleep based balance_dirty_pages() can pause at most MAX_PAUSE=200ms
      on every 1 4KB-page, which means it cannot throttle a task under
      4KB/200ms=20KB/s. So when there are more than 512 dd writing to a
      10MB/s USB stick, its bdi dirty pages could grow out of control.
      
      Even if we can increase MAX_PAUSE, the minimal (task_ratelimit = 1)
      means a limit of 4KB/s.
                                                             
      They can eventually be safeguarded by the global limit check 
      (nr_dirty < dirty_thresh). However if someone is also writing to an 
      HDD at the same time, it'll get poor HDD write performance.
                                                             
      We at least want to maintain good write performance for other devices
      when one device is attacked by some "massive parallel" workload, or
      suffers from slow write bandwidth, or somehow get stalled due to some 
      error condition (eg. NFS server not responding).
      
      For a stalled device, we need to completely block its dirtiers, too,
      before its bdi dirty pages grow all the way up to the global limit and
      leave no space for the other functional devices.
      
      So change the loop exit condition to
      
      	/*
      	 * Always enforce global dirty limit; also enforce bdi dirty limit
      	 * if the normal max_pause sleeps cannot keep things under control.
      	 */
      	if (nr_dirty < dirty_thresh &&
      	    (bdi_dirty < bdi_thresh || bdi->dirty_ratelimit > 1))
      		break;
      
      which can be further simplified to
      
      	if (task_ratelimit)
      		break;
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      1df64719
    • D
      xen: map foreign pages for shared rings by updating the PTEs directly · cd12909c
      David Vrabel 提交于
      When mapping a foreign page with xenbus_map_ring_valloc() with the
      GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
      pass a pointer to the PTE (in init_mm).
      
      After the page is mapped, the usual fault mechanism can be used to
      update additional MMs.  This allows the vmalloc_sync_all() to be
      removed from alloc_vm_area().
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      [v1: Squashed fix by Michal for no-mmu case]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NMichal Simek <monstr@monstr.eu>
      cd12909c
  6. 16 11月, 2011 3 次提交
  7. 11 11月, 2011 1 次提交
    • R
      backing-dev: ensure wakeup_timer is deleted · 7a401a97
      Rabin Vincent 提交于
      bdi_prune_sb() in bdi_unregister() attempts to removes the bdi links
      from all super_blocks and then del_timer_sync() the writeback timer.
      
      However, this can race with __mark_inode_dirty(), leading to
      bdi_wakeup_thread_delayed() rearming the writeback timer on the bdi
      we're unregistering, after we've called del_timer_sync().
      
      This can end up with the bdi being freed with an active timer inside it,
      as in the case of the following dump after the removal of an SD card.
      
      Fix this by redoing the del_timer_sync() in bdi_destory().
      
       ------------[ cut here ]------------
       WARNING: at /home/rabin/kernel/arm/lib/debugobjects.c:262 debug_print_object+0x9c/0xc8()
       ODEBUG: free active (active state 0) object type: timer_list hint: wakeup_timer_fn+0x0/0x180
       Modules linked in:
       Backtrace:
       [<c00109dc>] (dump_backtrace+0x0/0x110) from [<c0236e4c>] (dump_stack+0x18/0x1c)
        r6:c02bc638 r5:00000106 r4:c79f5d18 r3:00000000
       [<c0236e34>] (dump_stack+0x0/0x1c) from [<c0025e6c>] (warn_slowpath_common+0x54/0x6c)
       [<c0025e18>] (warn_slowpath_common+0x0/0x6c) from [<c0025f28>] (warn_slowpath_fmt+0x38/0x40)
        r8:20000013 r7:c780c6f0 r6:c031613c r5:c780c6f0 r4:c02b1b29
       r3:00000009
       [<c0025ef0>] (warn_slowpath_fmt+0x0/0x40) from [<c015eb4c>] (debug_print_object+0x9c/0xc8)
        r3:c02b1b29 r2:c02bc662
       [<c015eab0>] (debug_print_object+0x0/0xc8) from [<c015f574>] (debug_check_no_obj_freed+0xac/0x1dc)
        r6:c7964000 r5:00000001 r4:c7964000
       [<c015f4c8>] (debug_check_no_obj_freed+0x0/0x1dc) from [<c00a9e38>] (kmem_cache_free+0x88/0x1f8)
       [<c00a9db0>] (kmem_cache_free+0x0/0x1f8) from [<c014286c>] (blk_release_queue+0x70/0x78)
       [<c01427fc>] (blk_release_queue+0x0/0x78) from [<c015290c>] (kobject_release+0x70/0x84)
        r5:c79641f0 r4:c796420c
       [<c015289c>] (kobject_release+0x0/0x84) from [<c0153ce4>] (kref_put+0x68/0x80)
        r7:00000083 r6:c74083d0 r5:c015289c r4:c796420c
       [<c0153c7c>] (kref_put+0x0/0x80) from [<c01527d0>] (kobject_put+0x48/0x5c)
        r5:c79643b4 r4:c79641f0
       [<c0152788>] (kobject_put+0x0/0x5c) from [<c013ddd8>] (blk_cleanup_queue+0x68/0x74)
        r4:c7964000
       [<c013dd70>] (blk_cleanup_queue+0x0/0x74) from [<c01a6370>] (mmc_blk_put+0x78/0xe8)
        r5:00000000 r4:c794c400
       [<c01a62f8>] (mmc_blk_put+0x0/0xe8) from [<c01a64b4>] (mmc_blk_release+0x24/0x38)
        r5:c794c400 r4:c0322824
       [<c01a6490>] (mmc_blk_release+0x0/0x38) from [<c00de11c>] (__blkdev_put+0xe8/0x170)
        r5:c78d5e00 r4:c74083c0
       [<c00de034>] (__blkdev_put+0x0/0x170) from [<c00de2c0>] (blkdev_put+0x11c/0x12c)
        r8:c79f5f70 r7:00000001 r6:c74083d0 r5:00000083 r4:c74083c0
       r3:00000000
       [<c00de1a4>] (blkdev_put+0x0/0x12c) from [<c00b0724>] (kill_block_super+0x60/0x6c)
        r7:c7942300 r6:c79f4000 r5:00000083 r4:c74083c0
       [<c00b06c4>] (kill_block_super+0x0/0x6c) from [<c00b0a94>] (deactivate_locked_super+0x44/0x70)
        r6:c79f4000 r5:c031af64 r4:c794dc00 r3:c00b06c4
       [<c00b0a50>] (deactivate_locked_super+0x0/0x70) from [<c00b1358>] (deactivate_super+0x6c/0x70)
        r5:c794dc00 r4:c794dc00
       [<c00b12ec>] (deactivate_super+0x0/0x70) from [<c00c88b0>] (mntput_no_expire+0x188/0x194)
        r5:c794dc00 r4:c7942300
       [<c00c8728>] (mntput_no_expire+0x0/0x194) from [<c00c95e0>] (sys_umount+0x2e4/0x310)
        r6:c7942300 r5:00000000 r4:00000000 r3:00000000
       [<c00c92fc>] (sys_umount+0x0/0x310) from [<c000d940>] (ret_fast_syscall+0x0/0x30)
       ---[ end trace e5c83c92ada51c76 ]---
      
      Cc: stable@kernel.org
      Signed-off-by: NRabin Vincent <rabin.vincent@stericsson.com>
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a401a97
  8. 07 11月, 2011 1 次提交
  9. 03 11月, 2011 9 次提交
  10. 02 11月, 2011 1 次提交
  11. 01 11月, 2011 15 次提交