1. 11 9月, 2009 2 次提交
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
    • J
      writeback: move dirty inodes from super_block to backing_dev_info · 66f3b8e2
      Jens Axboe 提交于
      This is a first step at introducing per-bdi flusher threads. We should
      have no change in behaviour, although sb_has_dirty_inodes() is now
      ridiculously expensive, as there's no easy way to answer that question.
      Not a huge problem, since it'll be deleted in subsequent patches.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      66f3b8e2
  2. 09 9月, 2009 1 次提交
  3. 06 9月, 2009 2 次提交
  4. 04 9月, 2009 1 次提交
  5. 01 9月, 2009 1 次提交
    • T
      percpu: don't assume existence of cpu0 · 04a13c7c
      Tejun Heo 提交于
      percpu incorrectly assumed that cpu0 was always there which led to the
      following warning and eventual oops on sparc machines w/o cpu0.
      
        WARNING: at mm/percpu.c:651 pcpu_map+0xdc/0x100()
        Modules linked in:
        Call Trace:
          [000000000045eb70] warn_slowpath_common+0x50/0xa0
          [000000000045ebdc] warn_slowpath_null+0x1c/0x40
          [00000000004d493c] pcpu_map+0xdc/0x100
          [00000000004d59a4] pcpu_alloc+0x3e4/0x4e0
          [00000000004d5af8] __alloc_percpu+0x18/0x40
          [00000000005b112c] __percpu_counter_init+0x4c/0xc0
        ...
        Unable to handle kernel NULL pointer dereference
        ...
         I7: <sysfs_new_dirent+0x30/0x120>
         Disabling lock debugging due to kernel taint
         Caller[000000000053c1b0]: sysfs_new_dirent+0x30/0x120
         Caller[000000000053c7a4]: create_dir+0x24/0xc0
         Caller[000000000053c870]: sysfs_create_dir+0x30/0x80
         Caller[00000000005990e8]: kobject_add_internal+0xc8/0x200
        ...
         Kernel panic - not syncing: Attempted to kill the idle task!
      
      This patch fixes the problem by backporting parts from devel branch to
      make percpu core not depend on the existence of cpu0.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Cc: David Miller <davem@davemloft.net>
      04a13c7c
  6. 27 8月, 2009 1 次提交
  7. 19 8月, 2009 3 次提交
  8. 17 8月, 2009 1 次提交
    • E
      Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab
      Eric Paris 提交于
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      788084ab
  9. 14 8月, 2009 2 次提交
    • A
      percpu: use the right flag for get_vm_area() · 142d44b0
      Amerigo Wang 提交于
      get_vm_area() only accepts VM_* flags, not GFP_*.
      
      And according to the doc of get_vm_area(), here should be
      VM_ALLOC.
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      142d44b0
    • T
      percpu, sparc64: fix sparse possible cpu map handling · 74d46d6b
      Tejun Heo 提交于
      percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
      is incorrect if cpu_possible_map contains holes.  This causes percpu
      code to access beyond allocated memories and vmalloc areas.  On a
      sparc64 machine with cpus 0 and 2 (u60), this triggers the following
      warning or fails boot.
      
       WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
       Modules linked in:
       Call Trace:
        [00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
        [00000000004b1840] map_vm_area+0x20/0x60
        [00000000004b1950] __vmalloc_area_node+0xd0/0x160
        [0000000000593434] deflate_init+0x14/0xe0
        [0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
        [00000000005844f0] crypto_alloc_base+0x50/0xa0
        [000000000058b898] alg_test_comp+0x18/0x80
        [000000000058dad4] alg_test+0x54/0x180
        [000000000058af00] cryptomgr_test+0x40/0x60
        [0000000000473098] kthread+0x58/0x80
        [000000000042b590] kernel_thread+0x30/0x60
        [0000000000472fd0] kthreadd+0xf0/0x160
       ---[ end trace 429b268a213317ba ]---
      
      This patch fixes generic percpu functions and sparc64
      setup_per_cpu_areas() so that they handle sparse cpu_possible_map
      properly.
      
      Please note that on x86, cpu_possible_map() doesn't contain holes and
      thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
      any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      74d46d6b
  10. 10 8月, 2009 1 次提交
  11. 08 8月, 2009 1 次提交
    • K
      mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware · 4bfc4495
      KAMEZAWA Hiroyuki 提交于
      At first, init_task's mems_allowed is initialized as this.
       init_task->mems_allowed == node_state[N_POSSIBLE]
      
      And cpuset's top_cpuset mask is initialized as this
       top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]
      
      Before 2.6.29:
      policy's mems_allowed is initialized as this.
      
        1. update tasks->mems_allowed by its cpuset->mems_allowed.
        2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)
      
      Updating task's mems_allowed in reference to top_cpuset's one.
      cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.
      
      In 2.6.30: After commit 58568d2a
      ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
      is initialized as this.
      
        1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)
      
      Here, if task is in top_cpuset, task->mems_allowed is not updated from
      init's one.  Assume user excutes command as #numactrl --interleave=all
      ,....
      
        policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)
      
      Then, policy's mems_allowd can includes a possible node, which has no pgdat.
      
      MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
      directly.
      
        NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL
      
      Then, what's we need is making policy->mems_allowed be aware of
      N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
      be on statck.  Because I know cpumask has a new interface of
      CPUMASK_ALLOC(), I added it to node.
      
      This patch stands on old behavior.  But I feel this fix itself is just a
      Band-Aid.  But to do fundametal fix, we have to take care of memory
      hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
      think.)
      
      mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
      should be includes only online nodes.
      
      In old behavior, this is guaranteed by frequent reference to cpuset's
      code.  Now, most of them are removed and mempolicy has to check it by
      itself.
      
      To do check, a few nodemask_t will be used for calculating nodemask.  But,
      size of nodemask_t can be big and it's not good to allocate them on stack.
      
      Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
      NODEMASK_ALLOC/FREE shoudl be there.
      
      [akpm@linux-foundation.org: cleanups & tweaks]
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bfc4495
  12. 30 7月, 2009 7 次提交
  13. 28 7月, 2009 1 次提交
    • B
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb() · 9e1b32ca
      Benjamin Herrenschmidt 提交于
      mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
      
      Upcoming paches to support the new 64-bit "BookE" powerpc architecture
      will need to have the virtual address corresponding to PTE page when
      freeing it, due to the way the HW table walker works.
      
      Basically, the TLB can be loaded with "large" pages that cover the whole
      virtual space (well, sort-of, half of it actually) represented by a PTE
      page, and which contain an "indirect" bit indicating that this TLB entry
      RPN points to an array of PTEs from which the TLB can then create direct
      entries. Thus, in order to invalidate those when PTE pages are deleted,
      we need the virtual address to pass to tlbilx or tlbivax instructions.
      
      The old trick of sticking it somewhere in the PTE page struct page sucks
      too much, the address is almost readily available in all call sites and
      almost everybody implemets these as macros, so we may as well add the
      argument everywhere. I added it to the pmd and pud variants for consistency.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
      Acked-by: NNick Piggin <npiggin@suse.de>
      Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e1b32ca
  14. 11 7月, 2009 1 次提交
  15. 10 7月, 2009 1 次提交
  16. 08 7月, 2009 6 次提交
  17. 07 7月, 2009 4 次提交
  18. 02 7月, 2009 1 次提交
    • I
      kmemleak: Fix scheduling-while-atomic bug · 57d81f6f
      Ingo Molnar 提交于
      One of the kmemleak changes caused the following
      scheduling-while-holding-the-tasklist-lock regression on x86:
      
      BUG: sleeping function called from invalid context at mm/kmemleak.c:795
      in_atomic(): 1, irqs_disabled(): 0, pid: 1737, name: kmemleak
      2 locks held by kmemleak/1737:
       #0:  (scan_mutex){......}, at: [<c10c4376>] kmemleak_scan_thread+0x45/0x86
       #1:  (tasklist_lock){......}, at: [<c10c3bb4>] kmemleak_scan+0x1a9/0x39c
      Pid: 1737, comm: kmemleak Not tainted 2.6.31-rc1-tip #59266
      Call Trace:
       [<c105ac0f>] ? __debug_show_held_locks+0x1e/0x20
       [<c102e490>] __might_sleep+0x10a/0x111
       [<c10c38d5>] scan_yield+0x17/0x3b
       [<c10c3970>] scan_block+0x39/0xd4
       [<c10c3bc6>] kmemleak_scan+0x1bb/0x39c
       [<c10c4331>] ? kmemleak_scan_thread+0x0/0x86
       [<c10c437b>] kmemleak_scan_thread+0x4a/0x86
       [<c104d73e>] kthread+0x6e/0x73
       [<c104d6d0>] ? kthread+0x0/0x73
       [<c100959f>] kernel_thread_helper+0x7/0x10
      kmemleak: 834 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      
      The bit causing it is highly dubious:
      
      static void scan_yield(void)
      {
              might_sleep();
      
              if (time_is_before_eq_jiffies(next_scan_yield)) {
                      schedule();
                      next_scan_yield = jiffies + jiffies_scan_yield;
              }
      }
      
      It called deep inside the codepath and in a conditional way,
      and that is what crapped up when one of the new scan_block()
      uses grew a tasklist_lock dependency.
      
      This minimal patch removes that yielding stuff and adds the
      proper cond_resched().
      
      The background scanning thread could probably also be reniced
      to +10.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57d81f6f
  19. 01 7月, 2009 3 次提交