1. 31 10月, 2011 2 次提交
  2. 28 10月, 2011 1 次提交
    • J
      vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately · 39be79c1
      Jeff Layton 提交于
      Currently, when you call iov_iter_advance, then the pointer to the iovec
      array can be incremented, but it does not decrement the nr_segs value in
      the iov_iter struct. The result is a iov_iter struct with a nr_segs
      value that goes beyond the end of the array.
      
      While I'm not aware of anything that's specifically broken by this, it
      seems odd and a bit dangerous not to decrement that value. If someone
      were to trust the nr_segs value to be correct, then they could end up
      walking off the end of the array.
      
      Changing this might also provide some micro-optimization when dealing
      with the last iovec in an array. Many of the other routines that deal
      with iov_iter have optimized codepaths when nr_segs == 1.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      39be79c1
  3. 20 10月, 2011 1 次提交
    • H
      mm: fix race between mremap and removing migration entry · 486cf46f
      Hugh Dickins 提交于
      I don't usually pay much attention to the stale "? " addresses in
      stack backtraces, but this lucky report from Pawel Sikora hints that
      mremap's move_ptes() has inadequate locking against page migration.
      
       3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
       kernel BUG at include/linux/swapops.h:105!
       RIP: 0010:[<ffffffff81127b76>]  [<ffffffff81127b76>]
                             migration_entry_wait+0x156/0x160
        [<ffffffff811016a1>] handle_pte_fault+0xae1/0xaf0
        [<ffffffff810feee2>] ? __pte_alloc+0x42/0x120
        [<ffffffff8112c26b>] ? do_huge_pmd_anonymous_page+0xab/0x310
        [<ffffffff81102a31>] handle_mm_fault+0x181/0x310
        [<ffffffff81106097>] ? vma_adjust+0x537/0x570
        [<ffffffff81424bed>] do_page_fault+0x11d/0x4e0
        [<ffffffff81109a05>] ? do_mremap+0x2d5/0x570
        [<ffffffff81421d5f>] page_fault+0x1f/0x30
      
      mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
      and pagetable locks, were good enough before page migration (with its
      requirement that every migration entry be found) came in, and enough
      while migration always held mmap_sem; but not enough nowadays, when
      there's memory hotremove and compaction.
      
      The danger is that move_ptes() lets a migration entry dodge around
      behind remove_migration_pte()'s back, so it's in the old location when
      looking at the new, then in the new location when looking at the old.
      
      Either mremap's move_ptes() must additionally take anon_vma lock(), or
      migration's remove_migration_pte() must stop peeking for is_swap_entry()
      before it takes pagetable lock.
      
      Consensus chooses the latter: we prefer to add overhead to migration
      than to mremapping, which gets used by JVMs and by exec stack setup.
      Reported-and-tested-by: NPaweł Sikora <pluto@agmk.net>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      486cf46f
  4. 28 9月, 2011 3 次提交
    • A
      slub: Discard slab page when node partial > minimum partial number · dcc3be6a
      Alex Shi 提交于
      Discarding slab should be done when node partial > min_partial.  Otherwise,
      node partial slab may eat up all memory.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      dcc3be6a
    • A
      slub: correct comments error for per cpu partial · 9f264904
      Alex Shi 提交于
      Correct comment errors, that mistake cpu partial objects number as pages
      number, may make reader misunderstand.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      9f264904
    • V
      mm: restrict access to slab files under procfs and sysfs · ab067e99
      Vasiliy Kulikov 提交于
      Historically /proc/slabinfo and files under /sys/kernel/slab/* have
      world read permissions and are accessible to the world.  slabinfo
      contains rather private information related both to the kernel and
      userspace tasks.  Depending on the situation, it might reveal either
      private information per se or information useful to make another
      targeted attack.  Some examples of what can be learned by
      reading/watching for /proc/slabinfo entries:
      
      1) dentry (and different *inode*) number might reveal other processes fs
      activity.  The number of dentry "active objects" doesn't strictly show
      file count opened/touched by a process, however, there is a good
      correlation between them.  The patch "proc: force dcache drop on
      unauthorized access" relies on the privacy of dentry count.
      
      2) different inode entries might reveal the same information as (1), but
      these are more fine granted counters.  If a filesystem is mounted in a
      private mount point (or even a private namespace) and fs type differs from
      other mounted fs types, fs activity in this mount point/namespace is
      revealed.  If there is a single ecryptfs mount point, the whole fs
      activity of a single user is revealed.  Number of files in ecryptfs
      mount point is a private information per se.
      
      3) fuse_* reveals number of files / fs activity of a user in a user
      private mount point.  It is approx. the same severity as ecryptfs
      infoleak in (2).
      
      4) sysfs_dir_cache similar to (2) reveals devices' addition/removal,
      which can be otherwise hidden by "chmod 0700 /sys/".  With 0444 slabinfo
      the precise number of sysfs files is known to the world.
      
      5) buffer_head might reveal some kernel activity.  With other
      information leaks an attacker might identify what specific kernel
      routines generate buffer_head activity.
      
      6) *kmalloc* infoleaks are very situational.  Attacker should watch for
      the specific kmalloc size entry and filter the noise related to the unrelated
      kernel activity.  If an attacker has relatively silent victim system, he
      might get rather precise counters.
      
      Additional information sources might significantly increase the slabinfo
      infoleak benefits.  E.g. if an attacker knows that the processes
      activity on the system is very low (only core daemons like syslog and
      cron), he may run setxid binaries / trigger local daemon activity /
      trigger network services activity / await sporadic cron jobs activity
      / etc. and get rather precise counters for fs and network activity of
      these privileged tasks, which is unknown otherwise.
      
      Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate
      exploitation of kernel heap overflows (and possibly, other bugs).  The
      related discussion:
      
      http://thread.gmane.org/gmane.linux.kernel/1108378
      
      To keep compatibility with old permission model where non-root
      monitoring daemon could watch for kernel memleaks though slabinfo one
      should do:
      
          groupadd slabinfo
          usermod -a -G slabinfo $MONITOR_USER
      
      And add the following commands to init scripts (to mountall.conf in
      Ubuntu's upstart case):
      
          chmod g+r /proc/slabinfo /sys/kernel/slab/*/*
          chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/*
      Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: NKees Cook <kees@ubuntu.com>
      Reviewed-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NChristoph Lameter <cl@gentwo.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      CC: Valdis.Kletnieks@vt.edu
      CC: Linus Torvalds <torvalds@linux-foundation.org>
      CC: Alan Cox <alan@linux.intel.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      ab067e99
  5. 15 9月, 2011 8 次提交
  6. 14 9月, 2011 1 次提交
  7. 03 9月, 2011 2 次提交
  8. 27 8月, 2011 2 次提交
  9. 26 8月, 2011 4 次提交
  10. 24 8月, 2011 1 次提交
  11. 20 8月, 2011 6 次提交
    • C
      slub: per cpu cache for partial pages · 49e22585
      Christoph Lameter 提交于
      Allow filling out the rest of the kmem_cache_cpu cacheline with pointers to
      partial pages. The partial page list is used in slab_free() to avoid
      per node lock taking.
      
      In __slab_alloc() we can then take multiple partial pages off the per
      node partial list in one go reducing node lock pressure.
      
      We can also use the per cpu partial list in slab_alloc() to avoid scanning
      partial lists for pages with free objects.
      
      The main effect of a per cpu partial list is that the per node list_lock
      is taken for batches of partial pages instead of individual ones.
      
      Potential future enhancements:
      
      1. The pickup from the partial list could be perhaps be done without disabling
         interrupts with some work. The free path already puts the page into the
         per cpu partial list without disabling interrupts.
      
      2. __slab_free() may have some code paths that could use optimization.
      
      Performance:
      
      				Before		After
      ./hackbench 100 process 200000
      				Time: 1953.047	1564.614
      ./hackbench 100 process 20000
      				Time: 207.176   156.940
      ./hackbench 100 process 20000
      				Time: 204.468	156.940
      ./hackbench 100 process 20000
      				Time: 204.879	158.772
      ./hackbench 10 process 20000
      				Time: 20.153	15.853
      ./hackbench 10 process 20000
      				Time: 20.153	15.986
      ./hackbench 10 process 20000
      				Time: 19.363	16.111
      ./hackbench 1 process 20000
      				Time: 2.518	2.307
      ./hackbench 1 process 20000
      				Time: 2.258	2.339
      ./hackbench 1 process 20000
      				Time: 2.864	2.163
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      49e22585
    • C
      slub: return object pointer from get_partial() / new_slab(). · 497b66f2
      Christoph Lameter 提交于
      There is no need anymore to return the pointer to a slab page from get_partial()
      since the page reference can be stored in the kmem_cache_cpu structures "page" field.
      
      Return an object pointer instead.
      
      That in turn allows a simplification of the spaghetti code in __slab_alloc().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      497b66f2
    • C
      slub: pass kmem_cache_cpu pointer to get_partial() · acd19fd1
      Christoph Lameter 提交于
      Pass the kmem_cache_cpu pointer to get_partial(). That way
      we can avoid the this_cpu_write() statements.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      acd19fd1
    • C
      slub: Prepare inuse field in new_slab() · e6e82ea1
      Christoph Lameter 提交于
      inuse will always be set to page->objects. There is no point in
      initializing the field to zero in new_slab() and then overwriting
      the value in __slab_alloc().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      e6e82ea1
    • C
      slub: Remove useless statements in __slab_alloc · 7db0d705
      Christoph Lameter 提交于
      Two statements in __slab_alloc() do not have any effect.
      
      1. c->page is already set to NULL by deactivate_slab() called right before.
      
      2. gfpflags are masked in new_slab() before being passed to the page
         allocator. There is no need to mask gfpflags in __slab_alloc in particular
         since most frequent processing in __slab_alloc does not require the use of a
         gfpmask.
      
      Cc: torvalds@linux-foundation.org
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      7db0d705
    • C
      slub: free slabs without holding locks · 69cb8e6b
      Christoph Lameter 提交于
      There are two situations in which slub holds a lock while releasing
      pages:
      
      	A. During kmem_cache_shrink()
      	B. During kmem_cache_close()
      
      For A build a list while holding the lock and then release the pages
      later. In case of B we are the last remaining user of the slab so
      there is no need to take the listlock.
      
      After this patch all calls to the page allocator to free pages are
      done without holding any spinlocks. kmem_cache_destroy() will still
      hold the slub_lock semaphore.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      69cb8e6b
  12. 19 8月, 2011 1 次提交
    • W
      squeeze max-pause area and drop pass-good area · bb082295
      Wu Fengguang 提交于
      Revert the pass-good area introduced in ffd1f609 ("writeback:
      introduce max-pause and pass-good dirty limits") and make the max-pause
      area smaller and safe.
      
      This fixes ~30% performance regression in the ext3 data=writeback
      fio_mmap_randwrite_64k/fio_mmap_randrw_64k test cases, where there are
      12 JBOD disks, on each disk runs 8 concurrent tasks doing reads+writes.
      
      Using deadline scheduler also has a regression, but not that big as CFQ,
      so this suggests we have some write starvation.
      
      The test logs show that
      
      - the disks are sometimes under utilized
      
      - global dirty pages sometimes rush high to the pass-good area for
        several hundred seconds, while in the mean time some bdi dirty pages
        drop to very low value (bdi_dirty << bdi_thresh).  Then suddenly the
        global dirty pages dropped under global dirty threshold and bdi_dirty
        rush very high (for example, 2 times higher than bdi_thresh). During
        which time balance_dirty_pages() is not called at all.
      
      So the problems are
      
      1) The random writes progress so slow that they break the assumption of
         the max-pause logic that "8 pages per 200ms is typically more than
         enough to curb heavy dirtiers".
      
      2) The max-pause logic ignored task_bdi_thresh and thus opens the possibility
         for some bdi's to over dirty pages, leading to (bdi_dirty >> bdi_thresh)
         and then (bdi_thresh >> bdi_dirty) for others.
      
      3) The higher max-pause/pass-good thresholds somehow leads to the bad
         swing of dirty pages.
      
      The fix is to allow the task to slightly dirty over task_bdi_thresh, but
      no way to exceed bdi_dirty and/or global dirty_thresh.
      
      Tests show that it fixed the JBOD regression completely (both behavior
      and performance), while still being able to cut down large pause times
      in balance_dirty_pages() for single-disk cases.
      Reported-by: NLi Shaohua <shaohua.li@intel.com>
      Tested-by: NLi Shaohua <shaohua.li@intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      bb082295
  13. 18 8月, 2011 1 次提交
  14. 15 8月, 2011 1 次提交
  15. 10 8月, 2011 2 次提交
    • M
      Revert "memcg: get rid of percpu_charge_mutex lock" · 9f50fad6
      Michal Hocko 提交于
      This reverts commit 8521fc50.
      
      The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
      bit operations is sufficient but that is not true.  Johannes Weiner has
      reported a crash during parallel memory cgroup removal:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
        IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
        Oops: 0000 [#1] PREEMPT SMP
        Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
        RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
        RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
        Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
        Call Trace:
         [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
         [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
         [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
         [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
         [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
         [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
         [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
         [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
         [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
      
      We are crashing because we try to dereference cached memcg when we are
      checking whether we should wait for draining on the cache.  The cache is
      already cleaned up, though.
      
      There is also a theoretical chance that the cached memcg gets freed
      between we test for the FLUSHING_CACHED_CHARGE and dereference it in
      mem_cgroup_same_or_subtree:
      
              CPU0                    CPU1                         CPU2
        mem=stock->cached
        stock->cached=NULL
                                    clear_bit
                                                              test_and_set_bit
        test_bit()                    ...
        <preempted>             mem_cgroup_destroy
        use after free
      
      The percpu_charge_mutex protected from this race because sync draining
      is exclusive.
      
      It is safer to revert now and come up with a more parallel
      implementation later.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f50fad6
    • C
      slub: Fix partial count comparison confusion · 81107188
      Christoph Lameter 提交于
      deactivate_slab() has the comparison if more than the minimum number of
      partial pages are in the partial list wrong. An effect of this may be that
      empty pages are not freed from deactivate_slab(). The result could be an
      OOM due to growth of the partial slabs per node. Frees mostly occur from
      __slab_free which is okay so this would only affect use cases where a lot
      of switching around of per cpu slabs occur.
      
      Switching per cpu slabs occurs with high frequency if debugging options are
      enabled.
      Reported-and-tested-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      81107188
  16. 09 8月, 2011 2 次提交
    • A
      slub: fix check_bytes() for slub debugging · ef62fb32
      Akinobu Mita 提交于
      The check_bytes() function is used by slub debugging.  It returns a pointer
      to the first unmatching byte for a character in the given memory area.
      
      If the character for matching byte is greater than 0x80, check_bytes()
      doesn't work.  Becuase 64-bit pattern is generated as below.
      
      	value64 = value | value << 8 | value << 16 | value << 24;
      	value64 = value64 | value64 << 32;
      
      The integer promotions are performed and sign-extended as the type of value
      is u8.  The upper 32 bits of value64 is 0xffffffff in the first line, and
      the second line has no effect.
      
      This fixes the 64-bit pattern generation.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Reviewed-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      ef62fb32
    • C
      slub: Fix full list corruption if debugging is on · 6fbabb20
      Christoph Lameter 提交于
      When a slab is freed by __slab_free() and the slab can only contain a
      single object ever then it was full (and therefore not on the partial
      lists but on the full list in the debug case) before we reached
      slab_empty.
      
      This caused the following full list corruption when SLUB debugging was enabled:
      
        [ 5913.233035] ------------[ cut here ]------------
        [ 5913.233097] WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
        [ 5913.233101] Hardware name: Adamo 13
        [ 5913.233105] list_del corruption. prev->next should be ffffea000434fd20, but was ffffea0004199520
        [ 5913.233108] Modules linked in: nfs fscache fuse ebtable_nat ebtables ppdev parport_pc lp parport ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss xt_CHECKSUM sunrpc iptable_mangle bridge stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables rfcomm bnep arc4 iwlagn snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_intel btusb mac80211 snd_hda_codec bluetooth snd_hwdep snd_seq snd_seq_device snd_pcm usb_debug dell_wmi sparse_keymap cdc_ether usbnet cdc_acm uvcvideo cdc_wdm mii cfg80211 snd_timer dell_laptop videodev dcdbas snd microcode v4l2_compat_ioctl32 soundcore joydev tg3 pcspkr snd_page_alloc iTCO_wdt i2c_i801 rfkill iTCO_vendor_support wmi virtio_net kvm_intel kvm ipv6 xts gf128mul dm_crypt i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
        [ 5913.233213] Pid: 0, comm: swapper Not tainted 3.0.0+ #127
        [ 5913.233213] Call Trace:
        [ 5913.233213]  <IRQ>  [<ffffffff8105df18>] warn_slowpath_common+0x83/0x9b
        [ 5913.233213]  [<ffffffff8105dfd3>] warn_slowpath_fmt+0x46/0x48
        [ 5913.233213]  [<ffffffff8127e7c1>] __list_del_entry+0x8d/0x98
        [ 5913.233213]  [<ffffffff8127e7da>] list_del+0xe/0x2d
        [ 5913.233213]  [<ffffffff814e0430>] __slab_free+0x1db/0x235
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706ab>] ? bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff81133085>] kmem_cache_free+0x88/0x102
        [ 5913.233213]  [<ffffffff811706ab>] bvec_free_bs+0x35/0x37
        [ 5913.233213]  [<ffffffff811706e1>] bio_free+0x34/0x64
        [ 5913.233213]  [<ffffffff813dc390>] dm_bio_destructor+0x12/0x14
        [ 5913.233213]  [<ffffffff8116fef6>] bio_put+0x2b/0x2d
        [ 5913.233213]  [<ffffffff813dccab>] clone_endio+0x9e/0xb4
        [ 5913.233213]  [<ffffffff8116f7dd>] bio_endio+0x2d/0x2f
        [ 5913.233213]  [<ffffffffa00148da>] crypt_dec_pending+0x5c/0x8b [dm_crypt]
        [ 5913.233213]  [<ffffffffa00150a9>] crypt_endio+0x78/0x81 [dm_crypt]
      
      [ Full discussion here: https://lkml.org/lkml/2011/8/4/375 ]
      
      Make sure that we remove such a slab also from the full lists.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NXiaotian Feng <xtfeng@gmail.com>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      6fbabb20
  17. 04 8月, 2011 2 次提交