1. 15 5月, 2008 1 次提交
    • N
      fix SMP data race in pagetable setup vs walking · 362a61ad
      Nick Piggin 提交于
      There is a possible data race in the page table walking code. After the split
      ptlock patches, it actually seems to have been introduced to the core code, but
      even before that I think it would have impacted some architectures (powerpc
      and sparc64, at least, walk the page tables without taking locks eg. see
      find_linux_pte()).
      
      The race is as follows:
      The pte page is allocated, zeroed, and its struct page gets its spinlock
      initialized. The mm-wide ptl is then taken, and then the pte page is inserted
      into the pagetables.
      
      At this point, the spinlock is not guaranteed to have ordered the previous
      stores to initialize the pte page with the subsequent store to put it in the
      page tables. So another Linux page table walker might be walking down (without
      any locks, because we have split-leaf-ptls), and find that new pte we've
      inserted. It might try to take the spinlock before the store from the other
      CPU initializes it. And subsequently it might read a pte_t out before stores
      from the other CPU have cleared the memory.
      
      There are also similar races in higher levels of the page tables. They
      obviously don't involve the spinlock, but could see uninitialized memory.
      
      Arch code and hardware pagetable walkers that walk the pagetables without
      locks could see similar uninitialized memory problems, regardless of whether
      split ptes are enabled or not.
      
      I prefer to put the barriers in core code, because that's where the higher
      level logic happens, but the page table accessors are per-arch, and open-coding
      them everywhere I don't think is an option. I'll put the read-side barriers
      in alpha arch code for now (other architectures perform data-dependent loads
      in order).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      362a61ad
  2. 13 5月, 2008 2 次提交
  3. 09 5月, 2008 1 次提交
  4. 07 5月, 2008 2 次提交
    • M
      vfs: splice remove_suid() cleanup · 7f3d4ee1
      Miklos Szeredi 提交于
      generic_file_splice_write() duplicates remove_suid() just because it
      doesn't hold i_mutex.  But it grabs i_mutex inside splice_from_pipe()
      anyway, so this is rather pointless.
      
      Move locking to generic_file_splice_write() and call remove_suid() and
      __splice_from_pipe() instead.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      7f3d4ee1
    • H
      x86: fix PAE pmd_bad bootup warning · aeed5fce
      Hugh Dickins 提交于
      Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.
      
      That came from 9fc34113 x86: debug pmd_bad();
      but we understand now that the typecasting was wrong for PAE in the previous
      version: pagetable pages above 4GB looked bad and stopped Arjan from booting.
      
      And revert that cded932b x86: fix pmd_bad
      and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
      weaken every pmd_bad and pud_bad check to let huge pages slip through - in
      part they check that we _don't_ have a huge page where it's not expected.
      
      Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
      been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
      junk in the upper word is good; and x86_64 should follow x86_32's stricter
      comparison, to stop thinking any subset of required bits is good); but that
      should be a later patch.
      
      Fix Hans' good observation that follow_page() will never find pmd_huge()
      because that would have already failed the pmd_bad test: test pmd_huge in
      between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
      No, once it's a hugepage entry, it can get quite far from a good pmd: for
      example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.
      
      However... though follow_page() contains this and another test for huge
      pages, so it's nice to keep it working on them, where does it actually get
      called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
      to call alternative hugetlb processing, as does unmap_vmas() and others.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Earlier-version-tested-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jeff Chua <jeff.chua.linux@gmail.com>
      Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aeed5fce
  5. 02 5月, 2008 2 次提交
  6. 01 5月, 2008 3 次提交
  7. 30 4月, 2008 12 次提交
    • A
      revert "memory hotplug: allocate usemap on the section with pgdat" · 51674644
      Andrew Morton 提交于
      This:
      
      commit 86f6dae1
      Author: Yasunori Goto <y-goto@jp.fujitsu.com>
      Date:   Mon Apr 28 02:13:33 2008 -0700
      
          memory hotplug: allocate usemap on the section with pgdat
      
          Usemaps are allocated on the section which has pgdat by this.
      
          Because usemap size is very small, many other sections usemaps are allocated
          on only one page.  If a section has usemap, it can't be removed until removing
          other sections.  This dependency is not desirable for memory removing.
      
          Pgdat has similar feature.  When a section has pgdat area, it must be the last
          section for removing on the node.  So, if section A has pgdat and section B
          has usemap for section A, Both sections can't be removed due to dependency
          each other.
      
          To solve this issue, this patch collects usemap on same section with pgdat.
          If other sections doesn't have any dependency, this section will be able to be
          removed finally.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
          Cc: Badari Pulavarty <pbadari@us.ibm.com>
          Cc: Yinghai Lu <yhlu.kernel@gmail.com>
          Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      broke davem's sparc64 bootup.  Revert it while we work out what went wrong.
      
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51674644
    • N
      mm: fix warning on memory offline · 3a902c5f
      Nick Piggin 提交于
      KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
      is coming from page migration caller.
      
      WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
      Call Trace:
       [<a000000100015220>] show_stack+0x80/0xa0
       [<a000000100015270>] dump_stack+0x30/0x60
       [<a000000100089ed0>] warn_on_slowpath+0x90/0xe0
       [<a0000001001f8b10>] __set_page_dirty+0x330/0x360
       [<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280
       [<a00000010012fec0>] set_page_dirty+0xc0/0x260
       [<a000000100195670>] migrate_page_copy+0x5d0/0x5e0
       [<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0
       [<a000000100195eb0>] migrate_pages+0x770/0xe00
      
      What was happening is that migrate_page_copy wants to transfer the PG_dirty
      bit from old page to new page, so what it would do is set_page_dirty(newpage).
      However set_page_dirty() is used to set the entire page dirty, wheras in
      this case, only part of the page was dirty, and it also was not uptodate.
      
      Marking the whole page dirty with set_page_dirty would lead to corruption or
      unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
      buffers.
      
      Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
      however in the interests of keeping the change minimal...
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a902c5f
    • H
      mm: remove remaining __FUNCTION__ occurrences · d40cee24
      Harvey Harrison 提交于
      __FUNCTION__ is gcc-specific, use __func__
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d40cee24
    • T
      infrastructure to debug (dynamic) objects · 3ac7fe5a
      Thomas Gleixner 提交于
      We can see an ever repeating problem pattern with objects of any kind in the
      kernel:
      
      1) freeing of active objects
      2) reinitialization of active objects
      
      Both problems can be hard to debug because the crash happens at a point where
      we have no chance to decode the root cause anymore.  One problem spot are
      kernel timers, where the detection of the problem often happens in interrupt
      context and usually causes the machine to panic.
      
      While working on a timer related bug report I had to hack specialized code
      into the timer subsystem to get a reasonable hint for the root cause.  This
      debug hack was fine for temporary use, but far from a mergeable solution due
      to the intrusiveness into the timer code.
      
      The code further lacked the ability to detect and report the root cause
      instantly and keep the system operational.
      
      Keeping the system operational is important to get hold of the debug
      information without special debugging aids like serial consoles and special
      knowledge of the bug reporter.
      
      The problems described above are not restricted to timers, but timers tend to
      expose it usually in a full system crash.  Other objects are less explosive,
      but the symptoms caused by such mistakes can be even harder to debug.
      
      Instead of creating specialized debugging code for the timer subsystem a
      generic infrastructure is created which allows developers to verify their code
      and provides an easy to enable debug facility for users in case of trouble.
      
      The debugobjects core code keeps track of operations on static and dynamic
      objects by inserting them into a hashed list and sanity checking them on
      object operations and provides additional checks whenever kernel memory is
      freed.
      
      The tracked object operations are:
      - initializing an object
      - adding an object to a subsystem list
      - deleting an object from a subsystem list
      
      Each operation is sanity checked before the operation is executed and the
      subsystem specific code can provide a fixup function which allows to prevent
      the damage of the operation.  When the sanity check triggers a warning message
      and a stack trace is printed.
      
      The list of operations can be extended if the need arises.  For now it's
      limited to the requirements of the first user (timers).
      
      The core code enqueues the objects into hash buckets.  The hash index is
      generated from the address of the object to simplify the lookup for the check
      on kfree/vfree.  Each bucket has it's own spinlock to avoid contention on a
      global lock.
      
      The debug code can be compiled in without being active.  The runtime overhead
      is minimal and could be optimized by asm alternatives.  A kernel command line
      option enables the debugging code.
      
      Thanks to Ingo Molnar for review, suggestions and cleanup patches.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Greg KH <greg@kroah.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ac7fe5a
    • M
      mm: Add NR_WRITEBACK_TEMP counter · fc3ba692
      Miklos Szeredi 提交于
      Fuse will use temporary buffers to write back dirty data from memory mappings
      (normal writes are done synchronously).  This is needed, because there cannot
      be any guarantee about the time in which a write will complete.
      
      By using temporary buffers, from the MM's point if view the page is written
      back immediately.  If the writeout was due to memory pressure, this
      effectively migrates data from a full zone to a less full zone.
      
      This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
      as temporary buffers.
      
      [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc3ba692
    • M
      mm: bdi: export bdi_writeout_inc() · dd5656e5
      Miklos Szeredi 提交于
      Fuse needs this for writable mmap support.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd5656e5
    • M
      mm: bdi: add separate writeback accounting capability · e4ad08fe
      Miklos Szeredi 提交于
      Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB.  If this flag is
      set, then don't update the per-bdi writeback stats from
      test_set_page_writeback() and test_clear_page_writeback().
      
      Misc cleanups:
      
       - convert bdi_cap_writeback_dirty() and friends to static inline functions
       - create a flag that includes all three dirty/writeback related flags,
         since almst all users will want to have them toghether
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4ad08fe
    • M
      mm: bdi: move statistics to debugfs · 76f1418b
      Miklos Szeredi 提交于
      Move BDI statistics to debugfs:
      
         /sys/kernel/debug/bdi/<bdi>/stats
      
      Use postcore_initcall() to initialize the sysfs class and debugfs,
      because debugfs is initialized in core_initcall().
      
      Update descriptions in ABI documentation.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76f1418b
    • P
      mm: bdi: allow setting a maximum for the bdi dirty limit · a42dde04
      Peter Zijlstra 提交于
      Add "max_ratio" to /sys/class/bdi.  This indicates the maximum percentage of
      the global dirty threshold allocated to this bdi.
      
      [mszeredi@suse.cz]
      
       - fix parsing in max_ratio_store().
       - export bdi_set_max_ratio() to modules
       - limit bdi_dirty with bdi->max_ratio
       - document new sysfs attribute
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a42dde04
    • P
      mm: bdi: allow setting a minimum for the bdi dirty limit · 189d3c4a
      Peter Zijlstra 提交于
      Under normal circumstances each device is given a part of the total write-back
      cache that relates to its current avg writeout speed in relation to the other
      devices.
      
      min_ratio - allows one to assign a minimum portion of the write-back cache to
      a particular device.  This is useful in situations where you might want to
      provide a minimum QoS.  (One request for this feature came from flash based
      storage people who wanted to avoid writing out at all costs - they of course
      needed some pdflush hacks as well)
      
      max_ratio - allows one to assign a maximum portion of the dirty limit to a
      particular device.  This is useful in situations where you want to avoid one
      device taking all or most of the write-back cache.  Eg.  an NFS mount that is
      prone to get stuck, or a FUSE mount which you don't trust to play fair.
      
      Add "min_ratio" to /sys/class/bdi.  This indicates the minimum percentage of
      the global dirty threshold allocated to this bdi.
      
      [mszeredi@suse.cz]
      
       - fix parsing in min_ratio_store()
       - document new sysfs attribute
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      189d3c4a
    • P
      mm: bdi: export BDI attributes in sysfs · cf0ca9fe
      Peter Zijlstra 提交于
      Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
      This allows us to see and set the various BDI specific variables.
      
      In particular this properly exposes the read-ahead window for all relevant
      users and /sys/block/<block>/queue/read_ahead_kb should be deprecated.
      
      With patient help from Kay Sievers and Greg KH
      
      [mszeredi@suse.cz]
      
       - split off NFS and FUSE changes into separate patches
       - document new sysfs attributes under Documentation/ABI
       - do bdi_class_init as a core_initcall, otherwise the "default" BDI
         won't be initialized
       - remove bdi_init_fmt macro, it's not used very much
      
      [akpm@linux-foundation.org: fix ia64 warning]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Acked-by: NGreg KH <greg@kroah.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf0ca9fe
    • K
      /proc/pagetypeinfo: fix output for memoryless nodes · 41b25a37
      KOSAKI Motohiro 提交于
      on memoryless node, /proc/pagetypeinfo is displayed slightly funny output.
      this patch fix it.
      
      output example (header is outputed, but no data is outputed)
      --------------------------------------------------------------
      Page block order: 14
      Pages per block:  16384
      
      Free pages count per migrate type at order       0      1      2      3      4      5    \
        6      7      8      9     10     11     12     13     14     15     16
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve      Isolate
      Page block order: 14
      Pages per block:  16384
      
      Free pages count per migrate type at order       0      1      2      3      4      5    \
        6      7      8      9     10     11     12     13     14     15     16
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41b25a37
  8. 29 4月, 2008 17 次提交