1. 29 4月, 2008 2 次提交
  2. 28 4月, 2008 38 次提交
    • A
      FAT_VALID_MEDIA(): remove pointless test · 73f20e58
      Andrew Morton 提交于
      The on-disk media specification field in FAT is only 8-bits, so testing for
      <=0xff is pointless, and can generate a "comparison is always true due to
      limited range of data type" warning.
      
      While we're there, convert FAT_VALID_MEDIA() into a C function - the present
      implementation is buggy: it generates either one or two references to its
      argument.
      
      Cc: Frank Seidel <fseidel@suse.de>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73f20e58
    • O
      fat: Update free_clusters even if it is untrusted · 606e423e
      OGAWA Hirofumi 提交于
      Currently, free_clusters is not updated until it is trusted, because
      Windows doesn't update it correctly.
      
      But if user is using FAT driver of Linux, it updates free_clusters
      correctly.  Instead, this updates it even if it's untrusted, so if
      free_clustes is correct, now keep correct value.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      606e423e
    • O
      fat: Add allow_utime option · 1ae43f82
      OGAWA Hirofumi 提交于
      Normally utime(2) checks current process is owner of the file, or it
      has CAP_FOWNER capability.  But FAT filesystem doesn't have uid/gid as
      on disk info, so normal check is too unflexible.
      
      With this option you can relax it.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ae43f82
    • O
      fat: fat_notify_change() and check_mode() cleanup · 1278fdd3
      OGAWA Hirofumi 提交于
      - Rename fat_notify_change() to fat_setattr()
      - check_mode() cleanup
      - Change layout of code
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1278fdd3
    • J
      reiserfs: unpack tails on quota files · d5dee5c3
      Jan Kara 提交于
      Quota files cannot have tails because quota_write and quota_read functions do
      not support them.  So far when quota files did have tail, we just refused to
      turn quotas on it.  Sadly this check has been wrong and so there are now
      plenty installations where quota files don't have NOTAIL flag set and so now
      after fixing the check, they suddently fail to turn quotas on.  Since it's
      easy to unpack the tail from kernel, do this from reiserfs_quota_on() which
      solves the problem and is generally nicer to users anyway.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: <urhausen@urifabi.net>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5dee5c3
    • D
      md: introduce get_priority_stripe() to improve raid456 write performance · 8b3e6cdc
      Dan Williams 提交于
      Improve write performance by preventing the delayed_list from dumping all its
      stripes onto the handle_list in one shot.  Delayed stripes are now further
      delayed by being held on the 'hold_list'.  The 'hold_list' is bypassed when:
      
        * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list'
        * 'handle_list' is empty and i/o is being done to satisfy full stripe-width
          write requests
        * 'bypass_count' is less than 'bypass_threshold'.  By default the threshold
          is 1, i.e. every other stripe handled is a preread stripe provided the
          top two conditions are false.
      
      Benchmark data:
      System: 2x Xeon 5150, 4x SATA, mem=1GB
      Baseline: 2.6.24-rc7
      Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
      Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
        * patched:  +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512)
      
      Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS)
        * patched: +13%
        * patched + preread_bypass_threshold = 0: +37%
      
      Changes since v1:
      * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and
        make it configurable.  This defaults to fairness and modest performance
        gains out of the box.
      Changes since v2:
      * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not
        necessary, the important change was clearing STRIPE_DELAYED in
        add_stripe_bio and this has been moved out to make_request for the hang
        fix.
      * [neilb@suse.de]: simplify get_priority_stripe
      * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is
        sampled empty (+11%)
      * [dan.j.williams@intel.com]: decrement the bypass_count at the detection
        of stripes being naturally promoted off of hold_list +2%.  Note, resetting
        bypass_count instead of decrementing on these events yields +4% but that is
        probably too aggressive.
      Changes since v3:
      * cosmetic fixups
      Tested-by: NJames W. Laferriere <babydr@baby-dragons.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b3e6cdc
    • A
      PM/gxfb: add hook to PM console layer that allows disabling of suspend VT switch · b6f448e9
      Andres Salomon 提交于
      Prior to suspend, we allocate and switch to a new VT; after suspend, we switch
      back to the original VT.  This can be slow, and is completely unnecessary if
      the framebuffer we're using can restore video properly.
      
      This adds a hook that allows drivers to select whether or not to do this vt
      switch, and changes the gxfb driver to call this hook.  It also adds a module
      param to gxfb to allow controlling of the vt switch (defaulting to no switch).
      
      (Note: I'm not convinced that console_sem is the best way to protect this, but
      we should probably have some form of locking..)
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NAndres Salomon <dilinger@debian.org>
      Cc: Jordan Crouse <jordan.crouse@amd.com>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6f448e9
    • A
      fb: add support for foreign endianness · e4c690e0
      Anton Vorontsov 提交于
      Add support for the framebuffers with non-native endianness.  This is done via
      FBINFO_FOREIGN_ENDIAN flag that will be used by the drivers.  Depending on the
      host endianness this flag will be overwritten by FBINFO_BE_MATH internal flag,
      or cleared.
      
      Tested to work on MPC8360E-RDK (BE) + Fujitsu MINT framebuffer (LE).
      Signed-off-by: NAnton Vorontsov <avorontsov@ru.mvista.com>
      Cc: "Antonino A. Daplas" <adaplas@pol.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: <Valdis.Kletnieks@vt.edu>
      Cc: Clemens Koller <clemens.koller@anagramm.de>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4c690e0
    • I
      i2o: remove static inline forward declarations · 73fcdc9e
      Ilpo Järvinen 提交于
      Nothing in between of them and the later declaration with body
      needs them.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73fcdc9e
    • A
      quota: convert stub functions from macros into inlines · 50f8c370
      Andrew Morton 提交于
      Fixes things like this:
      
      fs/super.c: In function `deactivate_super':
      fs/super.c:182: warning: statement with no effect
      fs/super.c: In function `do_remount_sb':
      fs/super.c:644: warning: statement with no effect
      
      Cc: Jan Kara <jack@ucw.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50f8c370
    • J
      quota: quota core changes for quotaon on remount · 0ff5af83
      Jan Kara 提交于
      Currently, we just turn quotas off on remount of filesystem to read-only
      state.  The patch below adds necessary framework so that we can turn quotas
      off on remount RO but we are able to automatically reenable them again when
      filesystem is remounted to RW state.  All we need to do is to keep references
      to inodes of quota files when remounting RO and using these references to
      reenable quotas when remounting RW.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ff5af83
    • J
      quota: various style cleanups · 03f6e92b
      Jan Kara 提交于
      Cleanups in quota code:
        Change __inline__ to inline.
        Change some macros to inline functions.
        Remove vfs_quota_off_mount() macro.
        DQUOT_OFF() should be (0) is CONFIG_QUOTA is disabled.
        Move declaration of mark_dquot_dirty and dirty_dquot from quota.h to dquot.c
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03f6e92b
    • A
      quota: do not allow setting of quota limits to too high values · 338bf9af
      Andrew Perepechko 提交于
      We should check whether quota limits set via Q_SETQUOTA are not exceeding
      limits which quota format is able to handle.
      Signed-off-by: NAndrew Perepechko <andrew.perepechko@sun.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      338bf9af
    • M
      kprobes: add (un)register_jprobes for batch registration · 26b31c19
      Masami Hiramatsu 提交于
      Introduce unregister_/register_jprobes() for jprobe batch registration.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26b31c19
    • M
      kprobes: add (un)register_kretprobes for batch registration · 4a296e07
      Masami Hiramatsu 提交于
      Introduce unregister_/register_kretprobes() for kretprobe batch registration.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a296e07
    • M
      kprobes: add (un)register_kprobes for batch registration · 9861668f
      Masami Hiramatsu 提交于
      Introduce unregister_/register_kprobes() for kprobe batch registration.  This
      can reduce waiting time for synchronized_sched() when a lot of probes have to
      be unregistered at once.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9861668f
    • M
      list.h: add list_is_singular() · 99602572
      Masami Hiramatsu 提交于
      Add list_is_singular() to check a list has just one entry.
      
      list_is_singular() is useful to check whether a list_head which have been
      temporarily allocated for listing objects can be released or not.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99602572
    • S
      kprobes: prevent probing of preempt_schedule() · 3d8d996e
      Srinivasa Ds 提交于
      Prohibit users from probing preempt_schedule().  One way of prohibiting the
      user from probing functions is by marking such functions with __kprobes.  But
      this method doesn't work for those functions, which are already marked to
      different section like preempt_schedule() (belongs to __sched section).  So we
      use blacklist approach to refuse user from probing these functions.
      
      In blacklist approach we populate the blacklisted function's starting address
      and its size in kprobe_blacklist structure.  Then we verify the user specified
      address against start and end of the blacklisted function.  So any attempt to
      register probe on blacklisted functions will be rejected.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSrinivasa DS <srinivasa@in.ibm.com>
      Signed-off-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Signed-off-by: NJim Keniston <jkenisto@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8d996e
    • K
      VT notifier extension for accessibility · 0341a4d0
      Karl Dahlke 提交于
      Some accessibility modules need to be able to catch the output on the
      console before the VT interpretation, and possibly swallow it.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0341a4d0
    • M
      sm501: add uart support · 61711f8f
      Magnus Damm 提交于
      This patch extends the sm501 mfd with 8250 uart support. We're currently
      doing this in the board specific r2d-1 code already, but it would be nice to
      do move things into the mfd since it's more chip specific than board specific.
      Signed-off-by: NMagnus Damm <damm@igel.co.jp>
      Cc: Ben Dooks <ben-linux@fluff.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61711f8f
    • T
      x86: configurable DMI scanning code · 7ae9392c
      Thomas Petazzoni 提交于
      Turn CONFIG_DMI into a selectable option if EMBEDDED is defined, in
      order to be able to remove the DMI table scanning code if it's not
      needed, and then reduce the kernel code size.
      
      With CONFIG_DMI (i.e before) :
      
         text    data     bss     dec     hex filename
      1076076  128656   98304 1303036  13e1fc vmlinux
      
      Without CONFIG_DMI (i.e after) :
      
         text    data     bss     dec     hex filename
      1068092  126308   98304 1292704  13b9a0 vmlinux
      
      Result:
      
         text    data     bss     dec     hex filename
        -7984   -2348       0  -10332   -285c vmlinux
      
      The new option appears in "Processor type and features", only when
      CONFIG_EMBEDDED is defined.
      
      This patch is part of the Linux Tiny project, and is based on previous work
      done by Matt Mackall <mpm@selenic.com>.
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Anvin" <hpa@zytor.com>
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ae9392c
    • J
      synclink drivers bool conversion · 0fab6de0
      Joe Perches 提交于
      Remove more TRUE/FALSE defines and uses
      Remove == TRUE tests
      Convert BOOLEAN to bool
      Convert int to bool where appropriate
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NPaul Fulghum <paulkf@microgate.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fab6de0
    • H
      ncpfs: add prototypes to ncp_fs.h · cdf88037
      Harvey Harrison 提交于
      Removes some externs from C files, noticed from the sparse warnings:
      fs/ncpfs/dir.c:90:26: warning: symbol 'ncp_root_dentry_operations' was not declared. Should it be static?
      fs/ncpfs/symlink.c:107:5: warning: symbol 'ncp_symlink' was not declared. Should it be static?
      fs/ncpfs/symlink.c:101:39: warning: symbol 'ncp_symlink_aops' was not declared. Should it be static?
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Acked-by: NPetr Vandrovec <VANDROVE@vc.cvut.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf88037
    • A
      capabilities: implement per-process securebits · 3898b1b4
      Andrew G. Morgan 提交于
      Filesystem capability support makes it possible to do away with (set)uid-0
      based privilege and use capabilities instead.  That is, with filesystem
      support for capabilities but without this present patch, it is (conceptually)
      possible to manage a system with capabilities alone and never need to obtain
      privilege via (set)uid-0.
      
      Of course, conceptually isn't quite the same as currently possible since few
      user applications, certainly not enough to run a viable system, are currently
      prepared to leverage capabilities to exercise privilege.  Further, many
      applications exist that may never get upgraded in this way, and the kernel
      will continue to want to support their setuid-0 base privilege needs.
      
      Where pure-capability applications evolve and replace setuid-0 binaries, it is
      desirable that there be a mechanisms by which they can contain their
      privilege.  In addition to leveraging the per-process bounding and inheritable
      sets, this should include suppressing the privilege of the uid-0 superuser
      from the process' tree of children.
      
      The feature added by this patch can be leveraged to suppress the privilege
      associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
      initiate, and only immediately affects the 'current' process (it is inherited
      through fork()/exec()).  This reimplementation differs significantly from the
      historical support for securebits which was system-wide, unwieldy and which
      has ultimately withered to a dead relic in the source of the modern kernel.
      
      With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
      all legacy privilege (through uid=0) for itself and all subsequently
      fork()'d/exec()'d children with:
      
        prctl(PR_SET_SECUREBITS, 0x2f);
      
      This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
      enabled at configure time.
      
      [akpm@linux-foundation.org: fix uninitialised var warning]
      [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Paul Moore <paul.moore@hp.com>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3898b1b4
    • K
      mm: fix broken gfp_zone with __GFP_THISNODE · 8cece85e
      KAMEZAWA Hiroyuki 提交于
      This hack, "base = MAX_NR_ZONES", at __GFP_THISNODE was used for old
      zonliests.
      
      Now, new zonelist[] have a list for __GFP_THISNODE and this hack is incorrect.
      Should be removed.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cece85e
    • Y
      memory hotplug: make alloc_bootmem_section() · e70260aa
      Yasunori Goto 提交于
      alloc_bootmem_section() can allocate specified section's area.  This is used
      for usemap to keep same section with pgdat by later patch.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e70260aa
    • Y
      memory hotplug: register section/node id to free · 04753278
      Yasunori Goto 提交于
      This patch set is to free pages which is allocated by bootmem for
      memory-hotremove.  Some structures of memory management are allocated by
      bootmem.  ex) memmap, etc.
      
      To remove memory physically, some of them must be freed according to
      circumstance.  This patch set makes basis to free those pages, and free
      memmaps.
      
      Basic my idea is using remain members of struct page to remember information
      of users of bootmem (section number or node id).  When the section is
      removing, kernel can confirm it.  By this information, some issues can be
      solved.
      
        1) When the memmap of removing section is allocated on other
           section by bootmem, it should/can be free.
        2) When the memmap of removing section is allocated on the
           same section, it shouldn't be freed. Because the section has to be
           logical memory offlined already and all pages must be isolated against
           page allocater. If it is freed, page allocator may use it which will
           be removed physically soon.
        3) When removing section has other section's memmap,
           kernel will be able to show easily which section should be removed
           before it for user. (Not implemented yet)
        4) When the above case 2), the page isolation will be able to check and skip
           memmap's page when logical memory offline (offline_pages()).
           Current page isolation code fails in this case because this page is
           just reserved page and it can't distinguish this pages can be
           removed or not. But, it will be able to do by this patch.
           (Not implemented yet.)
        5) The node information like pgdat has similar issues. But, this
           will be able to be solved too by this.
           (Not implemented yet, but, remembering node id in the pages.)
      
      Fortunately, current bootmem allocator just keeps PageReserved flags,
      and doesn't use any other members of page struct. The users of
      bootmem doesn't use them too.
      
      This patch:
      
      This is to register information which is node or section's id.  Kernel can
      distinguish which node/section uses the pages allcated by bootmem.  This is
      basis for hot-remove sections or nodes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04753278
    • G
      hugetlbfs: architecture header cleanup · 6d779079
      Gerald Schaefer 提交于
      This patch moves all architecture functions for hugetlb to architecture header
      files (include/asm-foo/hugetlb.h) and converts all macros to inline functions.
       It also removes (!) ARCH_HAS_HUGEPAGE_ONLY_RANGE,
      ARCH_HAS_HUGETLB_FREE_PGD_RANGE, ARCH_HAS_PREPARE_HUGEPAGE_RANGE,
      ARCH_HAS_SETCLEAR_HUGE_PTE and ARCH_HAS_HUGETLB_PREFAULT_HOOK.
      
      Getting rid of the ARCH_HAS_xxx #ifdef and macro fugliness should increase
      readability and maintainability, at the price of some code duplication.  An
      asm-generic common part would have reduced the loc, but we would end up with
      new ARCH_HAS_xxx defines eventually.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d779079
    • L
      mempolicy: use struct mempolicy pointer in shmem_sb_info · 71fe804b
      Lee Schermerhorn 提交于
      This patch replaces the mempolicy mode, mode_flags, and nodemask in the
      shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
      This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
      inode.c and simplifies the interfaces.
      
      mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
      pointer arg, a struct mempolicy pointer on success.  For MPOL_DEFAULT, the
      returned pointer is NULL.  Further, mpol_parse_str() now takes a 'no_context'
      argument that causes the input nodemask to be stored in the w.user_nodemask of
      the created mempolicy for use when the mempolicy is installed in a tmpfs inode
      shared policy tree.  At that time, any cpuset contextualization is applied to
      the original input nodemask.  This preserves the previous behavior where the
      input nodemask was stored in the superblock.  We can think of the returned
      mempolicy as "context free".
      
      Because mpol_parse_str() is now calling mpol_new(), we can remove from
      mpol_to_str() the semantic checks that mpol_new() already performs.
      
      Add 'no_context' parameter to mpol_to_str() to specify that it should format
      the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
      
      Change mpol_shared_policy_init() to take a pointer to a "context free" struct
      mempolicy and to create a new, "contextualized" mempolicy using the mode,
      mode_flags and user_nodemask from the input mempolicy.
      
        Note: we know that the mempolicy passed to mpol_to_str() or
        mpol_shared_policy_init() from a tmpfs superblock is "context free".  This
        is currently the only instance thereof.  However, if we found more uses for
        this concept, and introduced any ambiguity as to whether a mempolicy was
        context free or not, we could add another internal mode flag to identify
        context free mempolicies.  Then, we could remove the 'no_context' argument
        from mpol_to_str().
      
      Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
      if one exists, to pass to mpol_shared_policy_init().  We must add the
      reference under the sb stat_lock to prevent races with replacement of the mpol
      by remount.  This reference is removed in mpol_shared_policy_init().
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: yet another build fix]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe804b
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy · fc36b8d3
      Lee Schermerhorn 提交于
      Now that we're using "preferred local" policy for system default, we need to
      make this as fast as possible.  Because of the variable size of the mempolicy
      structure [based on size of nodemasks], the preferred_node may be in a
      different cacheline from the mode.  This can result in accessing an extra
      cacheline in the normal case of system default policy.  Suspect this is the
      cause of an observed 2-3% slowdown in page fault testing relative to kernel
      without this patch series.
      
      To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
      flags member which is guaranteed [?] to be in the same cacheline as the mode
      itself.
      
      Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
      for both anon and shmem segments with system default and vma [preferred local]
      policy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc36b8d3
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: document {set|get}_policy() vm_ops APIs · a6020ed7
      Lee Schermerhorn 提交于
      Document mempolicy return value reference semantics assumed by the rest of the
      mempolicy code for the set_ and get_policy vm_ops in <linux/mm.h>--where the
      prototypes are defined--to inform any future mempolicy vm_op writers what the
      rest of the subsystem expects of them.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6020ed7
    • L
      mempolicy: mark shared policies for unref · aab0b102
      Lee Schermerhorn 提交于
      As part of yet another rework of mempolicy reference counting, we want to be
      able to identify shared policies efficiently, because they have an extra ref
      taken on lookup that needs to be removed when we're finished using the policy.
      
        Note:  the extra ref is required because the policies are
        shared between tasks/processes and can be changed/freed
        by one task while another task is using them--e.g., for
        page allocation.
      
      Building on David Rientjes mempolicy "mode flags" enhancement, this patch
      indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
      member of the struct mempolicy added by David.  MPOL_F_SHARED, and any future
      "internal mode flags" are reserved from bit zero up, as they will never be
      passed in the upper bits of the mode argument of a mempolicy API.
      
      I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
      rb-tree.  Don't need/want to clear the flag when removing from the tree as the
      mempolicy is freed [unref'd] internally to the sp_delete() function.  However,
      a task could hold another reference on this mempolicy from a prior lookup.  We
      need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
      unref, eventually freeing, the mempolicy.
      
      A later patch in this series will introduce a function to conditionally unref
      [mpol_free] a policy.  The MPOL_F_SHARED flag is one reason [currently the
      only reason] to unref/free a policy via the conditional free.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aab0b102
    • L
      mempolicy: rename struct mempolicy 'policy' member to 'mode' · 45c4745a
      Lee Schermerhorn 提交于
      The terms 'policy' and 'mode' are both used in various places to describe the
      semantics of the value stored in the 'policy' member of struct mempolicy.
      Furthermore, the term 'policy' is used to refer to that member, to the entire
      struct mempolicy and to the more abstract concept of the tuple consisting of a
      "mode" and an optional node or set of nodes.  Recently, we have added "mode
      flags" that are passed in the upper bits of the 'mode' [or sometimes,
      'policy'] member of the numa APIs.
      
      I'd like to resolve this confusion, which perhaps only exists in my mind, by
      renaming the 'policy' member to 'mode' throughout, and fixing up the
      Documentation.  Man pages will be updated separately.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c4745a
    • L
      mempolicy: rename mpol_copy to mpol_dup · 846a16bf
      Lee Schermerhorn 提交于
      This patch renames mpol_copy() to mpol_dup() because, well, that's what it
      does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
      existing mempolicy, allocates a new one and copies the contents.
      
      In a later patch, I want to use the name mpol_copy() to copy the contents from
      one mempolicy to another like, e.g., strcpy() does for strings.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846a16bf
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • A
      Subject: [PATCH] hugetlb: vmstat events for huge page allocations · 3b116300
      Adam Litke 提交于
      Allocating huge pages directly from the buddy allocator is not guaranteed to
      succeed.  Success depends on several factors (such as the amount of physical
      memory available and the level of fragmentation).  With the addition of
      dynamic hugetlb pool resizing, allocations can occur much more frequently.
      For these reasons it is desirable to keep track of huge page allocation
      successes and failures.
      
      Add two new vmstat entries to track huge page allocations that succeed and
      fail.  The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
      being enabled.
      
      [akpm@linux-foundation.org: reduced ifdeffery]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NEric Munson <ebmunson@us.ibm.com>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b116300