1. 20 10月, 2007 1 次提交
    • S
      pid namespaces: define is_global_init() and is_container_init() · b460cbc5
      Serge E. Hallyn 提交于
      is_init() is an ambiguous name for the pid==1 check.  Split it into
      is_global_init() and is_container_init().
      
      A cgroup init has it's tsk->pid == 1.
      
      A global init also has it's tsk->pid == 1 and it's active pid namespace
      is the init_pid_ns.  But rather than check the active pid namespace,
      compare the task structure with 'init_pid_ns.child_reaper', which is
      initialized during boot to the /sbin/init process and never changes.
      
      Changelog:
      
      	2.6.22-rc4-mm2-pidns1:
      	- Use 'init_pid_ns.child_reaper' to determine if a given task is the
      	  global init (/sbin/init) process. This would improve performance
      	  and remove dependence on the task_pid().
      
      	2.6.21-mm2-pidns2:
      
      	- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
      	  ppc,avr32}/traps.c for the _exception() call to is_global_init().
      	  This way, we kill only the cgroup if the cgroup's init has a
      	  bug rather than force a kernel panic.
      
      [akpm@linux-foundation.org: fix comment]
      [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
      [bunk@stusta.de: kernel/pid.c: remove unused exports]
      [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzel <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b460cbc5
  2. 19 10月, 2007 9 次提交
    • A
      V3 file capabilities: alter behavior of cap_setpcap · 72c2d582
      Andrew Morgan 提交于
      The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1,
      can change the capabilities of another process, p2.  This is not the
      meaning that was intended for this capability at all, and this
      implementation came about purely because, without filesystem capabilities,
      there was no way to use capabilities without one process bestowing them on
      another.
      
      Since we now have a filesystem support for capabilities we can fix the
      implementation of CAP_SETPCAP.
      
      The most significant thing about this change is that, with it in effect, no
      process can set the capabilities of another process.
      
      The capabilities of a program are set via the capability convolution
      rules:
      
         pI(post-exec) = pI(pre-exec)
         pP(post-exec) = (X(aka cap_bset) & fP) | (pI(post-exec) & fI)
         pE(post-exec) = fE ? pP(post-exec) : 0
      
      at exec() time.  As such, the only influence the pre-exec() program can
      have on the post-exec() program's capabilities are through the pI
      capability set.
      
      The correct implementation for CAP_SETPCAP (and that enabled by this patch)
      is that it can be used to add extra pI capabilities to the current process
      - to be picked up by subsequent exec()s when the above convolution rules
      are applied.
      
      Here is how it works:
      
      Let's say we have a process, p. It has capability sets, pE, pP and pI.
      Generally, p, can change the value of its own pI to pI' where
      
         (pI' & ~pI) & ~pP = 0.
      
      That is, the only new things in pI' that were not present in pI need to
      be present in pP.
      
      The role of CAP_SETPCAP is basically to permit changes to pI beyond
      the above:
      
         if (pE & CAP_SETPCAP) {
            pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0  */
         }
      
      This capability is useful for things like login, which (say, via
      pam_cap) might want to raise certain inheritable capabilities for use
      by the children of the logged-in user's shell, but those capabilities
      are not useful to or needed by the login program itself.
      
      One such use might be to limit who can run ping. You set the
      capabilities of the 'ping' program to be "= cap_net_raw+i", and then
      only shells that have (pI & CAP_NET_RAW) will be able to run
      it. Without CAP_SETPCAP implemented as described above, login(pam_cap)
      would have to also have (pP & CAP_NET_RAW) in order to raise this
      capability and pass it on through the inheritable set.
      Signed-off-by: NAndrew Morgan <morgan@kernel.org>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72c2d582
    • E
      sysctl: deprecate sys_sysctl in a user space visible fashion. · 7058cb02
      Eric W. Biederman 提交于
      After adding checking to register_sysctl_table and finding a whole new set
      of bugs.  Missed by countless code reviews and testers I have finally lost
      patience with the binary sysctl interface.
      
      The binary sysctl interface has been sort of deprecated for years and
      finding a user space program that uses the syscall is more difficult then
      finding a needle in a haystack.  Problems continue to crop up, with the in
      kernel implementation.  So since supporting something that no one uses is
      silly, deprecate sys_sysctl with a sufficient grace period and notice that
      the handful of user space applications that care can be fixed or replaced.
      
      The /proc/sys sysctl interface that people use will continue to be
      supported indefinitely.
      
      This patch moves the tested warning about sysctls from the path where
      sys_sysctl to a separate path called from both implementations of
      sys_sysctl, and it adds a proper entry into
      Documentation/feature-removal-schedule.
      
      Allowing us to revisit this in a couple years time and actually kill
      sys_sysctl.
      
      [lethal@linux-sh.org: sysctl: Fix syscall disabled build]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7058cb02
    • E
      sysctl: Error on bad sysctl tables · fc6cd25b
      Eric W. Biederman 提交于
      After going through the kernels sysctl tables several times it has become
      clear that code review and testing is just not effective in prevent
      problematic sysctl tables from being used in the stable kernel.  I certainly
      can't seem to fix the problems as fast as they are introduced.
      
      Therefore this patch adds sysctl_check_table which is called when a sysctl
      table is registered and checks to see if we have a problematic sysctl table.
      
      The biggest part of the code is the table of valid binary sysctl entries, but
      since we have frozen our set of binary sysctls this table should not need to
      change, and it makes it much easier to detect when someone unintentionally
      adds a new binary sysctl value.
      
      As best as I can determine all of the several hundred errors spewed on boot up
      now are legitimate.
      
      [bunk@kernel.org: kernel/sysctl_check.c must #include <linux/string.h>]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6cd25b
    • E
      sysctl: remove the cad_pid binary sysctl path · c65f9239
      Eric W. Biederman 提交于
      It looks like we inadvertently killed the cad_pid binary sysctl support when
      cap_pid was changed to be a struct pid.  Since no one has complained just
      remove the binary path.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c65f9239
    • E
      sysctl: simplify the pty sysctl logic · 35834ca1
      Eric W. Biederman 提交于
      Instead of having a bunch of ifdefs in sysctl.c move all of the pty sysctl
      logic into drivers/char/pty.c
      
      As well as cleaning up the logic this prevents sysctl_check_table from
      complaining that the root table has a NULL data pointer on something with
      generic methods.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35834ca1
    • E
      sysctl: remove the binary interface for aio-nr, aio-max-nr, acpi_video_flags · 0d135a4a
      Eric W. Biederman 提交于
      aio-nr, aio-max-nr, acpi_video_flags are unsigned long values which sysctl
      does not handle properly with a 64bit kernel and a 32bit user space.
      
      Since no one is likely to be using the binary sysctl values and the ascii
      interface still works, this patch just removes support for the binary sysctl
      interface from the kernel.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d135a4a
    • E
      sysctl: remove binary sysctl support where it clearly doesn't work · f5ead5ce
      Eric W. Biederman 提交于
      These functions are all wrapper functions for the proc interface that are
      needed for them to work correctly.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Acked-by: NAndrew Morgan <morgan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5ead5ce
    • E
      sysctl: Factor out sysctl_data. · 49a0c458
      Eric W. Biederman 提交于
      There as been no easy way to wrap the default sysctl strategy routine except
      for returning 0.  Which is not always what we want.  The few instances I have
      seen that want different behaviour have written their own version of
      sysctl_data.  While not too hard it is unnecessary code and has the potential
      for extra bugs.
      
      So to make these situations easier and make that part of sysctl more symetric
      I have factord sysctl_data out of do_sysctl_strategy and exported as a
      function everyone can use.
      
      Further having sysctl_data be an explicit function makes checking for badly
      formed sysctl tables much easier.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49a0c458
    • E
      sysctl core: Stop using the unnecessary ctl_table typedef · d8217f07
      Eric W. Biederman 提交于
      In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't
      needed and is a bit of a nuisance because it makes it harder to recognize
      ctl_table is a type name.
      
      So this patch removes it from the generic sysctl code.  Hopefully I will have
      enough energy to send the rest of my patches will follow and to remove it from
      the rest of the kernel.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8217f07
  3. 17 10月, 2007 4 次提交
  4. 15 10月, 2007 4 次提交
  5. 14 10月, 2007 1 次提交
    • A
      minimal build fixes for uml (fallout from x86 merge) · 2b8232ce
      Al Viro 提交于
       a) include/asm-um/arch can't just point to include/asm-$(SUBARCH) now
       b) arch/{i386,x86_64}/crypto are merged now
       c) subarch-obj needed changes
       d) cpufeature_64.h should pull "cpufeature_32.h", not <asm/cpufeature_32.h>
          since it can be included from asm-um/cpufeature.h
       e) in case of uml-i386 we need CONFIG_X86_32 for make and gcc, but not
          for Kconfig
       f) sysctl.c shouldn't do vdso_enabled for uml-i386 (actually, that one
          should be registered from corresponding arch/*/kernel/*, with ifdef
          going away; that's a separate patch, though).
      
      With that and with Stephen's patch ("[PATCH net-2.6] uml: hard_header fix")
      we have uml allmodconfig building both on i386 and amd64.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b8232ce
  6. 12 10月, 2007 1 次提交
  7. 20 9月, 2007 1 次提交
    • I
      sched: add /proc/sys/kernel/sched_compat_yield · 1799e35d
      Ingo Molnar 提交于
      add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
      more agressive, by moving the yielding task to the last position
      in the rbtree.
      
      with sched_compat_yield=0:
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        2539 mingo     20   0  1576  252  204 R   50  0.0   0:02.03 loop_yield
        2541 mingo     20   0  1576  244  196 R   50  0.0   0:02.05 loop
      
      with sched_compat_yield=1:
      
         PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
        2584 mingo     20   0  1576  248  196 R   99  0.0   0:52.45 loop
        2582 mingo     20   0  1576  256  204 R    0  0.0   0:00.00 loop_yield
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      1799e35d
  8. 26 8月, 2007 3 次提交
  9. 20 8月, 2007 1 次提交
  10. 12 8月, 2007 1 次提交
  11. 30 7月, 2007 1 次提交
  12. 25 7月, 2007 1 次提交
  13. 23 7月, 2007 1 次提交
    • M
      x86: i386-show-unhandled-signals-v3 · abd4f750
      Masoud Asgharifard Sharbiani 提交于
      This patch makes the i386 behave the same way that x86_64 does when a
      segfault happens.  A line gets printed to the kernel log so that tools
      that need to check for failures can behave more uniformly between
      debug.show_unhandled_signals sysctl variable to 0 (or by doing echo 0 >
      /proc/sys/debug/exception-trace)
      
      Also, all of the lines being printed are now using printk_ratelimit() to
      deny the ability of DoS from a local user with a program like the
      following:
      
      main()
      {
             while (1)
                     if (!fork()) *(int *)0 = 0;
      }
      
      This new revision also includes the fix that Andrew did which got rid of
      new sysctl that was added to the system in earlier versions of this.
      Also, 'show-unhandled-signals' sysctl has been renamed back to the old
      'exception-trace' to avoid breakage of people's scripts.
      
      AK: Enabling by default for i386 will be likely controversal, but let's see what happens
      AK: Really folks, before complaining just fix your segfaults
      AK: I bet this will find a lot of silent issues
      Signed-off-by: NMasoud Sharbiani <masouds@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      [ Personally, I've found the complaints useful on x86-64, so I'm all for
        this. That said, I wonder if we could do it more prettily..   -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd4f750
  14. 20 7月, 2007 5 次提交
    • A
      kernel/sysctl.c: finish off the warning comments · ed2c12f3
      Andrew Morton 提交于
      I've been chasing these comments around this file all week.  Hopefully we're
      straight now.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed2c12f3
    • P
      lockstat: core infrastructure · f20786ff
      Peter Zijlstra 提交于
      Introduce the core lock statistics code.
      
      Lock statistics provides lock wait-time and hold-time (as well as the count
      of corresponding contention and acquisitions events). Also, the first few
      call-sites that encounter contention are tracked.
      
      Lock wait-time is the time spent waiting on the lock. This provides insight
      into the locking scheme, that is, a heavily contended lock is indicative of
      a too coarse locking scheme.
      
      Lock hold-time is the duration the lock was held, this provides a reference for
      the wait-time numbers, so they can be put into perspective.
      
        1)
          lock
        2)
          ... do stuff ..
          unlock
        3)
      
      The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
      hold-time.
      
      The lockdep held-lock tracking code is reused, because it already collects locks
      into meaningful groups (classes), and because it is an existing infrastructure
      for lock instrumentation.
      
      Currently lockdep tracks lock acquisition with two hooks:
      
        lock()
          lock_acquire()
          _lock()
      
       ... code protected by lock ...
      
        unlock()
          lock_release()
          _unlock()
      
      We need to extend this with two more hooks, in order to measure contention.
      
        lock_contended() - used to measure contention events
        lock_acquired()  - completion of the contention
      
      These are then placed the following way:
      
        lock()
          lock_acquire()
          if (!_try_lock())
            lock_contended()
            _lock()
            lock_acquired()
      
       ... do locked stuff ...
      
        unlock()
          lock_release()
          _unlock()
      
      (Note: the try_lock() 'trick' is used to avoid instrumenting all platform
             dependent lock primitive implementations.)
      
      It is also possible to toggle the two lockdep features at runtime using:
      
        /proc/sys/kernel/prove_locking
        /proc/sys/kernel/lock_stat
      
      (esp. turning off the O(n^2) prove_locking functionaliy can help)
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: nuke unneeded ifdefs]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f20786ff
    • K
      coredump masking: bound suid_dumpable sysctl · 76fdbb25
      Kawai, Hidehiro 提交于
      This patch series is version 5 of the core dump masking feature, which
      controls which VMAs should be dumped based on their memory types and
      per-process flags.
      
      I adopted most of Andrew's suggestion at the previous version.  He also
      suggested using system call instead of /proc/<pid>/ interface, I decided to
      use the latter continuously because adding new system call with pid argument
      will give a big impact on the kernel.
      
      You can access the per-process flags via /proc/<pid>/coredump_filter
      interface.  coredump_filter represents a bitmask of memory types, and if a bit
      is set, VMAs of corresponding memory type are written into a core file when
      the process is dumped.  The bitmask is inherited from the parent process when
      a process is created.
      
      The original purpose is to avoid longtime system slowdown when a number of
      processes which share a huge shared memory are dumped at the same time.  To
      achieve this purpose, this patch series adds an ability to suppress dumping
      anonymous shared memory for specified processes.  In this version, three other
      memory types are also supported.
      
      Here are the coredump_filter bits:
        bit 0: anonymous private memory
        bit 1: anonymous shared memory
        bit 2: file-backed private memory
        bit 3: file-backed shared memory
      
      The default value of coredump_filter is 0x3.  This means the new core dump
      routine has the same behavior as conventional behavior by default.
      
      In this version, coredump_filter bits and mm.dumpable are merged into
      mm.flags, and it is accessed by atomic bitops.
      
      The supported core file formats are ELF and ELF-FDPIC.  ELF has been tested,
      but ELF-FDPIC has not been built and tested because I don't have the test
      environment.
      
      This patch limits a value of suid_dumpable sysctl to the range of 0 to 2.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76fdbb25
    • P
      audit: rework execve audit · bdf4c48a
      Peter Zijlstra 提交于
      The purpose of audit_bprm() is to log the argv array to a userspace daemon at
      the end of the execve system call.  Since user-space hasn't had time to run,
      this array is still in pristine state on the process' stack; so no need to
      copy it, we can just grab it from there.
      
      In order to minimize the damage to audit_log_*() copy each string into a
      temporary kernel buffer first.
      
      Currently the audit code requires that the full argument vector fits in a
      single packet.  So currently it does clip the argv size to a (sysctl) limit,
      but only when execve auditing is enabled.
      
      If the audit protocol gets extended to allow for multiple packets this check
      can be removed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NOllie Wild <aaw@google.com>
      Cc: <linux-audit@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdf4c48a
    • P
      PM: Integrate beeping flag with existing acpi_sleep flags · 77afcf78
      Pavel Machek 提交于
      Move "debug during resume from s2ram" into the variable we already use
      for real-mode flags to simplify code. It also closes nasty trap for
      the user in acpi_sleep_setup; order of parameters actually mattered there,
      acpi_sleep=s3_bios,s3_mode doing something different from
      acpi_sleep=s3_mode,s3_bios.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77afcf78
  15. 18 7月, 2007 3 次提交
    • J
      Add common orderly_poweroff() · 10a0a8d4
      Jeremy Fitzhardinge 提交于
      Various pieces of code around the kernel want to be able to trigger an
      orderly poweroff.  This pulls them together into a single
      implementation.
      
      By default the poweroff command is /sbin/poweroff, but it can be set
      via sysctl: kernel/poweroff_cmd.  This is split at whitespace, so it
      can include command-line arguments.
      
      This patch replaces four other instances of invoking either "poweroff"
      or "shutdown -h now": two sbus drivers, and acpi thermal
      management.
      
      sparc64 has its own "powerd"; still need to determine whether it should
      be replaced by orderly_poweroff().
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Acked-by: NLen Brown <lenb@kernel.org>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David S. Miller <davem@davemloft.net>
      10a0a8d4
    • A
      proper prototype for proc_nr_files() · 62239ac2
      Adrian Bunk 提交于
      Add a proper prototype for proc_nr_files() in include/linux/fs.h
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62239ac2
    • M
      Allow huge page allocations to use GFP_HIGH_MOVABLE · 396faf03
      Mel Gorman 提交于
      Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
      as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
      can be used to satisfy hugepage allocations even when the system has been
      running a long time.  This allows an administrator to resize the hugepage pool
      at runtime depending on the size of ZONE_MOVABLE.
      
      This patch adds a new sysctl called hugepages_treat_as_movable.  When a
      non-zero value is written to it, future allocations for the huge page pool
      will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
      introduce additional external fragmentation of note as huge pages are always
      the largest contiguous block we care about.
      
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396faf03
  16. 17 7月, 2007 3 次提交