1. 26 10月, 2010 2 次提交
  2. 23 10月, 2010 2 次提交
    • L
      Revert "tty: Add a new file /proc/tty/consoles" · 6c2754c2
      Linus Torvalds 提交于
      This reverts commit f4a3e0bc.  Jiri
      Sladby points out that the tty structure we're using may already be
      gone, and Al Viro doesn't hold back in complaining about the random
      loading of 'filp->private_data' which doesn't have to be a pointer at
      all, nor does checking the magic field for TTY_MAGIC prove anything.
      
      Belated review by Al:
      
       "a) global variable depending on stdin of the last opener? Affecting
           output of read(2)? Really?
      
        b) iterator is broken; list should be locked in ->start(), unlocked in
           ->stop() and *NOT* unlocked/relocked in ->next()
      
        c) ->show() ought to do nothing in case of ->device == NULL, instead
           of skipping those in ->next()/->start()
      
        d) regardless of the merits of the bright idea about asterisk at that
           line in output *and* regardless of (a), the implementation is not
           only atrociously ugly, it's actually very likely to be a roothole.
           Verifying that Cthulhu knows what number happens to be address of a
           tty_struct by blindly dereferencing memory at that address...
           Ouch.
      
        Please revert that crap."
      
      And Christoph pipes in and NAK's the approach of walking fd tables etc
      too.  So it's pretty unanimous.
      Noticed-by: NJri Slaby <jslaby@suse.cz>
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Werner Fink <werner@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c2754c2
    • D
      tty: Add a new file /proc/tty/consoles · f4a3e0bc
      Dr. Werner Fink 提交于
      Add a new file /proc/tty/consoles to be able to determine the registered
      system console lines.  If the reading process holds /dev/console open at
      the regular standard input stream the active device will be marked by an
      asterisk.  Show possible operations and also decode the used flags of
      the listed console lines.
      Signed-off-by: NWerner Fink <werner@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      f4a3e0bc
  3. 15 10月, 2010 1 次提交
    • A
      llseek: automatically add .llseek fop · 6038f373
      Arnd Bergmann 提交于
      All file_operations should get a .llseek operation so we can make
      nonseekable_open the default for future file operations without a
      .llseek pointer.
      
      The three cases that we can automatically detect are no_llseek, seq_lseek
      and default_llseek. For cases where we can we can automatically prove that
      the file offset is always ignored, we use noop_llseek, which maintains
      the current behavior of not returning an error from a seek.
      
      New drivers should normally not use noop_llseek but instead use no_llseek
      and call nonseekable_open at open time.  Existing drivers can be converted
      to do the same when the maintainer knows for certain that no user code
      relies on calling seek on the device file.
      
      The generated code is often incorrectly indented and right now contains
      comments that clarify for each added line why a specific variant was
      chosen. In the version that gets submitted upstream, the comments will
      be gone and I will manually fix the indentation, because there does not
      seem to be a way to do that using coccinelle.
      
      Some amount of new code is currently sitting in linux-next that should get
      the same modifications, which I will do at the end of the merge window.
      
      Many thanks to Julia Lawall for helping me learn to write a semantic
      patch that does all this.
      
      ===== begin semantic patch =====
      // This adds an llseek= method to all file operations,
      // as a preparation for making no_llseek the default.
      //
      // The rules are
      // - use no_llseek explicitly if we do nonseekable_open
      // - use seq_lseek for sequential files
      // - use default_llseek if we know we access f_pos
      // - use noop_llseek if we know we don't access f_pos,
      //   but we still want to allow users to call lseek
      //
      @ open1 exists @
      identifier nested_open;
      @@
      nested_open(...)
      {
      <+...
      nonseekable_open(...)
      ...+>
      }
      
      @ open exists@
      identifier open_f;
      identifier i, f;
      identifier open1.nested_open;
      @@
      int open_f(struct inode *i, struct file *f)
      {
      <+...
      (
      nonseekable_open(...)
      |
      nested_open(...)
      )
      ...+>
      }
      
      @ read disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      <+...
      (
         *off = E
      |
         *off += E
      |
         func(..., off, ...)
      |
         E = *off
      )
      ...+>
      }
      
      @ read_no_fpos disable optional_qualifier exists @
      identifier read_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ write @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      expression E;
      identifier func;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      <+...
      (
        *off = E
      |
        *off += E
      |
        func(..., off, ...)
      |
        E = *off
      )
      ...+>
      }
      
      @ write_no_fpos @
      identifier write_f;
      identifier f, p, s, off;
      type ssize_t, size_t, loff_t;
      @@
      ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
      {
      ... when != off
      }
      
      @ fops0 @
      identifier fops;
      @@
      struct file_operations fops = {
       ...
      };
      
      @ has_llseek depends on fops0 @
      identifier fops0.fops;
      identifier llseek_f;
      @@
      struct file_operations fops = {
      ...
       .llseek = llseek_f,
      ...
      };
      
      @ has_read depends on fops0 @
      identifier fops0.fops;
      identifier read_f;
      @@
      struct file_operations fops = {
      ...
       .read = read_f,
      ...
      };
      
      @ has_write depends on fops0 @
      identifier fops0.fops;
      identifier write_f;
      @@
      struct file_operations fops = {
      ...
       .write = write_f,
      ...
      };
      
      @ has_open depends on fops0 @
      identifier fops0.fops;
      identifier open_f;
      @@
      struct file_operations fops = {
      ...
       .open = open_f,
      ...
      };
      
      // use no_llseek if we call nonseekable_open
      ////////////////////////////////////////////
      @ nonseekable1 depends on !has_llseek && has_open @
      identifier fops0.fops;
      identifier nso ~= "nonseekable_open";
      @@
      struct file_operations fops = {
      ...  .open = nso, ...
      +.llseek = no_llseek, /* nonseekable */
      };
      
      @ nonseekable2 depends on !has_llseek @
      identifier fops0.fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...  .open = open_f, ...
      +.llseek = no_llseek, /* open uses nonseekable */
      };
      
      // use seq_lseek for sequential files
      /////////////////////////////////////
      @ seq depends on !has_llseek @
      identifier fops0.fops;
      identifier sr ~= "seq_read";
      @@
      struct file_operations fops = {
      ...  .read = sr, ...
      +.llseek = seq_lseek, /* we have seq_read */
      };
      
      // use default_llseek if there is a readdir
      ///////////////////////////////////////////
      @ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier readdir_e;
      @@
      // any other fop is used that changes pos
      struct file_operations fops = {
      ... .readdir = readdir_e, ...
      +.llseek = default_llseek, /* readdir is present */
      };
      
      // use default_llseek if at least one of read/write touches f_pos
      /////////////////////////////////////////////////////////////////
      @ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read.read_f;
      @@
      // read fops use offset
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = default_llseek, /* read accesses f_pos */
      };
      
      @ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ... .write = write_f, ...
      +	.llseek = default_llseek, /* write accesses f_pos */
      };
      
      // Use noop_llseek if neither read nor write accesses f_pos
      ///////////////////////////////////////////////////////////
      
      @ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      identifier write_no_fpos.write_f;
      @@
      // write fops use offset
      struct file_operations fops = {
      ...
       .write = write_f,
       .read = read_f,
      ...
      +.llseek = noop_llseek, /* read and write both use no f_pos */
      };
      
      @ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier write_no_fpos.write_f;
      @@
      struct file_operations fops = {
      ... .write = write_f, ...
      +.llseek = noop_llseek, /* write uses no f_pos */
      };
      
      @ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      identifier read_no_fpos.read_f;
      @@
      struct file_operations fops = {
      ... .read = read_f, ...
      +.llseek = noop_llseek, /* read uses no f_pos */
      };
      
      @ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
      identifier fops0.fops;
      @@
      struct file_operations fops = {
      ...
      +.llseek = noop_llseek, /* no read or write fn */
      };
      ===== End semantic patch =====
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Julia Lawall <julia@diku.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      6038f373
  4. 02 10月, 2010 1 次提交
  5. 23 9月, 2010 2 次提交
    • K
      /proc/pid/smaps: fix dirty pages accounting · 1c2499ae
      KOSAKI Motohiro 提交于
      Currently, /proc/<pid>/smaps has wrong dirty pages accounting.
      Shared_Dirty and Private_Dirty output only pte dirty pages and ignore
      PG_dirty page flag.  It is difference against documentation, but also
      inconsistent against Referenced field.  (Referenced checks both pte and
      page flags)
      
      This patch fixes it.
      
      Test program:
      
       large-array.c
       ---------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <unistd.h>
      
       char array[1*1024*1024*1024L];
      
       int main(void)
       {
               memset(array, 1, sizeof(array));
               pause();
      
               return 0;
       }
       ---------------------------------------------------
      
      Test case:
       1. run ./large-array
       2. cat /proc/`pidof large-array`/smaps
       3. swapoff -a
       4. cat /proc/`pidof large-array`/smaps again
      
      Test result:
       <before patch>
      
      00601000-40601000 rw-p 00000000 00:00 0
      Size:            1048576 kB
      Rss:             1048576 kB
      Pss:             1048576 kB
      Shared_Clean:          0 kB
      Shared_Dirty:          0 kB
      Private_Clean:    218992 kB   <-- showed pages as clean incorrectly
      Private_Dirty:    829584 kB
      Referenced:       388364 kB
      Swap:                  0 kB
      KernelPageSize:        4 kB
      MMUPageSize:           4 kB
      
       <after patch>
      
      00601000-40601000 rw-p 00000000 00:00 0
      Size:            1048576 kB
      Rss:             1048576 kB
      Pss:             1048576 kB
      Shared_Clean:          0 kB
      Shared_Dirty:          0 kB
      Private_Clean:         0 kB
      Private_Dirty:   1048576 kB  <-- fixed
      Referenced:       388480 kB
      Swap:                  0 kB
      KernelPageSize:        4 kB
      MMUPageSize:           4 kB
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c2499ae
    • A
      /proc/vmcore: fix seeking · c227e690
      Arnd Bergmann 提交于
      Commit 73296bc6 ("procfs: Use generic_file_llseek in /proc/vmcore")
      broke seeking on /proc/vmcore.  This changes it back to use default_llseek
      in order to restore the original behaviour.
      
      The problem with generic_file_llseek is that it only allows seeks up to
      inode->i_sb->s_maxbytes, which is zero on procfs and some other virtual
      file systems.  We should merge generic_file_llseek and default_llseek some
      day and clean this up in a proper way, but for 2.6.35/36, reverting vmcore
      is the safer solution.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c227e690
  6. 10 9月, 2010 2 次提交
  7. 16 8月, 2010 1 次提交
    • L
      mm: fix up some user-visible effects of the stack guard page · d7824370
      Linus Torvalds 提交于
      This commit makes the stack guard page somewhat less visible to user
      space. It does this by:
      
       - not showing the guard page in /proc/<pid>/maps
      
         It looks like lvm-tools will actually read /proc/self/maps to figure
         out where all its mappings are, and effectively do a specialized
         "mlockall()" in user space.  By not showing the guard page as part of
         the mapping (by just adding PAGE_SIZE to the start for grows-up
         pages), lvm-tools ends up not being aware of it.
      
       - by also teaching the _real_ mlock() functionality not to try to lock
         the guard page.
      
         That would just expand the mapping down to create a new guard page,
         so there really is no point in trying to lock it in place.
      
      It would perhaps be nice to show the guard page specially in
      /proc/<pid>/maps (or at least mark grow-down segments some way), but
      let's not open ourselves up to more breakage by user space from programs
      that depends on the exact deails of the 'maps' file.
      
      Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
      source code to see what was going on with the whole new warning.
      
      Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
      Reported-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7824370
  8. 14 8月, 2010 1 次提交
  9. 11 8月, 2010 3 次提交
  10. 10 8月, 2010 6 次提交
    • D
      oom: deprecate oom_adj tunable · 51b1bd2a
      David Rientjes 提交于
      /proc/pid/oom_adj is now deprecated so that that it may eventually be
      removed.  The target date for removal is August 2012.
      
      A warning will be printed to the kernel log if a task attempts to use this
      interface.  Future warning will be suppressed until the kernel is rebooted
      to prevent spamming the kernel log.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b1bd2a
    • D
      oom: badness heuristic rewrite · a63d83f4
      David Rientjes 提交于
      This a complete rewrite of the oom killer's badness() heuristic which is
      used to determine which task to kill in oom conditions.  The goal is to
      make it as simple and predictable as possible so the results are better
      understood and we end up killing the task which will lead to the most
      memory freeing while still respecting the fine-tuning from userspace.
      
      Instead of basing the heuristic on mm->total_vm for each task, the task's
      rss and swap space is used instead.  This is a better indication of the
      amount of memory that will be freeable if the oom killed task is chosen
      and subsequently exits.  This helps specifically in cases where KDE or
      GNOME is chosen for oom kill on desktop systems instead of a memory
      hogging task.
      
      The baseline for the heuristic is a proportion of memory that each task is
      currently using in memory plus swap compared to the amount of "allowable"
      memory.  "Allowable," in this sense, means the system-wide resources for
      unconstrained oom conditions, the set of mempolicy nodes, the mems
      attached to current's cpuset, or a memory controller's limit.  The
      proportion is given on a scale of 0 (never kill) to 1000 (always kill),
      roughly meaning that if a task has a badness() score of 500 that the task
      consumes approximately 50% of allowable memory resident in RAM or in swap
      space.
      
      The proportion is always relative to the amount of "allowable" memory and
      not the total amount of RAM systemwide so that mempolicies and cpusets may
      operate in isolation; they shall not need to know the true size of the
      machine on which they are running if they are bound to a specific set of
      nodes or mems, respectively.
      
      Root tasks are given 3% extra memory just like __vm_enough_memory()
      provides in LSMs.  In the event of two tasks consuming similar amounts of
      memory, it is generally better to save root's task.
      
      Because of the change in the badness() heuristic's baseline, it is also
      necessary to introduce a new user interface to tune it.  It's not possible
      to redefine the meaning of /proc/pid/oom_adj with a new scale since the
      ABI cannot be changed for backward compatability.  Instead, a new tunable,
      /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
      be used to polarize the heuristic such that certain tasks are never
      considered for oom kill while others may always be considered.  The value
      is added directly into the badness() score so a value of -500, for
      example, means to discount 50% of its memory consumption in comparison to
      other tasks either on the system, bound to the mempolicy, in the cpuset,
      or sharing the same memory controller.
      
      /proc/pid/oom_adj is changed so that its meaning is rescaled into the
      units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
      these per-task tunables will rescale the value of the other to an
      equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
      a bitshift on the badness score, it now shares the same linear growth as
      /proc/pid/oom_score_adj but with different granularity.  This is required
      so the ABI is not broken with userspace applications and allows oom_adj to
      be deprecated for future removal.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a63d83f4
    • A
      oom: move badness() declaration into oom.h · 74bcbf40
      Andrew Morton 提交于
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74bcbf40
    • K
      oom: /proc/<pid>/oom_score treat kernel thread honestly · 26ebc984
      KOSAKI Motohiro 提交于
      If a kernel thread is using use_mm(), badness() returns a positive value.
      This is not a big issue because caller take care of it correctly.  But
      there is one exception, /proc/<pid>/oom_score calls badness() directly and
      doesn't care that the task is a regular process.
      
      Another example, /proc/1/oom_score return !0 value.  But it's unkillable.
      This incorrectness makes administration a little confusing.
      
      This patch fixes it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26ebc984
    • A
      switch procfs to ->evict_inode() · 8267952b
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8267952b
    • C
      remove inode_setattr · 1025774c
      Christoph Hellwig 提交于
      Replace inode_setattr with opencoded variants of it in all callers.  This
      moves the remaining call to vmtruncate into the filesystem methods where it
      can be replaced with the proper truncate sequence.
      
      In a few cases it was obvious that we would never end up calling vmtruncate
      so it was left out in the opencoded variant:
      
       spufs: explicitly checks for ATTR_SIZE earlier
       btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
       ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
      
      In addition to that ncpfs called inode_setattr with handcrafted iattrs,
      which allowed to trim down the opencoded variant.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1025774c
  11. 30 7月, 2010 1 次提交
    • D
      CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials · de09a977
      David Howells 提交于
      It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
      credentials by incrementing their usage count after their replacement by the
      task being accessed.
      
      What happens is that get_task_cred() can race with commit_creds():
      
      	TASK_1			TASK_2			RCU_CLEANER
      	-->get_task_cred(TASK_2)
      	rcu_read_lock()
      	__cred = __task_cred(TASK_2)
      				-->commit_creds()
      				old_cred = TASK_2->real_cred
      				TASK_2->real_cred = ...
      				put_cred(old_cred)
      				  call_rcu(old_cred)
      		[__cred->usage == 0]
      	get_cred(__cred)
      		[__cred->usage == 1]
      	rcu_read_unlock()
      							-->put_cred_rcu()
      							[__cred->usage == 1]
      							panic()
      
      However, since a tasks credentials are generally not changed very often, we can
      reasonably make use of a loop involving reading the creds pointer and using
      atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.
      
      If successful, we can safely return the credentials in the knowledge that, even
      if the task we're accessing has released them, they haven't gone to the RCU
      cleanup code.
      
      We then change task_state() in procfs to use get_task_cred() rather than
      calling get_cred() on the result of __task_cred(), as that suffers from the
      same problem.
      
      Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
      tripped when it is noticed that the usage count is not zero as it ought to be,
      for example:
      
      kernel BUG at kernel/cred.c:168!
      invalid opcode: 0000 [#1] SMP
      last sysfs file: /sys/kernel/mm/ksm/run
      CPU 0
      Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
      745
      RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
      RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
      RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
      RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
      RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
      R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
      FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
      Stack:
       ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
      <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
      <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
      Call Trace:
       [<ffffffff810698cd>] put_cred+0x13/0x15
       [<ffffffff81069b45>] commit_creds+0x16b/0x175
       [<ffffffff8106aace>] set_current_groups+0x47/0x4e
       [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
       [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
      Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
      48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
      04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
      RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
       RSP <ffff88019e7e9eb8>
      ---[ end trace df391256a100ebdd ]---
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de09a977
  12. 30 6月, 2010 1 次提交
    • M
      nommu: add '[stack]' label to /proc/pid/maps output · 3c26c9d9
      Mike Frysinger 提交于
      Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack
      of the original thread after execve.  This is largely based on the MMU code.
      Subsidiary thread stacks are not indicated.
      
      For FDPIC, we now get:
      
      	root:/> cat /proc/self/maps
      	02064000-02067ccc rw-p 0004d000 00:01 22         /bin/busybox
      	0206e000-0206f35c rw-p 00006000 00:01 295        /lib/ld-uClibc.so.0
      	025f0000-025f6f0c r-xs 00000000 00:01 295        /lib/ld-uClibc.so.0
      	02680000-026ba6b0 r-xs 00000000 00:01 297        /lib/libc.so.0
      	02700000-0274d384 r-xs 00000000 00:01 22         /bin/busybox
      	02816000-02817000 rw-p 00000000 00:00 0
      	02848000-0284c0d8 rw-p 00000000 00:00 0
      	02860000-02880000 rw-p 00000000 00:00 0          [stack]
      
      The semi-downside here is that for FLAT, we get:
      
      	root:/> cat /proc/155/maps
      	029f0000-029f9000 rwxp 00000000 00:00 0          [stack]
      
      The reason being that FLAT combines a whole lot of stuff into one map
      (including the stack).  But this isn't any worse than the current output
      (which is nothing), so screw it.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c26c9d9
  13. 14 6月, 2010 1 次提交
  14. 28 5月, 2010 4 次提交
  15. 25 5月, 2010 1 次提交
  16. 17 5月, 2010 1 次提交
    • F
      procfs: Kill the bkl in ioctl · c2f98050
      Frederic Weisbecker 提交于
      There are no more users of procfs that implement the ioctl
      callback. Drop the bkl from this path and warn on any use
      of this callback.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      c2f98050
  17. 12 5月, 2010 1 次提交
    • R
      revert "procfs: provide stack information for threads" and its fixup commits · 34441427
      Robin Holt 提交于
      Originally, commit d899bf7b ("procfs: provide stack information for
      threads") attempted to introduce a new feature for showing where the
      threadstack was located and how many pages are being utilized by the
      stack.
      
      Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
      applied to fix the NO_MMU case.
      
      Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
      64-bit") was applied to fix a bug in ia32 executables being loaded.
      
      Commit 9ebd4eba ("procfs: fix /proc/<pid>/stat stack pointer for kernel
      threads") was applied to fix a bug which had kernel threads printing a
      userland stack address.
      
      Commit 1306d603 ('proc: partially revert "procfs: provide stack
      information for threads"') was then applied to revert the stack pages
      being used to solve a significant performance regression.
      
      This patch nearly undoes the effect of all these patches.
      
      The reason for reverting these is it provides an unusable value in
      field 28.  For x86_64, a fork will result in the task->stack_start
      value being updated to the current user top of stack and not the stack
      start address.  This unpredictability of the stack_start value makes
      it worthless.  That includes the intended use of showing how much stack
      space a thread has.
      
      Other architectures will get different values.  As an example, ia64
      gets 0.  The do_fork() and copy_process() functions appear to treat the
      stack_start and stack_size parameters as architecture specific.
      
      I only partially reverted c44972f1 ("procfs: disable per-task stack usage
      on NOMMU") .  If I had completely reverted it, I would have had to change
      mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
      configured.  Since I could not test the builds without significant effort,
      I decided to not change mm/Makefile.
      
      I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
      information for threads on 64-bit") .  I left the KSTK_ESP() change in
      place as that seemed worthwhile.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Stefani Seibold <stefani@seibold.net>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34441427
  18. 28 4月, 2010 1 次提交
  19. 09 4月, 2010 4 次提交
    • F
      procfs: Use generic_file_llseek in /proc/vmcore · 73296bc6
      Frederic Weisbecker 提交于
      /proc/vmcore has no llseek and then falls down to use default_llseek.
      This is racy against read_vmcore() that directly manipulates fpos
      but it doesn't hold the bkl there so using it in llseek doesn't
      protect anything.
      
      Let's use generic_file_llseek() instead.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      73296bc6
    • F
      procfs: Use generic_file_llseek in /proc/kmsg · 41775e29
      Frederic Weisbecker 提交于
      No need to hold the bkl to seek here, none of the other
      fops callbacks use it.
      
      Use generic_file_llseek explicitly.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      41775e29
    • F
      procfs: Use generic_file_llseek in /proc/kcore · 34aacb29
      Frederic Weisbecker 提交于
      /proc/kcore has no llseek and then falls down to use default_llseek.
      This is racy against read_kcore() that directly manipulates fpos
      but it doesn't hold the bkl there so using it in llseek doesn't
      protect anything.
      
      Let's use generic_file_llseek() instead.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      34aacb29
    • A
      procfs: Kill BKL in llseek on proc base · 87df8424
      Arnd Bergmann 提交于
      We don't use the BKL elsewhere, so use generic_file_llseek
      so we can avoid default_llseek taking the BKL.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [restore proc_fdinfo_file_operations as non-seekable]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      87df8424
  20. 07 4月, 2010 1 次提交
    • N
      pagemap: fix pfn calculation for hugepage · 116354d1
      Naoya Horiguchi 提交于
      When we look into pagemap using page-types with option -p, the value of
      pfn for hugepages looks wrong (see below.) This is because pte was
      evaluated only once for one vma although it should be updated for each
      hugepage.  This patch fixes it.
      
        $ page-types -p 3277 -Nl -b huge
        voffset   offset  len     flags
        7f21e8a00 11e400  1       ___U___________H_G________________
        7f21e8a01 11e401  1ff     ________________TG________________
                     ^^^
        7f21e8c00 11e400  1       ___U___________H_G________________
        7f21e8c01 11e401  1ff     ________________TG________________
                     ^^^
      
      One hugepage contains 1 head page and 511 tail pages in x86_64 and each
      two lines represent each hugepage.  Voffset and offset mean virtual
      address and physical address in the page unit, respectively.  The
      different hugepages should not have the same offset value.
      
      With this patch applied:
      
        $ page-types -p 3386 -Nl -b huge
        voffset   offset   len    flags
        7fec7a600 112c00   1      ___UD__________H_G________________
        7fec7a601 112c01   1ff    ________________TG________________
                     ^^^
        7fec7a800 113200   1      ___UD__________H_G________________
        7fec7a801 113201   1ff    ________________TG________________
                     ^^^
                     OK
      
      More info:
      
      - This patch modifies walk_page_range()'s hugepage walker.  But the
        change only affects pagemap_read(), which is the only caller of hugepage
        callback.
      
      - Without this patch, hugetlb_entry() callback is called per vma, that
        doesn't match the natural expectation from its name.
      
      - With this patch, hugetlb_entry() is called per hugepte entry and the
        callback can become much simpler.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      116354d1
  21. 06 4月, 2010 1 次提交
  22. 05 4月, 2010 1 次提交
    • K
      proc: pagemap: Hold mmap_sem during page walk · d82ef020
      KAMEZAWA Hiroyuki 提交于
      In initial design, walk_page_range() was designed just for walking page
      table and it didn't require mmap_sem.  Now, find_vma() etc..  are used
      in walk_page_range() and we need mmap_sem around it.
      
      This patch adds mmap_sem around walk_page_range().
      
      Because /proc/<pid>/pagemap's callback routine use put_user(), we have
      to get rid of it to do sane fix.
      
      Changelog: 2010/Apr/2
       - fixed start_vaddr and end overflow
      Changelog: 2010/Apr/1
       - fixed start_vaddr calculation
       - removed unnecessary cast.
       - removed unnecessary change in smaps.
       - use GFP_TEMPORARY instead of GFP_KERNEL
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: San Mehat <san@google.com>
      Cc: Brian Swetland <swetland@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      [ Fixed kmalloc failure return code as per Matt ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d82ef020
  23. 01 4月, 2010 1 次提交
    • O
      oom: fix the unsafe usage of badness() in proc_oom_score() · b95c35e7
      Oleg Nesterov 提交于
      proc_oom_score(task) has a reference to task_struct, but that is all.
      If this task was already released before we take tasklist_lock
      
      	- we can't use task->group_leader, it points to nowhere
      
      	- it is not safe to call badness() even if this task is
      	  ->group_leader, has_intersects_mems_allowed() assumes
      	  it is safe to iterate over ->thread_group list.
      
      	- even worse, badness() can hit ->signal == NULL
      
      Add the pid_alive() check to ensure __unhash_process() was not called.
      
      Also, use "task" instead of task->group_leader. badness() should return
      the same result for any sub-thread. Currently this is not true, but
      this should be changed anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b95c35e7