1. 26 7月, 2008 40 次提交
    • O
      workqueues: implement flush_work() · db700897
      Oleg Nesterov 提交于
      Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
      but sometimes we really need to wait for the completion and cancelling is not
      an option. schedule_on_each_cpu() is good example.
      
      Add the new helper, flush_work(work), which waits for the completion of the
      specific work_struct. More precisely, it "flushes" the result of of the last
      queue_work() which is visible to the caller.
      
      For example, this code
      
      	queue_work(wq, work);
      	/* WINDOW */
      	queue_work(wq, work);
      
      	flush_work(work);
      
      doesn't necessary work "as expected". What can happen in the WINDOW above is
      
      	- wq starts the execution of work->func()
      
      	- the caller migrates to another CPU
      
      now, after the 2nd queue_work() this work is active on the previous CPU, and
      at the same time it is queued on another. In this case flush_work(work) may
      return before the first work->func() completes.
      
      It is trivial to add another helper
      
      	int flush_work_sync(struct work_struct *work)
      	{
      		return flush_work(work) || wait_on_work(work);
      	}
      
      which works "more correctly", but it has to iterate over all CPUs and thus
      it much slower than flush_work().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NMax Krasnyansky <maxk@qualcomm.com>
      Acked-by: NJarek Poplawski <jarkao2@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db700897
    • O
      coredump: kill mm->core_done · a94e2d40
      Oleg Nesterov 提交于
      Now that we have core_state->dumper list we can use it to wake up the
      sub-threads waiting for the coredump completion.
      
      This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
      lessens by sizeof(struct completion).  Also, with this change we can
      decouple exit_mm() from the coredumping code.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a94e2d40
    • O
      coredump: construct the list of coredumping threads at startup time · b564daf8
      Oleg Nesterov 提交于
      binfmt->core_dump() has to iterate over the all threads in system in order
      to find the coredumping threads and construct the list using the
      GFP_ATOMIC allocations.
      
      With this patch each thread allocates the list node on exit_mm()'s stack and
      adds itself to the list.
      
      This allows us to do further changes:
      
      	- simplify ->core_dump()
      
      	- change exit_mm() to clear ->mm first, then wait for ->core_done.
      	  this makes the coredumping process visible to oom_kill
      
      	- kill mm->core_done
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b564daf8
    • O
      coredump: turn core_state->nr_threads into atomic_t · c5f1cc8c
      Oleg Nesterov 提交于
      Turn core_state->nr_threads into atomic_t and kill now unneeded
      down_write(&mm->mmap_sem) in exit_mm().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5f1cc8c
    • O
      coredump: move mm->core_waiters into struct core_state · 999d9fc1
      Oleg Nesterov 提交于
      Move mm->core_waiters into "struct core_state" allocated on stack.  This
      shrinks mm_struct a little bit and allows further changes.
      
      This patch mostly does s/core_waiters/core_state.  The only essential
      change is that coredump_wait() must clear mm->core_state before return.
      
      The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
      is fixed by the next patch.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      999d9fc1
    • O
      coredump: turn mm->core_startup_done into the pointer to struct core_state · 32ecb1f2
      Oleg Nesterov 提交于
      mm->core_startup_done points to "struct completion startup_done" allocated
      on the coredump_wait()'s stack.  Introduce the new structure, core_state,
      which holds this "struct completion".  This way we can add more info
      visible to the threads participating in coredump without enlarging
      mm_struct.
      
      No changes in affected .o files.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32ecb1f2
    • O
      kill PF_BORROWED_MM in favour of PF_KTHREAD · 246bb0b1
      Oleg Nesterov 提交于
      Kill PF_BORROWED_MM.  Change use_mm/unuse_mm to not play with ->flags, and
      do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.
      
      No functional changes yet.  But this allows us to do further
      fixes/cleanups.
      
      oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
      kthreads, this is wrong because of use_mm().  The problem with
      PF_BORROWED_MM is that we need task_lock() to avoid races.  With this
      patch we can check PF_KTHREAD directly, or use a simple lockless helper:
      
      	/* The result must not be dereferenced !!! */
      	struct mm_struct *__get_task_mm(struct task_struct *tsk)
      	{
      		if (tsk->flags & PF_KTHREAD)
      			return NULL;
      		return tsk->mm;
      	}
      
      Note also ecard_task().  It runs with ->mm != NULL, but it's the kernel
      thread without PF_BORROWED_MM.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246bb0b1
    • O
      introduce PF_KTHREAD flag · 7b34e428
      Oleg Nesterov 提交于
      Introduce the new PF_KTHREAD flag to mark the kernel threads.  It is set
      by INIT_TASK() and copied to the forked childs (we could set it in
      kthreadd() along with PF_NOFREEZE instead).
      
      daemonize() was changed as well.  In that case testing of PF_KTHREAD is
      racy, but daemonize() is hopeless anyway.
      
      This flag is cleared in do_execve(), before search_binary_handler().
      Probably not the best place, we can do this in exec_mmap() or in
      start_thread(), or clear it along with PF_FORKNOEXEC.  But I think this
      doesn't matter in practice, and if do_execve() fails kthread should die
      soon.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b34e428
    • O
      ptrace: give more respect to SIGKILL · 364d3c13
      Oleg Nesterov 提交于
      ptrace_stop() has some complicated checks to prevent the scheduling in the
      TASK_TRACED state with the pending SIGKILL, but these checks are racy, and
      they depend on arch_ptrace_stop_needed().
      
      This patch assumes that the traced task should die asap if it was killed by
      SIGKILL, in that case schedule()->signal_pending_state() has no reason to
      ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty
      special case.
      
      Note: do_exit()->ptrace_notify() is special, the killed task can already
      dequeue SIGKILL at this point. Another indication that fatal_signal_pending()
      is not exactly right.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      364d3c13
    • K
      res_counter: limit change support ebusy · 12b98044
      KAMEZAWA Hiroyuki 提交于
      Add an interface to set limit.  This is necessary to memory resource
      controller because it shrinks usage at set limit.
      
      Other controllers may not need this interface to shrink usage because
      shrinking is not necessary or impossible.
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12b98044
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
    • K
      memcg: better migration handling · e8589cc1
      KAMEZAWA Hiroyuki 提交于
      This patch changes page migration under memory controller to use a
      different algorithm.  (thanks to Christoph for new idea.)
      
      Before:
       - page_cgroup is migrated from an old page to a new page.
      After:
       - a new page is accounted , no reuse of page_cgroup.
      
      Pros:
      
       - We can avoid compliated lock depndencies and races in migration.
      
      Cons:
      
       - new param to mem_cgroup_charge_common().
      
       - mem_cgroup_getref() is added for handling ref_cnt ping-pong.
      
      This version simplifies complicated lock dependency in page migraiton
      under memory resource controller.
      
        new refcnt sequence is following.
      
      a mapped page:
        prepage_migration() ..... +1 to NEW page
        try_to_unmap()      ..... all refs to OLD page is gone.
        move_pages()        ..... +1 to NEW page if page cache.
        remap...            ..... all refs from *map* is added to NEW one.
        end_migration()     ..... -1 to New page.
      
        page's mapcount + (page_is_cache) refs are added to NEW one.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8589cc1
    • S
      cgroup_clone: use pid of newly created task for new cgroup · e885dcde
      Serge E. Hallyn 提交于
      cgroup_clone creates a new cgroup with the pid of the task.  This works
      correctly for unshare, but for clone cgroup_clone is called from
      copy_namespaces inside copy_process, which happens before the new pid is
      created.  As a result, the new cgroup was created with current's pid.
      This patch:
      
      	1. Moves the call inside copy_process to after the new pid
      	   is created
      	2. Passes the struct pid into ns_cgroup_clone (as it is not
      	   yet attached to the task)
      	3. Passes a name from ns_cgroup_clone() into cgroup_clone()
      	   so as to keep cgroup_clone() itself simpler
      	4. Uses pid_vnr() to get the process id value, so that the
      	   pid used to name the new cgroup is always the pid as it
      	   would be known to the task which did the cloning or
      	   unsharing.  I think that is the most intuitive thing to
      	   do.  This way, task t1 does clone(CLONE_NEWPID) to get
      	   t2, which does clone(CLONE_NEWPID) to get t3, then the
      	   cgroup for t3 will be named for the pid by which t2 knows
      	   t3.
      
      (Thanks to Dan Smith for finding the main bug)
      
      Changelog:
      	June 11: Incorporate Paul Menage's feedback:  don't pass
      	         NULL to ns_cgroup_clone from unshare, and reduce
      		 patch size by using 'nodename' in cgroup_clone.
      	June 10: Original version
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NSerge Hallyn <serge@us.ibm.com>
      Acked-by: NPaul Menage <menage@google.com>
      Tested-by: NDan Smith <danms@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e885dcde
    • P
      cgroup files: convert res_counter_write() to be a cgroups write_string() handler · 856c13aa
      Paul Menage 提交于
      Currently res_counter_write() is a raw file handler even though it's
      ultimately taking a number, since in some cases it wants to
      pre-process the string when converting it to a number.
      
      This patch converts res_counter_write() from a raw file handler to a
      write_string() handler; this allows some of the boilerplate
      copying/locking/checking to be removed, and simplies the cleanup path,
      since these functions are now performed by the cgroups framework.
      
      [lizf@cn.fujitsu.com: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      856c13aa
    • P
      cgroups: misc cleanups to write_string patchset · 84eea842
      Paul Menage 提交于
      This patch contains cleanups suggested by reviewers for the recent
      write_string() patchset:
      
      - pair cgroup_lock_live_group() with cgroup_unlock() in cgroup.c for
        clarity, rather than directly unlocking cgroup_mutex.
      
      - make the return type of cgroup_lock_live_group() a bool
      
      - use a #define'd constant for the local buffer size in read/write functions
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84eea842
    • P
      cgroup files: move the release_agent file to use typed handlers · e788e066
      Paul Menage 提交于
      Adds cgroup_release_agent_write() and cgroup_release_agent_show()
      methods to handle writing/reading the path to a cgroup hierarchy's
      release agent. As a result, cgroup_common_file_read() is now unnecessary.
      
      As part of the change, a previously-tolerated race in
      cgroup_release_agent() is avoided by copying the current
      release_agent_path prior to calling call_usermode_helper().
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e788e066
    • P
      cgroup files: add write_string cgroup control file method · db3b1497
      Paul Menage 提交于
      This patch adds a write_string() method for cgroups control files. The
      semantics are that a buffer is copied from userspace to kernelspace
      and the handler function invoked on that buffer.  The buffer is
      guaranteed to be nul-terminated, and no longer than max_write_len
      (defaulting to 64 bytes if unspecified). Later patches will convert
      existing raw file write handlers in control group subsystems to use
      this method.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Acked-by: NBalbir Singh <balbir@in.ibm.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db3b1497
    • P
      cgroup files: clean up whitespace in struct cftype · ce16b49d
      Paul Menage 提交于
      This patch removes some extraneous spaces from method declarations in
      struct cftype, to fit in with conventional kernel style.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce16b49d
    • P
      Mark res_counter_charge(_locked) with __must_check · f2992db2
      Pavel Emelyanov 提交于
      Ignoring their return values may result in counter underflow in the future -
      when the value charged will be uncharged (or in "leaks" - when the value is
      not uncharged).
      
      This also prevents from using charging routines to decrement the
      counter value (i.e. uncharge it) ;)
      
      (Current code works OK with res_counter, however :) )
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2992db2
    • J
      quota: implement sending information via netlink about user below quota · 657d3bfa
      Jan Kara 提交于
      Sometimes it may be useful for userspace to know (e.g.  for some hosting
      guys) that some user stopped exceeding his hardlimit or softlimit in
      quotas.  Implement sending of such events to userspace via quota netlink
      protocol so that they don't have to poll for such events.  Based on idea
      and initial implementation by Vladislav Bogdanov.
      
      Cc: Vladislav Bogdanov <slava@nsys.by>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      657d3bfa
    • J
      03b06343
    • J
      quota: move function-macros from quota.h to quotaops.h · 74abb989
      Jan Kara 提交于
      Move declarations of some macros, which should be in fact functions to
      quotaops.h.  This way they can be later converted to inline functions
      because we can now use declarations from quota.h.  Also add necessary
      includes of quotaops.h to a few files.
      
      [akpm@linux-foundation.org: fix JFS build]
      [akpm@linux-foundation.org: fix UFS build]
      [vegard.nossum@gmail.com: fix QUOTA=n build]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Arjen Pool <arjenpool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74abb989
    • J
      quota: cleanup loop in sync_dquots() · 02a55ca8
      Jan Kara 提交于
      Make loop in sync_dquots() checking whether there's something to write
      more readable, remove useless variable and macro info_any_dirty() which
      is used only in this place.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02a55ca8
    • J
      quota: rename quota functions from upper case, make bigger ones non-inline · b85f4b87
      Jan Kara 提交于
      Cleanup quotaops.h: Rename functions from uppercase to lowercase (and
      define backward compatibility macros), move larger functions to dquot.c
      and make them non-inline.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b85f4b87
    • J
      fatfs: add UTC timestamp option · b271e067
      Joe Peterson 提交于
      Provide a new mount option ("tz=UTC") for DOS (vfat/msdos) filesystems,
      allowing timestamps to be in coordinated universal time (UTC) rather than
      local time in applications where doing this is advantageous.
      
      In particular, portable devices that use fat/vfat (such as digital
      cameras) can benefit from using UTC in their internal clocks, thus
      avoiding daylight saving time errors and general time ambiguity issues.
      The user of the device does not have to worry about changing the time when
      moving from place or when daylight saving changes.
      
      The new mount option, when set, disables the counter-adjustment that Linux
      currently makes to FAT timestamp info in anticipation of the normal
      userspace time zone correction.  When used in this new mode, all daylight
      saving time and time zone handling is done in userspace as is normal for
      many other filesystems (like ext3).  The default mode, which remains
      unchanged, is still appropriate when mounting volumes written in Windows
      (because of its use of local time).
      
      I originally based this patch on one submitted last year by Paul Collins,
      but I updated it to work with current source and changed variable/option
      naming.  Ogawa Hirofumi (who maintains these filesystems) and I discussed
      this patch at length on lkml, and he suggested using the option name in
      the attached version of the patch.  Barry Bouwsma pointed out a good
      addition to the patch as well.
      Signed-off-by: NJoe Peterson <joe@skyrush.com>
      Signed-off-by: NPaul Collins <paul@ondioline.org>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Barry Bouwsma <free_beer_for_all@yahoo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b271e067
    • A
      remove unused #include <linux/dirent.h>'s · e8938a62
      Adrian Bunk 提交于
      Remove some unused #include <linux/dirent.h>'s.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8938a62
    • A
      remove the in-kernel struct dirent{,64} · cf6ae8b5
      Adrian Bunk 提交于
      The kernel struct dirent{,64} were different from the ones in
      userspace.
      
      Even worse, we exported the kernel ones to userspace.
      
      But after the fat usages are fixed we can remove the conflicting
      kernel versions.
      Reviewed-by: NH. Peter Anvin <hpa@kernel.org>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf6ae8b5
    • R
      msdos fs: remove unsettable atari option · 7557bc66
      Rene Scharfe 提交于
      It has been impossible to set the option 'atari' of the MSDOS filesystem
      for several years.  Since nobody seems to have missed it, let's remove its
      remains.
      Signed-off-by: NRene Scharfe <rene.scharfe@lsrfire.ath.cx>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7557bc66
    • O
      fat: fix VFAT_IOCTL_READDIR_xxx and cleanup for userland · 4596c8aa
      OGAWA Hirofumi 提交于
      "struct dirent" is a kernel type here, but is a **different type** in
      userspace!  This means both the structure and the IOCTL number is wrong!
      
      So, this adds new "struct __fat_dirent" to generate correct IOCTL number.
      And kernel stuff moves to under __KERNEL__.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4596c8aa
    • J
      reiserfs: convert j_commit_lock to mutex · 90415dea
      Jeff Mahoney 提交于
      j_commit_lock is a semaphore but uses it as if it were a mutex.  This patch
      converts it to a mutex.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Edward Shishkin <edward.shishkin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90415dea
    • J
      reiserfs: convert j_flush_sem to mutex · afe70259
      Jeff Mahoney 提交于
      j_flush_sem is a semaphore but uses it as if it were a mutex.  This patch
      converts it to a mutex.
      
      [akpm@linux-foundation.org: fix mutex_trylock retval treatment]
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Edward Shishkin <edward.shishkin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afe70259
    • J
      reiserfs: convert j_lock to mutex · f68215c4
      Jeff Mahoney 提交于
      j_lock is a semaphore but uses it as if it were a mutex.  This patch converts
      it to a mutex.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Edward Shishkin <edward.shishkin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f68215c4
    • A
      coda: remove CODA_FS_OLD_API · de0ca06a
      Adrian Bunk 提交于
      While fixing CONFIG_ leakages to the userspace kernel headers I ran into
      CODA_FS_OLD_API.
      
      After five years, are there still people using the old API left?
      Especially considering that you have to choose at compile time which API
      to support in the kernel (and distributions tend to offer the new API for
      some time).
      
      Jan: "The old API can definitely go.  Around the time the new
            interface went in there were some non-Coda userspace file system
            implementations that took a while longer to convert to the new API,
            but by now they all switched to the new interface or in some cases
            to a FUSE-based solution."
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Acked-by: NJan Harkes <jaharkes@cs.cmu.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de0ca06a
    • D
      ext3: handle corrupted orphan list at mount · ae76dd9a
      Duane Griffin 提交于
      If the orphan node list includes valid, untruncatable nodes with nlink > 0
      the ext3_orphan_cleanup loop which attempts to delete them will not do so,
      causing it to loop forever. Fix by checking for such nodes in the
      ext3_orphan_get function.
      
      This patch fixes the second case (image hdb.20000009.softlockup.gz)
      reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: printk warning fix]
      Signed-off-by: NDuane Griffin <duaneg@dghda.com>
      Cc: <linux-ext4@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae76dd9a
    • S
      ext2: fix typo in Hurd part of include/linux/ext2_fs.h · 50c33a84
      Samuel Thibault 提交于
      Fix typo in Hurd part of include/linux/ext2_fs.h
      
      The ';' here is redundant or can even pose problem.  This is actually not
      used by the Linux kernel, but it is exposed in GNU/Hurd.
      Signed-off-by: NSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50c33a84
    • E
      gpio: max732x driver · bbcd6d54
      Eric Miao 提交于
      This adds a driver supporting a family of I2C port expanders from Maxim,
      which includes the MAX7319 and MAX7320-7327 chips.
      
      [dbrownell@users.sourceforge.net: minor fixes]
      Signed-off-by: NJack Ren <jack.ren@marvell.com>
      Signed-off-by: NEric Miao <eric.miao@marvell.com>
      Acked-by: NJean Delvare <khali@linux-fr.org>
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbcd6d54
    • D
      gpio: mcp23s08 handles multiple chips per chipselect · 8f1cc3b1
      David Brownell 提交于
      Teach the mcp23s08 driver about a curious feature of these chips: up to
      four of them can share the same chipselect, with the SPI signals wired in
      parallel, by matching two bits in the first protocol byte against two
      address lines on the chip.
      
      This is handled by three software changes:
      
        * Platform data now holds an array of per-chip structs, not
          just one chip's address and pullup configuration.
      
        * Probe() and remove() now use another level of structure,
          wrapping an instance of the original structure for each
          mcp23s08 chip sharing that chipselect.
      
        * The HAEN bit is set, so that the hardware address bits can no
          longer be ignored (boot firmware may not have enabled them).
      
      The "one struct per chip" preserves the guts of the current code,
      but platform_data will need minor changes.
      
          OLD:
      	/* incorrect "slave" ID may not have mattered */
      	.slave = 3,
      	.pullups = BIT(3) | BIT(1) | BIT(0),
      
          NEW:
      	/* slave address _must_ match chip's wiring */
      	.chip[3] = {
      		.is_present = true,
      		.pullups = BIT(3) | BIT(1) | BIT(0),
      	},
      
      There's no change in how things _behave_ for spi_device nodes with a
      single mcp23s08 chip.  New multi-chip configurations assign GPIOs in
      sequence, without holes.  The spi_device just resembles a bigger
      controller, but internally it has multiple gpio_chip instances.
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f1cc3b1
    • D
      gpio: sysfs interface · d8f388d8
      David Brownell 提交于
      This adds a simple sysfs interface for GPIOs.
      
          /sys/class/gpio
          	/export ... asks the kernel to export a GPIO to userspace
          	/unexport ... to return a GPIO to the kernel
              /gpioN ... for each exported GPIO #N
      	    /value ... always readable, writes fail for input GPIOs
      	    /direction ... r/w as: in, out (default low); write high, low
      	/gpiochipN ... for each gpiochip; #N is its first GPIO
      	    /base ... (r/o) same as N
      	    /label ... (r/o) descriptive, not necessarily unique
      	    /ngpio ... (r/o) number of GPIOs; numbered N .. N+(ngpio - 1)
      
      GPIOs claimed by kernel code may be exported by its owner using a new
      gpio_export() call, which should be most useful for driver debugging.
      Such exports may optionally be done without a "direction" attribute.
      
      Userspace may ask to take over a GPIO by writing to a sysfs control file,
      helping to cope with incomplete board support or other "one-off"
      requirements that don't merit full kernel support:
      
        echo 23 > /sys/class/gpio/export
      	... will gpio_request(23, "sysfs") and gpio_export(23);
      	use /sys/class/gpio/gpio-23/direction to (re)configure it,
      	when that GPIO can be used as both input and output.
        echo 23 > /sys/class/gpio/unexport
      	... will gpio_free(23), when it was exported as above
      
      The extra D-space footprint is a few hundred bytes, except for the sysfs
      resources associated with each exported GPIO.  The additional I-space
      footprint is about two thirds of the current size of gpiolib (!).  Since
      no /dev node creation is involved, no "udev" support is needed.
      
      Related changes:
      
        * This adds a device pointer to "struct gpio_chip".  When GPIO
          providers initialize that, sysfs gpio class devices become children of
          that device instead of being "virtual" devices.
      
        * The (few) gpio_chip providers which have such a device node have
          been updated.
      
        * Some gpio_chip drivers also needed to update their module "owner"
          field ...  for which missing kerneldoc was added.
      
        * Some gpio_chips don't support input GPIOs.  Those GPIOs are now
          flagged appropriately when the chip is registered.
      
      Based on previous patches, and discussion both on and off LKML.
      
      A Documentation/ABI/testing/sysfs-gpio update is ready to submit once this
      merges to mainline.
      
      [akpm@linux-foundation.org: a few maintenance build fixes]
      Signed-off-by: NDavid Brownell <dbrownell@users.sourceforge.net>
      Cc: Guennadi Liakhovetski <g.liakhovetski@pengutronix.de>
      Cc: Greg KH <greg@kroah.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8f388d8
    • S
      kprobes: improve kretprobe scalability with hashed locking · ef53d9c5
      Srinivasa D S 提交于
      Currently list of kretprobe instances are stored in kretprobe object (as
      used_instances,free_instances) and in kretprobe hash table.  We have one
      global kretprobe lock to serialise the access to these lists.  This causes
      only one kretprobe handler to execute at a time.  Hence affects system
      performance, particularly on SMP systems and when return probe is set on
      lot of functions (like on all systemcalls).
      
      Solution proposed here gives fine-grain locks that performs better on SMP
      system compared to present kretprobe implementation.
      
      Solution:
      
       1) Instead of having one global lock to protect kretprobe instances
          present in kretprobe object and kretprobe hash table.  We will have
          two locks, one lock for protecting kretprobe hash table and another
          lock for kretporbe object.
      
       2) We hold lock present in kretprobe object while we modify kretprobe
          instance in kretprobe object and we hold per-hash-list lock while
          modifying kretprobe instances present in that hash list.  To prevent
          deadlock, we never grab a per-hash-list lock while holding a kretprobe
          lock.
      
       3) We can remove used_instances from struct kretprobe, as we can
          track used instances of kretprobe instances using kretprobe hash
          table.
      
      Time duration for kernel compilation ("make -j 8") on a 8-way ppc64 system
      with return probes set on all systemcalls looks like this.
      
      cacheline              non-cacheline             Un-patched kernel
      aligned patch 	       aligned patch
      ===============================================================================
      real    9m46.784s       9m54.412s                  10m2.450s
      user    40m5.715s       40m7.142s                  40m4.273s
      sys     2m57.754s       2m58.583s                  3m17.430s
      ===========================================================
      
      Time duration for kernel compilation ("make -j 8) on the same system, when
      kernel is not probed.
      =========================
      real    9m26.389s
      user    40m8.775s
      sys     2m7.283s
      =========================
      Signed-off-by: NSrinivasa DS <srinivasa@in.ibm.com>
      Signed-off-by: NJim Keniston <jkenisto@us.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef53d9c5