1. 08 4月, 2014 40 次提交
    • M
      lib/idr.c: use RCU_INIT_POINTER(x, NULL) · 3f59b067
      Monam Agarwal 提交于
      Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
      
      The rcu_assign_pointer() ensures that the initialization of a structure
      is carried out before storing a pointer to that structure.  And in the
      case of the NULL pointer, there is no structure to initialize.
      
      So, rcu_assign_pointer(p, NULL) can be safely converted to
      RCU_INIT_POINTER(p, NULL)
      Signed-off-by: NMonam Agarwal <monamagarwal123@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f59b067
    • S
      idr: remove dead code · 90ae3ae5
      Stephen Hemminger 提交于
      Remove no longer used deprecated code, and make local functions
      static.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: NJean Delvare <jdelvare@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: George Spelvin <linux@horizon.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90ae3ae5
    • W
      vmcore: continue vmcore initialization if PT_NOTE is found empty · c4082f36
      WANG Chao 提交于
      Currently when an empty PT_NOTE is detected, vmcore initialization
      fails.  It sounds too harsh.  Because PT_NOTE could be empty, for
      example, one offlined a cpu but never restarted kdump service, and after
      crash, PT_NOTE program header is there but no data contains.  It's
      better to warn about the empty PT_NOTE and continue to initialise
      vmcore.
      
      And ultimately the multiple PT_NOTE are merged into a single one, all
      empty PT_NOTE are discarded naturally during the merge.  So empty
      PT_NOTE is not visible to user space and vmcore is as good as expected.
      Signed-off-by: NWANG Chao <chaowang@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Greg Pearson <greg.pearson@hp.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4082f36
    • R
      include/linux/crash_dump.h: add vmcore_cleanup() prototype · 82e0703b
      Rashika Kheria 提交于
      Eliminate the following warning in proc/vmcore.c:
      
        fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]
      
      [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
      Signed-off-by: NRashika Kheria <rashika.kheria@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e0703b
    • O
      wait: WSTOPPED|WCONTINUED doesn't work if a zombie leader is traced by another process · 7c733eb3
      Oleg Nesterov 提交于
      Even if the main thread is dead the process still can stop/continue.
      However, if the leader is ptraced wait_consider_task(ptrace => false)
      always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
      WCONTINUED can never work for the natural parent in this case.
      
      Move the "A zombie ptracee is only visible to its ptracer" check into the
      "if (!delay_group_leader(p))" block.  ->notask_error is cleared by the
      "fall through" code below.
      
      This depends on the previous change, wait_task_stopped/continued must be
      avoided if !delay_group_leader() and the tracer is ->real_parent.
      Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
      is already dead (single-threaded or not).  If it is traced by another task
      then the "stopped" state is fine until the debugger detaches and reveals a
      zombie state.
      
      Stupid test-case:
      
      	void *tfunc(void *arg)
      	{
      		sleep(1);	// wait for zombie leader
      		raise(SIGSTOP);
      		exit(0x13);
      		return NULL;
      	}
      
      	int run_child(void)
      	{
      		pthread_t thread;
      
      		if (!fork()) {
      			int tracee = getppid();
      
      			assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
      			do
      				ptrace(PTRACE_CONT, tracee, 0,0);
      			while (wait(NULL) > 0);
      
      			return 0;
      		}
      
      		sleep(1);	// wait for PTRACE_ATTACH
      		assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
      		pthread_exit(NULL);
      	}
      
      	int main(void)
      	{
      		int child, stat;
      
      		child = fork();
      		if (!child)
      			return run_child();
      
      		assert(child == waitpid(-1, &stat, WSTOPPED));
      		assert(stat == 0x137f);
      
      		kill(child, SIGCONT);
      
      		assert(child == waitpid(-1, &stat, WCONTINUED));
      		assert(stat == 0xffff);
      
      		assert(child == waitpid(-1, &stat, 0));
      		assert(stat == 0x1300);
      
      		return 0;
      	}
      
      Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
      never called.
      
      Note: this doesn't fix all problems with a zombie delay_group_leader(),
      WCONTINUED | WEXITED check is not exactly right.  debugger can't assume it
      will be notified if another thread reaps the whole thread group.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c733eb3
    • O
      wait: WSTOPPED|WCONTINUED hangs if a zombie child is traced by real_parent · 377d75da
      Oleg Nesterov 提交于
      "A zombie is only visible to its ptracer" logic in wait_consider_task()
      is very wrong. Trivial test-case:
      
      	#include <unistd.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <assert.h>
      
      	int main(void)
      	{
      		int child = fork();
      
      		if (!child) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			return 0x23;
      		}
      
      		assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
      		assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
      		return 0;
      	}
      
      it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
      child.  This is because wait_consider_task(ptrace => 0) sees p->ptrace and
      cleares ->notask_error assuming that the debugger should detach and notify
      us.
      
      Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
      child is traced by us.  This really simplifies the logic and allows us to
      do more fixes, see the next changes.  This also hides the unwanted group
      stop state automatically, we can remove another ptrace_reparented() check.
      
      Unfortunately, this adds the following behavioural changes:
      
      	1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
      	   a natural child if it is traced by the caller's sub-thread.
      
      	   Hopefully nobody will ever notice this change, and I think
      	   that nobody should rely on this behaviour anyway.
      
      	2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
      	   it is real parent.
      
      	   While this change comes as a side effect, I think it is good
      	   by itself. The group continued state can not be consumed by
      	   another process in this case, it doesn't depend on ptrace,
      	   it doesn't make sense to hide it from real parent.
      
      	   Perhaps we should add the thread_group_leader() check before
      	   wait_task_continued()? May be, but this shouldn't depend on
      	   ptrace_reparented().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      377d75da
    • O
      wait: swap EXIT_ZOMBIE and EXIT_DEAD to hide EXIT_TRACE from user-space · ad86622b
      Oleg Nesterov 提交于
      get_task_state() uses the most significant bit to report the state to
      user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
      can be noticed via /proc as Z -> X -> Z change.  Note that this was
      possible even before EXIT_TRACE was introduced.
      
      This is not really bad but imho it make sense to hide EXIT_TRACE from
      user-space completely.  So the patch simply swaps EXIT_ZOMBIE and
      EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad86622b
    • O
      wait: completely ignore the EXIT_DEAD tasks · b3ab0316
      Oleg Nesterov 提交于
      Now that EXIT_DEAD is the terminal state it doesn't make sense to call
      eligible_child() or security_task_wait() if the task is really dead.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3ab0316
    • O
      wait: use EXIT_TRACE only if thread_group_leader(zombie) · b4360690
      Oleg Nesterov 提交于
      wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
      ptrace_reparented().  This is suboptimal and a bit confusing: we do not
      need do_notify_parent(p) if !thread_group_leader(p) and in this case we
      also do not need ptrace_unlink(), we can rely on ptrace_release_task().
      
      Change wait_task_zombie() to check thread_group_leader() along with
      ptrace_reparented() and simplify the final p->exit_state transition.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4360690
    • O
      wait: introduce EXIT_TRACE to avoid the racy EXIT_DEAD->EXIT_ZOMBIE transition · abd50b39
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.  This was fixed by
      the previous commit, but it was the temporary hack.
      
      1. Add the new exit_state, EXIT_TRACE. It means that the task is the
         traced zombie, debugger is going to detach and notify its natural
         parent.
      
         This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
         can avoid the changes in proc/kgdb code, get_task_state() still
         reports "X (dead)" in this case.
      
         Note: with or without this change userspace can see Z -> X -> Z
         transition. Not really bad, but probably makes sense to fix.
      
      2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
         if we need to notify the ->real_parent.
      
      3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
         is always the final state we can safely ignore such a task.
      
      4. Change wait_consider_task() to check EXIT_TRACE separately and kill
         the racy and no longer needed ptrace_reparented() case.
      
         If ptrace == T an EXIT_TRACE thread should be simply ignored, the
         owner of this state is going to ptrace_unlink() this task. We can
         pretend that it was already removed from ->ptraced list.
      
         Otherwise we should skip this thread too but clear ->notask_error,
         we must be the natural parent and debugger is going to untrace and
         notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
         even if the task was already untraced.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd50b39
    • O
      wait: fix reparent_leader() vs EXIT_DEAD->EXIT_ZOMBIE race · dfccbb5e
      Oleg Nesterov 提交于
      wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
      drops tasklist_lock.  If this task is not the natural child and it is
      traced, we change its state back to EXIT_ZOMBIE for ->real_parent.
      
      The last transition is racy, this is even documented in 50b8d257
      "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
      race".  wait_consider_task() tries to detect this transition and clear
      ->notask_error but we can't rely on ptrace_reparented(), debugger can
      exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.
      
      And there is another problem which were missed before: this transition
      can also race with reparent_leader() which doesn't reset >exit_signal if
      EXIT_DEAD, assuming that this task must be reaped by someone else.  So
      the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
      /sbin/init doesn't use __WALL it becomes unreapable.
      
      Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
      Note: this is the simple temporary hack for -stable, it doesn't try to
      solve all problems, it will be reverted by the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NJan Kratochvil <jan.kratochvil@redhat.com>
      Reported-by: NMichal Schmidt <mschmidt@redhat.com>
      Tested-by: NMichal Schmidt <mschmidt@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfccbb5e
    • O
      exec: kill bprm->tcomm[], simplify the "basename" logic · 23aebe16
      Oleg Nesterov 提交于
      Starting from commit c4ad8f98 ("execve: use 'struct filename *' for
      executable name passing") bprm->filename can not go away after
      flush_old_exec(), so we do not need to save the binary name in
      bprm->tcomm[] added by 96e02d15 ("exec: fix use-after-free bug in
      setup_new_exec()").
      
      And there was never need for filename_to_taskname-like code, we can
      simply do set_task_comm(kbasename(filename).
      
      This patch has to change set_task_comm() and trace_task_rename() to
      accept "const char *", but I think this change is also good.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23aebe16
    • D
      procfs: make /proc/*/pagemap 0400 · 32ed74a4
      Djalal Harouni 提交于
      The /proc/*/pagemap contain sensitive information and currently its mode
      is 0444.  Change this to 0400, so the VFS will prevent unprivileged
      processes from getting file descriptors on arbitrary privileged
      /proc/*/pagemap files.
      
      This reduces the scope of address space leaking and bypasses by protecting
      already running processes.
      Signed-off-by: NDjalal Harouni <tixxdz@opendz.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32ed74a4
    • D
      procfs: make /proc/*/{stack,syscall,personality} 0400 · 35a35046
      Djalal Harouni 提交于
      These procfs files contain sensitive information and currently their
      mode is 0444.  Change this to 0400, so the VFS will be able to block
      unprivileged processes from getting file descriptors on arbitrary
      privileged /proc/*/{stack,syscall,personality} files.
      
      This reduces the scope of ASLR leaking and bypasses by protecting already
      running processes.
      Signed-off-by: NDjalal Harouni <tixxdz@opendz.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35a35046
    • M
      fs/proc/inode.c: use RCU_INIT_POINTER(x, NULL) · 1c44dbc8
      Monam Agarwal 提交于
      Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)
      
      The rcu_assign_pointer() ensures that the initialization of a structure
      is carried out before storing a pointer to that structure.  And in the
      case of the NULL pointer, there is no structure to initialize.  So,
      rcu_assign_pointer(p, NULL) can be safely converted to
      RCU_INIT_POINTER(p, NULL)
      Signed-off-by: NMonam Agarwal <monamagarwal123@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c44dbc8
    • A
      proc: show mnt_id in /proc/pid/fdinfo · 49d063cb
      Andrey Vagin 提交于
      Currently we don't have a way how to determing from which mount point
      file has been opened.  This information is required for proper dumping
      and restoring file descriptos due to presence of mount namespaces.  It's
      possible, that two file descriptors are opened using the same paths, but
      one fd references mount point from one namespace while the other fd --
      from other namespace.
      
      $ ls -l /proc/1/fd/1
      lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null
      
      $ cat /proc/1/fdinfo/1
      pos:	0
      flags:	0100002
      mnt_id:	16
      
      $ cat /proc/1/mountinfo | grep ^16
      16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49d063cb
    • L
      fs/proc/meminfo: meminfo_proc_show(): fix typo in comment · f0b5664b
      Luiz Capitulino 提交于
      It should read "reclaimable slab" and not "reclaimable swap".
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b5664b
    • G
      kernel/exit.c: call proc_exit_connector() after exit_state is set · ef982393
      Guillaume Morin 提交于
      The process events connector delivers a notification when a process
      exits.  This is really convenient for a process that spawns and wants to
      monitor its children through an epoll-able() interface.
      
      Unfortunately, there is a small window between when the event is
      delivered and the child become wait()-able.
      
      This is creates a race if the parent wants to make sure that it knows
      about the exit, e.g
      
      pid_t pid = fork();
      if (pid > 0) {
      	register_interest_for_pid(pid);
      	if (waitpid(pid, NULL, WNOHANG) > 0)
      	{
      	  /* We might have raced with exit() */
      	}
      	return;
      }
      
      /* Child */
      execve(...)
      
      register_interest_for_pid() would be telling the the connector socket
      reader to pay attention to events related to pid.
      
      Though this is not a bug, I think it would make the connector a bit more
      usable if this race was closed by simply moving the call to
      proc_exit_connector() from just before exit_notify() to right after.
      
      Oleg said:
      
      : Even with this patch the code above is still "racy" if the child is
      : multi-threaded.  Plus it should obviously filter-out subthreads.  And
      : afaics there is no way to make it reliable, even if you change the code
      : above so that waitpid() is called only after the last thread exits WNOHANG
      : still can fail.
      Signed-off-by: NGuillaume Morin <guillaume@morinfr.org>
      Cc: Matt Helsley <matt.helsley@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef982393
    • O
      exit: move check_stack_usage() to the end of do_exit() · 4bcb8232
      Oleg Nesterov 提交于
      It is not clear why check_stack_usage() is called so early and thus it
      never checks the stack usage in, say, exit_notify() or
      flush_ptrace_hw_breakpoint() or other functions which are only called by
      do_exit().
      
      Move the callsite down to the last preempt_disable/schedule.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bcb8232
    • O
      exit: call disassociate_ctty() before exit_task_namespaces() · c39df5fa
      Oleg Nesterov 提交于
      Commit 8aac6270 ("move exit_task_namespaces() outside of
      exit_notify()") breaks pppd and the exiting service crashes the kernel:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: ppp_register_channel+0x13/0x20 [ppp_generic]
          Call Trace:
            ppp_asynctty_open+0x12b/0x170 [ppp_async]
            tty_ldisc_open.isra.2+0x27/0x60
            tty_ldisc_hangup+0x1e3/0x220
            __tty_hangup+0x2c4/0x440
            disassociate_ctty+0x61/0x270
            do_exit+0x7f2/0xa50
      
      ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.
      
      Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
      sense to delay it after perf_event_exit_task() or cgroup_exit().
      
      This also allows to use task_work_add() inside the (nontrivial) code
      paths in disassociate_ctty().
      
      Investigated by Peter Hurley.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSree Harsha Totakura <sreeharsha@totakura.in>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Sree Harsha Totakura <sreeharsha@totakura.in>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c39df5fa
    • S
      mm/zswap.c: remove unnecessary parentheses · 5d2d42de
      SeongJae Park 提交于
      Fix following trivial checkpatch error:
      
        ERROR: return is not a function, parentheses are not required
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d2d42de
    • M
      mm/zswap: support multiple swap devices · 60105e12
      Minchan Kim 提交于
      Cai Liu reporeted that now zbud pool pages counting has a problem when
      multiple swap is used because it just counts only one swap intead of all
      of swap so zswap cannot control writeback properly.  The result is
      unnecessary writeback or no writeback when we should really writeback.
      
      IOW, it made zswap crazy.
      
      Another problem in zswap is:
      
      For example, let's assume we use two swap A and B with different
      priority and A already has charged 19% long time ago and let's assume
      that A swap is full now so VM start to use B so that B has charged 1%
      recently.  It menas zswap charged (19% + 1%) is full by default.  Then,
      if VM want to swap out more pages into B, zbud_reclaim_page would be
      evict one of pages in B's pool and it would be repeated continuously.
      It's totally LRU reverse problem and swap thrashing in B would happen.
      
      This patch makes zswap consider mutliple swap by creating *a* zbud pool
      which will be shared by multiple swap so all of zswap pages in multiple
      swap keep order by LRU so it can prevent above two problems.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NCai Liu <cai.liu@samsung.com>
      Suggested-by: NWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60105e12
    • S
      mm/zswap.c: update zsmalloc in comment to zbud · 6335b193
      SeongJae Park 提交于
      zswap used zsmalloc before and now using zbud.  But, some comments saying
      it use zsmalloc yet.  Fix the trivial problems.
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6335b193
    • S
      mm/zswap.c: fix trivial typo and arrange indentation · 6b452516
      SeongJae Park 提交于
      Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b452516
    • J
      zram: support REQ_DISCARD · f4659d8e
      Joonsoo Kim 提交于
      zram is ram based block device and can be used by backend of filesystem.
      When filesystem deletes a file, it normally doesn't do anything on data
      block of that file.  It just marks on metadata of that file.  This
      behavior has no problem on disk based block device, but has problems on
      ram based block device, since we can't free memory used for data block.
      To overcome this disadvantage, there is REQ_DISCARD functionality.  If
      block device support REQ_DISCARD and filesystem is mounted with discard
      option, filesystem sends REQ_DISCARD to block device whenever some data
      blocks are discarded.  All we have to do is to handle this request.
      
      This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
      REQ_DISCARD request.  With it, we can free memory used by zram if it isn't
      used.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4659d8e
    • S
      zram: use scnprintf() in attrs show() methods · 56b4e8cb
      Sergey Senozhatsky 提交于
      sysfs.txt documentation lists the following requirements:
      
       - The buffer will always be PAGE_SIZE bytes in length. On i386, this
         is 4096.
      
       - show() methods should return the number of bytes printed into the
         buffer. This is the return value of scnprintf().
      
       - show() should always use scnprintf().
      
      Use scnprintf() in show() functions.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56b4e8cb
    • M
      zram: propagate error to user · 60a726e3
      Minchan Kim 提交于
      When we initialized zcomp with single, we couldn't change
      max_comp_streams without zram reset but current interface doesn't show
      any error to user and even it changes max_comp_streams's value without
      any effect so it would make user very confusing.
      
      This patch prevents max_comp_streams's change when zcomp was initialized
      as single zcomp and emit the error to user(ex, echo).
      
      [akpm@linux-foundation.org: don't return with the lock held, per Sergey]
      [fengguang.wu@intel.com: fix coccinelle warnings]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60a726e3
    • S
      zram: return error-valued pointer from zcomp_create() · fcfa8d95
      Sergey Senozhatsky 提交于
      Instead of returning just NULL, return ERR_PTR from zcomp_create() if
      compressing backend creation has failed.  ERR_PTR(-EINVAL) for unsupported
      compression algorithm request, ERR_PTR(-ENOMEM) for allocation (zcomp or
      compression stream) error.
      
      Perform IS_ERR() check of returned from zcomp_create() value in
      disksize_store() and set return code to PTR_ERR().
      
      Change suggested by Jerome Marchand.
      
      [akpm@linux-foundation.org: clean up error recovery flow]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcfa8d95
    • S
      zram: move comp allocation out of init_lock · d61f98c7
      Sergey Senozhatsky 提交于
      While fixing lockdep spew of ->init_lock reported by Sasha Levin [1],
      Minchan Kim noted [2] that it's better to move compression backend
      allocation (using GPF_KERNEL) out of the ->init_lock lock, same way as
      with zram_meta_alloc(), in order to prevent the same lockdep spew.
      
      [1] https://lkml.org/lkml/2014/2/27/337
      [2] https://lkml.org/lkml/2014/3/3/32Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d61f98c7
    • S
      zram: add lz4 algorithm backend · 6e76668e
      Sergey Senozhatsky 提交于
      Introduce LZ4 compression backend and make it available for selection.
      LZ4 support is optional and requires user to set ZRAM_LZ4_COMPRESS config
      option.  The default compression backend is LZO.
      
      TEST
      
      (x86_64, core i5, 2 cores + 2 hyperthreading, zram disk size 1G,
      ext4 file system, 3 compression streams)
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             Test           LZO           LZ4
      ----------------------------------------------
        Initial write   1642744.62    1317005.09
              Rewrite   2498980.88    1800645.16
                 Read   3957026.38    5877043.75
              Re-read   3950997.38    5861847.00
         Reverse Read   2937114.56    5047384.00
          Stride read   2948163.19    4929587.38
          Random read   3292692.69    4880793.62
       Mixed workload   1545602.62    3502940.38
         Random write   2448039.75    1758786.25
               Pwrite   1670051.03    1338329.69
                Pread   2530682.00    5097177.62
               Fwrite   3232085.62    3275942.56
                Fread   6306880.25    6645271.12
      
      So on my system LZ4 is slower in write-only tests, while it performs
      better in read-only and mixed (reads + writes) tests.
      
      Official LZ4 benchmarks available here http://code.google.com/p/lz4/
      (linux kernel uses revision r90).
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e76668e
    • S
      zram: make compression algorithm selection possible · e46b8a03
      Sergey Senozhatsky 提交于
      Add and document `comp_algorithm' device attribute.  This attribute allows
      to show supported compression and currently selected compression
      algorithms:
      
      	cat /sys/block/zram0/comp_algorithm
      	[lzo] lz4
      
      and change selected compression algorithm:
      	echo lzo > /sys/block/zram0/comp_algorithm
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46b8a03
    • S
      zram: add set_max_streams knob · fe8eb122
      Sergey Senozhatsky 提交于
      This patch allows to change max_comp_streams on initialised zcomp.
      
      Introduce zcomp set_max_streams() knob, zcomp_strm_multi_set_max_streams()
      and zcomp_strm_single_set_max_streams() callbacks to change streams limit
      for zcomp_strm_multi and zcomp_strm_single, accordingly.  set_max_streams
      for single steam zcomp does nothing.
      
      If user has lowered the limit, then zcomp_strm_multi_set_max_streams()
      attempts to immediately free extra streams (as much as it can, depending
      on idle streams availability).
      
      Note, this patch does not allow to change stream 'policy' from single to
      multi stream (or vice versa) on already initialised compression backend.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe8eb122
    • S
      zram: add multi stream functionality · beca3ec7
      Sergey Senozhatsky 提交于
      Existing zram (zcomp) implementation has only one compression stream
      (buffer and algorithm private part), so in order to prevent data
      corruption only one write (compress operation) can use this compression
      stream, forcing all concurrent write operations to wait for stream lock
      to be released.  This patch changes zcomp to keep a compression streams
      list of user-defined size (via sysfs device attr).  Each write operation
      still exclusively holds compression stream, the difference is that we
      can have N write operations (depending on size of streams list)
      executing in parallel.  See TEST section later in commit message for
      performance data.
      
      Introduce struct zcomp_strm_multi and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_multi has a list of idle
      zcomp_strm structs, spinlock to protect idle list and wait queue, making
      it possible to perform parallel compressions.
      
      The following set of functions added:
      - zcomp_strm_multi_find()/zcomp_strm_multi_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_multi_create()/zcomp_strm_multi_destroy()
        create and destroy zcomp_strm_multi
      
      zcomp ->strm_find() and ->strm_release() callbacks are set during
      initialisation to zcomp_strm_multi_find()/zcomp_strm_multi_release()
      correspondingly.
      
      Each time zcomp issues a zcomp_strm_multi_find() call, the following set
      of operations performed:
      
      - spin lock strm_lock
      - if idle list is not empty, remove zcomp_strm from idle list, spin
        unlock and return zcomp stream pointer to caller
      - if idle list is empty, current adds itself to wait queue. it will be
        awaken by zcomp_strm_multi_release() caller.
      
      zcomp_strm_multi_release():
      - spin lock strm_lock
      - add zcomp stream to idle list
      - spin unlock, wake up sleeper
      
      Minchan Kim reported that spinlock-based locking scheme has demonstrated
      a severe perfomance regression for single compression stream case,
      comparing to mutex-based (see https://lkml.org/lkml/2014/2/18/16)
      
      base                      spinlock                    mutex
      
      ==Initial write           ==Initial write             ==Initial  write
      records:  5               records:  5                 records:   5
      avg:      1642424.35      avg:      699610.40         avg:       1655583.71
      std:      39890.95(2.43%) std:      232014.19(33.16%) std:       52293.96
      max:      1690170.94      max:      1163473.45        max:       1697164.75
      min:      1568669.52      min:      573429.88         min:       1553410.23
      ==Rewrite                 ==Rewrite                   ==Rewrite
      records:  5               records:  5                 records:   5
      avg:      1611775.39      avg:      501406.64         avg:       1684419.11
      std:      17144.58(1.06%) std:      15354.41(3.06%)   std:       18367.42
      max:      1641800.95      max:      531356.78         max:       1706445.84
      min:      1593515.27      min:      488817.78         min:       1655335.73
      
      When only one compression stream available, mutex with spin on owner
      tends to perform much better than frequent wait_event()/wake_up().  This
      is why single stream implemented as a special case with mutex locking.
      
      Introduce and document zram device attribute max_comp_streams.  This
      attr shows and stores current zcomp's max number of zcomp streams
      (max_strm).  Extend zcomp's zcomp_create() with `max_strm' parameter.
      `max_strm' limits the number of zcomp_strm structs in compression
      backend's idle list (max_comp_streams).
      
      max_comp_streams used during initialisation as follows:
      -- passing to zcomp_create() max_strm equals to 1 will initialise zcomp
      using single compression stream zcomp_strm_single (mutex-based locking).
      -- passing to zcomp_create() max_strm greater than 1 will initialise zcomp
      using multi compression stream zcomp_strm_multi (spinlock-based locking).
      
      default max_comp_streams value is 1, meaning that zram with single stream
      will be initialised.
      
      Later patch will introduce configuration knob to change max_comp_streams
      on already initialised and used zcomp.
      
      TEST
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test           base       1 strm (mutex)     3 strm (spinlock)
      -----------------------------------------------------------------------
       Initial write      589286.78       583518.39          718011.05
             Rewrite      604837.97       596776.38         1515125.72
        Random write      584120.11       595714.58         1388850.25
              Pwrite      535731.17       541117.38          739295.27
              Fwrite     1418083.88      1478612.72         1484927.06
      
      Usage example:
      set max_comp_streams to 4
              echo 4 > /sys/block/zram0/max_comp_streams
      
      show current max_comp_streams (default value is 1).
              cat /sys/block/zram0/max_comp_streams
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      beca3ec7
    • S
      zram: factor out single stream compression · 9cc97529
      Sergey Senozhatsky 提交于
      This is preparation patch to add multi stream support to zcomp.
      
      Introduce struct zcomp_strm_single and a set of functions to manage
      zcomp_strm stream access.  zcomp_strm_single implements single compession
      stream, same way as current zcomp implementation.  This moves zcomp_strm
      stream control and locking from zcomp, so compressing backend zcomp is not
      aware of required locking.
      
      Single and multi streams require different locking schemes.  Minchan Kim
      reported that spinlock-based locking scheme (which is used in multi stream
      implementation) has demonstrated a severe perfomance regression for single
      compression stream case, comparing to mutex-based.  see
      https://lkml.org/lkml/2014/2/18/16
      
      The following set of functions added:
      - zcomp_strm_single_find()/zcomp_strm_single_release()
        find and release a compression stream, implement required locking
      - zcomp_strm_single_create()/zcomp_strm_single_destroy()
        create and destroy zcomp_strm_single
      
      New ->strm_find() and ->strm_release() callbacks added to zcomp, which are
      set to zcomp_strm_single_find() and zcomp_strm_single_release() during
      initialisation.  Instead of direct locking and zcomp_strm access from
      zcomp_strm_find() and zcomp_strm_release(), zcomp now calls ->strm_find()
      and ->strm_release() correspondingly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cc97529
    • S
      zram: use zcomp compressing backends · b7ca232e
      Sergey Senozhatsky 提交于
      Do not perform direct LZO compress/decompress calls, initialise
      and use zcomp LZO backend (single compression stream) instead.
      
      [akpm@linux-foundation.org: resolve conflicts with zram-delete-zram_init_device-fix.patch]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ca232e
    • S
      zram: introduce compressing backend abstraction · e7e1ef43
      Sergey Senozhatsky 提交于
      ZRAM performs direct LZO compression algorithm calls, making it the one
      and only option.  While LZO is generally performs well, LZ4 algorithm
      tends to have a faster decompression (see http://code.google.com/p/lz4/
      for full report)
      
      	Name            Ratio  C.speed D.speed
      	                        MB/s    MB/s
      	LZ4 (r101)      2.084    422    1820
      	LZO 2.06        2.106    414     600
      
      Thus, users who have mostly read (decompress) usage scenarious or mixed
      workflow (writes with relatively high read ops number) will benefit from
      using LZ4 compression backend.
      
      Introduce compressing backend abstraction zcomp in order to support
      multiple compression algorithms with the following set of operations:
      
              .create
              .destroy
              .compress
              .decompress
      
      Schematically zram write() usually contains the following steps:
      0) preparation (decompression of partioal IO, etc.)
      1) lock buffer_lock mutex (protects meta compress buffers)
      2) compress (using meta compress buffers)
      3) alloc and map zs_pool object
      4) copy compressed data (from meta compress buffers) to object allocated by 3)
      5) free previous pool page, assign a new one
      6) unlock buffer_lock mutex
      
      As we can see, compressing buffers must remain untouched from 1) to 4),
      because, otherwise, concurrent write() can overwrite data.  At the same
      time, zram_meta must be aware of a) specific compression algorithm memory
      requirements and b) necessary locking to protect compression buffers.  To
      remove requirement a) new struct zcomp_strm introduced, which contains a
      compress/decompress `buffer' and compression algorithm `private' part.
      While struct zcomp implements zcomp_strm stream handling and locking and
      removes requirement b) from zram meta.  zcomp ->create() and ->destroy(),
      respectively, allocate and deallocate algorithm specific zcomp_strm
      `private' part.
      
      Every zcomp has zcomp stream and mutex to protect its compression stream.
      Stream usage semantics remains the same -- only one write can hold stream
      lock and use its buffers.  zcomp_strm_find() turns caller into exclusive
      user of a stream (holding stream mutex until zram release stream), and
      zcomp_strm_release() makes zcomp stream available (unlock the stream
      mutex).  Hence no concurrent write (compression) operations possible at
      the moment.
      
      iozone -t 3 -R -r 16K -s 60M -I +Z
      
             test            base           patched
      --------------------------------------------------
        Initial write      597992.91       591660.58
              Rewrite      609674.34       616054.97
                 Read     2404771.75      2452909.12
              Re-read     2459216.81      2470074.44
         Reverse Read     1652769.66      1589128.66
          Stride read     2202441.81      2202173.31
          Random read     2236311.47      2276565.31
       Mixed workload     1423760.41      1709760.06
         Random write      579584.08       615933.86
               Pwrite      597550.02       594933.70
                Pread     1703672.53      1718126.72
               Fwrite     1330497.06      1461054.00
                Fread     3922851.00      3957242.62
      
      Usage examples:
      
      	comp = zcomp_create(NAME) /* NAME e.g. "lzo" */
      
      which initialises compressing backend if requested algorithm is supported.
      
      Compress:
      	zstrm = zcomp_strm_find(comp)
      	zcomp_compress(comp, zstrm, src, &dst_len)
      	[..] /* copy compressed data */
      	zcomp_strm_release(comp, zstrm)
      
      Decompress:
      	zcomp_decompress(comp, src, src_len, dst);
      
      Free compessing backend and its zcomp stream:
      	zcomp_destroy(comp)
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e1ef43
    • S
      zram: delete zram_init_device() · b67d1ec1
      Sergey Senozhatsky 提交于
      allocate new `zram_meta' in disksize_store() only for uninitialised zram
      device, saving a number of allocations and deallocations in case if
      disksize_store() was called on currently used device.  at the same time
      zram_meta stack variable is not necessary, because we can set ->meta
      directly.  there is also no need in setting QUEUE_FLAG_NONROT queue on
      every disksize_store(), set it once during device creation.
      
      [minchan@kernel.org: handle zram->meta alloc fail case]
      [minchan@kernel.org: prevent lockdep spew of init_lock]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67d1ec1
    • S
      zram: document failed_reads, failed_writes stats · 8dd1d324
      Sergey Senozhatsky 提交于
      Document `failed_reads' and `failed_writes' device attributes.
      Remove info about `discard' - there is no such zram attr.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd1d324
    • S
      zram: move zram size warning to documentation · e64cd51d
      Sergey Senozhatsky 提交于
      Move zram warning about disksize and size of memory correlation to zram
      documentation.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e64cd51d
    • S
      zram: drop not used table `count' member · 59fc86a4
      Sergey Senozhatsky 提交于
      struct table `count' member is not used.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59fc86a4