1. 15 9月, 2013 1 次提交
  2. 14 9月, 2013 1 次提交
    • L
      vfs: fix dentry LRU list handling and nr_dentry_unused accounting · 89dc77bc
      Linus Torvalds 提交于
      The LRU list changes interacted badly with our nr_dentry_unused
      accounting, and even worse with the new DCACHE_LRU_LIST bit logic.
      
      This introduces helper functions to make sure everything follows the
      proper dcache d_lru list rules: the dentry cache is complicated by the
      fact that some of the hotpaths don't even want to look at the LRU list
      at all, and the fact that we use the same list entry in the dentry for
      both the LRU list and for our temporary shrinking lists when removing
      things from the LRU.
      
      The helper functions temporarily have some extra sanity checking for the
      flag bits that have to match the current LRU state of the dentry.  We'll
      remove that before the final 3.12 release, but considering how easy it
      is to get wrong, this first cleanup version has some very particular
      sanity checking.
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89dc77bc
  3. 13 9月, 2013 8 次提交
  4. 12 9月, 2013 30 次提交
    • M
      xfs: remove dead code from xlog_recover_inode_pass2 · 08474ed6
      Mark Tinguely 提交于
      Additional code in the error handler of xlog_recover_inode_pass2()
      results in the following error:
      
      static checker warning: "fs/xfs/xfs_log_recover.c:2999
      xlog_recover_inode_pass2()
      	 info: ignoring unreachable code."
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: Ben Myers <bpm@sgi.com
      Signed-off-by: NBen Myers <bpm@sgi.com>
      08474ed6
    • D
      xfs: = vs == typo in ASSERT() · aa9e1040
      Dan Carpenter 提交于
      There is a '=' vs '==' typo so the ASSERT()s are always true.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      aa9e1040
    • R
      initmpfs: move rootfs code from fs/ramfs/ to init/ · 57f150a5
      Rob Landley 提交于
      When the rootfs code was a wrapper around ramfs, having them in the same
      file made sense.  Now that it can wrap another filesystem type, move it in
      with the init code instead.
      
      This also allows a subsequent patch to access rootfstype= command line
      arg.
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f150a5
    • R
      initmpfs: move bdi setup from init_rootfs to init_ramfs · 4bbee76b
      Rob Landley 提交于
      Even though ramfs hasn't got a backing device, commit e0bf68dd ("mm:
      bdi init hooks") added one anyway, and put the initialization in
      init_rootfs() since that's the first user, leaving it out of init_ramfs()
      to avoid duplication.
      
      But initmpfs uses init_tmpfs() instead, so move the init into the
      filesystem's init function, add a "once" guard to prevent duplicate
      initialization, and call the filesystem init from rootfs init.
      
      This goes part of the way to allowing ramfs to be built as a module.
      
      [akpm@linux-foundation.org; using bit 1 was odd]
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bbee76b
    • R
      initmpfs: replace MS_NOUSER in initramfs · 137fdcc1
      Rob Landley 提交于
      Mounting MS_NOUSER prevents --bind mounts from rootfs.  Prevent new rootfs
      mounts with a different mechanism that doesn't affect bind mounts.
      Signed-off-by: NRob Landley <rob@landley.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jim Cromie <jim.cromie@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      137fdcc1
    • J
      lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt · 5e4c0d97
      Jan Kara 提交于
      With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
      one such possible user), the following race can happen:
      
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      <interrupt>
      ...
      radix_tree_preload()
      ...
      radix_tree_insert()
        radix_tree_node_alloc()
          if (rtp->nr) {
            ret = rtp->nodes[rtp->nr - 1];
      
      And we give out one radix tree node twice.  That clearly results in radix
      tree corruption with different results (usually OOPS) depending on which
      two users of radix tree race.
      
      We fix the problem by making radix_tree_node_alloc() always allocate fresh
      radix tree nodes when in interrupt.  Using preloading when in interrupt
      doesn't make sense since all the allocations have to be atomic anyway and
      we cannot steal nodes from process-context users because some users rely
      on radix_tree_insert() succeeding after radix_tree_preload().
      in_interrupt() check is somewhat ugly but we cannot simply key off passed
      gfp_mask as that is acquired from root_gfp_mask() and thus the same for
      all preload users.
      
      Another part of the fix is to avoid node preallocation in
      radix_tree_preload() when passed gfp_mask doesn't allow waiting.  Again,
      preallocation in such case doesn't make sense and when preallocation would
      happen in interrupt we could possibly leak some allocated nodes.  However,
      some users of radix_tree_preload() require following radix_tree_insert()
      to succeed.  To avoid unexpected effects for these users,
      radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
      and we provide a new function radix_tree_maybe_preload() for those users
      which get different gfp mask from different call sites and which are
      prepared to handle radix_tree_insert() failure.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e4c0d97
    • D
      affs: use loff_t in affs_truncate() · 63259326
      Dan Carpenter 提交于
      It seems pretty unlikely that AFFS supports files over 4GB but we may as
      well leave use loff_t just for cleanness sake instead of truncating it to
      32 bits.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Marco Stornelli <marco.stornelli@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63259326
    • M
      vmcore: enable /proc/vmcore mmap for s390 · 11e376a3
      Michael Holzheu 提交于
      The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
      now to use mmap also on s390.
      
      So enable mmap for s390 again.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11e376a3
    • M
      vmcore: introduce remap_oldmem_pfn_range() · 9cb21813
      Michael Holzheu 提交于
      For zfcpdump we can't map the HSA storage because it is only available via
      a read interface.  Therefore, for the new vmcore mmap feature we have
      introduce a new mechanism to create mappings on demand.
      
      This patch introduces a new architecture function remap_oldmem_pfn_range()
      that should be used to create mappings with remap_pfn_range() for oldmem
      areas that can be directly mapped.  For zfcpdump this is everything
      besides of the HSA memory.  For the areas that are not mapped by
      remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
      handler mmap_vmcore_fault() is called.
      
      This handler works as follows:
      
      * Get already available or new page from page cache (find_or_create_page)
      * Check if /proc/vmcore page is filled with data (PageUptodate)
      * If yes:
        Return that page
      * If no:
        Fill page using __vmcore_read(), set PageUptodate, and return page
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cb21813
    • M
      vmcore: introduce ELF header in new memory feature · be8a8d06
      Michael Holzheu 提交于
      For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
      (zfcpdump).  We have support where the first HSA_SIZE bytes are saved into
      a hypervisor owned memory area (HSA) before the kdump kernel is booted.
      When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.
      
      The advantages of this mechanism are:
      
       * No crashkernel memory has to be defined in the old kernel.
       * Early boot problems (before kexec_load has been done) can be dumped
       * Non-Linux systems can be dumped.
      
      We modify the s390 copy_oldmem_page() function to read from the HSA memory
      if memory below HSA_SIZE bytes is requested.
      
      Since we cannot use the kexec tool to load the kernel in this scenario,
      we have to build the ELF header in the 2nd (kdump/new) kernel.
      
      So with the following patch set we would like to introduce the new
      function that the ELF header for /proc/vmcore can be created in the 2nd
      kernel memory.
      
      The following steps are done during zfcpdump execution:
      
      1.  Production system crashes
      2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
      3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
      4.  Boot loader loads kernel into low memory area
      5.  Kernel boots and uses only HSA_SIZE bytes of memory
      6.  Kernel saves registers of non-boot CPUs
      7.  Kernel does memory detection for dump memory map
      8.  Kernel creates ELF header for /proc/vmcore
      9.  /proc/vmcore uses this header for initialization
      10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
          - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
          - copy_oldmem_page() copies from real memory for memory above HSA_SIZE
      
      Currently for s390 we create the ELF core header in the 2nd kernel with a
      small trick.  We relocate the addresses in the ELF header in a way that
      for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
      and the read_from_oldmem() returns the correct data.  This allows the
      /proc/vmcore code to use the ELF header in the 2nd kernel.
      
      This patch:
      
      Exchange the old mechanism with the new and much cleaner function call
      override feature that now offcially allows to create the ELF core header
      in the 2nd kernel.
      
      To use the new feature the following function have to be defined
      by the architecture backend code to read from new memory:
      
       * elfcorehdr_alloc: Allocate ELF header
       * elfcorehdr_free: Free the memory of the ELF header
       * elfcorehdr_read: Read from ELF header
       * elfcorehdr_read_notes: Read from ELF notes
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Jan Willeke <willeke@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be8a8d06
    • O
      exec: cleanup the error handling in search_binary_handler() · 6b3c538f
      Oleg Nesterov 提交于
      The error hanling and ret-from-loop look confusing and inconsistent.
      
      - "retval >= 0" simply returns
      
      - "!bprm->file" returns too but with read_unlock() because
         binfmt_lock was already re-acquired
      
      - "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
        relies on the same check after the main loop
      
      Consolidate these checks into a single if/return statement.
      
      need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
      the main loop are not needed.  This is only for pathological and
      impossible list_empty(&formats) case.
      
      It is not clear why do we check "bprm->mm == NULL", probably this
      should be removed.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b3c538f
    • O
      exec: don't retry if request_module() fails · 4e0621a0
      Oleg Nesterov 提交于
      A separate one-liner for better documentation.
      
      It doesn't make sense to retry if request_module() fails to exec
      /sbin/modprobe, add the additional "request_module() < 0" check.
      
      However, this logic still doesn't look exactly right:
      
      1. It would be better to check "request_module() != 0", the user
         space modprobe process should report the correct exit code.
         But I didn't dare to add the user-visible change.
      
      2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
         to exec a "#!path-to-unsupported-binary" script. In this case
         request_module() + "retry" will be done twice: first by the
         "depth == 1" code, and then again by the "depth == 0" caller
         which doesn't make sense.
      
      3. And note that in the case above bprm->buf was already changed
         by load_script()->prepare_binprm(), so this looks even more
         ugly.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e0621a0
    • O
      exec: cleanup the CONFIG_MODULES logic · cb7b6b1c
      Oleg Nesterov 提交于
      search_binary_handler() uses "for (try=0; try<2; try++)" to avoid "goto"
      but the code looks too complicated and horrible imho.  We still need to
      check "try == 0" before request_module() and add the additional "break"
      for !CONFIG_MODULES case.
      
      Kill this loop and use a simple "bool need_retry" + "goto retry".  The
      code looks much simpler and we do not even need ifdef's, gcc can optimize
      out the "if (need_retry)" block if !IS_ENABLED().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb7b6b1c
    • O
      exec: kill ->load_binary != NULL check in search_binary_handler() · 92eaa565
      Oleg Nesterov 提交于
      search_binary_handler() checks ->load_binary != NULL for no reason, this
      method should be always defined.  Turn this check into WARN_ON() and move
      it into __register_binfmt().
      
      Also, kill the function pointer.  The current code looks confusing, as if
      ->load_binary can go away after read_unlock(&binfmt_lock).  But we rely on
      module_get(fmt->module), this fmt can't be changed or unregistered,
      otherwise this code is buggy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92eaa565
    • O
      exec: move allow_write_access/fput to exec_binprm() · 52f14282
      Oleg Nesterov 提交于
      When search_binary_handler() succeeds it does allow_write_access() and
      fput(), then it clears bprm->file to ensure the caller will not do the
      same.
      
      We can simply move this code to exec_binprm() which is called only once.
      In fact we could move this to free_bprm() and remove the same code in
      do_execve_common's error path.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52f14282
    • O
      exec: proc_exec_connector() should be called only once · 9beb266f
      Oleg Nesterov 提交于
      A separate one-liner with the minor fix.
      
      PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
      least twice if search_binary_handler() is called by ->load_binary()
      recursively, say, load_script().
      
      Move it to exec_binprm(), this is "depth == 0" code too.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9beb266f
    • O
      exec: kill "int depth" in search_binary_handler() · 131b2f9f
      Oleg Nesterov 提交于
      Nobody except search_binary_handler() should touch ->recursion_depth, "int
      depth" buys nothing but complicates the code, kill it.
      
      Probably we should also kill "fn" and the !NULL check, ->load_binary
      should be always defined.  And it can not go away after read_unlock() or
      this code is buggy anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      131b2f9f
    • O
      exec: introduce exec_binprm() for "depth == 0" code · 5d1baf3b
      Oleg Nesterov 提交于
      task_pid_nr_ns() and trace/ptrace code in the middle of the recursive
      search_binary_handler() looks confusing and imho annoying.  We only need
      this code if "depth == 0", lets add a simple helper which calls
      search_binary_handler() and does trace_sched_process_exec() +
      ptrace_event().
      
      The patch also moves the setting of task->did_exec, we need to do this
      only once.
      
      Note: we can kill either task->did_exec or PF_FORKNOEXEC.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Zach Levis <zml@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d1baf3b
    • O
      proc: make proc_fd_permission() thread-friendly · 96d0df79
      Oleg Nesterov 提交于
      proc_fd_permission() says "process can still access /proc/self/fd after it
      has executed a setuid()", but the "task_pid() = proc_pid() check only
      helps if the task is group leader, /proc/self points to
      /proc/<leader-pid>.
      
      Change this check to use task_tgid() so that the whole thread group can
      access its /proc/self/fd or /proc/<tid-of-sub-thread>/fd.
      
      Notes:
      	- CLONE_THREAD does not require CLONE_FILES so task->files
      	  can differ, but I don't think this can lead to any security
      	  problem. And this matches same_thread_group() in
      	  __ptrace_may_access().
      
      	- /proc/self should probably point to /proc/<thread-tid>, but
      	  it is too late to change the rules. Perhaps it makes sense
      	  to add /proc/thread though.
      
      Test-case:
      
      	void *tfunc(void *arg)
      	{
      		assert(opendir("/proc/self/fd"));
      		return NULL;
      	}
      
      	int main(void)
      	{
      		pthread_t t;
      		pthread_create(&t, NULL, tfunc, NULL);
      		pthread_join(t, NULL);
      		return 0;
      	}
      
      fails if, say, this executable is not readable and suid_dumpable = 0.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d0df79
    • C
      fs/proc/task_mmu.c: check the return value of mpol_to_str() · a3c03992
      Chen Gang 提交于
      mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
      check about it, or buffer may not be zero based, and next seq_printf()
      will cause issue.
      
      The failure return need after mpol_cond_put() to match get_vma_policy().
      Signed-off-by: NChen Gang <gang.chen@asianux.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3c03992
    • A
      fs/file_table.c:fput(): make comment more truthful · be49b30a
      Andrew Morton 提交于
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be49b30a
    • S
      coredump: add new %P variable in core_pattern · 65aafb1e
      Stéphane Graber 提交于
      Add a new %P variable to be used in core_pattern.  This variable contains
      the global PID (PID in the init namespace) as %p contains the PID in the
      current namespace which isn't always what we want.
      
      The main use for this is to make it easier to handle crashes that happened
      within a container.  With that new variables it's possible to have the
      crashes dumped into the container or forwarded to the host with the right
      PID (from the host's point of view).
      Signed-off-by: NStéphane Graber <stgraber@ubuntu.com>
      Reported-by: NHans Feldt <hans.feldt@ericsson.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65aafb1e
    • V
      hfsplus: integrate POSIX ACLs support into driver · b4c1107c
      Vyacheslav Dubeyko 提交于
      Integrate implemented POSIX ACLs support into hfsplus driver.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4c1107c
    • V
      hfsplus: implement POSIX ACLs support · eef80d4a
      Vyacheslav Dubeyko 提交于
      Implement POSIX ACLs support in hfsplus driver.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eef80d4a
    • V
      hfsplus: add necessary declarations for POSIX ACLs support · 2c92057e
      Vyacheslav Dubeyko 提交于
      This patchset implements POSIX ACLs support in hfsplus driver.
      
      Mac OS X beginning with version 10.4 ("Tiger") support NFSv4 ACLs, which
      are part of the NFSv4 standard.  HFS+ stores ACLs in the form of
      specially named extended attributes (com.apple.system.Security).
      
      But this patchset doesn't use "com.apple.system.Security" extended
      attributes.  It implements support of POSIX ACLs in the form of extended
      attributes with names "system.posix_acl_access" and
      "system.posix_acl_default".  These xattrs are treated only under Linux.
      POSIX ACLs doesn't mean something under Mac OS X.  Thereby, this patch
      set provides opportunity to use POSIX ACLs under Linux on HFS+
      filesystem.
      
      This patch:
      
      Add CONFIG_HFSPLUS_FS_POSIX_ACL kernel configuration option, DBG_ACL_MOD
      debugging flag and acl.h file with declaration of essential functions
      for support POSIX ACLs in hfsplus driver.
      Signed-off-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c92057e
    • E
      epoll: add a reschedule point in ep_free() · 91cf5ab6
      Eric Dumazet 提交于
      ep_free() might iterate on a huge set of epitems and hold cpu too long.
      Add two cond_resched() in order to yield cpu to other tasks.  This is safe
      as we only hold mutexes in this function.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Acked-by: NEric Wong <normalperson@yhbt.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91cf5ab6
    • G
      fs/bio-integrity: fix a potential mem leak · bc5c8f07
      Gu Zheng 提交于
      Free the bio_integrity_pool in the fail path of biovec_create_pool in
      function bioset_integrity_create().
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc5c8f07
    • J
      writeback: fix race that cause writeback hung · 146d7009
      Junxiao Bi 提交于
      There is a race between mark inode dirty and writeback thread, see the
      following scenario.  In this case, writeback thread will not run though
      there is dirty_io.
      
      __mark_inode_dirty()                                          bdi_writeback_workfn()
      	...                                                       	...
      	spin_lock(&inode->i_lock);
      	...
      	if (bdi_cap_writeback_dirty(bdi)) {
      	    <<< assume wb has dirty_io, so wakeup_bdi is false.
      	    <<< the following inode_dirty also have wakeup_bdi false.
      	    if (!wb_has_dirty_io(&bdi->wb))
      		    wakeup_bdi = true;
      	}
      	spin_unlock(&inode->i_lock);
      	                                                            <<< assume last dirty_io is removed here.
      	                                                            pages_written = wb_do_writeback(wb);
      	                                                            ...
      	                                                            <<< work_list empty and wb has no dirty_io,
      	                                                            <<< delayed_work will not be queued.
      	                                                            if (!list_empty(&bdi->work_list) ||
      	                                                                (wb_has_dirty_io(wb) && dirty_writeback_interval))
      	                                                                queue_delayed_work(bdi_wq, &wb->dwork,
      	                                                                    msecs_to_jiffies(dirty_writeback_interval * 10));
      	spin_lock(&bdi->wb.list_lock);
      	inode->dirtied_when = jiffies;
      	<<< new dirty_io is added.
      	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
      	spin_unlock(&bdi->wb.list_lock);
      
      	<<< though there is dirty_io, but wakeup_bdi is false,
      	<<< so writeback thread will not be waked up and
      	<<< the new dirty_io will not be flushed.
      	if (wakeup_bdi)
      	    bdi_wakeup_thread_delayed(bdi);
      
      Writeback will run until there is a new flush work queued.  This may cause
      a lot of dirty pages stay in memory for a long time.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      146d7009
    • M
      mm/page-writeback.c: add strictlimit feature · 5a537485
      Maxim Patlasov 提交于
      The feature prevents mistrusted filesystems (ie: FUSE mounts created by
      unprivileged users) to grow a large number of dirty pages before
      throttling.  For such filesystems balance_dirty_pages always check bdi
      counters against bdi limits.  I.e.  even if global "nr_dirty" is under
      "freerun", it's not allowed to skip bdi checks.  The only use case for now
      is fuse: it sets bdi max_ratio to 1% by default and system administrators
      are supposed to expect that this limit won't be exceeded.
      
      The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
      filesystem may set the flag when it initializes its BDI.
      
      The problematic scenario comes from the fact that nobody pays attention to
      the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
      writeback).  The implementation of fuse writeback releases original page
      (by calling end_page_writeback) almost immediately.  A fuse request queued
      for real processing bears a copy of original page.  Hence, if userspace
      fuse daemon doesn't finalize write requests in timely manner, an
      aggressive mmap writer can pollute virtually all memory by those temporary
      fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
      nobody cares.
      
      To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
      problem" as a shortcut for "a possibility of uncontrolled grow of amount
      of RAM consumed by temporary pages allocated by kernel fuse to process
      writeback".
      
      The problem was very easy to reproduce.  There is a trivial example
      filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
      added "sleep(1);" to the write methods, then recompiled and mounted it.
      Then created a huge file on the mount point and run a simple program which
      mmap-ed the file to a memory region, then wrote a data to the region.  An
      hour later I observed almost all RAM consumed by fuse writeback.  Since
      then some unrelated changes in kernel fuse made it more difficult to
      reproduce, but it is still possible now.
      
      Putting this theoretical happens-in-the-lab thing aside, there is another
      thing that really hurts real world (FUSE) users.  This is write-through
      page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
      fuse populates page cache and flushes user data to the server
      synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
      ("writeback cache policy") solve the problem, but they also make resolving
      NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
      a huge file to a fuse mount would result in memory starvation.  Miklos,
      the maintainer of FUSE, believes strictlimit feature the way to go.
      
      And eventually putting FUSE topics aside, there is one more use-case for
      strictlimit feature.  Using a slow USB stick (mass storage) in a machine
      with huge amount of RAM installed is a well-known pain.  Let's make simple
      computations.  Assuming 64GB of RAM installed, existing implementation of
      balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
      dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
      /media/my-usb-storage/" may return in a few seconds, but subsequent
      "umount /media/my-usb-storage/" will take more than two hours if effective
      throughput of the storage is, to say, 1MB/sec.
      
      After inclusion of strictlimit feature, it will be trivial to add a knob
      (e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
      Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
      natural desire to limit the amount of dirty memory for some devices we are
      not fully trust (in the sense of sustainable throughput).
      
      [akpm@linux-foundation.org: fix warning in page-writeback.c]
      Signed-off-by: NMaxim Patlasov <MPatlasov@parallels.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a537485
    • W
      mm/writeback: make writeback_inodes_wb static · 7d9f073b
      Wanpeng Li 提交于
      It's not used globally and could be static.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d9f073b