1. 17 12月, 2014 2 次提交
  2. 15 12月, 2014 1 次提交
    • J
      isofs: Fix infinite looping over CE entries · f54e18f1
      Jan Kara 提交于
      Rock Ridge extensions define so called Continuation Entries (CE) which
      define where is further space with Rock Ridge data. Corrupted isofs
      image can contain arbitrarily long chain of these, including a one
      containing loop and thus causing kernel to end in an infinite loop when
      traversing these entries.
      
      Limit the traversal to 32 entries which should be more than enough space
      to store all the Rock Ridge data.
      Reported-by: NP J P <ppandit@redhat.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      f54e18f1
  3. 14 12月, 2014 16 次提交
    • F
      aio: Skip timer for io_getevents if timeout=0 · 5f785de5
      Fam Zheng 提交于
      In this case, it is basically a polling. Let's not involve timer at all
      because that would hurt performance for application event loops.
      
      In an arbitrary test I've done, io_getevents syscall elapsed time
      reduces from 50000+ nanoseconds to a few hundereds.
      Signed-off-by: NFam Zheng <famz@redhat.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      5f785de5
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • J
      fsnotify: remove destroy_list from fsnotify_mark · 37d469e7
      Jan Kara 提交于
      destroy_list is used to track marks which still need waiting for srcu
      period end before they can be freed.  However by the time mark is added to
      destroy_list it isn't in group's list of marks anymore and thus we can
      reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
      two pointers for each fsnotify_mark.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37d469e7
    • J
      fsnotify: unify inode and mount marks handling · 0809ab69
      Jan Kara 提交于
      There's a lot of common code in inode and mount marks handling.  Factor it
      out to a common helper function.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0809ab69
    • H
      fallocate: create FAN_MODIFY and IN_MODIFY events · 820c12d5
      Heinrich Schuchardt 提交于
      The fanotify and the inotify API can be used to monitor changes of the
      file system.  System call fallocate() modifies files.  Hence it should
      trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
      events.  The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
      this value allows to create arbitrary file content from random data.
      
      This patch adds the missing call to fsnotify_modify().
      
      The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
      succeeds.  It will even be created if the file length remains unchanged,
      e.g.  when calling fanotify with flag FALLOC_FL_KEEP_SIZE.
      
      This logic was primarily chosen to keep the coding simple.
      
      It resembles the logic of the write() system call.
      
      When we call write() we always create a FAN_MODIFY event, even in the case
      of overwriting with identical data.
      
      Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
      actually changed.
      
      Furthermore even if if the filesize remains unchanged, fallocate() may
      influence whether a subsequent write() will succeed and hence the
      fallocate() call may be considered a modification.
      
      The fallocate(2) man page teaches: After a successful call, subsequent
      writes into the range specified by offset and len are guaranteed not to
      fail because of lack of disk space.
      
      So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
      different outcomes of a subsequent write depending on the values of offset
      and len.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Eric Paris <eparis@parisplace.org>
      Cc: John McCutchan <john@johnmccutchan.com>
      Cc: Robert Love <rlove@rlove.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      820c12d5
    • F
      fs/affs/file.c: remove obsolete pagesize check · 92cab82b
      Fabian Frederick 提交于
      linux kernel doesn't manage page sizes below 4kb.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92cab82b
    • F
      fs/affs/file.c: add support to O_DIRECT · 9abb4083
      Fabian Frederick 提交于
      Based on ext2_direct_IO
      
      Tested with O_DIRECT file open and sysbench/mariadb with 1% written
      queries improvement (update_non_index test) on a volume created with
      mkaffs.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9abb4083
    • F
      fs/affs/amigaffs.c: use va_format instead of buffer/vnsprintf · 1ee54b09
      Fabian Frederick 提交于
      -Remove ErrorBuffer and use %pV
      
      -Add __printf to enable argument mistmatch warnings
      
      Original patch by Joe Perches.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ee54b09
    • F
      fs/affs/file.c: forward declaration clean-up · 7633978b
      Fabian Frederick 提交于
      -Move file_operations to avoid forward declarations.
      
      -Remove unused declarations.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7633978b
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • N
      fat: fix data past EOF resulting from fsx testsuite · c0ef0cc9
      Namjae Jeon 提交于
      When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.
      
        fsx ./file2 -Z -r 4096 -w 4096
        ...
        ..
        truncating to largest ever: 0x907c
        fallocating to largest ever: 0x11137
        truncating to largest ever: 0x2c6fe
        truncating to largest ever: 0x2cfdf
        fallocating to largest ever: 0x40000
        Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
        ...
        ..
      
      The reason being, it is doing a truncate down, but the zeroing does not
      happen on the last block boundary when offset is not aligned.  Even though
      it calls truncate_setsize()->truncate_inode_pages()->
      truncate_inode_pages_range() and considers the partial zeroout but it
      retrieves the page using find_lock_page() - which only looks the page in
      the cache.  So, zeroing out does not happen in case of direct IO.
      
      Make a truncate page based around block_truncate_page for FAT filesystem
      and invoke that helper to zerout in case the offset is not aligned with
      the blocksize.
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAmit Sahrawat <a.sahrawat@samsung.com>
      Acked-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ef0cc9
    • J
      befs: remove dead code · f441ada0
      Jan Kara 提交于
      Coverity id: 1042674
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f441ada0
    • D
      fs, seq_file: fallback to vmalloc instead of oom kill processes · 5cec38ac
      David Rientjes 提交于
      Since commit 058504ed ("fs/seq_file: fallback to vmalloc allocation"),
      seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
      memory fails.  This was done to address order-4 slab allocations for
      reading /proc/stat on large machines and noticed because
      PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
      allocator when allocating new slab for such high-order allocations.
      
      Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
      Other GFP_KERNEL high-order allocations that are <=
      PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and
      oom kill processes as a result.
      
      We don't want to kill processes so that we can allocate contiguous memory
      in situations when contiguous memory isn't necessary.
      
      This patch does the kmalloc() allocation with __GFP_NORETRY for high-order
      allocations.  This still utilizes memory compaction and direct reclaim in
      the allocation path, the only difference is that it will fail immediately
      instead of oom kill processes when out of memory.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cec38ac
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm: use new helper functions around the i_mmap_mutex · 83cde9e8
      Davidlohr Bueso 提交于
      Convert all open coded mutex_lock/unlock calls to the
      i_mmap_[lock/unlock]_write() helpers.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83cde9e8
  4. 13 12月, 2014 1 次提交
  5. 12 12月, 2014 7 次提交
  6. 11 12月, 2014 13 次提交
    • A
      make default ->i_fop have ->open() fail with ENXIO · bd9b51e7
      Al Viro 提交于
      As it is, default ->i_fop has NULL ->open() (along with all other methods).
      The only case where it matters is reopening (via procfs symlink) a file that
      didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
      to something sane (default would fail on read/write/ioctl/etc.).
      
      	Unfortunately, such case exists - alloc_file() users, especially
      anon_get_file() ones.  There we have tons of opened files of very different
      kinds sharing the same inode.  As the result, attempt to reopen those via
      procfs succeeds and you get a descriptor you can't do anything with.
      
      	Moreover, in case of sockets we set ->i_fop that will only be used
      on such reopen attempts - and put a failing ->open() into it to make sure
      those do not succeed.
      
      	It would be simpler to put such ->open() into default ->i_fop and leave
      it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
      	* everything going through do_dentry_open() works as it used to
      	* sock_no_open() kludge is gone
      	* attempts to reopen anon-inode files fail as they really ought to
      	* ditto for aio_private_file()
      	* ditto for perfmon - this one actually tried to imitate sock_no_open()
      trick, but failed to set ->i_fop, so in the current tree reopens succeed and
      yield completely useless descriptor.  Intent clearly had been to fail with
      -ENXIO on such reopens; now it actually does.
      	* everything else that used alloc_file() keeps working - it has ->i_fop
      set for its inodes anyway
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd9b51e7
    • A
      make nameidata completely opaque outside of fs/namei.c · 1f55a6ec
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1f55a6ec
    • A
      kill proc_ns completely · 3d3d35b1
      Al Viro 提交于
      procfs inodes need only the ns_ops part; nsfs inodes don't need it at all
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3d3d35b1
    • A
      take the targets of /proc/*/ns/* symlinks to separate fs · e149ed2b
      Al Viro 提交于
      New pseudo-filesystem: nsfs.  Targets of /proc/*/ns/* live there now.
      It's not mountable (not even registered, so it's not in /proc/filesystems,
      etc.).  Files on it *are* bindable - we explicitly permit that in do_loopback().
      
      This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
      get_proc_ns() is a macro now (it's simply returning ->i_private; would
      have been an inline, if not for header ordering headache).
      proc_ns_inode() is an ex-parrot.  The interface used in procfs is
      ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).
      
      Dentries and inodes are never hashed; a non-counting reference to dentry
      is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
      if present.  See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
      of that mechanism.
      
      As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
      it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
      from ns_get_path().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e149ed2b
    • O
      exit: proc: don't try to flush /proc/tgid/task/tgid · c35a7f18
      Oleg Nesterov 提交于
      proc_flush_task_mnt() always tries to flush task/pid, but this is
      pointless if we reap the leader. d_invalidate() is recursive, and
      if nothing else the next d_hash_and_lookup(tgid) should fail anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c35a7f18
    • R
      fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp · ddbc22e2
      Rasmus Villemoes 提交于
      Relying on the sign (after casting to int) of the difference of two
      quantities for comparison is usually wrong.  For example, should a-b
      turn out to be 2^31, the return value of cmp(a,b) is -2^31; but that
      would also be the return value from cmp(b, a).  So a compares less than
      b and b compares less than a.  One can also easily find three values
      a,b,c such that a compares less than b, b compares less than c, but a
      does not compare less than c.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NVyacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddbc22e2
    • R
      nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races · 705304a8
      Ryusuke Konishi 提交于
      Same story as in commit 41080b5a ("nfsd race fixes: ext2") (similar
      ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
      of insert_inode_locked() and a bug of a check for dead inodes needs to
      be fixed.
      
      If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
      insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
      the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.
      
      If nilfs_iget() is called before nilfs_new_inode() calls
      insert_inode_locked4(), it will create an in-core inode and read its
      data from the on-disk inode.  But, nilfs_iget() will find i_nlink equals
      zero and fail at nilfs_read_inode_common(), which will lead it to call
      iget_failed() and cleanly fail.
      
      However, this sanity check doesn't work as expected for reused on-disk
      inodes because they leave a non-zero value in i_mode field and it
      hinders the test of i_nlink.  This patch also fixes the issue by
      removing the test on i_mode that nilfs2 doesn't need.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      705304a8
    • M
      nilfs2: deletion of an unnecessary check before the function call "iput" · 72b9918e
      Markus Elfring 提交于
      The iput() function tests whether its argument is NULL and then returns
      immediately.  Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b9918e
    • A
      nilfs2: avoid duplicate segment construction for fsync() · 75dc857c
      Andreas Rohner 提交于
      This patch removes filemap_write_and_wait_range() from nilfs_sync_file(),
      because it triggers a data segment construction by calling
      nilfs_writepages() with WB_SYNC_ALL.  A data segment construction does not
      remove the inode from the i_dirty list and it does not clear the
      NILFS_I_DIRTY flag.  Therefore nilfs_inode_dirty() still returns true,
      which leads to an unnecessary duplicate segment construction in
      nilfs_sync_file().
      
      A call to filemap_write_and_wait_range() is not needed, because NILFS2
      does not rely on the generic writeback mechanisms.  Instead it implements
      its own mechanism to collect all dirty pages and write them into segments.
       It is more efficient to initiate the segment construction directly in
      nilfs_sync_file() without the detour over filemap_write_and_wait_range().
      
      Additionally the lock of i_mutex is not needed, because all code blocks
      that are protected by i_mutex are also protected by a NILFS transaction:
      
        Function                i_mutex     nilfs_transaction
        ------------------------------------------------------
        nilfs_ioctl_setflags:   yes         yes
        nilfs_fiemap:           yes         no
        nilfs_write_begin:      yes         yes
        nilfs_write_end:        yes         yes
        nilfs_lookup:           yes         no
        nilfs_create:           yes         yes
        nilfs_link:             yes         yes
        nilfs_mknod:            yes         yes
        nilfs_symlink:          yes         yes
        nilfs_mkdir:            yes         yes
        nilfs_unlink:           yes         yes
        nilfs_rmdir:            yes         yes
        nilfs_rename:           yes         yes
        nilfs_setattr:          yes         yes
      
      For nilfs_lookup() i_mutex is held for the parent directory, to protect it
      from modification.  The segment construction does not modify directory
      inodes, so no lock is needed.
      
      nilfs_fiemap() reads the block layout on the disk, by using
      nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.
      Signed-off-by: NAndreas Rohner <andreas.rohner@gmx.net>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75dc857c
    • J
      ncpfs: return proper error from NCP_IOC_SETROOT ioctl · a682e9c2
      Jan Kara 提交于
      If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
      return value is then (in most cases) just overwritten before we return.
      This can result in reporting success to userspace although error happened.
      
      This bug was introduced by commit 2e54eb96 ("BKL: Remove BKL from
      ncpfs").  Propagate the errors correctly.
      
      Coverity id: 1226925.
      
      Fixes: 2e54eb96 ("BKL: Remove BKL from ncpfs")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Petr Vandrovec <petr@vandrovec.name>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a682e9c2
    • J
      fs/binfmt_elf.c: fix internal inconsistency relating to vma dump size · 52f5592e
      Jungseung Lee 提交于
      vma_dump_size() has been used several times on actual dumper and it is
      supposed to return the same value for the same vma.  But vma_dump_size()
      could return different values for same vma.
      
      The known problem case is concurrent shared memory removal.  If a vma is
      used for a shared memory and that shared memory is removed between
      writing program header and dumping vma memory, this will result in a
      dump file which is internally consistent.
      
      To fix the problem, we set baseline to get dump size and store the size
      into vma_filesz and always use the same vma dump size which is stored in
      vma_filsz.  The consistnecy with reality is not actually guranteed, but
      it's tolerable since that is fully consistent with base line.
      Signed-off-by: NJungseung Lee <js07.lee@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52f5592e
    • A
      fs/binfmt_misc.c: use GFP_KERNEL instead of GFP_USER · f7e1ad1a
      Andrew Morton 提交于
      GFP_USER means "honour cpuset nodes-allowed beancounting".  These are
      regular old kernel objects and there seems no reason to give them this
      treatment.
      Acked-by: NMike Frysinger <vapier@gentoo.org>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7e1ad1a
    • M
      binfmt_misc: clean up code style a bit · e6084d4a
      Mike Frysinger 提交于
      Clean up various coding style issues that checkpatch complains about.
      No functional changes here.
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6084d4a