1. 17 2月, 2015 1 次提交
  2. 13 2月, 2015 1 次提交
    • V
      fs: consolidate {nr,free}_cached_objects args in shrink_control · 4101b624
      Vladimir Davydov 提交于
      We are going to make FS shrinkers memcg-aware.  To achieve that, we will
      have to pass the memcg to scan to the nr_cached_objects and
      free_cached_objects VFS methods, which currently take only the NUMA node
      to scan.  Since the shrink_control structure already holds the node, and
      the memcg to scan will be added to it when we introduce memcg-aware
      vmscan, let us consolidate the methods' arguments in this structure to
      keep things clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4101b624
  3. 11 2月, 2015 3 次提交
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub · d83a08db
      Kirill A. Shutemov 提交于
      Nobody uses it anymore.
      
      [akpm@linux-foundation.org: fix filemap_xip.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d83a08db
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  4. 03 2月, 2015 2 次提交
    • C
      fs: add FL_LAYOUT lease type · 11afe9f7
      Christoph Hellwig 提交于
      This (ab-)uses the file locking code to allow filesystems to recall
      outstanding pNFS layouts on a file.  This new lease type is similar but
      not quite the same as FL_DELEG.  A FL_LAYOUT lease can always be granted,
      an a per-filesystem lock (XFS iolock for the initial implementation)
      ensures not FL_LAYOUT leases granted when we would need to recall them.
      
      Also included are changes that allow multiple outstanding read
      leases of different types on the same file as long as they have a
      differnt owner.  This wasn't a problem until now as nfsd never set
      FL_LEASE leases, and no one else used FL_DELEG leases, but given that
      nfsd will also issues FL_LAYOUT leases we will have to handle it now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      11afe9f7
    • A
      Make super_blocks and sb_lock static · 15d0f5ea
      Al Viro 提交于
      The only user outside of fs/super.c is gone now
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      15d0f5ea
  5. 22 1月, 2015 1 次提交
  6. 21 1月, 2015 2 次提交
  7. 17 1月, 2015 8 次提交
  8. 09 1月, 2015 1 次提交
  9. 14 12月, 2014 5 次提交
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • D
      mm/rmap: share the i_mmap_rwsem · 3dec0ba0
      Davidlohr Bueso 提交于
      Similarly to the anon memory counterpart, we can share the mapping's lock
      ownership as the interval tree is not modified when doing doing the walk,
      only the file page.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec0ba0
    • D
      mm: convert i_mmap_mutex to rwsem · c8c06efa
      Davidlohr Bueso 提交于
      The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
      similar data, one for file backed pages and the other for anon memory.  To
      this end, this lock can also be a rwsem.  In addition, there are some
      important opportunities to share the lock when there are no tree
      modifications.
      
      This conversion is straightforward.  For now, all users take the write
      lock.
      
      [sfr@canb.auug.org.au: update fremap.c]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8c06efa
    • D
      mm,fs: introduce helpers around the i_mmap_mutex · 8b28f621
      Davidlohr Bueso 提交于
      This series is a continuation of the conversion of the i_mmap_mutex to
      rwsem, following what we have for the anon memory counterpart.  With
      Hugh's feedback from the first iteration.
      
      Ultimately, the most obvious paths that require exclusive ownership of the
      lock is when we modify the VMA interval tree, via
      vma_interval_tree_insert() and vma_interval_tree_remove() families.  Cases
      such as unmapping, where the ptes content is changed but the tree remains
      untouched should make it safe to share the i_mmap_rwsem.
      
      As such, the code of course is straightforward, however the devil is very
      much in the details.  While its been tested on a number of workloads
      without anything exploding, I would not be surprised if there are some
      less documented/known assumptions about the lock that could suffer from
      these changes.  Or maybe I'm just missing something, but either way I
      believe its at the point where it could use more eyes and hopefully some
      time in linux-next.
      
      Because the lock type conversion is the heart of this patchset,
      its worth noting a few comparisons between mutex vs rwsem (xadd):
      
        (i) Same size, no extra footprint.
      
        (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
             exclusive lock ownership.
      
        (iii) Both can be slightly unfair wrt exclusive ownership, with
              writer lock stealing properties, not necessarily respecting
              FIFO order for granting the lock when contended.
      
        (iv) Mutexes can be slightly faster than rwsems when
             the lock is non-contended.
      
        (v) Both suck at performance for debug (slowpaths), which
            shouldn't matter anyway.
      
      Sharing the lock is obviously beneficial, and sem writer ownership is
      close enough to mutexes.  The biggest winner of these changes is
      migration.
      
      As for concrete numbers, the following performance results are for a
      4-socket 60-core IvyBridge-EX with 130Gb of RAM.
      
      Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well
      with this set, with a steady ~60% throughput (jpm) increase for alltests
      and up to ~30% for disk for high amounts of concurrency.  Lower counts of
      workload users (< 100) does not show much difference at all, so at least
      no regressions.
      
                          3.18-rc1            3.18-rc1-i_mmap_rwsem
      alltests-100     17918.72 (  0.00%)    28417.97 ( 58.59%)
      alltests-200     16529.39 (  0.00%)    26807.92 ( 62.18%)
      alltests-300     16591.17 (  0.00%)    26878.08 ( 62.00%)
      alltests-400     16490.37 (  0.00%)    26664.63 ( 61.70%)
      alltests-500     16593.17 (  0.00%)    26433.72 ( 59.30%)
      alltests-600     16508.56 (  0.00%)    26409.20 ( 59.97%)
      alltests-700     16508.19 (  0.00%)    26298.58 ( 59.31%)
      alltests-800     16437.58 (  0.00%)    26433.02 ( 60.81%)
      alltests-900     16418.35 (  0.00%)    26241.61 ( 59.83%)
      alltests-1000    16369.00 (  0.00%)    26195.76 ( 60.03%)
      alltests-1100    16330.11 (  0.00%)    26133.46 ( 60.03%)
      alltests-1200    16341.30 (  0.00%)    26084.03 ( 59.62%)
      alltests-1300    16304.75 (  0.00%)    26024.74 ( 59.61%)
      alltests-1400    16231.08 (  0.00%)    25952.35 ( 59.89%)
      alltests-1500    16168.06 (  0.00%)    25850.58 ( 59.89%)
      alltests-1600    16142.56 (  0.00%)    25767.42 ( 59.62%)
      alltests-1700    16118.91 (  0.00%)    25689.58 ( 59.38%)
      alltests-1800    16068.06 (  0.00%)    25599.71 ( 59.32%)
      alltests-1900    16046.94 (  0.00%)    25525.92 ( 59.07%)
      alltests-2000    16007.26 (  0.00%)    25513.07 ( 59.38%)
      
      disk-100          7582.14 (  0.00%)     7257.48 ( -4.28%)
      disk-200          6962.44 (  0.00%)     7109.15 (  2.11%)
      disk-300          6435.93 (  0.00%)     6904.75 (  7.28%)
      disk-400          6370.84 (  0.00%)     6861.26 (  7.70%)
      disk-500          6353.42 (  0.00%)     6846.71 (  7.76%)
      disk-600          6368.82 (  0.00%)     6806.75 (  6.88%)
      disk-700          6331.37 (  0.00%)     6796.01 (  7.34%)
      disk-800          6324.22 (  0.00%)     6788.00 (  7.33%)
      disk-900          6253.52 (  0.00%)     6750.43 (  7.95%)
      disk-1000         6242.53 (  0.00%)     6855.11 (  9.81%)
      disk-1100         6234.75 (  0.00%)     6858.47 ( 10.00%)
      disk-1200         6312.76 (  0.00%)     6845.13 (  8.43%)
      disk-1300         6309.95 (  0.00%)     6834.51 (  8.31%)
      disk-1400         6171.76 (  0.00%)     6787.09 (  9.97%)
      disk-1500         6139.81 (  0.00%)     6761.09 ( 10.12%)
      disk-1600         4807.12 (  0.00%)     6725.33 ( 39.90%)
      disk-1700         4669.50 (  0.00%)     5985.38 ( 28.18%)
      disk-1800         4663.51 (  0.00%)     5972.99 ( 28.08%)
      disk-1900         4674.31 (  0.00%)     5949.94 ( 27.29%)
      disk-2000         4668.36 (  0.00%)     5834.93 ( 24.99%)
      
      In addition, a 67.5% increase in successfully migrated NUMA pages, thus
      improving node locality.
      
      The patch layout is simple but designed for bisection (in case reversion
      is needed if the changes break upstream) and easier review:
      
      o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
      o Patches 5-10 share the lock in specific paths, each patch
        details the rationale behind why it should be safe.
      
      This patchset has been tested with: postgres 9.4 (with brand new hugetlb
      support), hugetlbfs test suite (all tests pass, in fact more tests pass
      with these changes than with an upstream kernel), ltp, aim7 benchmarks,
      memcached and iozone with the -B option for mmap'ing.  *Untested* paths
      are nommu, memory-failure, uprobes and xip.
      
      This patch (of 8):
      
      Various parts of the kernel acquire and release this mutex, so add
      i_mmap_lock_write() and immap_unlock_write() helper functions that will
      encapsulate this logic.  The next patch will make use of these.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b28f621
  10. 11 12月, 2014 1 次提交
    • A
      make default ->i_fop have ->open() fail with ENXIO · bd9b51e7
      Al Viro 提交于
      As it is, default ->i_fop has NULL ->open() (along with all other methods).
      The only case where it matters is reopening (via procfs symlink) a file that
      didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
      to something sane (default would fail on read/write/ioctl/etc.).
      
      	Unfortunately, such case exists - alloc_file() users, especially
      anon_get_file() ones.  There we have tons of opened files of very different
      kinds sharing the same inode.  As the result, attempt to reopen those via
      procfs succeeds and you get a descriptor you can't do anything with.
      
      	Moreover, in case of sockets we set ->i_fop that will only be used
      on such reopen attempts - and put a failing ->open() into it to make sure
      those do not succeed.
      
      	It would be simpler to put such ->open() into default ->i_fop and leave
      it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
      	* everything going through do_dentry_open() works as it used to
      	* sock_no_open() kludge is gone
      	* attempts to reopen anon-inode files fail as they really ought to
      	* ditto for aio_private_file()
      	* ditto for perfmon - this one actually tried to imitate sock_no_open()
      trick, but failed to set ->i_fop, so in the current tree reopens succeed and
      yield completely useless descriptor.  Intent clearly had been to fail with
      -ENXIO on such reopens; now it actually does.
      	* everything else that used alloc_file() keeps working - it has ->i_fop
      set for its inodes anyway
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bd9b51e7
  11. 20 11月, 2014 1 次提交
  12. 18 11月, 2014 1 次提交
  13. 17 11月, 2014 1 次提交
    • B
      fs: add freeze_super/thaw_super fs hooks · 48b6bca6
      Benjamin Marzinski 提交于
      Currently, freezing a filesystem involves calling freeze_super, which locks
      sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
      hard for gfs2 (and potentially other cluster filesystems) to use the vfs
      freezing code to do freezes on all the cluster nodes.
      
      In order to communicate that a freeze has been requested, and to make sure
      that only one node is trying to freeze at a time, gfs2 uses a glock
      (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
      this lock before calling freeze_super. This means that two nodes can
      attempt to freeze the filesystem by both calling freeze_super, acquiring
      the sb->s_umount lock, and then attempting to grab the cluster glock
      sd_freeze_gl. Only one will succeed, and the other will be stuck in
      freeze_super, making it impossible to finish freezing the node.
      
      To solve this problem, this patch adds the freeze_super and thaw_super
      hooks.  If a filesystem implements these hooks, they are called instead of
      the vfs freeze_super and thaw_super functions. This means that every
      filesystem that implements these hooks must call the vfs freeze_super and
      thaw_super functions itself within the hook function to make use of the vfs
      freezing code.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      48b6bca6
  14. 10 11月, 2014 3 次提交
  15. 08 11月, 2014 1 次提交
  16. 06 11月, 2014 1 次提交
  17. 01 11月, 2014 2 次提交
  18. 31 10月, 2014 1 次提交
    • D
      Return short read or 0 at end of a raw device, not EIO · b2de525f
      David Jeffery 提交于
      Author: David Jeffery <djeffery@redhat.com>
      Changes to the basic direct I/O code have broken the raw driver when reading
      to the end of a raw device.  Instead of returning a short read for a read that
      extends partially beyond the device's end or 0 when at the end of the device,
      these reads now return EIO.
      
      The raw driver needs the same end of device handling as was added for normal
      block devices.  Using blkdev_read_iter, which has the needed size checks,
      prevents the EIO conditions at the end of the device.
      Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2de525f
  19. 29 10月, 2014 1 次提交
  20. 24 10月, 2014 3 次提交
    • M
      fs: limit filesystem stacking depth · 69c433ed
      Miklos Szeredi 提交于
      Add a simple read-only counter to super_block that indicates how deep this
      is in the stack of filesystems.  Previously ecryptfs was the only stackable
      filesystem and it explicitly disallowed multiple layers of itself.
      
      Overlayfs, however, can be stacked recursively and also may be stacked
      on top of ecryptfs or vice versa.
      
      To limit the kernel stack usage we must limit the depth of the
      filesystem stack.  Initially the limit is set to 2.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      69c433ed
    • M
      vfs: add whiteout support · 787fb6bc
      Miklos Szeredi 提交于
      Whiteout isn't actually a new file type, but is represented as a char
      device (Linus's idea) with 0/0 device number.
      
      This has several advantages compared to introducing a new whiteout file
      type:
      
       - no userspace API changes (e.g. trivial to make backups of upper layer
         filesystem, without losing whiteouts)
      
       - no fs image format changes (you can boot an old kernel/fsck without
         whiteout support and things won't break)
      
       - implementation is trivial
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      787fb6bc
    • M
      vfs: export check_sticky() · cbdf35bc
      Miklos Szeredi 提交于
      It's already duplicated in btrfs and about to be used in overlayfs too.
      
      Move the sticky bit check to an inline helper and call the out-of-line
      helper only in the unlikly case of the sticky bit being set.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      cbdf35bc