1. 06 2月, 2018 1 次提交
    • D
      afs: Support the AFS dynamic root · 4d673da1
      David Howells 提交于
      Support the AFS dynamic root which is a pseudo-volume that doesn't connect
      to any server resource, but rather is just a root directory that
      dynamically creates mountpoint directories where the name of such a
      directory is the name of the cell.
      
      Such a mount can be created thus:
      
      	mount -t afs none /afs -o dyn
      
      Dynamic root superblocks aren't shared except by bind mounts and
      propagation.  Cell root volumes can then be mounted by referring to them by
      name, e.g.:
      
      	ls /afs/grand.central.org/
      	ls /afs/.grand.central.org/
      
      The kernel will upcall to consult the DNS if the address wasn't supplied
      directly.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4d673da1
  2. 02 1月, 2018 1 次提交
    • D
      afs: Fix unlink · 440fbc3a
      David Howells 提交于
      Repeating creation and deletion of a file on an afs mount will run the box
      out of memory, e.g.:
      
      	dd if=/dev/zero of=/afs/scratch/m0 bs=$((1024*1024)) count=512
      	rm /afs/scratch/m0
      
      The problem seems to be that it's not properly decrementing the nlink count
      so that the inode can be scrapped.
      
      Note that this doesn't fix local creation followed by remote deletion.
      That's harder to handle and will require a separate patch as we're not told
      that the file has been deleted - only that the directory has changed.
      Reported-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      440fbc3a
  3. 24 11月, 2017 3 次提交
  4. 13 11月, 2017 5 次提交
    • D
      afs: Introduce a file-private data record · 215804a9
      David Howells 提交于
      Introduce a file-private data record for kAFS and put the key into it
      rather than storing the key in file->private_data.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      215804a9
    • D
      afs: Fix directory read/modify race · dab17c1a
      David Howells 提交于
      Because parsing of the directory wasn't being done under any sort of lock,
      the pages holding the directory content can get invalidated whilst the
      parsing is ongoing.
      
      Further, the directory page check function gets called outside of the page
      lock, so if the page gets cleared or updated, this may return reports of
      bad magic numbers in the directory page.
      
      Also, the directory may change size whilst checking and parsing are
      ongoing, so more care needs to be taken here.
      
      Fix this by:
      
       (1) Perform the page check from the page filling function before we set
           PageUptodate and drop the page lock.
      
       (2) Check for the file having shrunk and the page having been abandoned
           before checking the page contents.
      
       (3) Lock the page whilst parsing it for the directory iterator.
      
      Whilst we're at it, add a tracepoint to report check failure.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dab17c1a
    • D
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells 提交于
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      servers.
      
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      
      To this end, the following structural changes are made:
      
       (1) Server record management is overhauled:
      
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
      
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
      
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
      
           (d) Server records are now keyed by their UUID within the namespace.
      
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           	 parameter.
      
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
      
           (g) The servers list is now in /proc/fs/afs/servers.
      
       (2) Volume record management is overhauled:
      
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
      
           (b) The superblock is now keyed on cell record and numeric volume ID.
      
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
      
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
      
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
      
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      
      and the following procedural changes are made:
      
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
      
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
      
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
      
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
      
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
      
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
      
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           	 moment.
      
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           	 salvaging.
      
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
      
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
      
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
      
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
      
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
      
      Notes:
      
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
      
       (2) VBUSY is retried forever for the moment at intervals of 1s.
      
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      d2ddc776
    • D
      afs: Overhaul the callback handling · c435ee34
      David Howells 提交于
      Overhaul the AFS callback handling by the following means:
      
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
      
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
      
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
      
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
           ilookup_nowait().
      
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
      
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
      
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
      
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      
      This has the following consequences:
      
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
      
       (2) Mass invalidation isn't seen until someone calls afs_validate().
      
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c435ee34
    • D
      afs: Push the net ns pointer to more places · 9ed900b1
      David Howells 提交于
      Push the network namespace pointer to more places in AFS, including the
      afs_server structure (which doesn't hold a ref on the netns).
      
      In particular, afs_put_cell() now takes requires a net ns parameter so that
      it can safely alter the netns after decrementing the cell usage count - the
      cell will be deallocated by a background thread after being cached for a
      period, which means that it's not safe to access it after reducing its
      usage count.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9ed900b1
  5. 10 7月, 2017 1 次提交
    • D
      afs: Add metadata xattrs · d3e3b7ea
      David Howells 提交于
      Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
      available.  The following xattrs are now available:
      
       - "afs.cell"
      
         The name of the cell in which the vnode's volume resides.
      
       - "afs.fid"
      
         The volume ID, vnode ID and vnode uniquifier of the file as three hex
         numbers separated by colons.
      
       - "afs.volume"
      
         The name of the volume in which the vnode resides.
      
      For example:
      
      	# getfattr -d -m ".*" /mnt/scratch
      	getfattr: Removing leading '/' from absolute path names
      	# file: mnt/scratch
      	afs.cell="mycell.myorg.org"
      	afs.fid="10000b:1:1"
      	afs.volume="scratch"
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3e3b7ea
  6. 28 2月, 2017 1 次提交
  7. 27 9月, 2016 2 次提交
    • M
      fs: rename "rename2" i_op to "rename" · 2773bf00
      Miklos Szeredi 提交于
      Generated patch:
      
      sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
      sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      2773bf00
    • M
      fs: make remaining filesystems use .rename2 · 1cd66c93
      Miklos Szeredi 提交于
      This is trivial to do:
      
       - add flags argument to foo_rename()
       - check if flags is zero
       - assign foo_rename() to .rename2 instead of .rename
      
      This doesn't mean it's impossible to support RENAME_NOREPLACE for these
      filesystems, but it is not trivial, like for local filesystems.
      RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
      for a file to be created on one host while it is overwritten by rename on
      another host).
      
      Filesystems converted:
      
      9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.
      
      After this, we can get rid of the duplicate interfaces for rename.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: David Howells <dhowells@redhat.com> [AFS]
      Acked-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jan Harkes <jaharkes@cs.cmu.edu>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      1cd66c93
  8. 11 5月, 2016 1 次提交
  9. 03 5月, 2016 1 次提交
    • A
      make ext2_get_page() and friends work without external serialization · be5b82db
      Al Viro 提交于
      Right now ext2_get_page() (and its analogues in a bunch of other filesystems)
      relies upon the directory being locked - the way it sets and tests Checked and
      Error bits would be racy without that.  Switch to a slightly different scheme,
      _not_ setting Checked in case of failure.  That way the logics becomes
      	if Checked => OK
      	else if Error => fail
      	else if !validate => fail
      	else => OK
      with validation setting Checked or Error on success and failure resp. and
      returning which one had happened.  Equivalent to the current logics, but unlike
      the current logics not sensitive to the order of set_bit, test_bit getting
      reordered by CPU, etc.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      be5b82db
  10. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  11. 16 4月, 2015 1 次提交
  12. 20 11月, 2014 1 次提交
  13. 01 11月, 2014 1 次提交
  14. 09 10月, 2014 1 次提交
  15. 30 9月, 2013 1 次提交
  16. 08 9月, 2013 1 次提交
  17. 06 9月, 2013 1 次提交
  18. 29 6月, 2013 1 次提交
  19. 23 2月, 2013 1 次提交
  20. 14 7月, 2012 3 次提交
  21. 04 1月, 2012 2 次提交
  22. 16 6月, 2011 1 次提交
  23. 28 5月, 2011 1 次提交
  24. 26 5月, 2011 2 次提交
  25. 16 1月, 2011 1 次提交
  26. 13 1月, 2011 1 次提交
  27. 07 1月, 2011 3 次提交
    • N
      fs: rcu-walk aware d_revalidate method · 34286d66
      Nick Piggin 提交于
      Require filesystems be aware of .d_revalidate being called in rcu-walk
      mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
      -ECHILD from all implementations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      34286d66
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: change d_delete semantics · fe15ce44
      Nick Piggin 提交于
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fe15ce44