1. 25 2月, 2017 3 次提交
  2. 21 2月, 2017 2 次提交
    • C
      nfsd: special case truncates some more · 783112f7
      Christoph Hellwig 提交于
      Both the NFS protocols and the Linux VFS use a setattr operation with a
      bitmap of attributes to set to set various file attributes including the
      file size and the uid/gid.
      
      The Linux syscalls never mix size updates with unrelated updates like
      the uid/gid, and some file systems like XFS and GFS2 rely on the fact
      that truncates don't update random other attributes, and many other file
      systems handle the case but do not update the other attributes in the
      same transaction.  NFSD on the other hand passes the attributes it gets
      on the wire more or less directly through to the VFS, leading to updates
      the file systems don't expect.  XFS at least has an assert on the
      allowed attributes, which caught an unusual NFS client setting the size
      and group at the same time.
      
      To handle this issue properly this splits the notify_change call in
      nfsd_setattr into two separate ones.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      783112f7
    • C
      nfsd: minor nfsd_setattr cleanup · 758e99fe
      Christoph Hellwig 提交于
      Simplify exit paths, size_change use.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      758e99fe
  3. 18 2月, 2017 6 次提交
  4. 10 2月, 2017 1 次提交
  5. 07 2月, 2017 1 次提交
    • N
      NFSDv4: use export cache flushtime for changeid on V4ROOT objects. · b8800921
      NeilBrown 提交于
      If you change the set of filesystems that are exported, then
      the contents of various directories in the NFSv4 pseudo-root
      is likely to change.  However the change-id of those
      directories is currently tied to the underlying directory,
      so the client may not see the changes in a timely fashion.
      
      This patch changes the change-id number to be derived from the
      "flush_time" of the export cache.  Whenever any changes are
      made to the set of exported filesystems, this flush_time is
      updated.  The result is that clients see changes to the set
      of exported filesystems much more quickly, often immediately.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      b8800921
  6. 04 2月, 2017 1 次提交
    • M
      fs: break out of iomap_file_buffered_write on fatal signals · d1908f52
      Michal Hocko 提交于
      Tetsuo has noticed that an OOM stress test which performs large write
      requests can cause the full memory reserves depletion.  He has tracked
      this down to the following path
      
      	__alloc_pages_nodemask+0x436/0x4d0
      	alloc_pages_current+0x97/0x1b0
      	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
      	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
      	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
      	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
      	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
      	? iomap_write_end+0x80/0x80             fs/iomap.c:150
      	iomap_apply+0xb3/0x130                  fs/iomap.c:79
      	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
      	? iomap_write_end+0x80/0x80
      	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
      	? remove_wait_queue+0x59/0x60
      	xfs_file_write_iter+0x90/0x130 [xfs]
      	__vfs_write+0xe5/0x140
      	vfs_write+0xc7/0x1f0
      	? syscall_trace_enter+0x1d0/0x380
      	SyS_write+0x58/0xc0
      	do_syscall_64+0x6c/0x200
      	entry_SYSCALL64_slow_path+0x25/0x25
      
      the oom victim has access to all memory reserves to make a forward
      progress to exit easier.  But iomap_file_buffered_write and other
      callers of iomap_apply loop to complete the full request.  We need to
      check for fatal signals and back off with a short write instead.
      
      As the iomap_apply delegates all the work down to the actor we have to
      hook into those.  All callers that work with the page cache are calling
      iomap_write_begin so we will check for signals there.  dax_iomap_actor
      has to handle the situation explicitly because it copies data to the
      userspace directly.  Other callers like iomap_page_mkwrite work on a
      single page or iomap_fiemap_actor do not allocate memory based on the
      given len.
      
      Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path")
      Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1908f52
  7. 01 2月, 2017 13 次提交
    • D
      fscache: Fix dead object requeue · e26bfebd
      David Howells 提交于
      Under some circumstances, an fscache object can become queued such that it
      fscache_object_work_func() can be called once the object is in the
      OBJECT_DEAD state.  This results in the kernel oopsing when it tries to
      invoke the handler for the state (which is hard coded to 0x2).
      
      The way this comes about is something like the following:
      
       (1) The object dispatcher is processing a work state for an object.  This
           is done in workqueue context.
      
       (2) An out-of-band event comes in that isn't masked, causing the object to
           be queued, say EV_KILL.
      
       (3) The object dispatcher finishes processing the current work state on
           that object and then sees there's another event to process, so,
           without returning to the workqueue core, it processes that event too.
           It then follows the chain of events that initiates until we reach
           OBJECT_DEAD without going through a wait state (such as
           WAIT_FOR_CLEARANCE).
      
           At this point, object->events may be 0, object->event_mask will be 0
           and oob_event_mask will be 0.
      
       (4) The object dispatcher returns to the workqueue processor, and in due
           course, this sees that the object's work item is still queued and
           invokes it again.
      
       (5) The current state is a work state (OBJECT_DEAD), so the dispatcher
           jumps to it - resulting in an OOPS.
      
      When I'm seeing this, the work state in (1) appears to have been either
      LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is
      fscache_osm_lookup_oob).
      
      The window for (2) is very small:
      
       (A) object->event_mask is cleared whilst the event dispatch process is
           underway - though there's no memory barrier to force this to the top
           of the function.
      
           The window, therefore is from the time the object was selected by the
           workqueue processor and made requeueable to the time the mask was
           cleared.
      
       (B) fscache_raise_event() will only queue the object if it manages to set
           the event bit and the corresponding event_mask bit was set.
      
           The enqueuement is then deferred slightly whilst we get a ref on the
           object and get the per-CPU variable for workqueue congestion.  This
           slight deferral slightly increases the probability by allowing extra
           time for the workqueue to make the item requeueable.
      
      Handle this by giving the dead state a processor function and checking the
      for the dead state address rather than seeing if the processor function is
      address 0x2.  The dead state processor function can then set a flag to
      indicate that it's occurred and give a warning if it occurs more than once
      per object.
      
      If this race occurs, an oops similar to the following is seen (note the RIP
      value):
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
      IP: [<0000000000000002>] 0x1
      PGD 0
      Oops: 0010 [#1] SMP
      Modules linked in: ...
      CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
      Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
      Workqueue: fscache_object fscache_object_work_func [fscache]
      task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000
      RIP: 0010:[<0000000000000002>]  [<0000000000000002>] 0x1
      RSP: 0018:ffff880717547df8  EFLAGS: 00010202
      RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200
      RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480
      RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510
      R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000
      R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600
      FS:  0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
       ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00
       ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000
       ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900
      Call Trace:
       [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache]
       [<ffffffff8109d5db>] process_one_work+0x17b/0x470
       [<ffffffff8109e4ac>] worker_thread+0x21c/0x400
       [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
       [<ffffffff810a5acf>] kthread+0xcf/0xe0
       [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
       [<ffffffff816460d8>] ret_from_fork+0x58/0x90
       [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJeremy McNicoll <jeremymc@redhat.com>
      Tested-by: NFrank Sorenson <sorenson@redhat.com>
      Tested-by: NBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e26bfebd
    • D
      fscache: Clear outstanding writes when disabling a cookie · 6bdded59
      David Howells 提交于
      fscache_disable_cookie() needs to clear the outstanding writes on the
      cookie it's disabling because they cannot be completed after.
      
      Without this, fscache_nfs_open_file() gets stuck because it disables the
      cookie when the file is opened for writing but can't uncache the pages till
      afterwards - otherwise there's a race between the open routine and anyone
      who already has it open R/O and is still reading from it.
      
      Looking in /proc/pid/stack of the offending process shows:
      
      [<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache]
      [<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache]
      [<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs]
      [<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4]
      [<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7
      [<ffffffff811743ac>] vfs_open+0x5c/0x65
      [<ffffffff81184185>] path_openat+0x785/0x8fb
      [<ffffffff81184343>] do_filp_open+0x48/0x9e
      [<ffffffff81174710>] do_sys_open+0x13b/0x1cb
      [<ffffffff811747b9>] SyS_open+0x19/0x1b
      [<ffffffff81001c44>] do_syscall_64+0x80/0x17a
      [<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a
      [<ffffffffffffffff>] 0xffffffffffffffff
      Reported-by: NJianhong Yin <jiyin@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6bdded59
    • D
      FS-Cache: Initialise stores_lock in netfs cookie · 62deb818
      David Howells 提交于
      Initialise the stores_lock in fscache netfs cookies.  Technically, it
      shouldn't be necessary, since the netfs cookie is an index and stores no
      data, but initialising it anyway adds insignificant overhead.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Acked-by: NSteve Dickson <steved@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      62deb818
    • J
      nfsd: opt in to labeled nfs per export · 32ddd944
      J. Bruce Fields 提交于
      Currently turning on NFSv4.2 results in 4.2 clients suddenly seeing the
      individual file labels as they're set on the server.  This is not what
      they've previously seen, and not appropriate in may cases.  (In
      particular, if clients have heterogenous security policies then one
      client's labels may not even make sense to another.)  Labeled NFS should
      be opted in only in those cases when the administrator knows it makes
      sense.
      
      It's helpful to be able to turn 4.2 on by default, and otherwise the
      protocol upgrade seems free of regressions.  So, default labeled NFS to
      off and provide an export flag to reenable it.
      
      Users wanting labeled NFS support on an export will henceforth need to:
      
      	- make sure 4.2 support is enabled on client and server (as
      	  before), and
      	- upgrade the server nfs-utils to a version supporting the new
      	  "security_label" export flag.
      	- set that "security_label" flag on the export.
      
      This is commit may be seen as a regression to anyone currently depending
      on security labels.  We believe those cases are currently rare.
      
      Reported-by: tibbs@math.uh.edu
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      32ddd944
    • J
      nfsd: constify nfsd_suppatttrs · 5cf23dbb
      J. Bruce Fields 提交于
      To keep me from accidentally writing to this again....
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      5cf23dbb
    • S
      lockd: initialize sin6_scope_id in lockd_inet6addr_event() · c01410f7
      Scott Mayhew 提交于
      I noticed this was missing when I was testing with link local addresses.
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      c01410f7
    • S
      nfsd: initialize sin6_scope_id in nfsd_inet6addr_event() · 7b19824d
      Scott Mayhew 提交于
      I noticed this was missing when I was testing with link local addresses.
      Signed-off-by: NScott Mayhew <smayhew@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      7b19824d
    • K
      NFSD: Remove unused value inode in nfsd_vfs_write · 865d50b2
      Kinglong Mee 提交于
      This is just cleanup, no change in functionality.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      865d50b2
    • K
      NFSD: cleanup dead codes and values in nfsd_write · 52e380e0
      Kinglong Mee 提交于
      This is just cleanup, no change in functionality.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      52e380e0
    • K
      NFSD: pass an integer for stable type to nfsd_vfs_write · 54bbb7d2
      Kinglong Mee 提交于
      After fae5096a "nfsd: assume writeable exportabled filesystems have
      f_sync" we no longer modify this argument.
      
      This is just cleanup, no change in functionality.
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      54bbb7d2
    • N
      NFSD: correctly range-check v4.x minor version when setting versions. · e35659f1
      NeilBrown 提交于
      Writing to /proc/fs/nfsd/versions allows individual major versions
      and NFSv4 minor versions to be enabled or disabled.
      
      However NFSv4.0 cannot currently be disabled, thought there is no good reason.
      Also the minor number is parsed as a 'long' but used as an 'int'
      so '4294967297' will be incorrectly treated as '1'.
      
      This patch removes the test on 'minor == 0' and switches to kstrtouint()
      to get correct range checking.
      
      When reading from /proc/fs/nfsd/versions, 4.0 is current not reported.
      To allow the disabling for v4.0 to be visible, while maintaining
      backward compatibility, change code to report "-4.0" if appropriate, but
      not "+4.0".
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e35659f1
    • C
      nfsd: special case truncates some more · 41f53350
      Christoph Hellwig 提交于
      Both the NFS protocols and the Linux VFS use a setattr operation with a
      bitmap of attributs to set to set various file attributes including the
      file size and the uid/gid.
      
      The Linux syscalls never mixes size updates with unrelated updates like
      the uid/gid, and some file systems like XFS and GFS2 rely on the fact
      that truncates might not update random other attributes, and many other
      file systems handle the case but do not update the different attributes
      in the same transaction.  NFSD on the other hand passes the attributes
      it gets on the wire more or less directly through to the VFS, leading to
      updates the file systems don't expect.  XFS at least has an assert on
      the allowed attributes, which caught an unusual NFS client setting the
      size and group at the same time.
      
      To handle this issue properly this switches nfsd to call vfs_truncate
      for size changes, and then handle all other attributes through
      notify_change.  As a side effect this also means less boilerplace code
      around the size change as we can now reuse the VFS code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      41f53350
    • K
      NFSD: Fix a null reference case in find_or_create_lock_stateid() · d19fb70d
      Kinglong Mee 提交于
      nfsd assigns the nfs4_free_lock_stateid to .sc_free in init_lock_stateid().
      
      If nfsd doesn't go through init_lock_stateid() and put stateid at end,
      there is a NULL reference to .sc_free when calling nfs4_put_stid(ns).
      
      This patch let the nfs4_stid.sc_free assignment to nfs4_alloc_stid().
      
      Cc: stable@vger.kernel.org
      Fixes: 356a95ec "nfsd: clean up races in lock stateid searching..."
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      d19fb70d
  8. 28 1月, 2017 1 次提交
    • B
      xfs: prevent quotacheck from overloading inode lru · e0d76fa4
      Brian Foster 提交于
      Quotacheck runs at mount time in situations where quota accounting must
      be recalculated. In doing so, it uses bulkstat to visit every inode in
      the filesystem. Historically, every inode processed during quotacheck
      was released and immediately tagged for reclaim because quotacheck runs
      before the superblock is marked active by the VFS. In other words,
      the final iput() lead to an immediate ->destroy_inode() call, which
      allowed the XFS background reclaim worker to start reclaiming inodes.
      
      Commit 17c12bcd ("xfs: when replaying bmap operations, don't let
      unlinked inodes get reaped") marks the XFS superblock active sooner as
      part of the mount process to support caching inodes processed during log
      recovery. This occurs before quotacheck and thus means all inodes
      processed by quotacheck are inserted to the LRU on release.  The
      s_umount lock is held until the mount has completed and thus prevents
      the shrinkers from operating on the sb. This means that quotacheck can
      excessively populate the inode LRU and lead to OOM conditions on systems
      without sufficient RAM.
      
      Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
      inodes processed by quotacheck. This causes ->drop_inode() to return 1
      and in turn causes iput_final() to evict the inode. This preserves the
      original quotacheck behavior and prevents it from overloading the LRU
      and running out of memory.
      
      CC: stable@vger.kernel.org # v4.9
      Reported-by: NMartin Svec <martin.svec@zoner.cz>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e0d76fa4
  9. 27 1月, 2017 6 次提交
  10. 26 1月, 2017 2 次提交
    • D
      xfs: clear _XBF_PAGES from buffers when readahead page · 2aa6ba7b
      Darrick J. Wong 提交于
      If we try to allocate memory pages to back an xfs_buf that we're trying
      to read, it's possible that we'll be so short on memory that the page
      allocation fails.  For a blocking read we'll just wait, but for
      readahead we simply dump all the pages we've collected so far.
      
      Unfortunately, after dumping the pages we neglect to clear the
      _XBF_PAGES state, which means that the subsequent call to xfs_buf_free
      thinks that b_pages still points to pages we own.  It then double-frees
      the b_pages pages.
      
      This results in screaming about negative page refcounts from the memory
      manager, which xfs oughtn't be triggering.  To reproduce this case,
      mount a filesystem where the size of the inodes far outweighs the
      availalble memory (a ~500M inode filesystem on a VM with 300MB memory
      did the trick here) and run bulkstat in parallel with other memory
      eating processes to put a huge load on the system.  The "check summary"
      phase of xfs_scrub also works for this purpose.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      2aa6ba7b
    • C
      xfs: extsize hints are not unlikely in xfs_bmap_btalloc · 493611eb
      Christoph Hellwig 提交于
      With COW files they are the hotpath, just like for files with the
      extent size hint attribute.  We really shouldn't micro-manage anything
      but failure cases with unlikely.
      
      Additionally Arnd Bergmann recently reported that one of these two
      unlikely annotations causes link failures together with an upcoming
      kernel instrumentation patch, so let's get rid of it ASAP.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      493611eb
  11. 25 1月, 2017 4 次提交
    • B
      xfs: remove racy hasattr check from attr ops · 5a93790d
      Brian Foster 提交于
      xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize
      away a lock cycle in cases where the fork does not exist or is otherwise
      empty. This check is not safe, however, because an attribute fork short
      form to extent format conversion includes a transient state that causes
      the xfs_inode_hasattr() check to fail. Specifically,
      xfs_attr_shortform_to_leaf() creates an empty extent format attribute
      fork and then adds the existing shortform attributes to it.
      
      This means that lookup of an existing xattr can spuriously return
      -ENOATTR when racing against a setxattr that causes the associated
      format conversion. This was originally reproduced by an untar on a
      particularly configured glusterfs volume, but can also be reproduced on
      demand with properly crafted xattr requests.
      
      The format conversion occurs under the exclusive ilock. xfs_attr_get()
      and xfs_attr_remove() already have the proper locking and checks further
      down in the functions to handle this situation correctly. Drop the
      unlocked checks to avoid the spurious failure and rely on the existing
      logic.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      5a93790d
    • C
      xfs: use per-AG reservations for the finobt · 76d771b4
      Christoph Hellwig 提交于
      Currently we try to rely on the global reserved block pool for block
      allocations for the free inode btree, but I have customer reports
      (fairly complex workload, need to find an easier reproducer) where that
      is not enough as the AG where we free an inode that requires a new
      finobt block is entirely full.  This causes us to cancel a dirty
      transaction and thus a file system shutdown.
      
      I think the right way to guard against this is to treat the finot the same
      way as the refcount btree and have a per-AG reservations for the possible
      worst case size of it, and the patch below implements that.
      
      Note that this could increase mount times with large finobt trees.  In
      an ideal world we would have added a field for the number of finobt
      fields to the AGI, similar to what we did for the refcount blocks.
      We should do add it next time we rev the AGI or AGF format by adding
      new fields.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      76d771b4
    • C
      xfs: only update mount/resv fields on success in __xfs_ag_resv_init · 4dfa2b84
      Christoph Hellwig 提交于
      Try to reserve the blocks first and only then update the fields in
      or hanging off the mount structure.  This way we can call __xfs_ag_resv_init
      again after a previous failure.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4dfa2b84
    • C
      romfs: use different way to generate fsid for BLOCK or MTD · f598f82e
      Coly Li 提交于
      Commit 8a59f5d2 ("fs/romfs: return f_fsid for statfs(2)") generates
      a 64bit id from sb->s_bdev->bd_dev.  This is only correct when romfs is
      defined with CONFIG_ROMFS_ON_BLOCK.  If romfs is only defined with
      CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev
      will triger an oops.
      
      Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y,
      both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined.
      Therefore when calling huge_encode_dev() to generate a 64bit id, I use
      the follow order to choose parameter,
      
      - CONFIG_ROMFS_ON_BLOCK defined
        use sb->s_bdev->bd_dev
      - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined
        use sb->s_dev when,
      - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined
        leave id as 0
      
      When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev
      is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index,
      otherwise sb->s_dev is 0.
      
      This is a try-best effort to generate a uniq file system ID, if all the
      above conditions are not meet, f_fsid of this romfs instance will be 0.
      Generally only one romfs can be built on single MTD block device, this
      method is enough to identify multiple romfs instances in a computer.
      
      Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.deSigned-off-by: NColy Li <colyli@suse.de>
      Reported-by: NNong Li <nongli1031@gmail.com>
      Tested-by: NNong Li <nongli1031@gmail.com>
      Cc: Richard Weinberger <richard.weinberger@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f598f82e