1. 05 3月, 2012 1 次提交
    • B
      GFS2: Eliminate sd_rindex_mutex · 6aad1c3d
      Bob Peterson 提交于
      Over time, we've slowly eliminated the use of sd_rindex_mutex.
      Up to this point, it was only used in two places: function
      gfs2_ri_total (which totals the file system size by reading
      and parsing the rindex file) and function gfs2_rindex_update
      which updates the rgrps in memory. Both of these functions have
      the rindex glock to protect them, so the rindex is unnecessary.
      Since gfs2_grow writes to the rindex via the meta_fs, the mutex
      is in the wrong order according to the normal rules. This patch
      eliminates the mutex entirely to avoid the problem.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      6aad1c3d
  2. 01 3月, 2012 1 次提交
  3. 29 2月, 2012 5 次提交
    • S
      GFS2: Make bd_cmp() static · 08728f2d
      Steven Whitehouse 提交于
      Add missing static to bd_cmp()
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      08728f2d
    • B
      GFS2: Sort the ordered write list · 4a36d08d
      Bob Peterson 提交于
      This patch sorts the ordered write list for GFS2 writes.
      This increases the throughput for simultaneous writes.
      For example, if you have ten processes, all doing:
      dd if=/dev/zero of=/mnt/gfs2/fileX
      on different files, the throughput will be much better.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4a36d08d
    • S
      GFS2: FITRIM ioctl support · 66fc061b
      Steven Whitehouse 提交于
      The FITRIM ioctl provides an alternative way to send discard requests to
      the underlying device. Using the discard mount option results in every
      freed block generating a discard request to the block device. This can
      be slow, since many block devices can only process discard requests of
      larger sizes, and also such operations can be time consuming.
      
      Rather than using the discard mount option, FITRIM allows a sweep of the
      filesystem on an occasional basis, and also to optionally avoid sending
      down discard requests for smaller regions.
      
      In GFS2 FITRIM will work at resource group granularity. There is a flag
      for each resource group which keeps track of which resource groups have
      been trimmed. This flag is reset whenever a deallocation occurs in the
      resource group, and set whenever a successful FITRIM of that resource
      group has taken place. This helps to reduce repeated discard requests
      for the same block ranges, again improving performance.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      66fc061b
    • S
      GFS2: Move two functions from log.c to lops.c · 47ac5537
      Steven Whitehouse 提交于
      gfs2_log_get_buf() and gfs2_log_fake_buf() are both used
      only in lops.c, so move them next to their callers and they
      can then become static.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      47ac5537
    • S
      GFS2: glock statistics gathering · a245769f
      Steven Whitehouse 提交于
      The stats are divided into two sets: those relating to the
      super block and those relating to an individual glock. The
      super block stats are done on a per cpu basis in order to
      try and reduce the overhead of gathering them. They are also
      further divided by glock type.
      
      In the case of both the super block and glock statistics,
      the same information is gathered in each case. The super
      block statistics are used to provide default values for
      most of the glock statistics, so that newly created glocks
      should have, as far as possible, a sensible starting point.
      
      The statistics are divided into three pairs of mean and
      variance, plus two counters. The mean/variance pairs are
      smoothed exponential estimates and the algorithm used is
      one which will be very familiar to those used to calculation
      of round trip times in network code.
      
      The three pairs of mean/variance measure the following
      things:
      
       1. DLM lock time (non-blocking requests)
       2. DLM lock time (blocking requests)
       3. Inter-request time (again to the DLM)
      
      A non-blocking request is one which will complete right
      away, whatever the state of the DLM lock in question. That
      currently means any requests when (a) the current state of
      the lock is exclusive (b) the requested state is either null
      or unlocked or (c) the "try lock" flag is set. A blocking
      request covers all the other lock requests.
      
      There are two counters. The first is there primarily to show
      how many lock requests have been made, and thus how much data
      has gone into the mean/variance calculations. The other counter
      is counting queueing of holders at the top layer of the glock
      code. Hopefully that number will be a lot larger than the number
      of dlm lock requests issued.
      
      So why gather these statistics? There are several reasons
      we'd like to get a better idea of these timings:
      
      1. To be able to better set the glock "min hold time"
      2. To spot performance issues more easily
      3. To improve the algorithm for selecting resource groups for
      allocation (to base it on lock wait time, rather than blindly
      using a "try lock")
      Due to the smoothing action of the updates, a step change in
      some input quantity being sampled will only fully be taken
      into account after 8 samples (or 4 for the variance) and this
      needs to be carefully considered when interpreting the
      results.
      
      Knowing both the time it takes a lock request to complete and
      the average time between lock requests for a glock means we
      can compute the total percentage of the time for which the
      node is able to use a glock vs. time that the rest of the
      cluster has its share. That will be very useful when setting
      the lock min hold time.
      
      The other point to remember is that all times are in
      nanoseconds. Great care has been taken to ensure that we
      measure exactly the quantities that we want, as accurately
      as possible. There are always inaccuracies in any
      measuring system, but I hope this is as accurate as we
      can reasonably make it.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a245769f
  4. 28 2月, 2012 4 次提交
  5. 26 2月, 2012 1 次提交
    • I
      autofs: work around unhappy compat problem on x86-64 · a32744d4
      Ian Kent 提交于
      When the autofs protocol version 5 packet type was added in commit
      5c0a32fc ("autofs4: add new packet type for v5 communications"), it
      obvously tried quite hard to be word-size agnostic, and uses explicitly
      sized fields that are all correctly aligned.
      
      However, with the final "char name[NAME_MAX+1]" array at the end, the
      actual size of the structure ends up being not very well defined:
      because the struct isn't marked 'packed', doing a "sizeof()" on it will
      align the size of the struct up to the biggest alignment of the members
      it has.
      
      And despite all the members being the same, the alignment of them is
      different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
      alignment on x86-64.  And while 'NAME_MAX+1' ends up being a nice round
      number (256), the name[] array starts out a 4-byte aligned.
      
      End result: the "packed" size of the structure is 300 bytes: 4-byte, but
      not 8-byte aligned.
      
      As a result, despite all the fields being in the same place on all
      architectures, sizeof() will round up that size to 304 bytes on
      architectures that have 8-byte alignment for u64.
      
      Note that this is *not* a problem for 32-bit compat mode on POWER, since
      there __u64 is 8-byte aligned even in 32-bit mode.  But on x86, 32-bit
      and 64-bit alignment is different for 64-bit entities, and as a result
      the structure that has exactly the same layout has different sizes.
      
      So on x86-64, but no other architecture, we will just subtract 4 from
      the size of the structure when running in a compat task.  That way we
      will write the properly sized packet that user mode expects.
      
      Not pretty.  Sadly, this very subtle, and unnecessary, size difference
      has been encoded in user space that wants to read packets of *exactly*
      the right size, and will refuse to touch anything else.
      Reported-and-tested-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a32744d4
  6. 25 2月, 2012 2 次提交
    • O
      epoll: ep_unregister_pollwait() can use the freed pwq->whead · 971316f0
      Oleg Nesterov 提交于
      signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
      this is not enough. eppoll_entry->whead still points to the memory
      we are going to free, ep_unregister_pollwait()->remove_wait_queue()
      is obviously unsafe.
      
      Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
      change ep_unregister_pollwait() to check pwq->whead != NULL under
      rcu_read_lock() before remove_wait_queue(). We add the new helper,
      ep_remove_wait_queue(), for this.
      
      This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
      ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
      ep_unregister_pollwait()->remove_wait_queue() can play with already
      freed and potentially reused ->sighand, but this is fine. This memory
      must have the valid ->signalfd_wqh until rcu_read_unlock().
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      971316f0
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
  7. 24 2月, 2012 3 次提交
    • C
      Btrfs: fix compiler warnings on 32 bit systems · e77266e4
      Chris Mason 提交于
      The enospc tracing code added some interesting uses of
      u64 pointer casts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e77266e4
    • A
      NTFS: Correct two spelling errors "dealocate" to "deallocate" in mft.c. · 9b556248
      Anton Altaparmakov 提交于
      From: Masanari Iida <standby24x7@gmail.com>
      Signed-off-by: NAnton Altaparmakov <anton@tuxera.com>
      9b556248
    • A
      Restore direct_io / truncate locking API · 37fbf4bf
      Anton Altaparmakov 提交于
      With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
      calls (namely inode_dio_wait() and inode_dio_done()) which are
      EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
      further inode_dio_wait() was pushed from notify_change() into the file
      system ->setattr() method but no non-GPL file system can make this call.
      
      That means non-GPL file systems cannot exist any more unless they do not
      use any VFS functionality related to reading/writing as far as I can
      tell or at least as long as they want to implement direct i/o.
      
      Both Linus and Al (and others) have said on LKML that this breakage of
      the VFS API should not have happened and that the change was simply
      missed as it was not documented in the change logs of the patches that
      did those changes.
      
      This patch changes the two function exports in question to be
      EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
      for all modules.
      
      Christoph, who introduced the two functions and exported them GPL-only
      is CC-ed on this patch to give him the opportunity to object to the
      symbols being changed in this manner if he did indeed intend them to be
      GPL-only and does not want them to become available to all modules.
      Signed-off-by: NAnton Altaparmakov <anton@tuxera.com>
      CC: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37fbf4bf
  8. 23 2月, 2012 5 次提交
  9. 22 2月, 2012 5 次提交
  10. 21 2月, 2012 1 次提交
  11. 18 2月, 2012 2 次提交
  12. 17 2月, 2012 9 次提交
  13. 16 2月, 2012 1 次提交