1. 26 2月, 2012 1 次提交
    • I
      autofs: work around unhappy compat problem on x86-64 · a32744d4
      Ian Kent 提交于
      When the autofs protocol version 5 packet type was added in commit
      5c0a32fc ("autofs4: add new packet type for v5 communications"), it
      obvously tried quite hard to be word-size agnostic, and uses explicitly
      sized fields that are all correctly aligned.
      
      However, with the final "char name[NAME_MAX+1]" array at the end, the
      actual size of the structure ends up being not very well defined:
      because the struct isn't marked 'packed', doing a "sizeof()" on it will
      align the size of the struct up to the biggest alignment of the members
      it has.
      
      And despite all the members being the same, the alignment of them is
      different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
      alignment on x86-64.  And while 'NAME_MAX+1' ends up being a nice round
      number (256), the name[] array starts out a 4-byte aligned.
      
      End result: the "packed" size of the structure is 300 bytes: 4-byte, but
      not 8-byte aligned.
      
      As a result, despite all the fields being in the same place on all
      architectures, sizeof() will round up that size to 304 bytes on
      architectures that have 8-byte alignment for u64.
      
      Note that this is *not* a problem for 32-bit compat mode on POWER, since
      there __u64 is 8-byte aligned even in 32-bit mode.  But on x86, 32-bit
      and 64-bit alignment is different for 64-bit entities, and as a result
      the structure that has exactly the same layout has different sizes.
      
      So on x86-64, but no other architecture, we will just subtract 4 from
      the size of the structure when running in a compat task.  That way we
      will write the properly sized packet that user mode expects.
      
      Not pretty.  Sadly, this very subtle, and unnecessary, size difference
      has been encoded in user space that wants to read packets of *exactly*
      the right size, and will refuse to touch anything else.
      Reported-and-tested-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NIan Kent <raven@themaw.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a32744d4
  2. 25 2月, 2012 2 次提交
    • O
      epoll: ep_unregister_pollwait() can use the freed pwq->whead · 971316f0
      Oleg Nesterov 提交于
      signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
      this is not enough. eppoll_entry->whead still points to the memory
      we are going to free, ep_unregister_pollwait()->remove_wait_queue()
      is obviously unsafe.
      
      Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
      change ep_unregister_pollwait() to check pwq->whead != NULL under
      rcu_read_lock() before remove_wait_queue(). We add the new helper,
      ep_remove_wait_queue(), for this.
      
      This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
      ->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
      ep_unregister_pollwait()->remove_wait_queue() can play with already
      freed and potentially reused ->sighand, but this is fine. This memory
      must have the valid ->signalfd_wqh until rcu_read_unlock().
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      971316f0
    • O
      epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() · d80e731e
      Oleg Nesterov 提交于
      This patch is intentionally incomplete to simplify the review.
      It ignores ep_unregister_pollwait() which plays with the same wqh.
      See the next change.
      
      epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
      f_op->poll() needs. In particular it assumes that the wait queue
      can't go away until eventpoll_release(). This is not true in case
      of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
      which is not connected to the file.
      
      This patch adds the special event, POLLFREE, currently only for
      epoll. It expects that init_poll_funcptr()'ed hook should do the
      necessary cleanup. Perhaps it should be defined as EPOLLFREE in
      eventpoll.
      
      __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
      ->signalfd_wqh is not empty, we add the new signalfd_cleanup()
      helper.
      
      ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
      This make this poll entry inconsistent, but we don't care. If you
      share epoll fd which contains our sigfd with another process you
      should blame yourself. signalfd is "really special". I simply do
      not know how we can define the "right" semantics if it used with
      epoll.
      
      The main problem is, epoll calls signalfd_poll() once to establish
      the connection with the wait queue, after that signalfd_poll(NULL)
      returns the different/inconsistent results depending on who does
      EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
      has nothing to do with the file, it works with the current thread.
      
      In short: this patch is the hack which tries to fix the symptoms.
      It also assumes that nobody can take tasklist_lock under epoll
      locks, this seems to be true.
      
      Note:
      
      	- we do not have wake_up_all_poll() but wake_up_poll()
      	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.
      
      	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
      	  we need a couple of simple changes in eventpoll.c to
      	  make sure it can't be "lost".
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Cc: <stable@kernel.org>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d80e731e
  3. 24 2月, 2012 2 次提交
    • C
      Btrfs: fix compiler warnings on 32 bit systems · e77266e4
      Chris Mason 提交于
      The enospc tracing code added some interesting uses of
      u64 pointer casts.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e77266e4
    • A
      Restore direct_io / truncate locking API · 37fbf4bf
      Anton Altaparmakov 提交于
      With kernel 3.1, Christoph removed i_alloc_sem and replaced it with
      calls (namely inode_dio_wait() and inode_dio_done()) which are
      EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and
      further inode_dio_wait() was pushed from notify_change() into the file
      system ->setattr() method but no non-GPL file system can make this call.
      
      That means non-GPL file systems cannot exist any more unless they do not
      use any VFS functionality related to reading/writing as far as I can
      tell or at least as long as they want to implement direct i/o.
      
      Both Linus and Al (and others) have said on LKML that this breakage of
      the VFS API should not have happened and that the change was simply
      missed as it was not documented in the change logs of the patches that
      did those changes.
      
      This patch changes the two function exports in question to be
      EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible
      for all modules.
      
      Christoph, who introduced the two functions and exported them GPL-only
      is CC-ed on this patch to give him the opportunity to object to the
      symbols being changed in this manner if he did indeed intend them to be
      GPL-only and does not want them to become available to all modules.
      Signed-off-by: NAnton Altaparmakov <anton@tuxera.com>
      CC: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37fbf4bf
  4. 23 2月, 2012 5 次提交
  5. 22 2月, 2012 3 次提交
    • L
      sys_poll: fix incorrect type for 'timeout' parameter · faf30900
      Linus Torvalds 提交于
      The 'poll()' system call timeout parameter is supposed to be 'int', not
      'long'.
      
      Now, the reason this matters is that right now 32-bit compat mode is
      broken on at least x86-64, because the 32-bit code just calls
      'sys_poll()' directly on x86-64, and the 32-bit argument will have been
      zero-extended, turning a signed 'int' into a large unsigned 'long'
      value.
      
      We could just introduce a 'compat_sys_poll()' function for this, and
      that may eventually be what we have to do, but since the actual standard
      poll() semantics is *supposed* to be 'int', and since at least on x86-64
      glibc sign-extends the argument before invocing the system call (so
      nobody can actually use a 64-bit timeout value in user space _anyway_,
      even in 64-bit binaries), the simpler solution would seem to be to just
      fix the definition of the system call to match what it should have been
      from the very start.
      
      If it turns out that somebody somehow circumvents the user-level libc
      64-bit sign extension and actually uses a large unsigned 64-bit timeout
      despite that not being how poll() is supposed to work, we will need to
      do the compat_sys_poll() approach.
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      faf30900
    • M
      xfs: make inode quota check more general · c922bbc8
      Mitsuo Hayasaka 提交于
      The xfs checks quota when reserving disk blocks and inodes. In the block
      reservation, it checks if the total number of blocks including current
      usage and new reservation exceed quota. In the inode reservation,
      it checks using the total number of inodes including only current usage
      without new reservation. However, this inode quota check works well
      since the caller of xfs_trans_dquot() always sets the argument of the
      number of new inode reservation to 1 or 0 and inode is reserved one by
      one in current xfs.
      
      To make it more general, this patch changes it to the same way as the
      block quota check.
      Signed-off-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c922bbc8
    • M
      xfs: change available ranges of softlimit and hardlimit in quota check · 20f12d8a
      Mitsuo Hayasaka 提交于
      In general, quota allows us to use disk blocks and inodes up to each
      limit, that is, they are available if they don't exceed their limitations.
      Current xfs sets their available ranges to lower than them except disk
      inode quota check. So, this patch changes the ranges to not beyond them.
      Signed-off-by: NMitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      20f12d8a
  6. 21 2月, 2012 1 次提交
  7. 18 2月, 2012 2 次提交
  8. 17 2月, 2012 9 次提交
  9. 16 2月, 2012 1 次提交
  10. 15 2月, 2012 9 次提交
    • D
      btrfs: silence warning in raid array setup · 8a334426
      David Sterba 提交于
      Raid array setup code creates an extent buffer in an usual way. When the
      PAGE_CACHE_SIZE is > super block size, the extent pages are not marked
      up-to-date, which triggers a WARN_ON in the following
      write_extent_buffer call. Add an explicit up-to-date call to silence the
      warning.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      8a334426
    • D
      btrfs: fix structs where bitfields and spinlock/atomic share 8B word · c08782da
      David Sterba 提交于
      On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
      gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
      disaterous effect.
      
      https://lkml.org/lkml/2012/2/1/220Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      c08782da
    • J
      btrfs: delalloc for page dirtied out-of-band in fixup worker · 87826df0
      Jeff Mahoney 提交于
       We encountered an issue that was easily observable on s/390 systems but
       could really happen anywhere. The timing just seemed to hit reliably
       on s/390 with limited memory.
      
       The gist is that when an unexpected set_page_dirty() happened, we'd
       run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
       properly set up for delalloc.
      
       This patch does the following:
       - Performs the missing delalloc in the fixup worker
       - Allow the start hook to return -EBUSY which informs __extent_writepage
         that it should mark the page skipped and not to redirty it. This is
         required since the fixup worker can fail with -ENOSPC and the page
         will have already been redirtied. That causes an Oops in
         drop_outstanding_extents later. Retrying the fixup worker could
         lead to an infinite loop. Deferring the page redirty also saves us
         some cycles since the page would be stuck in a resubmit-redirty loop
         until the fixup worker completes. It's not harmful, just wasteful.
       - If the fixup worker fails, we mark the page and mapping as errored,
         and end the writeback, similar to what we would do had the page
         actually been submitted to writeback.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      87826df0
    • T
      Btrfs: fix memory leak in load_free_space_cache() · a7e221e9
      Tsutomu Itoh 提交于
      load_free_space_cache() has forgotten to free path.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      a7e221e9
    • A
      btrfs: don't check DUP chunks twice · 859acaf1
      Arne Jansen 提交于
      Because scrub enumerates the dev extent tree to find the chunks to scrub,
      it currently finds each DUP chunk twice and also scrubs it twice. This
      patch makes sure that scrub_chunk only checks that part of the chunk the
      dev extent has been found for. This only changes the behaviour for DUP
      chunks.
      Reported-and-tested-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      859acaf1
    • L
      Btrfs: fix trim 0 bytes after a device delete · 2cac13e4
      Liu Bo 提交于
      A user reported a bug of btrfs's trim, that is we will trim 0 bytes
      after a device delete.
      
      The reproducer:
      
      $ mkfs.btrfs disk1
      $ mkfs.btrfs disk2
      $ mount disk1 /mnt
      $ fstrim -v /mnt
      $ btrfs device add disk2 /mnt
      $ btrfs device del disk1 /mnt
      $ fstrim -v /mnt
      
      This is because after we delete the device, the block group may start from
      a non-zero place, which will confuse trim to discard nothing.
      Reported-by: NLutz Euler <lutz.euler@freenet.de>
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      2cac13e4
    • J
      Btrfs: return the internal error unchanged if btrfs_get_extent_fiemap() call... · 6af021d8
      Jeff Liu 提交于
      Btrfs: return the internal error unchanged if btrfs_get_extent_fiemap() call failed for SEEK_DATA/SEEK_HOLE inquiry
      
      Given that ENXIO only means "offset beyond EOF" for either SEEK_DATA or SEEK_HOLE inquiry
      in a desired file range, so we should return the internal error unchanged if btrfs_get_extent_fiemap()
      call failed, rather than ENXIO.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      6af021d8
    • J
      Btrfs: avoid positive number with ERR_PTR · 8f24b496
      Jan Schmidt 提交于
      inode_ref_info() returns 1 when the element wasn't found and < 0 on error,
      just like btrfs_search_slot(). In iref_to_path() it's an error when the
      inode ref can't be found, thus we return ERR_PTR(ret) in that case. In order
      to avoid ERR_PTR(1), we now set ret to -ENOENT in that case.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8f24b496
    • K
      btrfs: Sector Size check during Mount · 941b2ddf
      Keith Mannthey 提交于
      Gracefully fail when trying to mount a BTRFS file system that has a
      sectorsize smaller than PAGE_SIZE.
      
      On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
      then boot into a 64K PAGE_SIZE kernel.  Presently open_ctree fails in an
      endless loop and hangs the machine in this situation.
      
      My debugging has show this Sector size < Page size to be a non trivial
      situation and a graceful exit from the situation would be nice for the
      time being.
      Signed-off-by: NKeith Mannthey <kmannth@us.ibm.com>
      941b2ddf
  11. 14 2月, 2012 5 次提交