1. 24 11月, 2013 2 次提交
    • K
      block: Convert various code to bio_for_each_segment() · 2c30c71b
      Kent Overstreet 提交于
      With immutable biovecs we don't want code accessing bi_io_vec directly -
      the uses this patch changes weren't incorrect since they all own the
      bio, but it makes the code harder to audit for no good reason - also,
      this will help with multipage bvecs later.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      2c30c71b
    • K
      block: submit_bio_wait() conversions · 33879d45
      Kent Overstreet 提交于
      It was being open coded in a few places.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Acked-by: NNeilBrown <neilb@suse.de>
      33879d45
  2. 22 11月, 2013 2 次提交
    • J
      configfs: fix race between dentry put and lookup · 76ae281f
      Junxiao Bi 提交于
      A race window in configfs, it starts from one dentry is UNHASHED and end
      before configfs_d_iput is called.  In this window, if a lookup happen,
      since the original dentry was UNHASHED, so a new dentry will be
      allocated, and then in configfs_attach_attr(), sd->s_dentry will be
      updated to the new dentry.  Then in configfs_d_iput(),
      BUG_ON(sd->s_dentry != dentry) will be triggered and system panic.
      
      sys_open:                     sys_close:
       ...                           fput
                                      dput
                                       dentry_kill
                                        __d_drop <--- dentry unhashed here,
                                                 but sd->dentry still point
                                                 to this dentry.
      
       lookup_real
        configfs_lookup
         configfs_attach_attr---> update sd->s_dentry
                                  to new allocated dentry here.
      
                                         d_kill
                                           configfs_d_iput <--- BUG_ON(sd->s_dentry != dentry)
                                                           triggered here.
      
      To fix it, change configfs_d_iput to not update sd->s_dentry if
      sd->s_count > 2, that means there are another dentry is using the sd
      beside the one that is going to be put.  Use configfs_dirent_lock in
      configfs_attach_attr to sync with configfs_d_iput.
      
      With the following steps, you can reproduce the bug.
      
      1. enable ocfs2, this will mount configfs at /sys/kernel/config and
         fill configure in it.
      
      2. run the following script.
      	while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
      	while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76ae281f
    • S
      GFS2: Fix ref count bug relating to atomic_open · ea0341e0
      Steven Whitehouse 提交于
      In the case that atomic_open calls finish_no_open() with
      the dentry that was supplied to gfs2_atomic_open() an
      extra reference count is required. This patch fixes that
      issue preventing a bug trap triggering at umount time.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ea0341e0
  3. 21 11月, 2013 17 次提交
  4. 20 11月, 2013 14 次提交
    • P
      Squashfs: Check stream is not NULL in decompressor_multi.c · ed4f381e
      Phillip Lougher 提交于
      Fix static checker complaint that stream is not checked in
      squashfs_decompressor_destroy().
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      ed4f381e
    • P
      Squashfs: Directly decompress into the page cache for file data · 0d455c12
      Phillip Lougher 提交于
      This introduces an implementation of squashfs_readpage_block()
      that directly decompresses into the page cache.
      
      This uses the previously added page handler abstraction to push
      down the necessary kmap_atomic/kunmap_atomic operations on the
      page cache buffers into the decompressors.  This enables
      direct copying into the page cache without using the slow
      kmap/kunmap calls.
      
      The code detects when multiple threads are racing in
      squashfs_readpage() to decompress the same block, and avoids
      this regression by falling back to using an intermediate
      buffer.
      
      This patch enhances the performance of Squashfs significantly
      when multiple processes are accessing the filesystem simultaneously
      because it not only reduces memcopying, but it more importantly
      eliminates the lock contention on the intermediate buffer.
      
      Using single-thread decompression.
      
              dd if=file1 of=/dev/null bs=4096 &
              dd if=file2 of=/dev/null bs=4096 &
              dd if=file3 of=/dev/null bs=4096 &
              dd if=file4 of=/dev/null bs=4096
      
      Before:
      
      629145600 bytes (629 MB) copied, 45.8046 s, 13.7 MB/s
      
      After:
      
      629145600 bytes (629 MB) copied, 9.29414 s, 67.7 MB/s
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      0d455c12
    • P
      Squashfs: Restructure squashfs_readpage() · 5f55dbc0
      Phillip Lougher 提交于
      Restructure squashfs_readpage() splitting it into separate
      functions for datablocks, fragments and sparse blocks.
      
      Move the memcpying (from squashfs cache entry) implementation of
      squashfs_readpage_block into file_cache.c
      
      This allows different implementations to be supported.
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      5f55dbc0
    • P
      Squashfs: Generalise paging handling in the decompressors · 846b730e
      Phillip Lougher 提交于
      Further generalise the decompressors by adding a page handler
      abstraction.  This adds helpers to allow the decompressors
      to access and process the output buffers in an implementation
      independant manner.
      
      This allows different types of output buffer to be passed
      to the decompressors, with the implementation specific
      aspects handled at decompression time, but without the
      knowledge being held in the decompressor wrapper code.
      
      This will allow the decompressors to handle Squashfs
      cache buffers, and page cache pages.
      
      This patch adds the abstraction and an implementation for
      the caches.
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      846b730e
    • P
      Squashfs: add multi-threaded decompression using percpu variable · d208383d
      Phillip Lougher 提交于
      Add a multi-threaded decompression implementation which uses
      percpu variables.
      
      Using percpu variables has advantages and disadvantages over
      implementations which do not use percpu variables.
      
      Advantages:
        * the nature of percpu variables ensures decompression is
          load-balanced across the multiple cores.
        * simplicity.
      
      Disadvantages: it limits decompression to one thread per core.
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      d208383d
    • M
      squashfs: Enhance parallel I/O · cd59c2ec
      Minchan Kim 提交于
      Now squashfs have used for only one stream buffer for decompression
      so it hurts parallel read performance so this patch supports
      multiple decompressor to enhance performance parallel I/O.
      
      Four 1G file dd read on KVM machine which has 2 CPU and 4G memory.
      
      dd if=test/test1.dat of=/dev/null &
      dd if=test/test2.dat of=/dev/null &
      dd if=test/test3.dat of=/dev/null &
      dd if=test/test4.dat of=/dev/null &
      
      old : 1m39s -> new : 9s
      
      * From v1
        * Change comp_strm with decomp_strm - Phillip
        * Change/add comments - Phillip
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      cd59c2ec
    • P
      Squashfs: Refactor decompressor interface and code · 9508c6b9
      Phillip Lougher 提交于
      The decompressor interface and code was written from
      the point of view of single-threaded operation.  In doing
      so it mixed a lot of single-threaded implementation specific
      aspects into the decompressor code and elsewhere which makes it
      difficult to seamlessly support multiple different decompressor
      implementations.
      
      This patch does the following:
      
      1.  It removes compressor_options parsing from the decompressor
          init() function.  This allows the decompressor init() function
          to be dynamically called to instantiate multiple decompressors,
          without the compressor options needing to be read and parsed each
          time.
      
      2.  It moves threading and all sleeping operations out of the
          decompressors.  In doing so, it makes the decompressors
          non-blocking wrappers which only deal with interfacing with
          the decompressor implementation.
      
      3. It splits decompressor.[ch] into decompressor generic functions
         in decompressor.[ch], and moves the single threaded
         decompressor implementation into decompressor_single.c.
      
      The result of this patch is Squashfs should now be able to
      support multiple decompressors by adding new decompressor_xxx.c
      files with specialised implementations of the functions in
      decompressor_single.c
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      9508c6b9
    • J
      nfsd4: fix xdr decoding of large non-write compounds · 365da4ad
      J. Bruce Fields 提交于
      This fixes a regression from 24750082
      "nfsd4: fix decoding of compounds across page boundaries".  The previous
      code was correct: argp->pagelist is initialized in
      nfs4svc_deocde_compoundargs to rqstp->rq_arg.pages, and is therefore a
      pointer to the page *after* the page we are currently decoding.
      
      The reason that patch nevertheless fixed a problem with decoding
      compounds containing write was a bug in the write decoding introduced by
      5a80a54d "nfsd4: reorganize write
      decoding", after which write decoding no longer adhered to the rule that
      argp->pagelist point to the next page.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      365da4ad
    • S
      aio: nullify aio->ring_pages after freeing it · ddb8c45b
      Sasha Levin 提交于
      After freeing ring_pages we leave it as is causing a dangling pointer. This
      has already caused an issue so to help catching any issues in the future
      NULL it out.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      ddb8c45b
    • S
      aio: prevent double free in ioctx_alloc · d5580232
      Sasha Levin 提交于
      ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
      fails to do so it would call aio_free_ring() before returning, but
      ioctx_alloc() would call aio_free_ring() again causing a double free of
      the ring.
      
      This is easily reproducible from userspace.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      d5580232
    • J
      genetlink: make multicast groups const, prevent abuse · 2a94fe48
      Johannes Berg 提交于
      Register generic netlink multicast groups as an array with
      the family and give them contiguous group IDs. Then instead
      of passing the global group ID to the various functions that
      send messages, pass the ID relative to the family - for most
      families that's just 0 because the only have one group.
      
      This avoids the list_head and ID in each group, adding a new
      field for the mcast group ID offset to the family.
      
      At the same time, this allows us to prevent abusing groups
      again like the quota and dropmon code did, since we can now
      check that a family only uses a group it owns.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a94fe48
    • J
      genetlink: pass family to functions using groups · 68eb5503
      Johannes Berg 提交于
      This doesn't really change anything, but prepares for the
      next patch that will change the APIs to pass the group ID
      within the family, rather than the global group ID.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68eb5503
    • J
      quota/genetlink: use proper genetlink multicast APIs · 2ecf7536
      Johannes Berg 提交于
      The quota code is abusing the genetlink API and is using
      its family ID as the multicast group ID, which is invalid
      and may belong to somebody else (and likely will.)
      
      Make the quota code use the correct API, but since this
      is already used as-is by userspace, reserve a family ID
      for this code and also reserve that group ID to not break
      userspace assumptions.
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ecf7536
    • J
      genetlink: only pass array to genl_register_family_with_ops() · c53ed742
      Johannes Berg 提交于
      As suggested by David Miller, make genl_register_family_with_ops()
      a macro and pass only the array, evaluating ARRAY_SIZE() in the
      macro, this is a little safer.
      
      The openvswitch has some indirection, assing ops/n_ops directly in
      that code. This might ultimately just assign the pointers in the
      family initializations, saving the struct genl_family_and_ops and
      code (once mcast groups are handled differently.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c53ed742
  5. 19 11月, 2013 4 次提交
  6. 18 11月, 2013 1 次提交
    • D
      xfs: open code inc_inode_iversion when logging an inode · 2fe8c1c0
      Dave Chinner 提交于
      Michael L Semon reported that generic/069 runtime increased on v5
      superblocks by 100% compared to v4 superblocks. his perf-based
      analysis pointed directly at the timestamp updates being done by the
      write path in this workload. The append writers are doing 4-byte
      writes, so there are lots of timestamp updates occurring.
      
      The thing is, they aren't being triggered by timestamp changes -
      they are being triggered by the inode change counter needing to be
      updated. That is, every write(2) system call needs to bump the inode
      version count, and it does that through the timestamp update
      mechanism. Hence for v5 filesystems, test generic/069 is running 3
      orders of magnitude more timestmap update transactions on v5
      filesystems due to the fact it does a huge number of *4 byte*
      write(2) calls.
      
      This isn't a real world scenario we really need to address - anyone
      doing such sequential IO should be using fwrite(3), not write(2).
      i.e. fwrite(3) buffers the writes in userspace to minimise the
      number of write(2) syscalls, and the problem goes away.
      
      However, there is a small change we can make to improve the
      situation - removing the expensive lock operation on the change
      counter update.  All inode version counter changes in XFS occur
      under the ip->i_ilock during a transaction, and therefore we
      don't actually need the spin lock that provides exclusive access to
      it through inc_inode_iversion().
      
      Hence avoid the lock and just open code the increment ourselves when
      logging the inode.
      Reported-by: NMichael L. Semon <mlsemon35@gmail.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2fe8c1c0