1. 12 7月, 2011 1 次提交
    • J
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest 提交于
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: NJustin TerAvest <teravest@google.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4aede84b
  2. 08 7月, 2011 2 次提交
    • S
      block: document blk_plug list access · 316cc67d
      Shaohua Li 提交于
      I'm often confused why not disable preempt when changing blk_plug list. It
      would be better to add comments here in case others have the similar concerns.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      316cc67d
    • S
      block: avoid building too big plug list · 55c022bb
      Shaohua Li 提交于
      When I test fio script with big I/O depth, I found the total throughput drops
      compared to some relative small I/O depth. The reason is the thread accumulates
      big requests in its plug list and causes some delays (surely this depends
      on CPU speed).
      I thought we'd better have a threshold for requests. When a threshold reaches,
      this means there is no request merge and queue lock contention isn't severe
      when pushing per-task requests to queue, so the main advantages of blk plug
      don't exist. We can force a plug list flush in this case.
      With this, my test throughput actually increases and almost equals to small
      I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
      for big I/O depth.
      The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
      reduce lock contention to me. But I'm open here, 32 is ok in my test too.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      55c022bb
  3. 07 7月, 2011 1 次提交
  4. 02 7月, 2011 1 次提交
    • J
      compat_ioctl: fix warning caused by qemu · 390192b3
      Johannes Stezenbach 提交于
      On Linux x86_64 host with 32bit userspace, running
      qemu or even just "qemu-img create -f qcow2 some.img 1G"
      causes a kernel warning:
      
      ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
      ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img
      
      ioctl 00005326 is CDROM_DRIVE_STATUS,
      ioctl 801c0204 is FDGETPRM.
      
      The warning appears because the Linux compat-ioctl handler for these
      ioctls only applies to block devices, while qemu also uses the ioctls on
      plain files.
      Signed-off-by: NJohannes Stezenbach <js@sig21.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      390192b3
  5. 01 7月, 2011 1 次提交
    • T
      block: flush MEDIA_CHANGE from drivers on close(2) · 85ef06d1
      Tejun Heo 提交于
      Currently, only open(2) is defined as the 'clearing' point.  It has
      two roles - first, it's an acknowledgement from userland indicating
      that the event has been received and kernel can clear pending states
      and proceed to generate more events.  Secondly, it's passed on to
      device drivers as a hint indicating that a synchronization point has
      been reached and it might want to take a deeper look at the device.
      
      The latter currently is only used by sr which uses two different
      mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
      to discover events, where the former is lighter weight and safe to be
      used repeatedly but may not provide full coverage.  Among other
      things, GET_EVENT can't detect media removal while TUR can.
      
      This patch makes close(2) - blkdev_put() - indicate clearing hint for
      MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
      disk_flush_events() and updated to take @mask for events to flush
      which is or'd to ev->clearing and will be passed to the driver on the
      next ->check_events() invocation.
      
      This change makes sr generate MEDIA_CHANGE when media is ejected from
      userland - e.g. with eject(1).
      
      Note: Given the current usage, it seems @clearing hint is needlessly
      complex.  disk_clear_events() can simply clear all events and the hint
      can be boolean @flush.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      85ef06d1
  6. 30 6月, 2011 1 次提交
  7. 22 6月, 2011 2 次提交
    • A
      PM: Fix async resume following suspend failure · 6d0e0e84
      Alan Stern 提交于
      The PM core doesn't handle suspend failures correctly when it comes to
      asynchronously suspended devices.  These devices are moved onto the
      dpm_suspended_list as soon as the corresponding async thread is
      started up, and they remain on the list even if they fail to suspend
      or the sleep transition is cancelled before they get suspended.  As a
      result, when the PM core unwinds the transition, it tries to resume
      the devices even though they were never suspended.
      
      This patch (as1474) fixes the problem by adding a new "is_suspended"
      flag to dev_pm_info.  Devices are resumed only if the flag is set.
      
      [rjw:
       * Moved the dev->power.is_suspended check into device_resume(),
         because we need to complete dev->power.completion and clear
         dev->power.is_prepared too for devices whose
         dev->power.is_suspended flags are unset.
       * Fixed __device_suspend() to avoid setting dev->power.is_suspended
         if async_error is different from zero.]
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@kernel.org
      6d0e0e84
    • A
      PM: Rename dev_pm_info.in_suspend to is_prepared · f76b168b
      Alan Stern 提交于
      This patch (as1473) renames the "in_suspend" field in struct
      dev_pm_info to "is_prepared", in preparation for an upcoming change.
      The new name is more descriptive of what the field really means.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@kernel.org
      f76b168b
  8. 21 6月, 2011 2 次提交
    • L
      vfs: i_state needs to be 'unsigned long' for now · 79568f5b
      Linus Torvalds 提交于
      Commit 13e12d14 ("vfs: reorganize 'struct inode' layout a bit")
      moved things around a bit changed i_state to be unsigned int instead of
      unsigned long.  That was to help structure layout for the 64-bit case,
      and shrink 'struct inode' a bit (admittedly that only happened when
      spinlock debugging was on and i_flags didn't pack with i_lock).
      
      However, Meelis Roos reports that this results in unaligned exceptions
      on sprc, and it turns out that the bit-locking primitives that we use
      for the I_NEW bit want to use the bitops.  Which want 'unsigned long',
      not 'unsigned int'.
      
      We really should fix the bit locking code to not have that kind of
      requirement, but that's a much bigger change.  So for now, revert that
      field back to 'unsigned long' (but keep the other re-ordering changes
      from the commit that caused this).
      
      Andi points out that we have played games with this in 'struct page', so
      it's solvable with other hacks too, but since right now the struct inode
      size advantage only happens with some rare config options, it's not
      worth fighting.
      
      It _would_ be worth fixing the bitlocking code, though.  Especially
      since there is no type safety in the bitlocking code (this never caused
      any warnings, and worked fine on x86-64, because the bitlocks take a
      'void *' and x86-64 doesn't care that deeply about alignment).  So it's
      currently a very easy problem to trigger by mistake and never notice.
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79568f5b
    • B
      NFSv4.1: file layout must consider pg_bsize for coalescing · 19345cb2
      Benny Halevy 提交于
      Otherwise we end up overflowing the rpc buffer size on the receive end.
      Signed-off-by: NBenny Halevy <benny@tonian.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      19345cb2
  9. 20 6月, 2011 2 次提交
  10. 18 6月, 2011 2 次提交
  11. 17 6月, 2011 2 次提交
  12. 16 6月, 2011 7 次提交
  13. 15 6月, 2011 3 次提交
    • B
      NFSv4.1: deprecate headerpadsz in CREATE_SESSION · c9c30dd5
      Benny Halevy 提交于
      We don't support header padding yet so better off ditching it
      Reported-by: NSid Moore <learnmost@gmail.com>
      Signed-off-by: NBenny Halevy <benny@tonian.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      c9c30dd5
    • T
      NLM: Don't hang forever on NLM unlock requests · 0b760113
      Trond Myklebust 提交于
      If the NLM daemon is killed on the NFS server, we can currently end up
      hanging forever on an 'unlock' request, instead of aborting. Basically,
      if the rpcbind request fails, or the server keeps returning garbage, we
      really want to quit instead of retrying.
      Tested-by: NVasily Averin <vvs@sw.ru>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@kernel.org
      0b760113
    • S
      rcu: Use softirq to address performance regression · 09223371
      Shaohua Li 提交于
      Commit a26ac245(rcu: move TREE_RCU from softirq to kthread)
      introduced performance regression. In an AIM7 test, this commit degraded
      performance by about 40%.
      
      The commit runs rcu callbacks in a kthread instead of softirq. We observed
      high rate of context switch which is caused by this. Out test system has
      64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
      which is caused by RCU's per-CPU kthread.  A trace showed that most of
      the time the RCU per-CPU kthread doesn't actually handle any callbacks,
      but instead just does a very small amount of work handling grace periods.
      This means that RCU's per-CPU kthreads are making the scheduler do quite
      a bit of work in order to allow a very small amount of RCU-related
      processing to be done.
      
      Alex Shi's analysis determined that this slowdown is due to lock
      contention within the scheduler.  Unfortunately, as Peter Zijlstra points
      out, the scheduler's real-time semantics require global action, which
      means that this contention is inherent in real-time scheduling.  (Yes,
      perhaps someone will come up with a workaround -- otherwise, -rt is not
      going to do well on large SMP systems -- but this patch will work around
      this issue in the meantime.  And "the meantime" might well be forever.)
      
      This patch therefore re-introduces softirq processing to RCU, but only
      for core RCU work.  RCU callbacks are still executed in kthread context,
      so that only a small amount of RCU work runs in softirq context in the
      common case.  This should minimize ksoftirqd execution, allowing us to
      skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Tested-by: N"Alex,Shi" <alex.shi@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09223371
  14. 14 6月, 2011 2 次提交
    • J
      jbd2: Fix oops in jbd2_journal_remove_journal_head() · de1b7941
      Jan Kara 提交于
      jbd2_journal_remove_journal_head() can oops when trying to access
      journal_head returned by bh2jh(). This is caused for example by the
      following race:
      
      	TASK1					TASK2
        jbd2_journal_commit_transaction()
          ...
          processing t_forget list
            __jbd2_journal_refile_buffer(jh);
            if (!jh->b_transaction) {
              jbd_unlock_bh_state(bh);
      					jbd2_journal_try_to_free_buffers()
      					  jbd2_journal_grab_journal_head(bh)
      					  jbd_lock_bh_state(bh)
      					  __journal_try_to_free_buffer()
      					  jbd2_journal_put_journal_head(jh)
              jbd2_journal_remove_journal_head(bh);
      
      jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
      buffer is not part of any transaction and thus frees journal_head
      before TASK1 gets to doing so. Note that even buffer_head can be
      released by try_to_free_buffers() after
      jbd2_journal_put_journal_head() which adds even larger opportunity for
      oops (but I didn't see this happen in reality).
      
      Fix the problem by making transactions hold their own journal_head
      reference (in b_jcount). That way we don't have to remove journal_head
      explicitely via jbd2_journal_remove_journal_head() and instead just
      remove journal_head when b_jcount drops to zero. The result of this is
      that [__]jbd2_journal_refile_buffer(),
      [__]jbd2_journal_unfile_buffer(), and
      __jdb2_journal_remove_checkpoint() can free journal_head which needs
      modification of a few callers. Also we have to be careful because once
      journal_head is removed, buffer_head might be freed as well. So we
      have to get our own buffer_head reference where it matters.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      de1b7941
    • J
      block: Add __attribute__((format(printf...) and fix fallout · fd16d263
      Joe Perches 提交于
      Use the compiler to verify format strings and arguments.
      
      Fix fallout.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      fd16d263
  15. 13 6月, 2011 3 次提交
    • W
      block:fix the comment error in blkdev.h · 4d0d98b6
      Wanlong Gao 提交于
      There is not a function rq_init but blk_rq_init in block/blk-core.c.
      Signed-off-by: NWanlong Gao <wanlong.gao@gmail.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      4d0d98b6
    • J
      block: Add __attribute__((format(printf...) and fix fallout · 08e8138a
      Joe Perches 提交于
      Use the compiler to verify format strings and arguments.
      
      Fix fallout.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      08e8138a
    • A
      Delay struct net freeing while there's a sysfs instance refering to it · a685e089
      Al Viro 提交于
      	* new refcount in struct net, controlling actual freeing of the memory
      	* new method in kobj_ns_type_operations (->drop_ns())
      	* ->current_ns() semantics change - it's supposed to be followed by
      corresponding ->drop_ns().  For struct net in case of CONFIG_NET_NS it bumps
      the new refcount; net_drop_ns() decrements it and calls net_free() if the
      last reference has been dropped.  Method renamed to ->grab_current_ns().
      	* old net_free() callers call net_drop_ns() instead.
      	* sysfs_exit_ns() is gone, along with a large part of callchain
      leading to it; now that the references stored in ->ns[...] stay valid we
      do not need to hunt them down and replace them with NULL.  That fixes
      problems in sysfs_lookup() and sysfs_readdir(), along with getting rid
      of sb->s_instances abuse.
      
      	Note that struct net *shutdown* logics has not changed - net_cleanup()
      is called exactly when it used to be called.  The only thing postponed by
      having a sysfs instance refering to that struct net is actual freeing of
      memory occupied by struct net.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a685e089
  16. 12 6月, 2011 2 次提交
    • J
      vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check · 0b5c9db1
      Jiri Pirko 提交于
      Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
      but rather in vlan_do_receive.  Otherwise the vlan header
      will not be properly put on the packet in the case of
      vlan header accelleration.
      
      As we remove the check from vlan_check_reorder_header
      rename it vlan_reorder_header to keep the naming clean.
      
      Fix up the skb->pkt_type early so we don't look at the packet
      after adding the vlan tag, which guarantees we don't goof
      and look at the wrong field.
      
      Use a simple if statement instead of a complicated switch
      statement to decided that we need to increment rx_stats
      for a multicast packet.
      
      Hopefully at somepoint we will just declare the case where
      VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
      the code.  Until then this keeps it working correctly.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5c9db1
    • D
      linux/seqlock.h should #include asm/processor.h for cpu_relax() · 56a21052
      David Howells 提交于
      It uses cpu_relax(), and so needs <asm/processor.h>
      
      Without this patch, I see:
      
         CC      arch/mn10300/kernel/asm-offsets.s
        In file included from include/linux/time.h:8,
                         from include/linux/timex.h:56,
                         from include/linux/sched.h:57,
                         from arch/mn10300/kernel/asm-offsets.c:7:
        include/linux/seqlock.h: In function 'read_seqbegin':
        include/linux/seqlock.h:91: error: implicit declaration of function 'cpu_relax'
      
      whilst building asb2364_defconfig on MN10300.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56a21052
  17. 10 6月, 2011 2 次提交
  18. 09 6月, 2011 1 次提交
    • L
      vfs: reorganize 'struct inode' layout a bit · 13e12d14
      Linus Torvalds 提交于
      This tries to make the 'struct inode' accesses denser in the data cache
      by moving a commonly accessed field (i_security) closer to other fields
      that are accessed often.
      
      It also makes 'i_state' just an 'unsigned int' rather than 'unsigned
      long', since we only use a few bits of that field, and moves it next to
      the existing 'i_flags' so that we potentially get better structure
      layout (although depending on config options, i_flags may already have
      packed in the same word as i_lock, so this improves packing only for the
      case of spinlock debugging)
      
      Out 'struct inode' is still way too big, and we should probably move
      some other fields around too (the acl fields in particular) for better
      data cache access density.  Other fields (like the inode hash) are
      likely to be entirely irrelevant under most loads.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e12d14
  19. 08 6月, 2011 1 次提交
    • A
      usb-storage: redo incorrect reads · 21c13a4f
      Alan Stern 提交于
      Some USB mass-storage devices have bugs that cause them not to handle
      the first READ(10) command they receive correctly.  The Corsair
      Padlock v2 returns completely bogus data for its first read (possibly
      it returns the data in encrypted form even though the device is
      supposed to be unlocked).  The Feiya SD/SDHC card reader fails to
      complete the first READ(10) command after it is plugged in or after a
      new card is inserted, returning a status code that indicates it thinks
      the command was invalid, which prevents the kernel from retrying the
      read.
      
      Since the first read of a new device or a new medium is for the
      partition sector, the kernel is unable to retrieve the device's
      partition table.  Users have to manually issue an "hdparm -z" or
      "blockdev --rereadpt" command before they can access the device.
      
      This patch (as1470) works around the problem.  It adds a new quirk
      flag, US_FL_INVALID_READ10, indicating that the first READ(10) should
      always be retried immediately, as should any failing READ(10) commands
      (provided the preceding READ(10) command succeeded, to avoid getting
      stuck in a loop).  The patch also adds appropriate unusual_devs
      entries containing the new flag.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Tested-by: NSven Geggus <sven-usbst@geggus.net>
      Tested-by: NPaul Hartman <paul.hartman+linux@gmail.com>
      CC: Matthew Dharm <mdharm-usb@one-eyed-alien.net>
      CC: <stable@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      21c13a4f
  20. 07 6月, 2011 2 次提交