1. 28 8月, 2013 1 次提交
  2. 23 8月, 2013 4 次提交
  3. 21 8月, 2013 1 次提交
  4. 20 8月, 2013 1 次提交
  5. 16 8月, 2013 2 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
    • M
      net/mlx5_core: Support MANAGE_PAGES and QUERY_PAGES firmware command changes · 0a324f31
      Moshe Lazer 提交于
      In the previous QUERY_PAGES command version we used one command to get the
      required amount of boot, init and post init pages.  The new version uses the
      op_mod field to specify whether the query is for the required amount of boot,
      init or post init pages. In addition the output field size for the required
      amount of pages increased from 16 to 32 bits.
      
      In MANAGE_PAGES command the input_num_entries and output_num_entries fields
      sizes changed from 16 to 32 bits and the PAS tables offset changed to 0x10.
      
      In the pages request event the num_pages field also changed to 32 bits.
      
      In the HCA-capabilities-layout the size and location of max_qp_mcg field has
      been changed to support 24 bits.
      
      This patch isn't compatible with firmware versions < 5; however, it  turns out that the
      first GA firmware we will publish will not support previous versions so this should be OK.
      Signed-off-by: NMoshe Lazer <moshel@mellanox.com>
      Signed-off-by: NEli Cohen <eli@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a324f31
  6. 15 8月, 2013 1 次提交
    • J
      net_sched: restore "linklayer atm" handling · 8a8e3d84
      Jesper Dangaard Brouer 提交于
      commit 56b765b7 ("htb: improved accuracy at high rates")
      broke the "linklayer atm" handling.
      
       tc class add ... htb rate X ceil Y linklayer atm
      
      The linklayer setting is implemented by modifying the rate table
      which is send to the kernel.  No direct parameter were
      transferred to the kernel indicating the linklayer setting.
      
      The commit 56b765b7 ("htb: improved accuracy at high rates")
      removed the use of the rate table system.
      
      To keep compatible with older iproute2 utils, this patch detects
      the linklayer by parsing the rate table.  It also supports future
      versions of iproute2 to send this linklayer parameter to the
      kernel directly. This is done by using the __reserved field in
      struct tc_ratespec, to convey the choosen linklayer option, but
      only using the lower 4 bits of this field.
      
      Linklayer detection is limited to speeds below 100Mbit/s, because
      at high rates the rtab is gets too inaccurate, so bad that
      several fields contain the same values, this resembling the ATM
      detect.  Fields even start to contain "0" time to send, e.g. at
      1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
      been more broken than we first realized.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a8e3d84
  7. 14 8月, 2013 5 次提交
  8. 13 8月, 2013 1 次提交
    • O
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov 提交于
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
  9. 10 8月, 2013 1 次提交
  10. 09 8月, 2013 1 次提交
  11. 08 8月, 2013 2 次提交
    • T
      SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister · 786615bc
      Trond Myklebust 提交于
      If rpcbind causes our connection to the AF_LOCAL socket to close after
      we've registered a service, then we want to be careful about reconnecting
      since the mount namespace may have changed.
      
      By simply refusing to reconnect the AF_LOCAL socket in the case of
      unregister, we avoid the need to somehow save the mount namespace. While
      this may lead to some services not unregistering properly, it should
      be safe.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: stable@vger.kernel.org # 3.9.x
      786615bc
    • R
      ACPI: Try harder to resolve _ADR collisions for bridges · 60f75b8e
      Rafael J. Wysocki 提交于
      In theory, under a given ACPI namespace node there should be only
      one child device object with _ADR whose value matches a given bus
      address exactly.  In practice, however, there are systems in which
      multiple child device objects under a given parent have _ADR matching
      exactly the same address.  In those cases we use _STA to determine
      which of the multiple matching devices is enabled, since some systems
      are known to indicate which ACPI device object to associate with the
      given physical (usually PCI) device this way.
      
      Unfortunately, as it turns out, there are systems in which many
      device objects under the same parent have _ADR matching exactly the
      same bus address and none of them has _STA, in which case they all
      should be regarded as enabled according to the spec.  Still, if
      those device objects are supposed to represent bridges (e.g. this
      is the case for device objects corresponding to PCIe ports), we can
      try harder and skip the ones that have no child device objects in the
      ACPI namespace.  With luck, we can avoid using device objects that we
      are not expected to use this way.
      
      Although this only works for bridges whose children also have ACPI
      namespace representation, it is sufficient to address graphics
      adapter detection issues on some systems, so rework the code finding
      a matching device ACPI handle for a given bus address to implement
      this idea.
      
      Introduce a new function, acpi_find_child(), taking three arguments:
      the ACPI handle of the device's parent, a bus address suitable for
      the device's bus type and a bool indicating if the device is a
      bridge and make it work as outlined above.  Reimplement the function
      currently used for this purpose, acpi_get_child(), as a call to
      acpi_find_child() with the last argument set to 'false' and make
      the PCI subsystem use acpi_find_child() with the bridge information
      passed as the last argument to it.  [Lan Tianyu notices that it is
      not sufficient to use pci_is_bridge() for that, because the device's
      subordinate pointer hasn't been set yet at this point, so use
      hdr_type instead.]
      
      This change fixes a regression introduced inadvertently by commit
      33f767d7 (ACPI: Rework acpi_get_child() to be more efficient) which
      overlooked the fact that for acpi_walk_namespace() "post-order" means
      "after all children have been visited" rather than "on the way back",
      so for device objects without children and for namespace walks of
      depth 1, as in the acpi_get_child() case, the "post-order" callbacks
      ordering is actually the same as the ordering of "pre-order" ones.
      Since that commit changed the namespace walk in acpi_get_child() to
      terminate after finding the first matching object instead of going
      through all of them and returning the last one, it effectively
      changed the result returned by that function in some rare cases and
      that led to problems (the switch from a "pre-order" to a "post-order"
      callback was supposed to prevent that from happening, but it was
      ineffective).
      
      As it turns out, the systems where the change made by commit
      33f767d7 actually matters are those where there are multiple ACPI
      device objects representing the same PCIe port (which effectively
      is a bridge).  Moreover, only one of them, and the one we are
      expected to use, has child device objects in the ACPI namespace,
      so the regression can be addressed as described above.
      
      References: https://bugzilla.kernel.org/show_bug.cgi?id=60561Reported-by: NPeter Wu <lekensteyn@gmail.com>
      Tested-by: NVladimir Lalov <mail@vlalov.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: 3.9+ <stable@vger.kernel.org> # 3.9+
      60f75b8e
  12. 07 8月, 2013 1 次提交
  13. 06 8月, 2013 2 次提交
  14. 05 8月, 2013 1 次提交
  15. 03 8月, 2013 2 次提交
  16. 02 8月, 2013 3 次提交
    • C
      net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL · e0d1095a
      Cong Wang 提交于
      Eliezer renames several *ll_poll to *busy_poll, but forgets
      CONFIG_NET_LL_RX_POLL, so in case of confusion, rename it too.
      
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0d1095a
    • C
      net: fix a compile error when CONFIG_NET_LL_RX_POLL is not set · dfcefb0b
      Cong Wang 提交于
      When CONFIG_NET_LL_RX_POLL is not set, I got:
      
      net/socket.c: In function ‘sock_poll’:
      net/socket.c:1165:4: error: implicit declaration of function ‘sk_busy_loop’ [-Werror=implicit-function-declaration]
      
      Fix this by adding a nop when !CONFIG_NET_LL_RX_POLL.
      
      Cc: Eliezer Tamir <eliezer.tamir@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfcefb0b
    • M
      ipv6: prevent fib6_run_gc() contention · 2ac3ac8f
      Michal Kubeček 提交于
      On a high-traffic router with many processors and many IPv6 dst
      entries, soft lockup in fib6_run_gc() can occur when number of
      entries reaches gc_thresh.
      
      This happens because fib6_run_gc() uses fib6_gc_lock to allow
      only one thread to run the garbage collector but ip6_dst_gc()
      doesn't update net->ipv6.ip6_rt_last_gc until fib6_run_gc()
      returns. On a system with many entries, this can take some time
      so that in the meantime, other threads pass the tests in
      ip6_dst_gc() (ip6_rt_last_gc is still not updated) and wait for
      the lock. They then have to run the garbage collector one after
      another which blocks them for quite long.
      
      Resolve this by replacing special value ~0UL of expire parameter
      to fib6_run_gc() by explicit "force" parameter to choose between
      spin_lock_bh() and spin_trylock_bh() and call fib6_run_gc() with
      force=false if gc_thresh is reached but not max_size.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ac3ac8f
  17. 01 8月, 2013 5 次提交
  18. 31 7月, 2013 2 次提交
  19. 30 7月, 2013 1 次提交
    • C
      freezer: set PF_SUSPEND_TASK flag on tasks that call freeze_processes · 2b44c4db
      Colin Cross 提交于
      Calling freeze_processes sets a global flag that will cause any
      process that calls try_to_freeze to enter the refrigerator.  It
      skips sending a signal to the current task, but if the current
      task ever hits try_to_freeze, all threads will be frozen and the
      system will deadlock.
      
      Set a new flag, PF_SUSPEND_TASK, on the task that calls
      freeze_processes.  The flag notifies the freezer that the thread
      is involved in suspend and should not be frozen.  Also add a
      WARN_ON in thaw_processes if the caller does not have the
      PF_SUSPEND_TASK flag set to catch if a different task calls
      thaw_processes than the one that called freeze_processes, leaving
      a task with PF_SUSPEND_TASK permanently set on it.
      
      Threads that spawn off a task with PF_SUSPEND_TASK set (which
      swsusp does) will also have PF_SUSPEND_TASK set, preventing them
      from freezing while they are helping with suspend, but they need
      to be dead by the time suspend is triggered, otherwise they may
      run when userspace is expected to be frozen.  Add a WARN_ON in
      thaw_processes if more than one thread has the PF_SUSPEND_TASK
      flag set.
      Reported-and-tested-by: NMichael Leun <lkml20130126@newton.leun.net>
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2b44c4db
  20. 29 7月, 2013 1 次提交
    • R
      Revert "cpuidle: Quickly notice prediction failure for repeat mode" · 14851912
      Rafael J. Wysocki 提交于
      Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
      repeat mode), because it has been identified as the source of a
      significant performance regression in v3.8 and later as explained by
      Jeremy Eder:
      
        We believe we've identified a particular commit to the cpuidle code
        that seems to be impacting performance of variety of workloads.
        The simplest way to reproduce is using netperf TCP_RR test, so
        we're using that, on a pair of Sandy Bridge based servers.  We also
        have data from a large database setup where performance is also
        measurably/positively impacted, though that test data isn't easily
        share-able.
      
        Included below are test results from 3 test kernels:
      
        kernel       reverts
        -----------------------------------------------------------
        1) vanilla   upstream (no reverts)
      
        2) perfteam2 reverts e11538d1
      
        3) test      reverts 69a37bea
                             e11538d1
      
        In summary, netperf TCP_RR numbers improve by approximately 4%
        after reverting 69a37bea.  When
        69a37bea is included, C0 residency
        never seems to get above 40%.  Taking that patch out gets C0 near
        100% quite often, and performance increases.
      
        The below data are histograms representing the %c0 residency @
        1-second sample rates (using turbostat), while under netperf test.
      
        - If you look at the first 4 histograms, you can see %c0 residency
          almost entirely in the 30,40% bin.
        - The last pair, which reverts 69a37bea,
          shows %c0 in the 80,90,100% bins.
      
        Below each kernel name are netperf TCP_RR trans/s numbers for the
        particular kernel that can be disclosed publicly, comparing the 3
        test kernels.  We ran a 4th test with the vanilla kernel where
        we've also set /dev/cpu_dma_latency=0 to show overall impact
        boosting single-threaded TCP_RR performance over 11% above
        baseline.
      
        3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
        TCP_RR trans/s 54323.78
      
        -----------------------------------------------------------
        3.10-rc2 vanilla RX (no reverts)
        TCP_RR trans/s 48192.47
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    59]:
        ***********************************************************
           40.0000 -    50.0000 [     1]: *
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        Sender %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    11]: ***********
           40.0000 -    50.0000 [    49]:
        *************************************************
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        -----------------------------------------------------------
        3.10-rc2 perfteam2 RX (reverts commit
        e11538d1)
        TCP_RR trans/s 49698.69
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     1]: *
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    59]:
        ***********************************************************
           40.0000 -    50.0000 [     0]:
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        Sender %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [     2]: **
           40.0000 -    50.0000 [    58]:
        **********************************************************
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [     0]:
      
        -----------------------------------------------------------
        3.10-rc2 test RX (reverts 69a37bea
        and e11538d1)
        TCP_RR trans/s 47766.95
      
        Receiver %c0
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     1]: *
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    27]: ***************************
           40.0000 -    50.0000 [     2]: **
           50.0000 -    60.0000 [     0]:
           60.0000 -    70.0000 [     2]: **
           70.0000 -    80.0000 [     0]:
           80.0000 -    90.0000 [     0]:
           90.0000 -   100.0000 [    28]: ****************************
      
        Sender:
            0.0000 -    10.0000 [     1]: *
           10.0000 -    20.0000 [     0]:
           20.0000 -    30.0000 [     0]:
           30.0000 -    40.0000 [    11]: ***********
           40.0000 -    50.0000 [     0]:
           50.0000 -    60.0000 [     1]: *
           60.0000 -    70.0000 [     0]:
           70.0000 -    80.0000 [     3]: ***
           80.0000 -    90.0000 [     7]: *******
           90.0000 -   100.0000 [    38]: **************************************
      
        These results demonstrate gaining back the tendency of the CPU to
        stay in more responsive, performant C-states (and thus yield
        measurably better performance), by reverting commit
        69a37bea.
      Requested-by: NJeremy Eder <jeder@redhat.com>
      Tested-by: NLen Brown <len.brown@intel.com>
      Cc: 3.8+ <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      14851912
  21. 28 7月, 2013 1 次提交
  22. 26 7月, 2013 1 次提交