1. 09 10月, 2014 3 次提交
  2. 27 9月, 2014 1 次提交
  3. 25 9月, 2014 4 次提交
  4. 24 9月, 2014 2 次提交
    • T
      blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe · 0a30288d
      Tejun Heo 提交于
      blk-mq uses percpu_ref for its usage counter which tracks the number
      of in-flight commands and used to synchronously drain the queue on
      freeze.  percpu_ref shutdown takes measureable wallclock time as it
      involves a sched RCU grace period.  This means that draining a blk-mq
      takes measureable wallclock time.  One would think that this shouldn't
      matter as queue shutdown should be a rare event which takes place
      asynchronously w.r.t. userland.
      
      Unfortunately, SCSI probing involves synchronously setting up and then
      tearing down a lot of request_queues back-to-back for non-existent
      LUNs.  This means that SCSI probing may take more than ten seconds
      when scsi-mq is used.
      
      This will be properly fixed by implementing a mechanism to keep
      q->mq_usage_counter in atomic mode till genhd registration; however,
      that involves rather big updates to percpu_ref which is difficult to
      apply late in the devel cycle (v3.17-rc6 at the moment).  As a
      stop-gap measure till the proper fix can be implemented in the next
      cycle, this patch introduces __percpu_ref_kill_expedited() and makes
      blk_mq_freeze_queue() use it.  This is heavy-handed but should work
      for testing the experimental SCSI blk-mq implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
      Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count")
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0a30288d
    • T
      crypto: ccp - Check for CCP before registering crypto algs · c9f21cb6
      Tom Lendacky 提交于
      If the ccp is built as a built-in module, then ccp-crypto (whether
      built as a module or a built-in module) will be able to load and
      it will register its crypto algorithms.  If the system does not have
      a CCP this will result in -ENODEV being returned whenever a command
      is attempted to be queued by the registered crypto algorithms.
      
      Add an API, ccp_present(), that checks for the presence of a CCP
      on the system.  The ccp-crypto module can use this to determine if it
      should register it's crypto alogorithms.
      
      Cc: stable@vger.kernel.org
      Reported-by: NScot Doyle <lkml14@scotdoyle.com>
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Tested-by: NScot Doyle <lkml14@scotdoyle.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      c9f21cb6
  5. 23 9月, 2014 1 次提交
    • E
      net: sched: shrink struct qdisc_skb_cb to 28 bytes · 25711786
      Eric Dumazet 提交于
      We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
      or increasing skb->cb[] size.
      
      Commit e0f31d84 ("flow_keys: Record IP layer protocol in
      skb_flow_dissect()") broke IPoIB.
      
      Only current offender is sch_choke, and this one do not need an
      absolutely precise flow key.
      
      If we store 17 bytes of flow key, its more than enough. (Its the actual
      size of flow_keys if it was a packed structure, but we might add new
      fields at the end of it later)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: e0f31d84 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25711786
  6. 22 9月, 2014 2 次提交
  7. 21 9月, 2014 2 次提交
  8. 20 9月, 2014 2 次提交
    • N
      genetlink: add function genl_has_listeners() · 0d566379
      Nicolas Dichtel 提交于
      This function is the counterpart of the function netlink_has_listeners().
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d566379
    • S
      IB: ib_umem_release() should decrement mm->pinned_vm from ib_umem_get · 87773dd5
      Shawn Bohrer 提交于
      In debugging an application that receives -ENOMEM from ib_reg_mr(), I
      found that ib_umem_get() can fail because the pinned_vm count has
      wrapped causing it to always be larger than the lock limit even with
      RLIMIT_MEMLOCK set to RLIM_INFINITY.
      
      The wrapping of pinned_vm occurs because the process that calls
      ib_reg_mr() will have its mm->pinned_vm count incremented.  Later a
      different process with a different mm_struct than the one that
      allocated the ib_umem struct ends up releasing it which results in
      decrementing the new processes mm->pinned_vm count past zero and
      wrapping.
      
      I'm not entirely sure what circumstances cause a different process to
      release the ib_umem than the one that allocated it but the kernel
      stack trace of the freeing process from my situation looks like the
      following:
      
          Call Trace:
           [<ffffffff814d64b1>] dump_stack+0x19/0x1b
           [<ffffffffa0b522a5>] ib_umem_release+0x1f5/0x200 [ib_core]
           [<ffffffffa0b90681>] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
           [<ffffffffa0b4d93c>] ib_destroy_qp+0x12c/0x170 [ib_core]
           [<ffffffffa0cc7129>] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
           [<ffffffff81141cba>] __fput+0xba/0x240
           [<ffffffff81141e4e>] ____fput+0xe/0x10
           [<ffffffff81060894>] task_work_run+0xc4/0xe0
           [<ffffffff810029e5>] do_notify_resume+0x95/0xa0
           [<ffffffff814e3dd0>] int_signal+0x12/0x17
      
      The following patch fixes the issue by storing the pid struct of the
      process that calls ib_umem_get() so that ib_umem_release and/or
      ib_umem_account() can properly decrement the pinned_vm count of the
      correct mm_struct.
      Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com>
      Reviewed-by: NShachar Raindel <raindel@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      87773dd5
  9. 19 9月, 2014 2 次提交
  10. 17 9月, 2014 1 次提交
  11. 16 9月, 2014 3 次提交
  12. 15 9月, 2014 1 次提交
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
  13. 14 9月, 2014 1 次提交
    • L
      Make hash_64() use a 64-bit multiply when appropriate · 23d0db76
      Linus Torvalds 提交于
      The hash_64() function historically does the multiply by the
      GOLDEN_RATIO_PRIME_64 number with explicit shifts and adds, because
      unlike the 32-bit case, gcc seems unable to turn the constant multiply
      into the more appropriate shift and adds when required.
      
      However, that means that we generate those shifts and adds even when the
      architecture has a fast multiplier, and could just do it better in
      hardware.
      
      Use the now-cleaned-up CONFIG_ARCH_HAS_FAST_MULTIPLIER (together with
      "is it a 64-bit architecture") to decide whether to use an integer
      multiply or the explicit sequence of shift/add instructions.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23d0db76
  14. 13 9月, 2014 3 次提交
    • S
      ipv6: clean up anycast when an interface is destroyed · 381f4dca
      Sabrina Dubroca 提交于
      If we try to rmmod the driver for an interface while sockets with
      setsockopt(JOIN_ANYCAST) are alive, some refcounts aren't cleaned up
      and we get stuck on:
      
        unregister_netdevice: waiting for ens3 to become free. Usage count = 1
      
      If we LEAVE_ANYCAST/close everything before rmmod'ing, there is no
      problem.
      
      We need to perform a cleanup similar to the one for multicast in
      addrconf_ifdown(how == 1).
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      381f4dca
    • A
      jiffies: Fix timeval conversion to jiffies · d78c9300
      Andrew Hunter 提交于
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Reported-by: NAaron Jacobs <jacobsa@google.com>
      Signed-off-by: NAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d78c9300
    • T
      workqueue: apply __WQ_ORDERED to create_singlethread_workqueue() · e09c2c29
      Tejun Heo 提交于
      create_singlethread_workqueue() is a compat interface for single
      threaded workqueue which maps to ordered workqueue w/ rescuer in the
      current implementation.  create_singlethread_workqueue() currently
      implemented by invoking alloc_workqueue() w/ appropriate parameters.
      
      8719dcea ("workqueue: reject adjusting max_active or applying
      attrs to ordered workqueues") introduced __WQ_ORDERED to protect
      ordered workqueues against dynamic attribute changes which can break
      ordering guarantees but forgot to apply it to
      create_singlethread_workqueue().  This in itself is okay as nobody
      currently uses dynamic attribute change on workqueues created with
      create_singlethread_workqueue().
      
      However, 4c16bd32 ("workqueue: implement NUMA affinity for unbound
      workqueues") broke singlethreaded guarantee for ordered workqueues
      through allocating a separate pool_workqueue on each NUMA node by
      default.  A later change 8a2b7538 ("workqueue: fix ordered
      workqueues in NUMA setups") fixed it by allocating only one global
      pool_workqueue if __WQ_ORDERED is set.
      
      Combined, the __WQ_ORDERED omission in create_singlethread_workqueue()
      became critical breaking its single threadedness and ordering
      guarantee.
      
      Let's make create_singlethread_workqueue() wrap
      alloc_ordered_workqueue() instead so that it inherits __WQ_ORDERED and
      can implicitly track future ordered_workqueue changes.
      
      v2: I missed that __WQ_ORDERED now protects against pwq splitting
          across NUMA nodes and incorrectly described the patch as a
          nice-to-have fix to protect against future dynamic attribute
          usages.  Oleg pointed out that this is actually a critical
          breakage due to 8a2b7538 ("workqueue: fix ordered workqueues
          in NUMA setups").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Anderson <mike.anderson@us.ibm.com>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Gustavo Luiz Duarte <gduarte@redhat.com>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 4c16bd32 ("workqueue: implement NUMA affinity for unbound workqueues")
      e09c2c29
  15. 12 9月, 2014 1 次提交
  16. 11 9月, 2014 3 次提交
  17. 09 9月, 2014 1 次提交
  18. 06 9月, 2014 2 次提交
  19. 05 9月, 2014 3 次提交
    • H
      crypto: drbg - backport "fix maximum value checks on 32 bit systems" · fb38ab4c
      Herbert Xu 提交于
      This is a backport of commit b9347aff.
      This backport is needed as without it the code will crash on 32-bit
      systems.
          
      The maximum values for additional input string or generated blocks is
      larger than 1<<32. To ensure a sensible value on 32 bit systems, return
      SIZE_MAX on 32 bit systems. This value is lower than the maximum
      allowed values defined in SP800-90A. The standard allow lower maximum
      values, but not larger values.
      
      SIZE_MAX - 1 is used for drbg_max_addtl to allow
      drbg_healthcheck_sanity to check the enforcement of the variable
      without wrapping.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      fb38ab4c
    • P
      usb: usbip: fix usbip.h path in userspace tool · 6fa9e1be
      Piotr Król 提交于
      Fixes: 588b48ca ("usbip: move usbip userspace code out of staging")
      which introduced build failure by not changing uapi/usbip.h include path
      according to new location.
      Signed-off-by: NPiotr Król <piotr.krol@3mdeb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6fa9e1be
    • F
      nohz: Restore NMI safe local irq work for local nohz kick · 40bea039
      Frederic Weisbecker 提交于
      The local nohz kick is currently used by perf which needs it to be
      NMI-safe. Recent commit though (7d1311b9)
      changed its implementation to fire the local kick using the remote kick
      API. It was convenient to make the code more generic but the remote kick
      isn't NMI-safe.
      
      As a result:
      
      	WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
      	CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
      	0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
      	0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
      	0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
      	Call Trace:
      	<NMI>  [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a
      	[<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0
      	[<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20
      	[<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140
      	[<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90
      	[<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350
      	[<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	[<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	[<ffffffff9a187934>] perf_event_overflow+0x14/0x20
      	[<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410
      	[<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130
      	[<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50
      	[<ffffffff9a007b72>] nmi_handle+0xd2/0x390
      	[<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390
      	[<ffffffff9a0d131b>] ? lock_release+0xab/0x330
      	[<ffffffff9a008062>] default_do_nmi+0x72/0x1c0
      	[<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200
      	[<ffffffff9a008268>] do_nmi+0xb8/0x100
      
      Lets fix this by restoring the use of local irq work for the nohz local
      kick.
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      40bea039
  20. 04 9月, 2014 1 次提交
  21. 03 9月, 2014 1 次提交