1. 07 6月, 2015 1 次提交
    • Y
      perf/x86/intel: Handle multiple records in the PEBS buffer · 21509084
      Yan, Zheng 提交于
      When the PEBS interrupt threshold is larger than one record and the
      machine supports multiple PEBS events, the records of these events are
      mixed up and we need to demultiplex them.
      
      Demuxing the records is hard because the hardware is deficient. The
      hardware has two issues that, when combined, create impossible
      scenarios to demux.
      
      The first issue is that the 'status' field of the PEBS record is a copy
      of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
      problem let us first describe the regular PEBS cycle:
      
      A) the CTRn value reaches 0:
        - the corresponding bit in GLOBAL_STATUS gets set
        - we start arming the hardware assist
        < some unspecified amount of time later -- this could cover multiple
          events of interest >
      
      B) the hardware assist is armed, any next event will trigger it
      
      C) a matching event happens:
        - the hardware assist triggers and generates a PEBS record
          this includes a copy of GLOBAL_STATUS at this moment
        - if we auto-reload we (re)set CTRn
        - we clear the relevant bit in GLOBAL_STATUS
      
      Now consider the following chain of events:
      
        A0, B0, A1, C0
      
      The event generated for counter 0 will include a status with counter 1
      set, even though its not at all related to the record. A similar thing
      can happen with a !PEBS event if it just happens to overflow at the
      right moment.
      
      The second issue is that the hardware will only emit one record for two
      or more counters if the event that triggers the assist is 'close'. The
      'close' can be several cycles. In some cases even the complete assist,
      if the event is something that doesn't need retirement.
      
      For instance, consider this chain of events:
      
        A0, B0, A1, B1, C01
      
      Where C01 is an event that triggers both hardware assists, we will
      generate but a single record, but again with both counters listed in the
      status field.
      
      This time the record pertains to both events.
      
      Note that these two cases are different but undistinguishable with the
      data as generated. Therefore demuxing records with multiple PEBS bits
      (we can safely ignore status bits for !PEBS counters) is impossible.
      
      Furthermore we cannot emit the record to both events because that might
      cause a data leak -- the events might not have the same privileges -- so
      what this patch does is discard such events.
      
      The assumption/hope is that such discards will be rare.
      
      Here lists some possible ways you may get high discard rate.
      
        - when you count the same thing multiple times. But it is not a useful
          configuration.
        - you can be unfortunate if you measure with a userspace only PEBS
          event along with either a kernel or unrestricted PEBS event. Imagine
          the event triggering and setting the overflow flag right before
          entering the kernel. Then all kernel side events will end up with
          multiple bits set.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NKan Liang <kan.liang@intel.com>
      [ Changelog improvements. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21509084
  2. 27 5月, 2015 2 次提交
    • T
      perf/x86/intel/cqm: Use proper data types · b3df4ec4
      Thomas Gleixner 提交于
      'int' is really not a proper data type for an MSR. Use u32 to make it
      clear that we are dealing with a 32-bit unsigned hardware value.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMatt Fleming <matt.fleming@intel.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Cc: Will Auld <will.auld@intel.com>
      Link: http://lkml.kernel.org/r/20150518235149.919350144@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b3df4ec4
    • P
      perf/x86: Fix event/group validation · b371b594
      Peter Zijlstra 提交于
      Commit 43b45780 ("perf/x86: Reduce stack usage of
      x86_schedule_events()") violated the rule that 'fake' scheduling; as
      used for event/group validation; should not change the event state.
      
      This went mostly un-noticed because repeated calls of
      x86_pmu::get_event_constraints() would give the same result. And
      x86_pmu::put_event_constraints() would mostly not do anything.
      
      Commit e979121b ("perf/x86/intel: Implement cross-HT corruption
      bug workaround") made the situation much worse by actually setting the
      event->hw.constraint value to NULL, so when validation and actual
      scheduling interact we get NULL ptr derefs.
      
      Fix it by removing the constraint pointer from the event and move it
      back to an array, this time in cpuc instead of on the stack.
      
      validate_group()
        x86_schedule_events()
          event->hw.constraint = c; # store
      
            <context switch>
              perf_task_event_sched_in()
                ...
                  x86_schedule_events();
                    event->hw.constraint = c2; # store
      
                    ...
      
                    put_event_constraints(event); # assume failure to schedule
                      intel_put_event_constraints()
                        event->hw.constraint = NULL;
      
            <context switch end>
      
          c = event->hw.constraint; # read -> NULL
      
          if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
      
      This in particular is possible when the event in question is a
      cpu-wide event and group-leader, where the validate_group() tries to
      add an event to the group.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()")
      Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b371b594
  3. 23 5月, 2015 1 次提交
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  4. 20 5月, 2015 1 次提交
  5. 17 5月, 2015 1 次提交
    • H
      rhashtable: Add cap on number of elements in hash table · 07ee0722
      Herbert Xu 提交于
      We currently have no limit on the number of elements in a hash table.
      This is a problem because some users (tipc) set a ceiling on the
      maximum table size and when that is reached the hash table may
      degenerate.  Others may encounter OOM when growing and if we allow
      insertions when that happens the hash table perofrmance may also
      suffer.
      
      This patch adds a new paramater insecure_max_entries which becomes
      the cap on the table.  If unset it defaults to max_size * 2.  If
      it is also zero it means that there is no cap on the number of
      elements in the table.  However, the table will grow whenever the
      utilisation hits 100% and if that growth fails, you will get ENOMEM
      on insertion.
      
      As allowing oversubscription is potentially dangerous, the name
      contains the word insecure.
      
      Note that the cap is not a hard limit.  This is done for performance
      reasons as enforcing a hard limit will result in use of atomic ops
      that are heavier than the ones we currently use.
      
      The reasoning is that we're only guarding against a gross over-
      subscription of the table, rather than a small breach of the limit.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07ee0722
  6. 15 5月, 2015 2 次提交
    • J
      uidgid: make uid_valid and gid_valid work with !CONFIG_MULTIUSER · 929aa5b2
      Josh Triplett 提交于
      {u,g}id_valid call {u,g}id_eq, which calls __k{u,g}id_val on both
      arguments and compares.  With !CONFIG_MULTIUSER, __k{u,g}id_val return a
      constant 0, which makes {u,g}id_valid always return false.  Change
      {u,g}id_valid to compare their argument against -1 instead.  That produces
      identical results in the normal CONFIG_MULTIUSER=y case, but with
      !CONFIG_MULTIUSER will make {u,g}id_valid constant-fold into "return
      true;" rather than "return false;".
      
      This fixes uses of devpts without CONFIG_MULTIUSER.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>,
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      929aa5b2
    • V
      gfp: add __GFP_NOACCOUNT · 8f4fc071
      Vladimir Davydov 提交于
      Not all kmem allocations should be accounted to memcg.  The following
      patch gives an example when accounting of a certain type of allocations to
      memcg can effectively result in a memory leak.  This patch adds the
      __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
      allocation to go through the root cgroup.  It will be used by the next
      patch.
      
      Note, since in case of kmemleak enabled each kmalloc implies yet another
      allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
      gfp_kmemleak_mask.
      
      Alternatively, we could introduce a per kmem cache flag disabling
      accounting for all allocations of a particular kind, but (a) we would not
      be able to bypass accounting for kmalloc then and (b) a kmem cache with
      this flag set could not be merged with a kmem cache without this flag,
      which would increase the number of global caches and therefore
      fragmentation even if the memory cgroup controller is not used.
      
      Despite its generic name, currently __GFP_NOACCOUNT disables accounting
      only for kmem allocations while user page allocations are always charged.
      To catch abusing of this flag, a warning is issued on an attempt of
      passing it to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f4fc071
  7. 13 5月, 2015 3 次提交
  8. 12 5月, 2015 1 次提交
    • S
      HID: hid-sensor-hub: Fix debug lock warning · 2d94e522
      Srinivas Pandruvada 提交于
      When CONFIG_DEBUG_LOCK_ALLOC is defined, mutex magic is compared and
      warned for (l->magic != l), here l is the address of mutex pointer.
      In hid-sensor-hub as part of hsdev creation, a per hsdev mutex is
      initialized during MFD cell creation. This hsdev, which contains, mutex
      is part of platform data for the a cell. But platform_data is copied
      in platform_device_add_data() in platform.c. This copy will copy the
      whole hsdev structure including mutex. But once copied the magic
      will no longer match. So when client driver call
      sensor_hub_input_attr_get_raw_value, this will trigger mutex warning.
      So to avoid this allocate mutex dynamically. This will be same even
      after copy.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      2d94e522
  9. 11 5月, 2015 1 次提交
    • P
      pty: Fix input race when closing · 1a48632f
      Peter Hurley 提交于
      A read() from a pty master may mistakenly indicate EOF (errno == -EIO)
      after the pty slave has closed, even though input data remains to be read.
      For example,
      
             pty slave       |        input worker        |    pty master
                             |                            |
                             |                            |   n_tty_read()
      pty_write()            |                            |     input avail? no
        add data             |                            |     sleep
        schedule worker  --->|                            |     .
                             |---> flush_to_ldisc()       |     .
      pty_close()            |       fill read buffer     |     .
        wait for worker      |       wakeup reader    --->|     .
                             |       read buffer full?    |---> input avail ? yes
                             |<---   yes - exit worker    |     copy 4096 bytes to user
        TTY_OTHER_CLOSED <---|                            |<--- kick worker
                             |                            |
      
      		                **** New read() before worker starts ****
      
                             |                            |   n_tty_read()
                             |                            |     input avail? no
                             |                            |     TTY_OTHER_CLOSED? yes
                             |                            |     return -EIO
      
      Several conditions are required to trigger this race:
      1. the ldisc read buffer must become full so the input worker exits
      2. the read() count parameter must be >= 4096 so the ldisc read buffer
         is empty
      3. the subsequent read() occurs before the kicked worker has processed
         more input
      
      However, the underlying cause of the race is that data is pipelined, while
      tty state is not; ie., data already written by the pty slave end is not
      yet visible to the pty master end, but state changes by the pty slave end
      are visible to the pty master end immediately.
      
      Pipeline the TTY_OTHER_CLOSED state through input worker to the reader.
      1. Introduce TTY_OTHER_DONE which is set by the input worker when
         TTY_OTHER_CLOSED is set and either the input buffers are flushed or
         input processing has completed. Readers/polls are woken when
         TTY_OTHER_DONE is set.
      2. Reader/poll checks TTY_OTHER_DONE instead of TTY_OTHER_CLOSED.
      3. A new input worker is started from pty_close() after setting
         TTY_OTHER_CLOSED, which ensures the TTY_OTHER_DONE state will be
         set if the last input worker is already finished (or just about to
         exit).
      
      Remove tty_flush_to_ldisc(); no in-tree callers.
      
      Fixes: 52bce7f8 ("pty, n_tty: Simplify input processing on final close")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96311
      BugLink: http://bugs.launchpad.net/bugs/1429756
      Cc: <stable@vger.kernel.org> # 3.19+
      Reported-by: NAndy Whitcroft <apw@canonical.com>
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a48632f
  10. 09 5月, 2015 1 次提交
  11. 08 5月, 2015 2 次提交
    • P
      perf: Fix software migrate events · ff303e66
      Peter Zijlstra 提交于
      Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
      was borken:
      
       > The problem is that the task isn't actually scheduled while its being
       > migrated (obviously), and if its not scheduled, the counters aren't
       > scheduled either, so there's no observing of the fact.
       >
       > A further problem with migrations is that many migrations happen from
       > softirq context, which is nested inside the 'random' task context of
       > whoemever happens to run at that time, similarly for the wakeup
       > migrations triggered from (soft)irq context. All those end up being
       > accounted in the task that's currently running, eg. your 'ls'.
      
      The below cures this by marking a task as migrated and accounting it
      on the subsequent sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff303e66
    • T
      sched: Handle priority boosted tasks proper in setscheduler() · 0782e63b
      Thomas Gleixner 提交于
      Ronny reported that the following scenario is not handled correctly:
      
      	T1 (prio = 10)
      	   lock(rtmutex);
      
      	T2 (prio = 20)
      	   lock(rtmutex)
      	      boost T1
      
      	T1 (prio = 20)
      	   sys_set_scheduler(prio = 30)
      	   T1 prio = 30
      	   ....
      	   sys_set_scheduler(prio = 10)
      	   T1 prio = 30
      
      The last step is wrong as T1 should now be back at prio 20.
      
      Commit c365c292 ("sched: Consider pi boosting in setscheduler()")
      only handles the case where a boosted tasks tries to lower its
      priority.
      
      Fix it by taking the new effective priority into account for the
      decision whether a change of the priority is required.
      Reported-by: NRonny Meeus <ronny.meeus@gmail.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Fixes: c365c292 ("sched: Consider pi boosting in setscheduler()")
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0782e63b
  12. 07 5月, 2015 1 次提交
  13. 06 5月, 2015 2 次提交
  14. 05 5月, 2015 1 次提交
    • S
      blk-mq: fix FUA request hang · b2387ddc
      Shaohua Li 提交于
      When a FUA request enters its DATA stage of flush pipeline, the
      request is added to mq requeue list, the request will then be added to
      ctx->rq_list. blk_mq_attempt_merge() might merge the request with a bio.
      Later when the request is finished the flush pipeline, the
      request->__data_len is 0. Then I only saw the bio gets endio called, the
      original request never finish.
      
      Adding REQ_FLUSH_SEQ into REQ_NOMERGE_FLAGS looks an easy fix.
      
      stable: 3.15+
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2387ddc
  15. 04 5月, 2015 1 次提交
    • D
      lib: make memzero_explicit more robust against dead store elimination · 7829fb09
      Daniel Borkmann 提交于
      In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
      of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
      case LTO would decide to inline memzero_explicit() and eventually
      find out it could be elimiated as dead store.
      
      While using barrier() works well for the case of gcc, recent efforts
      from LLVMLinux people suggest to use llvm as an alternative to gcc,
      and there, Stephan found in a simple stand-alone user space example
      that llvm could nevertheless optimize and thus elimitate the memset().
      A similar issue has been observed in the referenced llvm bug report,
      which is regarded as not-a-bug.
      
      Based on some experiments, icc is a bit special on its own, while it
      doesn't seem to eliminate the memset(), it could do so with an own
      implementation, and then result in similar findings as with llvm.
      
      The fix in this patch now works for all three compilers (also tested
      with more aggressive optimization levels). Arguably, in the current
      kernel tree it's more of a theoretical issue, but imho, it's better
      to be pedantic about it.
      
      It's clearly visible with gcc/llvm though, with the below code: if we
      would have used barrier() only here, llvm would have omitted clearing,
      not so with barrier_data() variant:
      
        static inline void memzero_explicit(void *s, size_t count)
        {
          memset(s, 0, count);
          barrier_data(s);
        }
      
        int main(void)
        {
          char buff[20];
          memzero_explicit(buff, sizeof(buff));
          return 0;
        }
      
        $ gcc -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
         0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
         0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
         0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
         0x000000000040041f <+31>: xor   %eax,%eax
         0x0000000000400421 <+33>: retq
        End of assembler dump.
      
        $ clang -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
         0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
         0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
         0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
         0x0000000000400505 <+21>: xor    %eax,%eax
         0x0000000000400507 <+23>: retq
        End of assembler dump.
      
      As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
      this in compiler-gcc.h only to be picked up. For a fallback or otherwise
      unsupported compiler, we define it as a barrier. Similarly, for ecc which
      does not support gcc inline asm.
      
      Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
      Tested-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: mancha security <mancha1@zoho.com>
      Cc: Mark Charlebois <charlebm@gmail.com>
      Cc: Behan Webster <behanw@converseincode.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7829fb09
  16. 30 4月, 2015 3 次提交
    • E
      tcp: add tcpi_bytes_received to tcp_info · bdd1f9ed
      Eric Dumazet 提交于
      This patch tracks total number of payload bytes received on a TCP socket.
      This is the sum of all changes done to tp->rcv_nxt
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsReceived
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_received was placed near tp->rcv_nxt for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdd1f9ed
    • E
      tcp: add tcpi_bytes_acked to tcp_info · 0df48c26
      Eric Dumazet 提交于
      This patch tracks total number of bytes acked for a TCP socket.
      This is the sum of all changes done to tp->snd_una, and allows
      for precise tracking of delivered data.
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_acked was placed near tp->snd_una for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0df48c26
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  17. 28 4月, 2015 2 次提交
  18. 27 4月, 2015 1 次提交
  19. 26 4月, 2015 3 次提交
    • G
      libata: Ignore spurious PHY event on LPM policy change · 09c5b480
      Gabriele Mazzotta 提交于
      When the LPM policy is set to ATA_LPM_MAX_POWER, the device might
      generate a spurious PHY event that cuases errors on the link.
      Ignore this event if it occured within 10s after the policy change.
      
      The timeout was chosen observing that on a Dell XPS13 9333 these
      spurious events can occur up to roughly 6s after the policy change.
      
      Link: http://lkml.kernel.org/g/3352987.ugV1Ipy7Z5@xps13Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      09c5b480
    • G
      libata: Add helper to determine when PHY events should be ignored · 8393b811
      Gabriele Mazzotta 提交于
      This is a preparation commit that will allow to add other criteria
      according to which PHY events should be dropped.
      Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      8393b811
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c
  20. 25 4月, 2015 4 次提交
    • E
      fix I_DIO_WAKEUP definition · ac74d8d6
      Eric Sandeen 提交于
      I_DIO_WAKEUP is never directly used, but fix it up anyway.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ac74d8d6
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
    • M
      irqchip: gic: Drop support for gic_arch_extn · 1dcc73d7
      Marc Zyngier 提交于
      Now that the users of gic_arch_extn have been fixed, drop the
      "feature" for good. This leads to the removal of some now useless
      locking.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Jason Cooper <jason@lakedaemon.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1dcc73d7
    • F
      netfilter: bridge: fix NULL deref in physin/out ifindex helpers · 547c4b54
      Florian Westphal 提交于
      Might not have an outdev yet. We'll oops when iface goes down while skbs
      are still nfqueue'd:
      
      RIP: 0010:[<ffffffff81422a2f>]  [<ffffffff81422a2f>] dev_cmp+0x4f/0x80
      nfqnl_rcv_dev_event+0xe2/0x150
      notifier_call_chain+0x53/0xa0
      
      Fixes: c737b7c4 ("netfilter: bridge: add helpers for fetching physin/outdev")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      547c4b54
  21. 24 4月, 2015 5 次提交
    • J
      rhashtable: don't attempt to grow when at max_size · 1d8dc3d3
      Johannes Berg 提交于
      The conversion of mac80211's station table to rhashtable had a bug
      that I found by accident in code review, that hadn't been found as
      rhashtable apparently managed to have a maximum hash chain length
      of one (!) in all our testing.
      
      In order to test the bug and verify the fix I set my rhashtable's
      max_size very low (4) in order to force getting hash collisions.
      
      At that point, rhashtable WARNed in rhashtable_insert_rehash() but
      didn't actually reject the hash table insertion. This caused it to
      lose insertions - my master list of stations would have 9 entries,
      but the rhashtable only had 5. This may warrant a deeper look, but
      that WARN_ON() just shouldn't happen.
      
      Fix this by not returning true from rht_grow_above_100() when the
      rhashtable's max_size has been reached - in this case the user is
      explicitly configuring it to be at most that big, so even if it's
      now above 100% it shouldn't attempt to resize.
      
      This fixes the "lost insertion" issue and consequently allows my
      code to display its error (and verify my fix for it.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d8dc3d3
    • A
      NFS: Move nfs_idmap.h into fs/nfs/ · 40c64c26
      Anna Schumaker 提交于
      This file is only used internally to the NFS v4 module, so it doesn't
      need to be in the global include path.  I also renamed it from
      nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
      file.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      40c64c26
    • A
      NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h · f9ebd618
      Anna Schumaker 提交于
      The idmapper is completely internal to the NFS v4 module, so this macro
      will always evaluate to true.  This patch also removes unnecessary
      includes of this file from the generic NFS client.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      f9ebd618
    • J
      sunrpc: make debugfs file creation failure non-fatal · 3f940098
      Jeff Layton 提交于
      v2: gracefully handle the case where some dentry pointers end up NULL
          and be more dilligent about zeroing out dentry pointers
      
      We currently have a problem that SELinux policy is being enforced when
      creating debugfs files. If a debugfs file is created as a side effect of
      doing some syscall, then that creation can fail if the SELinux policy
      for that process prevents it.
      
      This seems wrong. We don't do that for files under /proc, for instance,
      so Bruce has proposed a patch to fix that.
      
      While discussing that patch however, Greg K.H. stated:
      
          "No kernel code should care / fail if a debugfs function fails, so
           please fix up the sunrpc code first."
      
      This patch converts all of the sunrpc debugfs setup code to be void
      return functins, and the callers to not look for errors from those
      functions.
      
      This should allow rpc_clnt and rpc_xprt creation to work, even if the
      kernel fails to create debugfs files for some reason.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      3f940098
    • A
      NFS: Don't zap caches on fallocate() · 9a51940b
      Anna Schumaker 提交于
      This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
      operations so we can set the updated inode size and change attribute
      directly.  DEALLOCATE will still need to release pagecache pages, so
      nfs42_proc_deallocate() now calls truncate_pagecache_range() before
      contacting the server.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9a51940b
  22. 23 4月, 2015 1 次提交