1. 20 5月, 2015 5 次提交
    • D
      module: add core_param_unsafe · ec0ccc16
      Dmitry Torokhov 提交于
      Similarly to module_param_unsafe(), add the helper to be used by core
      code wishing to expose unsafe debugging or testing parameters that taint
      the kernel when set.
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec0ccc16
    • L
      driver-core: enable drivers to opt-out of async probe · d173a137
      Luis R. Rodriguez 提交于
      There are drivers that can not be probed asynchronously. One such group
      is platform drivers registered with platform_driver_probe(), which
      expects driver's probe routine be discarded after the driver has been
      registered and initial binding attempt executed. Also
      platform_driver_probe() an error when no devices were bound to the
      driver, allowing failing to load such driver module altogether.
      
      Other drivers do not work well with asynchronous probing because of
      driver bug or not optimal driver organization.
      
      To allow using such drivers even when user requests asynchronous probing
      as default boot strategy, let's allow them to opt out.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d173a137
    • L
      driver-core: add driver module asynchronous probe support · f2411da7
      Luis R. Rodriguez 提交于
      Some init systems may wish to express the desire to have device drivers
      run their probe() code asynchronously. This implements support for this
      and allows userspace to request async probe as a preference through a
      generic shared device driver module parameter, async_probe.
      
      Implementation for async probe is supported through a module parameter
      given that since synchronous probe has been prevalent for years some
      userspace might exist which relies on the fact that the device driver
      will probe synchronously and the assumption that devices it provides
      will be immediately available after this.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2411da7
    • D
      driver-core: add asynchronous probing support for drivers · 765230b5
      Dmitry Torokhov 提交于
      Some devices take a long time when initializing, and not all drivers are
      suited to initialize their devices when they are open. For example,
      input drivers need to interrogate their devices in order to publish
      device's capabilities before userspace will open them. When such drivers
      are compiled into kernel they may stall entire kernel initialization.
      
      This change allows drivers request for their probe functions to be
      called asynchronously during driver and device registration (manual
      binding is still synchronous). Because async_schedule is used to perform
      asynchronous calls module loading will still wait for the probing to
      complete.
      
      Note that the end goal is to make the probing asynchronous by default,
      so annotating drivers with PROBE_PREFER_ASYNCHRONOUS is a temporary
      measure that allows us to speed up boot process while we validating and
      fixing the rest of the drivers and preparing userspace.
      
      This change is based on earlier patch by "Luis R. Rodriguez"
      <mcgrof@suse.com>
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      765230b5
    • L
      module: add extra argument for parse_params() callback · ecc86170
      Luis R. Rodriguez 提交于
      This adds an extra argument onto parse_params() to be used
      as a way to make the unused callback a bit more useful and
      generic by allowing the caller to pass on a data structure
      of its choice. An example use case is to allow us to easily
      make module parameters for every module which we will do
      next.
      
      @ parse @
      identifier name, args, params, num, level_min, level_max;
      identifier unknown, param, val, doing;
      type s16;
      @@
       extern char *parse_args(const char *name,
       			 char *args,
       			 const struct kernel_param *params,
       			 unsigned num,
       			 s16 level_min,
       			 s16 level_max,
      +			 void *arg,
       			 int (*unknown)(char *param, char *val,
      					const char *doing
      +					, void *arg
      					));
      
      @ parse_mod @
      identifier name, args, params, num, level_min, level_max;
      identifier unknown, param, val, doing;
      type s16;
      @@
       char *parse_args(const char *name,
       			 char *args,
       			 const struct kernel_param *params,
       			 unsigned num,
       			 s16 level_min,
       			 s16 level_max,
      +			 void *arg,
       			 int (*unknown)(char *param, char *val,
      					const char *doing
      +					, void *arg
      					))
      {
      	...
      }
      
      @ parse_args_found @
      expression R, E1, E2, E3, E4, E5, E6;
      identifier func;
      @@
      
      (
      	R =
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   func);
      |
      	R =
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   &func);
      |
      	R =
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   NULL);
      |
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   func);
      |
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   &func);
      |
      	parse_args(E1, E2, E3, E4, E5, E6,
      +		   NULL,
      		   NULL);
      )
      
      @ parse_args_unused depends on parse_args_found @
      identifier parse_args_found.func;
      @@
      
      int func(char *param, char *val, const char *unused
      +		 , void *arg
      		 )
      {
      	...
      }
      
      @ mod_unused depends on parse_args_found @
      identifier parse_args_found.func;
      expression A1, A2, A3;
      @@
      
      -	func(A1, A2, A3);
      +	func(A1, A2, A3, NULL);
      
      Generated-by: Coccinelle SmPL
      Cc: cocci@systeme.lip6.fr
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Felipe Contreras <felipe.contreras@gmail.com>
      Cc: Ewan Milne <emilne@redhat.com>
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecc86170
  2. 15 5月, 2015 2 次提交
    • J
      uidgid: make uid_valid and gid_valid work with !CONFIG_MULTIUSER · 929aa5b2
      Josh Triplett 提交于
      {u,g}id_valid call {u,g}id_eq, which calls __k{u,g}id_val on both
      arguments and compares.  With !CONFIG_MULTIUSER, __k{u,g}id_val return a
      constant 0, which makes {u,g}id_valid always return false.  Change
      {u,g}id_valid to compare their argument against -1 instead.  That produces
      identical results in the normal CONFIG_MULTIUSER=y case, but with
      !CONFIG_MULTIUSER will make {u,g}id_valid constant-fold into "return
      true;" rather than "return false;".
      
      This fixes uses of devpts without CONFIG_MULTIUSER.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Reported-by: Fengguang Wu <fengguang.wu@intel.com>,
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      929aa5b2
    • V
      gfp: add __GFP_NOACCOUNT · 8f4fc071
      Vladimir Davydov 提交于
      Not all kmem allocations should be accounted to memcg.  The following
      patch gives an example when accounting of a certain type of allocations to
      memcg can effectively result in a memory leak.  This patch adds the
      __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
      allocation to go through the root cgroup.  It will be used by the next
      patch.
      
      Note, since in case of kmemleak enabled each kmalloc implies yet another
      allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
      gfp_kmemleak_mask.
      
      Alternatively, we could introduce a per kmem cache flag disabling
      accounting for all allocations of a particular kind, but (a) we would not
      be able to bypass accounting for kmalloc then and (b) a kmem cache with
      this flag set could not be merged with a kmem cache without this flag,
      which would increase the number of global caches and therefore
      fragmentation even if the memory cgroup controller is not used.
      
      Despite its generic name, currently __GFP_NOACCOUNT disables accounting
      only for kmem allocations while user page allocations are always charged.
      To catch abusing of this flag, a warning is issued on an attempt of
      passing it to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f4fc071
  3. 13 5月, 2015 1 次提交
  4. 11 5月, 2015 1 次提交
    • P
      pty: Fix input race when closing · 1a48632f
      Peter Hurley 提交于
      A read() from a pty master may mistakenly indicate EOF (errno == -EIO)
      after the pty slave has closed, even though input data remains to be read.
      For example,
      
             pty slave       |        input worker        |    pty master
                             |                            |
                             |                            |   n_tty_read()
      pty_write()            |                            |     input avail? no
        add data             |                            |     sleep
        schedule worker  --->|                            |     .
                             |---> flush_to_ldisc()       |     .
      pty_close()            |       fill read buffer     |     .
        wait for worker      |       wakeup reader    --->|     .
                             |       read buffer full?    |---> input avail ? yes
                             |<---   yes - exit worker    |     copy 4096 bytes to user
        TTY_OTHER_CLOSED <---|                            |<--- kick worker
                             |                            |
      
      		                **** New read() before worker starts ****
      
                             |                            |   n_tty_read()
                             |                            |     input avail? no
                             |                            |     TTY_OTHER_CLOSED? yes
                             |                            |     return -EIO
      
      Several conditions are required to trigger this race:
      1. the ldisc read buffer must become full so the input worker exits
      2. the read() count parameter must be >= 4096 so the ldisc read buffer
         is empty
      3. the subsequent read() occurs before the kicked worker has processed
         more input
      
      However, the underlying cause of the race is that data is pipelined, while
      tty state is not; ie., data already written by the pty slave end is not
      yet visible to the pty master end, but state changes by the pty slave end
      are visible to the pty master end immediately.
      
      Pipeline the TTY_OTHER_CLOSED state through input worker to the reader.
      1. Introduce TTY_OTHER_DONE which is set by the input worker when
         TTY_OTHER_CLOSED is set and either the input buffers are flushed or
         input processing has completed. Readers/polls are woken when
         TTY_OTHER_DONE is set.
      2. Reader/poll checks TTY_OTHER_DONE instead of TTY_OTHER_CLOSED.
      3. A new input worker is started from pty_close() after setting
         TTY_OTHER_CLOSED, which ensures the TTY_OTHER_DONE state will be
         set if the last input worker is already finished (or just about to
         exit).
      
      Remove tty_flush_to_ldisc(); no in-tree callers.
      
      Fixes: 52bce7f8 ("pty, n_tty: Simplify input processing on final close")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96311
      BugLink: http://bugs.launchpad.net/bugs/1429756
      Cc: <stable@vger.kernel.org> # 3.19+
      Reported-by: NAndy Whitcroft <apw@canonical.com>
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a48632f
  5. 08 5月, 2015 1 次提交
  6. 07 5月, 2015 1 次提交
  7. 06 5月, 2015 2 次提交
  8. 05 5月, 2015 1 次提交
    • S
      blk-mq: fix FUA request hang · b2387ddc
      Shaohua Li 提交于
      When a FUA request enters its DATA stage of flush pipeline, the
      request is added to mq requeue list, the request will then be added to
      ctx->rq_list. blk_mq_attempt_merge() might merge the request with a bio.
      Later when the request is finished the flush pipeline, the
      request->__data_len is 0. Then I only saw the bio gets endio called, the
      original request never finish.
      
      Adding REQ_FLUSH_SEQ into REQ_NOMERGE_FLAGS looks an easy fix.
      
      stable: 3.15+
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2387ddc
  9. 04 5月, 2015 1 次提交
    • D
      lib: make memzero_explicit more robust against dead store elimination · 7829fb09
      Daniel Borkmann 提交于
      In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
      of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
      case LTO would decide to inline memzero_explicit() and eventually
      find out it could be elimiated as dead store.
      
      While using barrier() works well for the case of gcc, recent efforts
      from LLVMLinux people suggest to use llvm as an alternative to gcc,
      and there, Stephan found in a simple stand-alone user space example
      that llvm could nevertheless optimize and thus elimitate the memset().
      A similar issue has been observed in the referenced llvm bug report,
      which is regarded as not-a-bug.
      
      Based on some experiments, icc is a bit special on its own, while it
      doesn't seem to eliminate the memset(), it could do so with an own
      implementation, and then result in similar findings as with llvm.
      
      The fix in this patch now works for all three compilers (also tested
      with more aggressive optimization levels). Arguably, in the current
      kernel tree it's more of a theoretical issue, but imho, it's better
      to be pedantic about it.
      
      It's clearly visible with gcc/llvm though, with the below code: if we
      would have used barrier() only here, llvm would have omitted clearing,
      not so with barrier_data() variant:
      
        static inline void memzero_explicit(void *s, size_t count)
        {
          memset(s, 0, count);
          barrier_data(s);
        }
      
        int main(void)
        {
          char buff[20];
          memzero_explicit(buff, sizeof(buff));
          return 0;
        }
      
        $ gcc -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
         0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
         0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
         0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
         0x000000000040041f <+31>: xor   %eax,%eax
         0x0000000000400421 <+33>: retq
        End of assembler dump.
      
        $ clang -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
         0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
         0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
         0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
         0x0000000000400505 <+21>: xor    %eax,%eax
         0x0000000000400507 <+23>: retq
        End of assembler dump.
      
      As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
      this in compiler-gcc.h only to be picked up. For a fallback or otherwise
      unsupported compiler, we define it as a barrier. Similarly, for ecc which
      does not support gcc inline asm.
      
      Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
      Tested-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: mancha security <mancha1@zoho.com>
      Cc: Mark Charlebois <charlebm@gmail.com>
      Cc: Behan Webster <behanw@converseincode.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7829fb09
  10. 30 4月, 2015 3 次提交
    • E
      tcp: add tcpi_bytes_received to tcp_info · bdd1f9ed
      Eric Dumazet 提交于
      This patch tracks total number of payload bytes received on a TCP socket.
      This is the sum of all changes done to tp->rcv_nxt
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsReceived
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_received was placed near tp->rcv_nxt for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdd1f9ed
    • E
      tcp: add tcpi_bytes_acked to tcp_info · 0df48c26
      Eric Dumazet 提交于
      This patch tracks total number of bytes acked for a TCP socket.
      This is the sum of all changes done to tp->snd_una, and allows
      for precise tracking of delivered data.
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_acked was placed near tp->snd_una for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0df48c26
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
  11. 28 4月, 2015 2 次提交
  12. 27 4月, 2015 1 次提交
  13. 26 4月, 2015 3 次提交
    • G
      libata: Ignore spurious PHY event on LPM policy change · 09c5b480
      Gabriele Mazzotta 提交于
      When the LPM policy is set to ATA_LPM_MAX_POWER, the device might
      generate a spurious PHY event that cuases errors on the link.
      Ignore this event if it occured within 10s after the policy change.
      
      The timeout was chosen observing that on a Dell XPS13 9333 these
      spurious events can occur up to roughly 6s after the policy change.
      
      Link: http://lkml.kernel.org/g/3352987.ugV1Ipy7Z5@xps13Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      09c5b480
    • G
      libata: Add helper to determine when PHY events should be ignored · 8393b811
      Gabriele Mazzotta 提交于
      This is a preparation commit that will allow to add other criteria
      according to which PHY events should be dropped.
      Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      8393b811
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c
  14. 25 4月, 2015 4 次提交
    • E
      fix I_DIO_WAKEUP definition · ac74d8d6
      Eric Sandeen 提交于
      I_DIO_WAKEUP is never directly used, but fix it up anyway.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ac74d8d6
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
    • M
      irqchip: gic: Drop support for gic_arch_extn · 1dcc73d7
      Marc Zyngier 提交于
      Now that the users of gic_arch_extn have been fixed, drop the
      "feature" for good. This leads to the removal of some now useless
      locking.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Jason Cooper <jason@lakedaemon.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1dcc73d7
    • F
      netfilter: bridge: fix NULL deref in physin/out ifindex helpers · 547c4b54
      Florian Westphal 提交于
      Might not have an outdev yet. We'll oops when iface goes down while skbs
      are still nfqueue'd:
      
      RIP: 0010:[<ffffffff81422a2f>]  [<ffffffff81422a2f>] dev_cmp+0x4f/0x80
      nfqnl_rcv_dev_event+0xe2/0x150
      notifier_call_chain+0x53/0xa0
      
      Fixes: c737b7c4 ("netfilter: bridge: add helpers for fetching physin/outdev")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      547c4b54
  15. 24 4月, 2015 5 次提交
    • J
      rhashtable: don't attempt to grow when at max_size · 1d8dc3d3
      Johannes Berg 提交于
      The conversion of mac80211's station table to rhashtable had a bug
      that I found by accident in code review, that hadn't been found as
      rhashtable apparently managed to have a maximum hash chain length
      of one (!) in all our testing.
      
      In order to test the bug and verify the fix I set my rhashtable's
      max_size very low (4) in order to force getting hash collisions.
      
      At that point, rhashtable WARNed in rhashtable_insert_rehash() but
      didn't actually reject the hash table insertion. This caused it to
      lose insertions - my master list of stations would have 9 entries,
      but the rhashtable only had 5. This may warrant a deeper look, but
      that WARN_ON() just shouldn't happen.
      
      Fix this by not returning true from rht_grow_above_100() when the
      rhashtable's max_size has been reached - in this case the user is
      explicitly configuring it to be at most that big, so even if it's
      now above 100% it shouldn't attempt to resize.
      
      This fixes the "lost insertion" issue and consequently allows my
      code to display its error (and verify my fix for it.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d8dc3d3
    • A
      NFS: Move nfs_idmap.h into fs/nfs/ · 40c64c26
      Anna Schumaker 提交于
      This file is only used internally to the NFS v4 module, so it doesn't
      need to be in the global include path.  I also renamed it from
      nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
      file.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      40c64c26
    • A
      NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h · f9ebd618
      Anna Schumaker 提交于
      The idmapper is completely internal to the NFS v4 module, so this macro
      will always evaluate to true.  This patch also removes unnecessary
      includes of this file from the generic NFS client.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      f9ebd618
    • J
      sunrpc: make debugfs file creation failure non-fatal · 3f940098
      Jeff Layton 提交于
      v2: gracefully handle the case where some dentry pointers end up NULL
          and be more dilligent about zeroing out dentry pointers
      
      We currently have a problem that SELinux policy is being enforced when
      creating debugfs files. If a debugfs file is created as a side effect of
      doing some syscall, then that creation can fail if the SELinux policy
      for that process prevents it.
      
      This seems wrong. We don't do that for files under /proc, for instance,
      so Bruce has proposed a patch to fix that.
      
      While discussing that patch however, Greg K.H. stated:
      
          "No kernel code should care / fail if a debugfs function fails, so
           please fix up the sunrpc code first."
      
      This patch converts all of the sunrpc debugfs setup code to be void
      return functins, and the callers to not look for errors from those
      functions.
      
      This should allow rpc_clnt and rpc_xprt creation to work, even if the
      kernel fails to create debugfs files for some reason.
      
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      3f940098
    • A
      NFS: Don't zap caches on fallocate() · 9a51940b
      Anna Schumaker 提交于
      This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
      operations so we can set the updated inode size and change attribute
      directly.  DEALLOCATE will still need to release pagecache pages, so
      nfs42_proc_deallocate() now calls truncate_pagecache_range() before
      contacting the server.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9a51940b
  16. 23 4月, 2015 4 次提交
  17. 22 4月, 2015 3 次提交
    • I
      libceph: announce support for straw2 buckets · 7c1c4747
      Ilya Dryomov 提交于
      Sync up feature bits and enable CEPH_FEATURE_CRUSH_V4.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      7c1c4747
    • I
      crush: straw2 bucket type with an efficient 64-bit crush_ln() · 958a2765
      Ilya Dryomov 提交于
      This is an improved straw bucket that correctly avoids any data movement
      between items A and B when neither A nor B's weights are changed.  Said
      differently, if we adjust the weight of item C (including adding it anew
      or removing it completely), we will only see inputs move to or from C,
      never between other items in the bucket.
      
      Notably, there is not intermediate scaling factor that needs to be
      calculated.  The mapping function is a simple function of the item weights.
      
      The below commits were squashed together into this one (mostly to avoid
      adding and then yanking a ~6000 lines worth of crush_ln_table):
      
      - crush: add a straw2 bucket type
      - crush: add crush_ln to calculate nature log efficently
      - crush: improve straw2 adjustment slightly
      - crush: change crush_ln to provide 32 more digits
      - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2
      - crush/mapper: fix divide-by-0 in straw2
        (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need
         to create a proper compat.h in ceph.git)
      
      Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53,
                                32a1ead92efcd351822d22a5fc37d159c65c1338,
                                6289912418c4a3597a11778bcf29ed5415117ad9,
                                35fcb04e2945717cf5cfe150b9fa89cb3d2303a1,
                                6445d9ee7290938de1e4ee9563912a6ab6d8ee5f,
                                b5921d55d16796e12d66ad2c4add7305f9ce2353.
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      958a2765
    • Y
      ceph: rename snapshot support · 0ea611a3
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      0ea611a3