1. 19 5月, 2015 1 次提交
  2. 17 5月, 2015 1 次提交
    • H
      rhashtable: Add cap on number of elements in hash table · 07ee0722
      Herbert Xu 提交于
      We currently have no limit on the number of elements in a hash table.
      This is a problem because some users (tipc) set a ceiling on the
      maximum table size and when that is reached the hash table may
      degenerate.  Others may encounter OOM when growing and if we allow
      insertions when that happens the hash table perofrmance may also
      suffer.
      
      This patch adds a new paramater insecure_max_entries which becomes
      the cap on the table.  If unset it defaults to max_size * 2.  If
      it is also zero it means that there is no cap on the number of
      elements in the table.  However, the table will grow whenever the
      utilisation hits 100% and if that growth fails, you will get ENOMEM
      on insertion.
      
      As allowing oversubscription is potentially dangerous, the name
      contains the word insecure.
      
      Note that the cap is not a hard limit.  This is done for performance
      reasons as enforcing a hard limit will result in use of atomic ops
      that are heavier than the ones we currently use.
      
      The reasoning is that we're only guarding against a gross over-
      subscription of the table, rather than a small breach of the limit.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07ee0722
  3. 16 5月, 2015 1 次提交
  4. 15 5月, 2015 1 次提交
  5. 13 5月, 2015 1 次提交
  6. 10 5月, 2015 1 次提交
  7. 07 5月, 2015 1 次提交
  8. 06 5月, 2015 5 次提交
  9. 05 5月, 2015 2 次提交
    • T
      RDMA/core: Enable the iWarp Port Mapper to provide the actual address of the... · 6eec1774
      Tatyana Nikolova 提交于
      RDMA/core: Enable the iWarp Port Mapper to provide the actual address of the connecting peer to its clients
      
      Add functionality to enable the port mapper on the passive side to provide to its
      clients the actual (non-mapped) ip/tcp address information of the connecting peer
      
      1) Adding remote_info_cb() to process the address info of the connecting peer
         The address info is provided by the user space port mapper service when
         the connection is initiated by the peer
      2) Adding a hash list to store the remote address info
      3) Adding functionality to add/remove the remote address info
         After the info has been provided to the port mapper client,
         it is removed from the hash list
      Signed-off-by: NTatyana Nikolova <tatyana.e.nikolova@intel.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      6eec1774
    • S
      blk-mq: fix FUA request hang · b2387ddc
      Shaohua Li 提交于
      When a FUA request enters its DATA stage of flush pipeline, the
      request is added to mq requeue list, the request will then be added to
      ctx->rq_list. blk_mq_attempt_merge() might merge the request with a bio.
      Later when the request is finished the flush pipeline, the
      request->__data_len is 0. Then I only saw the bio gets endio called, the
      original request never finish.
      
      Adding REQ_FLUSH_SEQ into REQ_NOMERGE_FLAGS looks an easy fix.
      
      stable: 3.15+
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b2387ddc
  10. 04 5月, 2015 3 次提交
    • R
      mac80211: fix 90 kernel-doc warnings · ff419b3f
      Randy Dunlap 提交于
      Eliminate 90 of these warnings:
      
      Warning(..//include/net/mac80211.h:1682): No description found for parameter 'drv_priv[0]'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      ff419b3f
    • D
      lib: make memzero_explicit more robust against dead store elimination · 7829fb09
      Daniel Borkmann 提交于
      In commit 0b053c95 ("lib: memzero_explicit: use barrier instead
      of OPTIMIZER_HIDE_VAR"), we made memzero_explicit() more robust in
      case LTO would decide to inline memzero_explicit() and eventually
      find out it could be elimiated as dead store.
      
      While using barrier() works well for the case of gcc, recent efforts
      from LLVMLinux people suggest to use llvm as an alternative to gcc,
      and there, Stephan found in a simple stand-alone user space example
      that llvm could nevertheless optimize and thus elimitate the memset().
      A similar issue has been observed in the referenced llvm bug report,
      which is regarded as not-a-bug.
      
      Based on some experiments, icc is a bit special on its own, while it
      doesn't seem to eliminate the memset(), it could do so with an own
      implementation, and then result in similar findings as with llvm.
      
      The fix in this patch now works for all three compilers (also tested
      with more aggressive optimization levels). Arguably, in the current
      kernel tree it's more of a theoretical issue, but imho, it's better
      to be pedantic about it.
      
      It's clearly visible with gcc/llvm though, with the below code: if we
      would have used barrier() only here, llvm would have omitted clearing,
      not so with barrier_data() variant:
      
        static inline void memzero_explicit(void *s, size_t count)
        {
          memset(s, 0, count);
          barrier_data(s);
        }
      
        int main(void)
        {
          char buff[20];
          memzero_explicit(buff, sizeof(buff));
          return 0;
        }
      
        $ gcc -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x0000000000400400  <+0>: lea   -0x28(%rsp),%rax
         0x0000000000400405  <+5>: movq  $0x0,-0x28(%rsp)
         0x000000000040040e <+14>: movq  $0x0,-0x20(%rsp)
         0x0000000000400417 <+23>: movl  $0x0,-0x18(%rsp)
         0x000000000040041f <+31>: xor   %eax,%eax
         0x0000000000400421 <+33>: retq
        End of assembler dump.
      
        $ clang -O2 test.c
        $ gdb a.out
        (gdb) disassemble main
        Dump of assembler code for function main:
         0x00000000004004f0  <+0>: xorps  %xmm0,%xmm0
         0x00000000004004f3  <+3>: movaps %xmm0,-0x18(%rsp)
         0x00000000004004f8  <+8>: movl   $0x0,-0x8(%rsp)
         0x0000000000400500 <+16>: lea    -0x18(%rsp),%rax
         0x0000000000400505 <+21>: xor    %eax,%eax
         0x0000000000400507 <+23>: retq
        End of assembler dump.
      
      As gcc, clang, but also icc defines __GNUC__, it's sufficient to define
      this in compiler-gcc.h only to be picked up. For a fallback or otherwise
      unsupported compiler, we define it as a barrier. Similarly, for ecc which
      does not support gcc inline asm.
      
      Reference: https://llvm.org/bugs/show_bug.cgi?id=15495Reported-by: NStephan Mueller <smueller@chronox.de>
      Tested-by: NStephan Mueller <smueller@chronox.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: mancha security <mancha1@zoho.com>
      Cc: Mark Charlebois <charlebm@gmail.com>
      Cc: Behan Webster <behanw@converseincode.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7829fb09
    • E
      codel: fix maxpacket/mtu confusion · a5d28090
      Eric Dumazet 提交于
      Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
      useless. In following example, not a single packet was ever dropped,
      while average delay in codel queue is ~100 ms !
      
      qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
       Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 13626b 3p requeues 0
        count 0 lastcount 0 ldelay 96.9ms drop_next 0us
        maxpacket 9084 ecn_mark 0 drop_overlimit 0
      
      This comes from a confusion of what should be the minimal backlog. It is
      pretty clear it is not 64KB or whatever max GSO packet ever reached the
      qdisc.
      
      codel intent was to use MTU of the device.
      
      After the fix, we finally drop some packets, and rtt/cwnd of my single
      TCP flow are meeting our expectations.
      
      qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
       Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
       backlog 6056b 3p requeues 0
        count 1 lastcount 1 ldelay 36.3ms drop_next 0us
        maxpacket 10598 ecn_mark 0 drop_overlimit 0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kathleen Nichols <nichols@pollere.com>
      Cc: Dave Taht <dave.taht@gmail.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5d28090
  11. 02 5月, 2015 1 次提交
  12. 01 5月, 2015 2 次提交
  13. 30 4月, 2015 6 次提交
    • E
      tcp: add TCP_CC_INFO socket option · 6e9250f5
      Eric Dumazet 提交于
      Some Congestion Control modules can provide per flow information,
      but current way to get this information is to use netlink.
      
      Like TCP_INFO, let's add TCP_CC_INFO so that applications can
      issue a getsockopt() if they have a socket file descriptor,
      instead of playing complex netlink games.
      
      Sample usage would be :
      
        union tcp_cc_info info;
        socklen_t len = sizeof(info);
      
        if (getsockopt(fd, SOL_TCP, TCP_CC_INFO, &info, &len) == -1)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e9250f5
    • E
      tcp: prepare CC get_info() access from getsockopt() · 64f40ff5
      Eric Dumazet 提交于
      We would like that optional info provided by Congestion Control
      modules using netlink can also be read using getsockopt()
      
      This patch changes get_info() to put this information in a buffer,
      instead of skb, like tcp_get_info(), so that following patch
      can reuse this common infrastructure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64f40ff5
    • E
      tcp: add tcpi_bytes_received to tcp_info · bdd1f9ed
      Eric Dumazet 提交于
      This patch tracks total number of payload bytes received on a TCP socket.
      This is the sum of all changes done to tp->rcv_nxt
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsReceived
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_received was placed near tp->rcv_nxt for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdd1f9ed
    • E
      tcp: add tcpi_bytes_acked to tcp_info · 0df48c26
      Eric Dumazet 提交于
      This patch tracks total number of bytes acked for a TCP socket.
      This is the sum of all changes done to tp->snd_una, and allows
      for precise tracking of delivered data.
      
      RFC4898 named this : tcpEStatsAppHCThruOctetsAcked
      
      This is a 64bit field, and can be fetched both from TCP_INFO
      getsockopt() if one has a handle on a TCP socket, or from inet_diag
      netlink facility (iproute2/ss patch will follow)
      
      Note that tp->bytes_acked was placed near tp->snd_una for
      best data locality and minimal performance impact.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Eric Salo <salo@google.com>
      Cc: Martin Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0df48c26
    • N
      bridge/nl: remove wrong use of NLM_F_MULTI · 46c264da
      Nicolas Dichtel 提交于
      NLM_F_MULTI must be used only when a NLMSG_DONE message is sent. In fact,
      it is sent only at the end of a dump.
      
      Libraries like libnl will wait forever for NLMSG_DONE.
      
      Fixes: e5a55a89 ("net: create generic bridge ops")
      Fixes: 815cccbf ("ixgbe: add setlink, getlink support to ixgbe and ixgbevf")
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Sathya Perla <sathya.perla@emulex.com>
      CC: Subbu Seetharaman <subbu.seetharaman@emulex.com>
      CC: Ajit Khaparde <ajit.khaparde@emulex.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: intel-wired-lan@lists.osuosl.org
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: bridge@lists.linux-foundation.org
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c264da
    • B
      xen: Suspend ticks on all CPUs during suspend · 2b953a5e
      Boris Ostrovsky 提交于
      Commit 77e32c89 ("clockevents: Manage device's state separately for
      the core") decouples clockevent device's modes from states. With this
      change when a Xen guest tries to resume, it won't be calling its
      set_mode op which needs to be done on each VCPU in order to make the
      hypervisor aware that we are in oneshot mode.
      
      This happens because clockevents_tick_resume() (which is an intermediate
      step of resuming ticks on a processor) doesn't call clockevents_set_state()
      anymore and because during suspend clockevent devices on all VCPUs (except
      for the one doing the suspend) are left in ONESHOT state. As result, during
      resume the clockevents state machine will assume that device is already
      where it should be and doesn't need to be updated.
      
      To avoid this problem we should suspend ticks on all VCPUs during
      suspend.
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      2b953a5e
  14. 29 4月, 2015 2 次提交
  15. 28 4月, 2015 4 次提交
    • R
      ASoC: Update email-id of Rajeev Kumar · 9d7dd6cd
      Rajeev Kumar 提交于
      rajeev-dlh.kumar@st.com email-id doesn't exist anymore as I have left the
      company.  Replace ST's id with Rajeev Kumar <rajeevkumar.linux@gmail.com>
      Signed-off-by: NRajeev Kumar <rajeevkumar.linux@gmail.com>
      Signed-off-by: NMark Brown <broonie@kernel.org>
      9d7dd6cd
    • F
      tty: Re-add external interface for tty_set_termios() · b00f5c2d
      Frederic Danis 提交于
      This is needed by Bluetooth hci_uart module to be able to change speed
      of Bluetooth controller and local UART.
      Signed-off-by: NFrederic Danis <frederic.danis@linux.intel.com>
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Cc: Marcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b00f5c2d
    • H
      uas: Add US_FL_MAX_SECTORS_240 flag · ee136af4
      Hans de Goede 提交于
      The usb-storage driver sets max_sectors = 240 in its scsi-host template,
      for uas we do not want to do that for all devices, but testing has shown
      that some devices need it.
      
      This commit adds a US_FL_MAX_SECTORS_240 flag for such devices, and
      implements support for it in uas.c, while at it it also adds support
      for US_FL_MAX_SECTORS_64 to uas.c.
      
      Cc: stable@vger.kernel.org # 3.16
      Signed-off-by: NHans de Goede <hdegoede@redhat.com>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee136af4
    • M
      SCSI: add 1024 max sectors black list flag · 35e9a9f9
      Mike Christie 提交于
      This works around a issue with qnap iscsi targets not handling large IOs
      very well.
      
      The target returns:
      
      VPD INQUIRY: Block limits page (SBC)
        Maximum compare and write length: 1 blocks
        Optimal transfer length granularity: 1 blocks
        Maximum transfer length: 4294967295 blocks
        Optimal transfer length: 4294967295 blocks
        Maximum prefetch, xdread, xdwrite transfer length: 0 blocks
        Maximum unmap LBA count: 8388607
        Maximum unmap block descriptor count: 1
        Optimal unmap granularity: 16383
        Unmap granularity alignment valid: 0
        Unmap granularity alignment: 0
        Maximum write same length: 0xffffffff blocks
        Maximum atomic transfer length: 0
        Atomic alignment: 0
        Atomic transfer length granularity: 0
      
      and it is *sometimes* able to handle at least one IO of size up to 8 MB. We
      have seen in traces where it will sometimes work, but other times it
      looks like it fails and it looks like it returns failures if we send
      multiple large IOs sometimes. Also it looks like it can return 2 different
      errors. It will sometimes send iscsi reject errors indicating out of
      resources or it will send invalid cdb illegal requests check conditions.
      And then when it sends iscsi rejects it does not seem to handle retries
      when there are command sequence holes, so I could not just add code to
      try and gracefully handle that error code.
      
      The problem is that we do not have a good contact for the company,
      so we are not able to determine under what conditions it returns
      which error and why it sometimes works.
      
      So, this patch just adds a new black list flag to set targets like this to
      the old max safe sectors of 1024. The max_hw_sectors changes added in 3.19
      caused this regression, so I also ccing stable.
      Reported-by: NChristian Hesse <list@eworm.de>
      Signed-off-by: NMike Christie <michaelc@cs.wisc.edu>
      Cc: stable@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJames Bottomley <JBottomley@Odin.com>
      35e9a9f9
  16. 27 4月, 2015 3 次提交
  17. 26 4月, 2015 3 次提交
    • G
      libata: Ignore spurious PHY event on LPM policy change · 09c5b480
      Gabriele Mazzotta 提交于
      When the LPM policy is set to ATA_LPM_MAX_POWER, the device might
      generate a spurious PHY event that cuases errors on the link.
      Ignore this event if it occured within 10s after the policy change.
      
      The timeout was chosen observing that on a Dell XPS13 9333 these
      spurious events can occur up to roughly 6s after the policy change.
      
      Link: http://lkml.kernel.org/g/3352987.ugV1Ipy7Z5@xps13Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      09c5b480
    • G
      libata: Add helper to determine when PHY events should be ignored · 8393b811
      Gabriele Mazzotta 提交于
      This is a preparation commit that will allow to add other criteria
      according to which PHY events should be dropped.
      Signed-off-by: NGabriele Mazzotta <gabriele.mzt@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      8393b811
    • E
      net: fix crash in build_skb() · 2ea2f62c
      Eric Dumazet 提交于
      When I added pfmemalloc support in build_skb(), I forgot netlink
      was using build_skb() with a vmalloc() area.
      
      In this patch I introduce __build_skb() for netlink use,
      and build_skb() is a wrapper handling both skb->head_frag and
      skb->pfmemalloc
      
      This means netlink no longer has to hack skb->head_frag
      
      [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
      [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      [ 1567.700067] Dumping ftrace buffer:
      [ 1567.700067]    (ftrace buffer empty)
      [ 1567.700067] Modules linked in:
      [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
      [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
      [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
      [ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
      [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
      [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
      [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
      [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
      [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
      [ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
      [ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
      [ 1567.700067] Stack:
      [ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
      [ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
      [ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
      [ 1567.700067] Call Trace:
      [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
      [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
      [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
      [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
      [ 1567.774369] sock_write_iter (net/socket.c:823)
      [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
      [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
      [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
      [ 1567.774369] ? default_llseek (fs/read_write.c:487)
      [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
      [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
      [ 1567.774369] vfs_write (fs/read_write.c:539)
      [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
      [ 1567.774369] ? SyS_read (fs/read_write.c:577)
      [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
      [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
      [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
      [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)
      
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ea2f62c
  18. 25 4月, 2015 2 次提交
    • E
      fix I_DIO_WAKEUP definition · ac74d8d6
      Eric Sandeen 提交于
      I_DIO_WAKEUP is never directly used, but fix it up anyway.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ac74d8d6
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0