1. 18 5月, 2016 4 次提交
    • C
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever 提交于
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
    • C
      xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers · 94931746
      Chuck Lever 提交于
      Send buffer space is shared between the RPC-over-RDMA header and
      an RPC message. A large RPC-over-RDMA header means less space is
      available for the associated RPC message, which then has to be
      moved via an RDMA Read or Write.
      
      As more segments are added to the chunk lists, the header increases
      in size.  Typical modern hardware needs only a few segments to
      convey the maximum payload size, but some devices and registration
      modes may need a lot of segments to convey data payload. Sometimes
      so many are needed that the remaining space in the Send buffer is
      not enough for the RPC message. Sending such a message usually
      fails.
      
      To ensure a transport can always make forward progress, cap the
      number of RDMA segments that are allowed in chunk lists. This
      prevents less-capable devices and memory registrations from
      consuming a large portion of the Send buffer by reducing the
      maximum data payload that can be conveyed with such devices.
      
      For now I choose an arbitrary maximum of 8 RDMA segments. This
      allows a maximum size RPC-over-RDMA header to fit nicely in the
      current 1024 byte inline threshold with over 700 bytes remaining
      for an inline RPC message.
      
      The current maximum data payload of NFS READ or WRITE requests is
      one megabyte. To convey that payload on a client with 4KB pages,
      each chunk segment would need to handle 32 or more data pages. This
      is well within the capabilities of FMR. For physical registration,
      the maximum payload size on platforms with 4KB pages is reduced to
      32KB.
      
      For FRWR, a device's maximum page list depth would need to be at
      least 34 to support the maximum 1MB payload. A device with a smaller
      maximum page list depth means the maximum data payload is reduced
      when using that device.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      94931746
    • C
      xprtrdma: Bound the inline threshold values · 29c55422
      Chuck Lever 提交于
      Currently the sysctls that allow setting the inline threshold allow
      any value to be set.
      
      Small values only make the transport run slower. The default 1KB
      setting is as low as is reasonable. And the logic that decides how
      to divide a Send buffer between RPC-over-RDMA header and RPC message
      assumes (but does not check) that the lower bound is not crazy (say,
      57 bytes).
      
      Send and receive buffers share a page with some control information.
      Values larger than about 3KB can't be supported, currently.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      29c55422
    • C
      sunrpc: Advertise maximum backchannel payload size · 6b26cc8c
      Chuck Lever 提交于
      RPC-over-RDMA transports have a limit on how large a backward
      direction (backchannel) RPC message can be. Ensure that the NFSv4.x
      CREATE_SESSION operation advertises this limit to servers.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6b26cc8c
  2. 09 5月, 2016 3 次提交
  3. 04 5月, 2016 4 次提交
  4. 03 5月, 2016 1 次提交
    • N
      netem: Segment GSO packets on enqueue · 6071bd1a
      Neil Horman 提交于
      This was recently reported to me, and reproduced on the latest net kernel,
      when attempting to run netperf from a host that had a netem qdisc attached
      to the egress interface:
      
      [  788.073771] ---------------------[ cut here ]---------------------------
      [  788.096716] WARNING: at net/core/dev.c:2253 skb_warn_bad_offload+0xcd/0xda()
      [  788.129521] bnx2: caps=(0x00000001801949b3, 0x0000000000000000) len=2962
      data_len=0 gso_size=1448 gso_type=1 ip_summed=3
      [  788.182150] Modules linked in: sch_netem kvm_amd kvm crc32_pclmul ipmi_ssif
      ghash_clmulni_intel sp5100_tco amd64_edac_mod aesni_intel lrw gf128mul
      glue_helper ablk_helper edac_mce_amd cryptd pcspkr sg edac_core hpilo ipmi_si
      i2c_piix4 k10temp fam15h_power hpwdt ipmi_msghandler shpchp acpi_power_meter
      pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
      sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
      i2c_algo_bit drm_kms_helper ahci ata_generic pata_acpi ttm libahci
      crct10dif_pclmul pata_atiixp tg3 libata crct10dif_common drm crc32c_intel ptp
      serio_raw bnx2 r8169 hpsa pps_core i2c_core mii dm_mirror dm_region_hash dm_log
      dm_mod
      [  788.465294] CPU: 16 PID: 0 Comm: swapper/16 Tainted: G        W
      ------------   3.10.0-327.el7.x86_64 #1
      [  788.511521] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 12/17/2012
      [  788.542260]  ffff880437c036b8 f7afc56532a53db9 ffff880437c03670
      ffffffff816351f1
      [  788.576332]  ffff880437c036a8 ffffffff8107b200 ffff880633e74200
      ffff880231674000
      [  788.611943]  0000000000000001 0000000000000003 0000000000000000
      ffff880437c03710
      [  788.647241] Call Trace:
      [  788.658817]  <IRQ>  [<ffffffff816351f1>] dump_stack+0x19/0x1b
      [  788.686193]  [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
      [  788.713803]  [<ffffffff8107b29c>] warn_slowpath_fmt+0x5c/0x80
      [  788.741314]  [<ffffffff812f92f3>] ? ___ratelimit+0x93/0x100
      [  788.767018]  [<ffffffff81637f49>] skb_warn_bad_offload+0xcd/0xda
      [  788.796117]  [<ffffffff8152950c>] skb_checksum_help+0x17c/0x190
      [  788.823392]  [<ffffffffa01463a1>] netem_enqueue+0x741/0x7c0 [sch_netem]
      [  788.854487]  [<ffffffff8152cb58>] dev_queue_xmit+0x2a8/0x570
      [  788.880870]  [<ffffffff8156ae1d>] ip_finish_output+0x53d/0x7d0
      ...
      
      The problem occurs because netem is not prepared to handle GSO packets (as it
      uses skb_checksum_help in its enqueue path, which cannot manipulate these
      frames).
      
      The solution I think is to simply segment the skb in a simmilar fashion to the
      way we do in __dev_queue_xmit (via validate_xmit_skb), with some minor changes.
      When we decide to corrupt an skb, if the frame is GSO, we segment it, corrupt
      the first segment, and enqueue the remaining ones.
      
      tested successfully by myself on the latest net kernel, to which this applies
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: netem@lists.linux-foundation.org
      CC: eric.dumazet@gmail.com
      CC: stephen@networkplumber.org
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6071bd1a
  5. 02 5月, 2016 4 次提交
    • J
      gre: do not pull header in ICMP error processing · b7f8fe25
      Jiri Benc 提交于
      iptunnel_pull_header expects that IP header was already pulled; with this
      expectation, it pulls the tunnel header. This is not true in gre_err.
      Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
      to the IP header.
      
      We cannot pull the tunnel header in this path. It's just a matter of not
      calling iptunnel_pull_header - we don't need any of its effects.
      
      Fixes: bda7bb46 ("gre: Allow multiple protocol listener for gre protocol.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7f8fe25
    • H
      tipc: only process unicast on intended node · efe79050
      Hamish Martin 提交于
      We have observed complete lock up of broadcast-link transmission due to
      unacknowledged packets never being removed from the 'transmq' queue. This
      is traced to nodes having their ack field set beyond the sequence number
      of packets that have actually been transmitted to them.
      Consider an example where node 1 has sent 10 packets to node 2 on a
      link and node 3 has sent 20 packets to node 2 on another link. We
      see examples of an ack from node 2 destined for node 3 being treated as
      an ack from node 2 at node 1. This leads to the ack on the node 1 to node
      2 link being increased to 20 even though we have only sent 10 packets.
      When node 1 does get around to sending further packets, none of the
      packets with sequence numbers less than 21 are actually removed from the
      transmq.
      To resolve this we reinstate some code lost in commit d999297c ("tipc:
      reduce locking scope during packet reception") which ensures that only
      messages destined for the receiving node are processed by that node. This
      prevents the sequence numbers from getting out of sync and resolves the
      packet leakage, thereby resolving the broadcast-link transmission
      lock-ups we observed.
      
      While we are aware that this change only patches over a root problem that
      we still haven't identified, this is a sanity test that it is always
      legitimate to do. It will remain in the code even after we identify and
      fix the real problem.
      Reviewed-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: NJohn Thompson <john.thompson@alliedtelesis.co.nz>
      Signed-off-by: NHamish Martin <hamish.martin@alliedtelesis.co.nz>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe79050
    • C
      soreuseport: Fix TCP listener hash collision · 90e5d0db
      Craig Gallek 提交于
      I forgot to include a check for listener port equality when deciding
      if two sockets should belong to the same reuseport group.  This was
      not caught previously because it's only necessary when two listening
      sockets for the same user happen to hash to the same listener bucket.
      The same error does not exist in the UDP path.
      
      Fixes: c125e80b("soreuseport: fast reuseport TCP socket selection")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e5d0db
    • W
      net: l2tp: fix reversed udp6 checksum flags · 018f8258
      Wang Shanker 提交于
      This patch fixes a bug which causes the behavior of whether to ignore
      udp6 checksum of udp6 encapsulated l2tp tunnel contrary to what
      userspace program requests.
      
      When the flag `L2TP_ATTR_UDP_ZERO_CSUM6_RX` is set by userspace, it is
      expected that udp6 checksums of received packets of the l2tp tunnel
      to create should be ignored. In `l2tp_netlink.c`:
      `l2tp_nl_cmd_tunnel_create()`, `cfg.udp6_zero_rx_checksums` is set
      according to the flag, and then passed to `l2tp_core.c`:
      `l2tp_tunnel_create()` and then `l2tp_tunnel_sock_create()`. In
      `l2tp_tunnel_sock_create()`, `udp_conf.use_udp6_rx_checksums` is set
      the same to `cfg.udp6_zero_rx_checksums`. However, if we want the
      checksum to be ignored, `udp_conf.use_udp6_rx_checksums` should be set
      to `false`, i.e. be set to the contrary. Similarly, the same should be
      done to `udp_conf.use_udp6_tx_checksums`.
      Signed-off-by: NMiao Wang <shankerwangmiao@gmail.com>
      Acked-by: NJames Chapman <jchapman@katalix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      018f8258
  6. 30 4月, 2016 1 次提交
  7. 29 4月, 2016 7 次提交
    • S
      batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node · abe59c65
      Sven Eckelmann 提交于
      The batadv_neigh_node was specific to a batadv_hardif_neigh_node and held
      an implicit reference to it. But this reference was never stored in form of
      a pointer in the batadv_neigh_node itself. Instead
      batadv_neigh_node_release depends on a consistent state of
      hard_iface->neigh_list and that batadv_hardif_neigh_get always returns the
      batadv_hardif_neigh_node object which it has a reference for. But
      batadv_hardif_neigh_get cannot guarantee that because it is working only
      with rcu_read_lock on this list. It can therefore happen that a neigh_addr
      is in this list twice or that batadv_hardif_neigh_get cannot find the
      batadv_hardif_neigh_node for an neigh_addr due to some other list
      operations taking place at the same time.
      
      Instead add a batadv_hardif_neigh_node pointer directly in
      batadv_neigh_node which will be used for the reference counter decremented
      on release of batadv_neigh_node.
      
      Fixes: cef63419 ("batman-adv: add list of unique single hop neighbors per hard-interface")
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      abe59c65
    • S
      batman-adv: Fix reference counting of vlan object for tt_local_entry · a33d970d
      Sven Eckelmann 提交于
      The batadv_tt_local_entry was specific to a batadv_softif_vlan and held an
      implicit reference to it. But this reference was never stored in form of a
      pointer in the tt_local_entry itself. Instead batadv_tt_local_remove,
      batadv_tt_local_table_free and batadv_tt_local_purge_pending_clients depend
      on a consistent state of bat_priv->softif_vlan_list and that
      batadv_softif_vlan_get always returns the batadv_softif_vlan object which
      it has a reference for. But batadv_softif_vlan_get cannot guarantee that
      because it is working only with rcu_read_lock on this list. It can
      therefore happen that an vid is in this list twice or that
      batadv_softif_vlan_get cannot find the batadv_softif_vlan for an vid due to
      some other list operations taking place at the same time.
      
      Instead add a batadv_softif_vlan pointer directly in batadv_tt_local_entry
      which will be used for the reference counter decremented on release of
      batadv_tt_local_entry.
      
      Fixes: 35df3b29 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Acked-by: NAntonio Quartulli <a@unstable.cc>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      a33d970d
    • A
      batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event · b6cf5d49
      Antonio Quartulli 提交于
      At the moment there is no explicit reactivation of an hard-interface
      upon NETDEV_UP event. In case of B.A.T.M.A.N. IV the interface is
      reactivated as soon as the next OGM is scheduled for sending, but this
      mechanism does not work with B.A.T.M.A.N. V. The latter does not rely
      on the same scheduling mechanism as its predecessor and for this reason
      the hard-interface remains deactivated forever after being brought down
      once.
      
      This patch fixes the reactivation mechanism by adding a new routing API
      which explicitly allows each algorithm to perform any needed operation
      upon interface re-activation.
      
      Such API is optional and is implemented by B.A.T.M.A.N. V only and it
      just takes care of setting the iface status to ACTIVE
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      b6cf5d49
    • A
      batman-adv: fix DAT candidate selection (must use vid) · 2871734e
      Antonio Quartulli 提交于
      Now that DAT is VLAN aware, it must use the VID when
      computing the DHT address of the candidate nodes where
      an entry is going to be stored/retrieved.
      
      Fixes: be1db4f6 ("batman-adv: make the Distributed ARP Table vlan aware")
      Signed-off-by: NAntonio Quartulli <a@unstable.cc>
      [sven@narfation.org: fix conflicts with current version]
      Signed-off-by: NSven Eckelmann <sven@narfation.org>
      Signed-off-by: NMarek Lindner <mareklindner@neomailbox.ch>
      2871734e
    • J
      gre: reject GUE and FOU in collect metadata mode · 946b636f
      Jiri Benc 提交于
      The collect metadata mode does not support GUE nor FOU. This might be
      implemented later; until then, we should reject such config.
      
      I think this is okay to be changed. It's unlikely anyone has such
      configuration (as it doesn't work anyway) and we may need a way to
      distinguish whether it's supported or not by the kernel later.
      
      For backwards compatibility with iproute2, it's not possible to just check
      the attribute presence (iproute2 always includes the attribute), the actual
      value has to be checked, too.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946b636f
    • J
      gre: build header correctly for collect metadata tunnels · 2090714e
      Jiri Benc 提交于
      In ipgre (i.e. not gretap) + collect metadata mode, the skb was assumed to
      contain Ethernet header and was encapsulated as ETH_P_TEB. This is not the
      case, the interface is ARPHRD_IPGRE and the protocol to be used for
      encapsulation is skb->protocol.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2090714e
    • J
      gre: do not assign header_ops in collect metadata mode · a64b04d8
      Jiri Benc 提交于
      In ipgre mode (i.e. not gretap) with collect metadata flag set, the tunnel
      is incorrectly assumed to be mGRE in NBMA mode (see commit 6a5f44d7).
      This is not the case, we're controlling the encapsulation addresses by
      lwtunnel metadata. And anyway, assigning dev->header_ops in collect metadata
      mode does not make sense.
      
      Although it would be more user firendly to reject requests that specify
      both the collect metadata flag and a remote/local IP address, this would
      break current users of gretap or introduce ugly code and differences in
      handling ipgre and gretap configuration. Keep the current behavior of
      remote/local IP address being ignored in such case.
      
      v3: Back to v1, added explanation paragraph.
      v2: Reject configuration specifying both remote/local address and collect
          metadata flag.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Signed-off-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a64b04d8
  8. 27 4月, 2016 1 次提交
  9. 26 4月, 2016 4 次提交
    • D
      net: ipv6: Delete host routes on an ifdown · 38bd10c4
      David Ahern 提交于
      It was a simple idea -- save IPv6 configured addresses on a link down
      so that IPv6 behaves similar to IPv4. As always the devil is in the
      details and the IPv6 stack as too many behavioral differences from IPv4
      making the simple idea more complicated than it needs to be.
      
      The current implementation for keeping IPv6 addresses can panic or spit
      out a warning in one of many paths:
      
      1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
         rt6_fill_node while handling a route dump request.
      
      2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
         fib6_del
      
      3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
      
      The root cause of all these is references related to the host route for
      an address that is retained.
      
      So, this patch deletes the host route every time the ifdown loop runs.
      Since the host route is deleted and will be re-generated an up there is
      no longer a need for the l3mdev fix up. On the 'admin up' side move
      addrconf_permanent_addr into the NETDEV_UP event handling so that it
      runs only once versus on UP and CHANGE events.
      
      All of the current panics and warnings appear to be related to
      addresses on the loopback device, but given the catastrophic nature when
      a bug is triggered this patch takes the conservative approach and evicts
      all host routes rather than trying to determine when it can be re-used
      and when it can not. That can be a later optimizaton if desired.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38bd10c4
    • D
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller 提交于
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a923934
    • D
      ipv6: Revert optional address flusing on ifdown. · 841645b5
      David S. Miller 提交于
      This reverts the following three commits:
      
      70af921d
      799977d9
      f1705ec1
      
      The feature was ill conceived, has terrible semantics, and has added
      nothing but regressions to the already fragile ipv6 stack.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841645b5
    • I
      libceph: make authorizer destruction independent of ceph_auth_client · 6c1ea260
      Ilya Dryomov 提交于
      Starting the kernel client with cephx disabled and then enabling cephx
      and restarting userspace daemons can result in a crash:
      
          [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
          [262671.531460] IP: [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262671.584334] PGD 0
          [262671.635847] Oops: 0000 [#1] SMP
          [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
          [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
          [262672.268937] Workqueue: ceph-msgr con_work [libceph]
          [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
          [262672.428330] RIP: 0010:[<ffffffff811cd04a>]  [<ffffffff811cd04a>] kfree+0x5a/0x130
          [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
          [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
          [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
          [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
          [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
          [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
          [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
          [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
          [262673.417556] Stack:
          [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
          [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
          [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
          [262673.805230] Call Trace:
          [262673.859116]  [<ffffffffc035f70f>] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
          [262673.968705]  [<ffffffffc035c89e>] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
          [262674.078852]  [<ffffffffc0352805>] put_osd+0x45/0x80 [libceph]
          [262674.134249]  [<ffffffffc035290e>] remove_osd+0xae/0x140 [libceph]
          [262674.189124]  [<ffffffffc0352aa3>] __reset_osd+0x103/0x150 [libceph]
          [262674.243749]  [<ffffffffc0354703>] kick_requests+0x223/0x460 [libceph]
          [262674.297485]  [<ffffffffc03559e2>] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
          [262674.350813]  [<ffffffffc035022e>] dispatch+0x4e/0x720 [libceph]
          [262674.403312]  [<ffffffffc034bd91>] try_read+0x3d1/0x1090 [libceph]
          [262674.454712]  [<ffffffff810ab7c2>] ? dequeue_entity+0x152/0x690
          [262674.505096]  [<ffffffffc034cb1b>] con_work+0xcb/0x1300 [libceph]
          [262674.555104]  [<ffffffff8108fb3e>] process_one_work+0x14e/0x3d0
          [262674.604072]  [<ffffffff810901ea>] worker_thread+0x11a/0x470
          [262674.652187]  [<ffffffff810900d0>] ? rescuer_thread+0x310/0x310
          [262674.699022]  [<ffffffff810957a2>] kthread+0xd2/0xf0
          [262674.744494]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
          [262674.789543]  [<ffffffff817bd81f>] ret_from_fork+0x3f/0x70
          [262674.834094]  [<ffffffff810956d0>] ? kthread_create_on_node+0x1c0/0x1c0
      
      What happens is the following:
      
          (1) new MON session is established
          (2) old "none" ac is destroyed
          (3) new "cephx" ac is constructed
          ...
          (4) old OSD session (w/ "none" authorizer) is put
                ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)
      
      osd->o_auth.authorizer in the "none" case is just a bare pointer into
      ac, which contains a single static copy for all services.  By the time
      we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
      a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
      so we end up trying to destroy a "none" authorizer with a "cephx"
      destructor operating on invalid memory!
      
      To fix this, decouple authorizer destruction from ac and do away with
      a single static "none" authorizer by making a copy for each OSD or MDS
      session.  Authorizers themselves are independent of ac and so there is
      no reason for destroy_authorizer() to be an ac op.  Make it an op on
      the authorizer itself by turning ceph_authorizer into a real struct.
      
      Fixes: http://tracker.ceph.com/issues/15447Reported-by: NAlan Zhang <alan.zhang@linux.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      6c1ea260
  10. 25 4月, 2016 4 次提交
  11. 24 4月, 2016 5 次提交
  12. 22 4月, 2016 2 次提交
    • S
      openvswitch: use flow protocol when recalculating ipv6 checksums · b4f70527
      Simon Horman 提交于
      When using masked actions the ipv6_proto field of an action
      to set IPv6 fields may be zero rather than the prevailing protocol
      which will result in skipping checksum recalculation.
      
      This patch resolves the problem by relying on the protocol
      in the flow key rather than that in the set field action.
      
      Fixes: 83d2b9ba ("net: openvswitch: Support masked set actions.")
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Signed-off-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4f70527
    • M
      tcp: Merge tx_flags and tskey in tcp_shifted_skb · cfea5a68
      Martin KaFai Lau 提交于
      After receiving sacks, tcp_shifted_skb() will collapse
      skbs if possible.  tx_flags and tskey also have to be
      merged.
      
      This patch reuses the tcp_skb_collapse_tstamp() to handle
      them.
      
      BPF Output Before:
      ~~~~~
      <no-output-due-to-missing-tstamp-event>
      
      BPF Output After:
      ~~~~~
      <...>-2024  [007] d.s.    88.644374: : ee_data:14599
      
      Packetdrill Script:
      ~~~~~
      +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10`
      +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1`
      +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0 bind(3, ..., ...) = 0
      +0 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
      0.200 < . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
      
      0.200 write(4, ..., 1460) = 1460
      +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0
      0.200 write(4, ..., 13140) = 13140
      
      0.200 > P. 1:1461(1460) ack 1
      0.200 > . 1461:8761(7300) ack 1
      0.200 > P. 8761:14601(5840) ack 1
      
      0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop>
      0.300 > P. 1:1461(1460) ack 1
      0.400 < . 1:1(0) ack 14601 win 257
      
      0.400 close(4) = 0
      0.400 > F. 14601:14601(0) ack 1
      0.500 < F. 1:1(0) ack 14602 win 257
      0.500 > . 14602:14602(0) ack 2
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Tested-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfea5a68