1. 24 8月, 2016 1 次提交
  2. 19 8月, 2016 6 次提交
  3. 15 8月, 2016 1 次提交
    • D
      netns: do not call pernet ops for not yet set up init_net namespace · f8c46cb3
      Dmitry Torokhov 提交于
      When CONFIG_NET_NS is disabled, registering pernet operations causes
      init() to be called immediately with init_net as an argument. Unfortunately
      this leads to some pernet ops, such as proc_net_ns_init() to be called too
      early, when init_net namespace has not been fully initialized. This causes
      issues when we want to change pernet ops to use more data from the net
      namespace in question, for example reference user namespace that owns our
      network namespace.
      
      To fix this we could either play game of musical chairs and rearrange init
      order, or we could do the same as when CONFIG_NET_NS is enabled, and
      postpone calling pernet ops->init() until namespace is set up properly.
      
      Note that we can not simply undo commit ed160e83 ("[NET]: Cleanup
      pernet operation without CONFIG_NET_NS") and use the same implementations
      for __register_pernet_operations() and __unregister_pernet_operations(),
      because many pernet ops are marked as __net_initdata and will be discarded,
      which wreaks havoc on our ops lists. Here we rely on the fact that we only
      use lists until init_net is fully initialized, which happens much earlier
      than discarding __net_initdata sections.
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c46cb3
  4. 14 8月, 2016 2 次提交
    • S
      net: remove type_check from dev_get_nest_level() · 952fcfd0
      Sabrina Dubroca 提交于
      The idea for type_check in dev_get_nest_level() was to count the number
      of nested devices of the same type (currently, only macvlan or vlan
      devices).
      This prevented the false positive lockdep warning on configurations such
      as:
      
      eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
      
      However, this doesn't prevent a warning on a configuration such as:
      
      eth0 <--- macvlan0 <--- vlan0
      eth1 <--- vlan1 <--- macvlan1
      
      In this case, all the locks end up with a nesting subclass of 1, so
      lockdep thinks that there is still a deadlock:
      
      - in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
        take (vlan_netdev_xmit_lock_key, 1)
      - in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
        take (macvlan_netdev_addr_lock_key, 1)
      
      By removing the linktype check in dev_get_nest_level() and always
      incrementing the nesting depth, lockdep considers this configuration
      valid.
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      952fcfd0
    • D
      bpf: fix write helpers with regards to non-linear parts · 0ed661d5
      Daniel Borkmann 提交于
      Fix the bpf_try_make_writable() helper and all call sites we have in BPF,
      it's currently defect with regards to skbs when the write_len spans into
      non-linear parts, no matter if cloned or not.
      
      There are multiple issues at once. First, using skb_store_bits() is not
      correct since even if we have a cloned skb, page frags can still be shared.
      To really make them private, we need to pull them in via __pskb_pull_tail()
      first, which also gets us a private head via pskb_expand_head() implicitly.
      
      This is for helpers like bpf_skb_store_bytes(), bpf_l3_csum_replace(),
      bpf_l4_csum_replace(). Really, the only thing reasonable and working here
      is to call skb_ensure_writable() before any write operation. Meaning, via
      pskb_may_pull() it makes sure that parts we want to access are pulled in and
      if not does so plus unclones the skb implicitly. If our write_len still fits
      the headlen and we're cloned and our header of the clone is not writable,
      then we need to make a private copy via pskb_expand_head(). skb_store_bits()
      is a bit misleading and only safe to store into non-linear data in different
      contexts such as 357b40a1 ("[IPV6]: IPV6_CHECKSUM socket option can
      corrupt kernel memory").
      
      For above BPF helper functions, it means after fixed bpf_try_make_writable(),
      we've pulled in enough, so that we operate always based on skb->data. Thus,
      the call to skb_header_pointer() and skb_store_bits() becomes superfluous.
      In bpf_skb_store_bytes(), the len check is unnecessary too since it can
      only pass in maximum of BPF stack size, so adding offset is guaranteed to
      never overflow. Also bpf_l3/4_csum_replace() helpers must test for proper
      offset alignment since they use __sum16 pointer for writing resulting csum.
      
      The remaining helpers that change skb data not discussed here yet are
      bpf_skb_vlan_push(), bpf_skb_vlan_pop() and bpf_skb_change_proto(). The
      vlan helpers internally call either skb_ensure_writable() (pop case) and
      skb_cow_head() (push case, for head expansion), respectively. Similarly,
      bpf_skb_proto_xlat() takes care to not mangle page frags.
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Fixes: 91bc4822 ("tc: bpf: add checksum helpers")
      Fixes: 3697649f ("bpf: try harder on clones when writing into skb")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed661d5
  5. 13 8月, 2016 1 次提交
  6. 11 8月, 2016 2 次提交
  7. 09 8月, 2016 4 次提交
    • J
      neigh: allow admin to set NUD_STALE · 0e7bbcc1
      Julian Anastasov 提交于
      Admin should be able to set any state. Currently, this fails
      when lladdr is not changed and state is changed from
      NUD_CONNECTED to NUD_STALE:
      
      ip neigh add 192.168.8.1 lladdr 00:11:22:33:44:55 nud perm dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      ip neigh change 192.168.8.1 lladdr 00:11:22:33:44:55 nud stale dev wlan0
      ip neigh show to 192.168.8.1
      192.168.8.1 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
      
      Problem may be from 2.1.X days.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Reviewed-by: NChunhui He <hchunhui@mail.ustc.edu.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e7bbcc1
    • D
      bpf: fix checksum for vlan push/pop helper · 8065694e
      Daniel Borkmann 提交于
      When having skbs on ingress with CHECKSUM_COMPLETE, tc BPF programs don't
      push rcsum of mac header back in and after BPF run back pull out again as
      opposed to some other subsystems (ovs, for example).
      
      For cases like q-in-q, meaning when a vlan tag for offloading is already
      present and we're about to push another one, then skb_vlan_push() pushes the
      inner one into the skb, increasing mac header and skb_postpush_rcsum()'ing
      the 4 bytes vlan header diff. Likewise, for the reverse operation in
      skb_vlan_pop() for the case where vlan header needs to be pulled out of the
      skb, we're decreasing the mac header and skb_postpull_rcsum()'ing the 4 bytes
      rcsum of the vlan header that was removed.
      
      However mangling the rcsum here will lead to hw csum failure for BPF case,
      since we're pulling or pushing data that was not part of the current rcsum.
      Changing tc BPF programs in general to push/pull rcsum around BPF_PROG_RUN()
      is also not really an option since current behaviour is ABI by now, but apart
      from that would also mean to do quite a bit of useless work in the sense that
      usually 12 bytes need to be rcsum pushed/pulled also when we don't need to
      touch this vlan related corner case. One way to fix it would be to push the
      necessary rcsum fixup down into vlan helpers that are (mostly) slow-path
      anyway.
      
      Fixes: 4e10df9a ("bpf: introduce bpf_skb_vlan_push/pop() helpers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8065694e
    • D
      bpf: fix checksum fixups on bpf_skb_store_bytes · 479ffccc
      Daniel Borkmann 提交于
      bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
      flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
      Where we ran into an issue with bpf_skb_store_bytes() is when we did a
      single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
      flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
      underlying issue has been tracked down to a buffer alignment issue.
      
      Meaning, that csum_partial() computations via skb_postpull_rcsum() and
      skb_postpush_rcsum() pair invoked had a wrong result since they operated on
      an odd address for the hoplimit, while other computations were done on an
      even address. This mix doesn't work as-is with skb_postpull_rcsum(),
      skb_postpush_rcsum() pair as it always expects at least half-word alignment
      of input buffers, which is normally the case. Thus, instead of these helpers
      using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
      csum_block_add(), respectively. For unaligned offsets, they rotate the sum
      to align it to a half-word boundary again, otherwise they work the same as
      csum_sub() and csum_add().
      
      Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
      offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
      csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
      use a 0 constant for offset so that the compiler optimizes the offset & 1
      test away and generates the same code as with csum_sub()/_add().
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479ffccc
    • D
      bpf: also call skb_postpush_rcsum on xmit occasions · a2bfe6bf
      Daniel Borkmann 提交于
      Follow-up to commit f8ffad69 ("bpf: add skb_postpush_rcsum and fix
      dev_forward_skb occasions") to fix an issue for dev_queue_xmit() redirect
      locations which need CHECKSUM_COMPLETE fixups on ingress.
      
      For the same reasons as described in f8ffad69 already, we of course
      also need this here, since dev_queue_xmit() on a veth device will let us
      end up in the dev_forward_skb() helper again to cross namespaces.
      
      Latter then calls into skb_postpull_rcsum() to pull out L2 header, so
      that netif_rx_internal() sees CHECKSUM_COMPLETE as it is expected. That
      is, CHECKSUM_COMPLETE on ingress covering L2 _payload_, not L2 headers.
      
      Also here we have to address bpf_redirect() and bpf_clone_redirect().
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2bfe6bf
  8. 27 7月, 2016 1 次提交
    • H
      net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update() · d1c2b501
      He Chunhui 提交于
      NUD_STALE is used when the caller(e.g. arp_process()) can't guarantee
      neighbour reachability. If the entry was NUD_VALID and lladdr is unchanged,
      the entry state should not be changed.
      
      Currently the code puts an extra "NUD_CONNECTED" condition. So if old state
      was NUD_DELAY or NUD_PROBE (they are NUD_VALID but not NUD_CONNECTED), the
      state can be changed to NUD_STALE.
      
      This may cause problem. Because NUD_STALE lladdr doesn't guarantee
      reachability, when we send traffic, the state will be changed to
      NUD_DELAY. In normal case, if we get no confirmation (by dst_confirm()),
      we will change the state to NUD_PROBE and send probe traffic. But now the
      state may be reset to NUD_STALE again(e.g. by broadcast ARP packets),
      so the probe traffic will not be sent. This situation may happen again and
      again, and packets will be sent to an non-reachable lladdr forever.
      
      The fix is to remove the "NUD_CONNECTED" condition. After that the
      "NEIGH_UPDATE_F_WEAK_OVERRIDE" condition (used by IPv6) in that branch will
      be redundant, so remove it.
      
      This change may increase probe traffic, but it's essential since NUD_STALE
      lladdr is unreliable. To ensure correctness, we prefer to resolve lladdr,
      when we can't get confirmation, even while remote packets try to set
      NUD_STALE state.
      Signed-off-by: NChunhui He <hchunhui@mail.ustc.edu.cn>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1c2b501
  9. 26 7月, 2016 1 次提交
    • D
      bpf, events: fix offset in skb copy handler · aa7145c1
      Daniel Borkmann 提交于
      This patch fixes the __output_custom() routine we currently use with
      bpf_skb_copy(). I missed that when len is larger than the size of the
      current handle, we can issue multiple invocations of copy_func, and
      __output_custom() advances destination but also source buffer by the
      written amount of bytes. When we have __output_custom(), this is actually
      wrong since in that case the source buffer points to a non-linear object,
      in our case an skb, which the copy_func helper is supposed to walk.
      Therefore, since this is non-linear we thus need to pass the offset into
      the helper, so that copy_func can use it for extracting the data from
      the source object.
      
      Therefore, adjust the callback signatures properly and pass offset
      into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
      __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
      i) to pass in whether we should advance source buffer or not; this is
      a compile-time constant condition, ii) to pass in the offset for
      __output_custom(), which we do with help of __VA_ARGS__, so everything
      can stay inlined as is currently. Both changes allow for adapting the
      __output_* fast-path helpers w/o extra overhead.
      
      Fixes: 555c8a86 ("bpf: avoid stack copy and use skb ctx for event output")
      Fixes: 7e3f977e ("perf, events: add non-linear data support for raw records")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa7145c1
  10. 21 7月, 2016 1 次提交
  11. 20 7月, 2016 3 次提交
  12. 16 7月, 2016 1 次提交
    • D
      bpf: avoid stack copy and use skb ctx for event output · 555c8a86
      Daniel Borkmann 提交于
      This work addresses a couple of issues bpf_skb_event_output()
      helper currently has: i) We need two copies instead of just a
      single one for the skb data when it should be part of a sample.
      The data can be non-linear and thus needs to be extracted via
      bpf_skb_load_bytes() helper first, and then copied once again
      into the ring buffer slot. ii) Since bpf_skb_load_bytes()
      currently needs to be used first, the helper needs to see a
      constant size on the passed stack buffer to make sure BPF
      verifier can do sanity checks on it during verification time.
      Thus, just passing skb->len (or any other non-constant value)
      wouldn't work, but changing bpf_skb_load_bytes() is also not
      the proper solution, since the two copies are generally still
      needed. iii) bpf_skb_load_bytes() is just for rather small
      buffers like headers, since they need to sit on the limited
      BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
      this work improves the bpf_skb_event_output() helper to address
      all 3 at once.
      
      We can make use of the passed in skb context that we have in
      the helper anyway, and use some of the reserved flag bits as
      a length argument. The helper will use the new __output_custom()
      facility from perf side with bpf_skb_copy() as callback helper
      to walk and extract the data. It will pass the data for setup
      to bpf_event_output(), which generates and pushes the raw record
      with an additional frag part. The linear data used in the first
      frag of the record serves as programmatically defined meta data
      passed along with the appended sample.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      555c8a86
  13. 14 7月, 2016 2 次提交
    • W
      dccp: limit sk_filter trim to payload · 4f0c40d9
      Willem de Bruijn 提交于
      Dccp verifies packet integrity, including length, at initial rcv in
      dccp_invalid_packet, later pulls headers in dccp_enqueue_skb.
      
      A call to sk_filter in-between can cause __skb_pull to wrap skb->len.
      skb_copy_datagram_msg interprets this as a negative value, so
      (correctly) fails with EFAULT. The negative length is reported in
      ioctl SIOCINQ or possibly in a DCCP_WARN in dccp_close.
      
      Introduce an sk_receive_skb variant that caps how small a filter
      program can trim packets, and call this in dccp with the header
      length. Excessively trimmed packets are now processed normally and
      queued for reception as 0B payloads.
      
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f0c40d9
    • W
      rose: limit sk_filter trim to payload · f4979fce
      Willem de Bruijn 提交于
      Sockets can have a filter program attached that drops or trims
      incoming packets based on the filter program return value.
      
      Rose requires data packets to have at least ROSE_MIN_LEN bytes. It
      verifies this on arrival in rose_route_frame and unconditionally pulls
      the bytes in rose_recvmsg. The filter can trim packets to below this
      value in-between, causing pull to fail, leaving the partial header at
      the time of skb_copy_datagram_msg.
      
      Place a lower bound on the size to which sk_filter may trim packets
      by introducing sk_filter_trim_cap and call this for rose packets.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4979fce
  14. 13 7月, 2016 1 次提交
  15. 12 7月, 2016 1 次提交
  16. 10 7月, 2016 1 次提交
  17. 06 7月, 2016 3 次提交
  18. 05 7月, 2016 3 次提交
  19. 03 7月, 2016 1 次提交
  20. 02 7月, 2016 4 次提交