1. 16 5月, 2017 1 次提交
    • G
      ebtables: arpreply: Add the standard target sanity check · c953d635
      Gao Feng 提交于
      The info->target comes from userspace and it would be used directly.
      So we need to add the sanity check to make sure it is a valid standard
      target, although the ebtables tool has already checked it. Kernel needs
      to validate anything coming from userspace.
      
      If the target is set as an evil value, it would break the ebtables
      and cause a panic. Because the non-standard target is treated as one
      offset.
      
      Now add one helper function ebt_invalid_target, and we would replace
      the macro INVALID_TARGET later.
      Signed-off-by: NGao Feng <gfree.wind@vip.163.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      c953d635
  2. 15 5月, 2017 10 次提交
    • P
      netfilter: nf_tables: revisit chain/object refcounting from elements · 59105446
      Pablo Neira Ayuso 提交于
      Andreas reports that the following incremental update using our commit
      protocol doesn't work.
      
       # nft -f incremental-update.nft
       delete element ip filter client_to_any { 10.180.86.22 : goto CIn_1 }
       delete chain ip filter CIn_1
       ... Error: Could not process rule: Device or resource busy
      
      The existing code is not well-integrated into the commit phase protocol,
      since element deletions do not result in refcount decrement from the
      preparation phase. This results in bogus EBUSY errors like the one
      above.
      
      Two new functions come with this patch:
      
      * nft_set_elem_activate() function is used from the abort path, to
        restore the set element refcounting on objects that occurred from
        the preparation phase.
      
      * nft_set_elem_deactivate() that is called from nft_del_setelem() to
        decrement set element refcounting on objects from the preparation
        phase in the commit protocol.
      
      The nft_data_uninit() has been renamed to nft_data_release() since this
      function does not uninitialize any data store in the data register,
      instead just releases the references to objects. Moreover, a new
      function nft_data_hold() has been introduced to be used from
      nft_set_elem_activate().
      Reported-by: NAndreas Schultz <aschultz@tpip.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      59105446
    • P
      netfilter: nf_tables: missing sanitization in data from userspace · 71df14b0
      Pablo Neira Ayuso 提交于
      Do not assume userspace always sends us NFT_DATA_VALUE for bitwise and
      cmp expressions. Although NFT_DATA_VERDICT does not make any sense, it
      is still possible to handcraft a netlink message using this incorrect
      data type.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      71df14b0
    • L
      netfilter: nf_tables: can't assume lock is acquired when dumping set elems · fa803605
      Liping Zhang 提交于
      When dumping the elements related to a specified set, we may invoke the
      nf_tables_dump_set with the NFNL_SUBSYS_NFTABLES lock not acquired. So
      we should use the proper rcu operation to avoid race condition, just
      like other nft dump operations.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      fa803605
    • E
      netfilter: synproxy: fix conntrackd interaction · 87e94dbc
      Eric Leblond 提交于
      This patch fixes the creation of connection tracking entry from
      netlink when synproxy is used. It was missing the addition of
      the synproxy extension.
      
      This was causing kernel crashes when a conntrack entry created by
      conntrackd was used after the switch of traffic from active node
      to the passive node.
      Signed-off-by: NEric Leblond <eric@regit.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      87e94dbc
    • W
      netfilter: xtables: zero padding in data_to_user · 324318f0
      Willem de Bruijn 提交于
      When looking up an iptables rule, the iptables binary compares the
      aligned match and target data (XT_ALIGN). In some cases this can
      exceed the actual data size to include padding bytes.
      
      Before commit f77bc5b2 ("iptables: use match, target and data
      copy_to_user helpers") the malloc()ed bytes were overwritten by the
      kernel with kzalloced contents, zeroing the padding and making the
      comparison succeed. After this patch, the kernel copies and clears
      only data, leaving the padding bytes undefined.
      
      Extend the clear operation from data size to aligned data size to
      include the padding bytes, if any.
      
      Padding bytes can be observed in both match and target, and the bug
      triggered, by issuing a rule with match icmp and target ACCEPT:
      
        iptables -t mangle -A INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
        iptables -t mangle -D INPUT -i lo -p icmp --icmp-type 1 -j ACCEPT
      
      Fixes: f77bc5b2 ("iptables: use match, target and data copy_to_user helpers")
      Reported-by: NPaul Moore <pmoore@redhat.com>
      Reported-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      324318f0
    • P
      Merge tag 'ipvs-fixes-for-v4.12' of http://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs · ff1e4300
      Pablo Neira Ayuso 提交于
      Simon Horman says:
      
      ====================
      IPVS Fixes for v4.12
      
      please consider this fix to IPVS for v4.12.
      
      * It is a fix from Julian Anastasov to only SNAT SNAT packet replies only for
        NATed connections
      
      My understanding is that this fix is appropriate for 4.9.25, 4.10.13, 4.11
      as well as the nf tree. Julian has separately posted backports for other
      -stable kernels; please see:
      
      * [PATCH 3.2.88,3.4.113 -stable 1/3] ipvs: SNAT packet replies only for
              NATed connections
      * [PATCH 3.10.105,3.12.73,3.16.43,4.1.39 -stable 2/3] ipvs: SNAT packet
              replies only for NATed connections
      * [PATCH 4.4.65 -stable 3/3] ipvs: SNAT packet replies only for NATed
              connections
      ====================
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      ff1e4300
    • L
      netfilter: nfnl_cthelper: reject del request if helper obj is in use · 9338d7b4
      Liping Zhang 提交于
      We can still delete the ct helper even if it is in use, this will cause
      a use-after-free error. In more detail, I mean:
        # nfct helper add ssdp inet udp
        # iptables -t raw -A OUTPUT -p udp -j CT --helper ssdp
        # nfct helper delete ssdp //--> oops, succeed!
        BUG: unable to handle kernel paging request at 000026ca
        IP: 0x26ca
        [...]
        Call Trace:
         ? ipv4_helper+0x62/0x80 [nf_conntrack_ipv4]
         nf_hook_slow+0x21/0xb0
         ip_output+0xe9/0x100
         ? ip_fragment.constprop.54+0xc0/0xc0
         ip_local_out+0x33/0x40
         ip_send_skb+0x16/0x80
         udp_send_skb+0x84/0x240
         udp_sendmsg+0x35d/0xa50
      
      So add reference count to fix this issue, if ct helper is used by
      others, reject the delete request.
      
      Apply this patch:
        # nfct helper delete ssdp
        nfct v1.4.3: netlink error: Device or resource busy
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9338d7b4
    • L
      netfilter: introduce nf_conntrack_helper_put helper function · d91fc59c
      Liping Zhang 提交于
      And convert module_put invocation to nf_conntrack_helper_put, this is
      prepared for the followup patch, which will add a refcnt for cthelper,
      so we can reject the deleting request when cthelper is in use.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d91fc59c
    • L
      netfilter: don't setup nat info for confirmed ct · d110a394
      Liping Zhang 提交于
      We cannot setup nat info if the ct has been confirmed already, else,
      different cpu may race to handle the same ct. In extreme situation,
      we may hit the "BUG_ON(nf_nat_initialized(ct, maniptype))" in the
      nf_nat_setup_info.
      
      Also running the following commands will easily hit NF_CT_ASSERT in
      nf_conntrack_alter_reply:
        # nft flush ruleset
        # ping -c 2 -W 1 1.1.1.111 &
        # nft add table t
        # nft add chain t c {type nat hook postrouting priority 0 \;}
        # nft add rule t c snat to 4.5.6.7
        WARNING: CPU: 1 PID: 10065 at net/netfilter/nf_conntrack_core.c:1472
        nf_conntrack_alter_reply+0x9a/0x1a0 [nf_conntrack]
        [...]
        Call Trace:
         nf_nat_setup_info+0xad/0x840 [nf_nat]
         ? deactivate_slab+0x65d/0x6c0
         nft_nat_eval+0xcd/0x100 [nft_nat]
         nft_do_chain+0xff/0x5d0 [nf_tables]
         ? mark_held_locks+0x6f/0xa0
         ? __local_bh_enable_ip+0x70/0xa0
         ? trace_hardirqs_on_caller+0x11f/0x190
         ? ipt_do_table+0x310/0x610
         ? trace_hardirqs_on+0xd/0x10
         ? __local_bh_enable_ip+0x70/0xa0
         ? ipt_do_table+0x32b/0x610
         ? __lock_acquire+0x2ac/0x1580
         ? ipt_do_table+0x32b/0x610
         nft_nat_do_chain+0x65/0x80 [nft_chain_nat_ipv4]
         nf_nat_ipv4_fn+0x1ae/0x240 [nf_nat_ipv4]
         nf_nat_ipv4_out+0x4a/0xf0 [nf_nat_ipv4]
         nft_nat_ipv4_out+0x15/0x20 [nft_chain_nat_ipv4]
         nf_hook_slow+0x2c/0xf0
         ip_output+0x154/0x270
      
      So for the confirmed ct, just ignore it and return NF_ACCEPT.
      
      Fixes: 9a08ecfe ("netfilter: don't attach a nat extension by default")
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d110a394
    • M
      netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch · a2b7cbdd
      Matthias Kaehlcke 提交于
      Not all parameters passed to ctnetlink_parse_tuple() and
      ctnetlink_exp_dump_tuple() match the enum type in the signatures of these
      functions. Since this is intended change the argument type of to be an
      unsigned integer value.
      Signed-off-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a2b7cbdd
  3. 13 5月, 2017 5 次提交
  4. 12 5月, 2017 24 次提交
    • X
      sctp: fix src address selection if using secondary addresses for ipv6 · dbc2b5e9
      Xin Long 提交于
      Commit 0ca50d12 ("sctp: fix src address selection if using secondary
      addresses") has fixed a src address selection issue when using secondary
      addresses for ipv4.
      
      Now sctp ipv6 also has the similar issue. When using a secondary address,
      sctp_v6_get_dst tries to choose the saddr which has the most same bits
      with the daddr by sctp_v6_addr_match_len. It may make some cases not work
      as expected.
      
      hostA:
        [1] fd21:356b:459a:cf10::11 (eth1)
        [2] fd21:356b:459a:cf20::11 (eth2)
      
      hostB:
        [a] fd21:356b:459a:cf30::2  (eth1)
        [b] fd21:356b:459a:cf40::2  (eth2)
      
      route from hostA to hostB:
        fd21:356b:459a:cf30::/64 dev eth1  metric 1024  mtu 1500
      
      The expected path should be:
        fd21:356b:459a:cf10::11 <-> fd21:356b:459a:cf30::2
      But addr[2] matches addr[a] more bits than addr[1] does, according to
      sctp_v6_addr_match_len. It causes the path to be:
        fd21:356b:459a:cf20::11 <-> fd21:356b:459a:cf30::2
      
      This patch is to fix it with the same way as Marcelo's fix for sctp ipv4.
      As no ip_dev_find for ipv6, this patch is to use ipv6_chk_addr to check
      if the saddr is in a dev instead.
      
      Note that for backwards compatibility, it will still do the addr_match_len
      check here when no optimal is found.
      Reported-by: NPatrick Talbert <ptalbert@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbc2b5e9
    • F
      net: phy: Call bus->reset() after releasing PHYs from reset · df0c8d91
      Florian Fainelli 提交于
      The API convention makes it that a given MDIO bus reset should be able
      to access PHY devices in its reset() callback and perform additional
      MDIO accesses in order to bring the bus and PHYs in a working state.
      
      Commit 69226896 ("mdio_bus: Issue GPIO RESET to PHYs.") broke that
      contract by first calling bus->reset() and then release all PHYs from
      reset using their shared GPIO line, so restore the expected
      functionality here.
      
      Fixes: 69226896 ("mdio_bus: Issue GPIO RESET to PHYs.")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df0c8d91
    • D
      bpf: Handle multiple variable additions into packet pointers in verifier. · 6832a333
      David S. Miller 提交于
      We must accumulate into reg->aux_off rather than use a plain assignment.
      
      Add a test for this situation to test_align.
      Reported-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6832a333
    • J
      tipc: make macro tipc_wait_for_cond() smp safe · 844cf763
      Jon Paul Maloy 提交于
      The macro tipc_wait_for_cond() is embedding the macro sk_wait_event()
      to fulfil its task. The latter, in turn, is evaluating the stated
      condition outside the socket lock context. This is problematic if
      the condition is accessing non-trivial data structures which may be
      altered by incoming interrupts, as is the case with the cong_links()
      linked list, used by socket to keep track of the current set of
      congested links. We sometimes see crashes when this list is accessed
      by a condition function at the same time as a SOCK_WAKEUP interrupt
      is removing an element from the list.
      
      We fix this by expanding selected parts of sk_wait_event() into the
      outer macro, while ensuring that all evaluations of a given condition
      are performed under socket lock protection.
      
      Fixes: commit 365ad353 ("tipc: reduce risk of user starvation during link congestion")
      Reviewed-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      844cf763
    • A
      samples/bpf: run cleanup routines when receiving SIGTERM · ad990dbe
      Andy Gospodarek 提交于
      Shahid Habib noticed that when xdp1 was killed from a different console the xdp
      program was not cleaned-up properly in the kernel and it continued to forward
      traffic.
      
      Most of the applications in samples/bpf cleanup properly, but only when getting
      SIGINT.  Since kill defaults to using SIGTERM, add support to cleanup when the
      application receives either SIGINT or SIGTERM.
      Signed-off-by: NAndy Gospodarek <andy@greyhouse.net>
      Reported-by: NShahid Habib <shahid.habib@broadcom.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad990dbe
    • C
      ethernet: aquantia: remove redundant checks on error status · d2be3667
      Colin Ian King 提交于
      The error status err is initialized as zero and then being checked
      several times to see if it is less than zero even when it has not
      been updated.  It may seem that the err should be assigned to the
      return code of the call to the various *offload_en_set calls and
      then we check for failure, however, these functions are void and
      never actually return any status.
      
      Since these error checks are redundant we can remove these
      as well as err and the error exit label err_exit.
      
      Detected by CoverityScan, CID#1398313 and CID#1398306 ("Logically
      dead code")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Acked-by: NPavel Belous <pavel.belous@aquantia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2be3667
    • D
    • D
      Merge branch 'qlcnic-fixes' · 34e934b9
      David S. Miller 提交于
      Manish Chopra says:
      
      ====================
      qlcnic: Bug fix and update version
      
      This series has one fix and bumps up driver version.
      Please consider applying to "net"
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34e934b9
    • C
      qlcnic: Update version to 5.3.66 · 33c16bfd
      Chopra, Manish 提交于
      Bumping up the version as couple of fixes added after 5.3.65
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33c16bfd
    • C
      qlcnic: Fix link configuration with autoneg disabled · f9c3fe2f
      Chopra, Manish 提交于
      Currently driver returns error on speed configurations
      for 83xx adapter's non XGBE ports, due to this link doesn't
      come up on the ports using 1000Base-T as a connector with
      autoneg disabled. This patch fixes this with initializing
      appropriate port type based on queried module/connector
      types from hardware before any speed/autoneg configuration.
      Signed-off-by: NManish Chopra <manish.chopra@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9c3fe2f
    • V
      xen-netfront: avoid crashing on resume after a failure in talk_to_netback() · d86b5672
      Vitaly Kuznetsov 提交于
      Unavoidable crashes in netfront_resume() and netback_changed() after a
      previous fail in talk_to_netback() (e.g. when we fail to read MAC from
      xenstore) were discovered. The failure path in talk_to_netback() does
      unregister/free for netdev but we don't reset drvdata and we try accessing
      it after resume.
      
      Fix the bug by removing the whole xen device completely with
      device_unregister(), this guarantees we won't have any calls into netfront
      after a failure.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d86b5672
    • E
      net: sched: optimize class dumps · cb395b20
      Eric Dumazet 提交于
      In commit 59cc1f61 ("net: sched: convert qdisc linked list to
      hashtable") we missed the opportunity to considerably speed up
      tc_dump_tclass_root() if a qdisc handle is provided by user.
      
      Instead of iterating all the qdiscs, use qdisc_match_from_root()
      to directly get the one we look for.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb395b20
    • Y
      tcp: avoid fragmenting peculiar skbs in SACK · b451e5d2
      Yuchung Cheng 提交于
      This patch fixes a bug in splitting an SKB during SACK
      processing. Specifically if an skb contains multiple
      packets and is only partially sacked in the higher sequences,
      tcp_match_sack_to_skb() splits the skb and marks the second fragment
      as SACKed.
      
      The current code further attempts rounding up the first fragment
      to MSS boundaries. But it misses a boundary condition when the
      rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
      such an skb is pointless and causses a kernel warning and aborts
      the SACK processing. This patch universally checks such over-split
      before calling tcp_fragment to prevent these unnecessary warnings.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b451e5d2
    • E
      netem: fix skb_orphan_partial() · f6ba8d33
      Eric Dumazet 提交于
      I should have known that lowering skb->truesize was dangerous :/
      
      In case packets are not leaving the host via a standard Ethernet device,
      but looped back to local sockets, bad things can happen, as reported
      by Michael Madsen ( https://bugzilla.kernel.org/show_bug.cgi?id=195713 )
      
      So instead of tweaking skb->truesize, lets change skb->destructor
      and keep a reference on the owner socket via its sk_refcnt.
      
      Fixes: f2f872f9 ("netem: Introduce skb_orphan_partial() helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichael Madsen <mkm@nabto.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6ba8d33
    • D
      Merge branch 'generic-xdp-followups' · 4e3c60ed
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      Two generic xdp related follow-ups
      
      Two follow-ups for the generic XDP API, would be great if
      both could still be considered, since the XDP API is not
      frozen yet. For details please see individual patches.
      
      v1 -> v2:
        - Implemented feedback from Jakub Kicinski (reusing
          attribute on dump), thanks!
        - Rest as is.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e3c60ed
    • D
      xdp: refine xdp api with regards to generic xdp · d67b9cd2
      Daniel Borkmann 提交于
      While working on the iproute2 generic XDP frontend, I noticed that
      as of right now it's possible to have native *and* generic XDP
      programs loaded both at the same time for the case when a driver
      supports native XDP.
      
      The intended model for generic XDP from b5cdae32 ("net: Generic
      XDP") is, however, that only one out of the two can be present at
      once which is also indicated as such in the XDP netlink dump part.
      The main rationale for generic XDP is to ease accessibility (in
      case a driver does not yet have XDP support) and to generically
      provide a semantical model as an example for driver developers
      wanting to add XDP support. The generic XDP option for an XDP
      aware driver can still be useful for comparing and testing both
      implementations.
      
      However, it is not intended to have a second XDP processing stage
      or layer with exactly the same functionality of the first native
      stage. Only reason could be to have a partial fallback for future
      XDP features that are not supported yet in the native implementation
      and we probably also shouldn't strive for such fallback and instead
      encourage native feature support in the first place. Given there's
      currently no such fallback issue or use case, lets not go there yet
      if we don't need to.
      
      Therefore, change semantics for loading XDP and bail out if the
      user tries to load a generic XDP program when a native one is
      present and vice versa. Another alternative to bailing out would
      be to handle the transition from one flavor to another gracefully,
      but that would require to bring the device down, exchange both
      types of programs, and bring it up again in order to avoid a tiny
      window where a packet could hit both hooks. Given this complicates
      the logic for just a debugging feature in the native case, I went
      with the simpler variant.
      
      For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae32
      and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
      or just a subset of flags that were used for loading the XDP prog
      is suboptimal in the long run since not all flags are useful for
      dumping and if we start to reuse the same flag definitions for
      load and dump, then we'll waste bit space. What we really just
      want is to dump the mode for now.
      
      Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
      a program is running at the native driver layer (1). Thus, add a
      mode that says that a program is running at generic XDP layer (2).
      Applications will handle this fine in that older binaries will
      just indicate that something is attached at XDP layer, effectively
      this is similar to IFLA_XDP_FLAGS attr that we would have had
      modulo the redundancy.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d67b9cd2
    • D
      xdp: add flag to enforce driver mode · 0489df9a
      Daniel Borkmann 提交于
      After commit b5cdae32 ("net: Generic XDP") we automatically fall
      back to a generic XDP variant if the driver does not support native
      XDP. Allow for an option where the user can specify that always the
      native XDP variant should be selected and in case it's not supported
      by a driver, just bail out.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0489df9a
    • D
      bpf: Provide a linux/types.h override for bpf selftests. · 0a5539f6
      David S. Miller 提交于
      We do not want to use the architecture's type.h header when
      building BPF programs which are always 64-bit.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a5539f6
    • D
      Merge branch 'bpf-pkt-ptr-align' · 228b0324
      David S. Miller 提交于
      David S. Miller says:
      
      ====================
      bpf: Add alignment tracker to verifier.
      
      First we add the alignment tracking logic to the verifier.
      
      Next, we work on building up infrastructure to facilitate regression
      testing of this facility.
      
      Finally, we add the "test_align" test case.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      228b0324
    • D
    • D
      bpf: Add bpf_verify_program() to the library. · 91045f5e
      David S. Miller 提交于
      This allows a test case to load a BPF program and unconditionally
      acquire the verifier log.
      
      It also allows specification of the strict alignment flag.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      91045f5e
    • D
      bpf: Add strict alignment flag for BPF_PROG_LOAD. · e07b98d9
      David S. Miller 提交于
      Add a new field, "prog_flags", and an initial flag value
      BPF_F_STRICT_ALIGNMENT.
      
      When set, the verifier will enforce strict pointer alignment
      regardless of the setting of CONFIG_EFFICIENT_UNALIGNED_ACCESS.
      
      The verifier, in this mode, will also use a fixed value of "2" in
      place of NET_IP_ALIGN.
      
      This facilitates test cases that will exercise and validate this part
      of the verifier even when run on architectures where alignment doesn't
      matter.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      e07b98d9
    • D
      bpf: Do per-instruction state dumping in verifier when log_level > 1. · c5fc9692
      David S. Miller 提交于
      If log_level > 1, do a state dump every instruction and emit it in
      a more compact way (without a leading newline).
      
      This will facilitate more sophisticated test cases which inspect the
      verifier log for register state.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      c5fc9692
    • D
      bpf: Track alignment of register values in the verifier. · d1174416
      David S. Miller 提交于
      Currently if we add only constant values to pointers we can fully
      validate the alignment, and properly check if we need to reject the
      program on !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS architectures.
      
      However, once an unknown value is introduced we only allow byte sized
      memory accesses which is too restrictive.
      
      Add logic to track the known minimum alignment of register values,
      and propagate this state into registers containing pointers.
      
      The most common paradigm that makes use of this new logic is computing
      the transport header using the IP header length field.  For example:
      
      	struct ethhdr *ep = skb->data;
      	struct iphdr *iph = (struct iphdr *) (ep + 1);
      	struct tcphdr *th;
       ...
      	n = iph->ihl;
      	th = ((void *)iph + (n * 4));
      	port = th->dest;
      
      The existing code will reject the load of th->dest because it cannot
      validate that the alignment is at least 2 once "n * 4" is added the
      the packet pointer.
      
      In the new code, the register holding "n * 4" will have a reg->min_align
      value of 4, because any value multiplied by 4 will be at least 4 byte
      aligned.  (actually, the eBPF code emitted by the compiler in this case
      is most likely to use a shift left by 2, but the end result is identical)
      
      At the critical addition:
      
      	th = ((void *)iph + (n * 4));
      
      The register holding 'th' will start with reg->off value of 14.  The
      pointer addition will transform that reg into something that looks like:
      
      	reg->aux_off = 14
      	reg->aux_off_align = 4
      
      Next, the verifier will look at the th->dest load, and it will see
      a load offset of 2, and first check:
      
      	if (reg->aux_off_align % size)
      
      which will pass because aux_off_align is 4.  reg_off will be computed:
      
      	reg_off = reg->off;
       ...
      		reg_off += reg->aux_off;
      
      plus we have off==2, and it will thus check:
      
      	if ((NET_IP_ALIGN + reg_off + off) % size != 0)
      
      which evaluates to:
      
      	if ((NET_IP_ALIGN + 14 + 2) % size != 0)
      
      On strict alignment architectures, NET_IP_ALIGN is 2, thus:
      
      	if ((2 + 14 + 2) % size != 0)
      
      which passes.
      
      These pointer transformations and checks work regardless of whether
      the constant offset or the variable with known alignment is added
      first to the pointer register.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      d1174416
新手
引导
客服 返回
顶部