1. 04 6月, 2019 1 次提交
  2. 03 6月, 2019 1 次提交
  3. 01 6月, 2019 4 次提交
    • R
      bpf: move memory size checks to bpf_map_charge_init() · c85d6913
      Roman Gushchin 提交于
      Most bpf map types doing similar checks and bytes to pages
      conversion during memory allocation and charging.
      
      Let's unify these checks by moving them into bpf_map_charge_init().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c85d6913
    • R
      bpf: rework memlock-based memory accounting for maps · b936ca64
      Roman Gushchin 提交于
      In order to unify the existing memlock charging code with the
      memcg-based memory accounting, which will be added later, let's
      rework the current scheme.
      
      Currently the following design is used:
        1) .alloc() callback optionally checks if the allocation will likely
           succeed using bpf_map_precharge_memlock()
        2) .alloc() performs actual allocations
        3) .alloc() callback calculates map cost and sets map.memory.pages
        4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
           and performs actual charging; in case of failure the map is
           destroyed
        <map is in use>
        1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
           performs uncharge and releases the user
        2) .map_free() callback releases the memory
      
      The scheme can be simplified and made more robust:
        1) .alloc() calculates map cost and calls bpf_map_charge_init()
        2) bpf_map_charge_init() sets map.memory.user and performs actual
          charge
        3) .alloc() performs actual allocations
        <map is in use>
        1) .map_free() callback releases the memory
        2) bpf_map_charge_finish() performs uncharge and releases the user
      
      The new scheme also allows to reuse bpf_map_charge_init()/finish()
      functions for memcg-based accounting. Because charges are performed
      before actual allocations and uncharges after freeing the memory,
      no bogus memory pressure can be created.
      
      In cases when the map structure is not available (e.g. it's not
      created yet, or is already destroyed), on-stack bpf_map_memory
      structure is used. The charge can be transferred with the
      bpf_map_charge_move() function.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b936ca64
    • R
      bpf: group memory related fields in struct bpf_map_memory · 3539b96e
      Roman Gushchin 提交于
      Group "user" and "pages" fields of bpf_map into the bpf_map_memory
      structure. Later it can be extended with "memcg" and other related
      information.
      
      The main reason for a such change (beside cosmetics) is to pass
      bpf_map_memory structure to charging functions before the actual
      allocation of bpf_map.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3539b96e
    • R
      bpf: add memlock precharge for socket local storage · d50836cd
      Roman Gushchin 提交于
      Socket local storage maps lack the memlock precharge check,
      which is performed before the memory allocation for
      most other bpf map types.
      
      Let's add it in order to unify all map types.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d50836cd
  4. 31 5月, 2019 5 次提交
    • W
      net: correct zerocopy refcnt with udp MSG_MORE · 100f6d8e
      Willem de Bruijn 提交于
      TCP zerocopy takes a uarg reference for every skb, plus one for the
      tcp_sendmsg_locked datapath temporarily, to avoid reaching refcnt zero
      as it builds, sends and frees skbs inside its inner loop.
      
      UDP and RAW zerocopy do not send inside the inner loop so do not need
      the extra sock_zerocopy_get + sock_zerocopy_put pair. Commit
      52900d22288ed ("udp: elide zerocopy operation in hot path") introduced
      extra_uref to pass the initial reference taken in sock_zerocopy_alloc
      to the first generated skb.
      
      But, sock_zerocopy_realloc takes this extra reference at the start of
      every call. With MSG_MORE, no new skb may be generated to attach the
      extra_uref to, so refcnt is incorrectly 2 with only one skb.
      
      Do not take the extra ref if uarg && !tcp, which implies MSG_MORE.
      Update extra_uref accordingly.
      
      This conditional assignment triggers a false positive may be used
      uninitialized warning, so have to initialize extra_uref at define.
      
      Changes v1->v2: fix typo in Fixes SHA1
      
      Fixes: 52900d22 ("udp: elide zerocopy operation in hot path")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Diagnosed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      100f6d8e
    • M
      ethtool: Check for vlan etype or vlan tci when parsing flow_rule · b73484b2
      Maxime Chevallier 提交于
      When parsing an ethtool flow spec to build a flow_rule, the code checks
      if both the vlan etype and the vlan tci are specified by the user to add
      a FLOW_DISSECTOR_KEY_VLAN match.
      
      However, when the user only specified a vlan etype or a vlan tci, this
      check silently ignores these parameters.
      
      For example, the following rule :
      
      ethtool -N eth0 flow-type udp4 vlan 0x0010 action -1 loc 0
      
      will result in no error being issued, but the equivalent rule will be
      created and passed to the NIC driver :
      
      ethtool -N eth0 flow-type udp4 action -1 loc 0
      
      In the end, neither the NIC driver using the rule nor the end user have
      a way to know that these keys were dropped along the way, or that
      incorrect parameters were entered.
      
      This kind of check should be left to either the driver, or the ethtool
      flow spec layer.
      
      This commit makes so that ethtool parameters are forwarded as-is to the
      NIC driver.
      
      Since none of the users of ethtool_rx_flow_rule_create are using the
      VLAN dissector, I don't think this qualifies as a regression.
      
      Fixes: eca4205f ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: NPablo Neira Ayuso <pablo@gnumonks.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b73484b2
    • E
      net-gro: fix use-after-free read in napi_gro_frags() · a4270d67
      Eric Dumazet 提交于
      If a network driver provides to napi_gro_frags() an
      skb with a page fragment of exactly 14 bytes, the call
      to gro_pull_from_frag0() will 'consume' the fragment
      by calling skb_frag_unref(skb, 0), and the page might
      be freed and reused.
      
      Reading eth->h_proto at the end of napi_frags_skb() might
      read mangled data, or crash under specific debugging features.
      
      BUG: KASAN: use-after-free in napi_frags_skb net/core/dev.c:5833 [inline]
      BUG: KASAN: use-after-free in napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
      Read of size 2 at addr ffff88809366840c by task syz-executor599/8957
      
      CPU: 1 PID: 8957 Comm: syz-executor599 Not tainted 5.2.0-rc1+ #32
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load_n_noabort+0xf/0x20 mm/kasan/generic_report.c:142
       napi_frags_skb net/core/dev.c:5833 [inline]
       napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
       tun_get_user+0x2f3c/0x3ff0 drivers/net/tun.c:1991
       tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
       call_write_iter include/linux/fs.h:1872 [inline]
       do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
       do_iter_write fs/read_write.c:970 [inline]
       do_iter_write+0x184/0x610 fs/read_write.c:951
       vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
       do_writev+0x15b/0x330 fs/read_write.c:1058
      
      Fixes: a50e233c ("net-gro: restore frag0 optimization")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4270d67
    • M
      net: avoid indirect calls in L4 checksum calculation · 2544af03
      Matteo Croce 提交于
      Commit 283c16a2 ("indirect call wrappers: helpers to speed-up
      indirect calls of builtin") introduces some macros to avoid doing
      indirect calls.
      
      Use these helpers to remove two indirect calls in the L4 checksum
      calculation for devices which don't have hardware support for it.
      
      As a test I generate packets with pktgen out to a dummy interface
      with HW checksumming disabled, to have the checksum calculated in
      every sent packet.
      The packet rate measured with an i7-6700K CPU and a single pktgen
      thread raised from 6143 to 6608 Kpps, an increase by 7.5%
      Suggested-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2544af03
    • S
      net: core: support XDP generic on stacked devices. · 458bf2f2
      Stephen Hemminger 提交于
      When a device is stacked like (team, bonding, failsafe or netvsc) the
      XDP generic program for the parent device was not called.
      
      Move the call to XDP generic inside __netif_receive_skb_core where
      it can be done multiple times for stacked case.
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      458bf2f2
  5. 26 5月, 2019 1 次提交
    • G
      flow_offload: use struct_size() in kzalloc() · 6dca9360
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
         int stuff;
         struct boo entry[];
      };
      
      instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dca9360
  6. 24 5月, 2019 2 次提交
  7. 23 5月, 2019 1 次提交
  8. 21 5月, 2019 2 次提交
  9. 17 5月, 2019 2 次提交
  10. 15 5月, 2019 1 次提交
    • S
      rtnetlink: always put IFLA_LINK for links with a link-netnsid · feadc4b6
      Sabrina Dubroca 提交于
      Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when
      iflink == ifindex.
      
      In some cases, a device can be created in a different netns with the
      same ifindex as its parent. That device will not dump its IFLA_LINK
      attribute, which can confuse some userspace software that expects it.
      For example, if the last ifindex created in init_net and foo are both
      8, these commands will trigger the issue:
      
          ip link add parent type dummy                   # ifindex 9
          ip link add link parent netns foo type macvlan  # ifindex 9 in ns foo
      
      So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump,
      always put the IFLA_LINK attribute as well.
      
      Thanks to Dan Winship for analyzing the original OpenShift bug down to
      the missing netlink attribute.
      
      v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said
          add Nicolas' ack
      v3: change Fixes tag
          fix subject typo, spotted by Edward Cree
      Analyzed-by: NDan Winship <danw@redhat.com>
      Fixes: d8a5ec67 ("[NET]: netlink support for moving devices between network namespaces.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      feadc4b6
  11. 14 5月, 2019 3 次提交
    • J
      bpf: sockmap fix msg->sg.size account on ingress skb · cabede8b
      John Fastabend 提交于
      When converting a skb to msg->sg we forget to set the size after the
      latest ktls/tls code conversion. This patch can be reached by doing
      a redir into ingress path from BPF skb sock recv hook. Then trying to
      read the size fails.
      
      Fix this by setting the size.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cabede8b
    • J
      bpf: sockmap, only stop/flush strp if it was enabled at some point · 01489436
      John Fastabend 提交于
      If we try to call strp_done on a parser that has never been
      initialized, because the sockmap user is only using TX side for
      example we get the following error.
      
        [  883.422081] WARNING: CPU: 1 PID: 208 at kernel/workqueue.c:3030 __flush_work+0x1ca/0x1e0
        ...
        [  883.422095] Workqueue: events sk_psock_destroy_deferred
        [  883.422097] RIP: 0010:__flush_work+0x1ca/0x1e0
      
      This had been wrapped in a 'if (psock->parser.enabled)' logic which
      was broken because the strp_done() was never actually being called
      because we do a strp_stop() earlier in the tear down logic will
      set parser.enabled to false. This could result in a use after free
      if work was still in the queue and was resolved by the patch here,
      1d79895a ("sk_msg: Always cancel strp work before freeing the
      psock"). However, calling strp_stop(), done by the patch marked in
      the fixes tag, only is useful if we never initialized a strp parser
      program and never initialized the strp to start with. Because if
      we had initialized a stream parser strp_stop() would have been called
      by sk_psock_drop() earlier in the tear down process.  By forcing the
      strp to stop we get past the WARNING in strp_done that checks
      the stopped flag but calling cancel_work_sync on work that has never
      been initialized is also wrong and generates the warning above.
      
      To fix check if the parser program exists. If the program exists
      then the strp work has been initialized and must be sync'd and
      cancelled before free'ing any structures. If no program exists we
      never initialized the stream parser in the first place so skip the
      sync/cancel logic implemented by strp_done.
      
      Finally, remove the strp_done its not needed and in the case where we
      are using the stream parser has already been called.
      
      Fixes: e8e34377 ("bpf: Stop the psock parser before canceling its work")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      01489436
    • E
      flow_dissector: disable preemption around BPF calls · b1c17a9a
      Eric Dumazet 提交于
      Various things in eBPF really require us to disable preemption
      before running an eBPF program.
      
      syzbot reported :
      
      BUG: assuming atomic context at net/core/flow_dissector.c:737
      in_atomic(): 0, irqs_disabled(): 0, pid: 24710, name: syz-executor.3
      2 locks held by syz-executor.3/24710:
       #0: 00000000e81a4bf1 (&tfile->napi_mutex){+.+.}, at: tun_get_user+0x168e/0x3ff0 drivers/net/tun.c:1850
       #1: 00000000254afebd (rcu_read_lock){....}, at: __skb_flow_dissect+0x1e1/0x4bb0 net/core/flow_dissector.c:822
      CPU: 1 PID: 24710 Comm: syz-executor.3 Not tainted 5.1.0+ #6
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       __cant_sleep kernel/sched/core.c:6165 [inline]
       __cant_sleep.cold+0xa3/0xbb kernel/sched/core.c:6142
       bpf_flow_dissect+0xfe/0x390 net/core/flow_dissector.c:737
       __skb_flow_dissect+0x362/0x4bb0 net/core/flow_dissector.c:853
       skb_flow_dissect_flow_keys_basic include/linux/skbuff.h:1322 [inline]
       skb_probe_transport_header include/linux/skbuff.h:2500 [inline]
       skb_probe_transport_header include/linux/skbuff.h:2493 [inline]
       tun_get_user+0x2cfe/0x3ff0 drivers/net/tun.c:1940
       tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
       call_write_iter include/linux/fs.h:1872 [inline]
       do_iter_readv_writev+0x5fd/0x900 fs/read_write.c:693
       do_iter_write fs/read_write.c:970 [inline]
       do_iter_write+0x184/0x610 fs/read_write.c:951
       vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
       do_writev+0x15b/0x330 fs/read_write.c:1058
       __do_sys_writev fs/read_write.c:1131 [inline]
       __se_sys_writev fs/read_write.c:1128 [inline]
       __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
       do_syscall_64+0x103/0x670 arch/x86/entry/common.c:298
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Petar Penkov <ppenkov@google.com>
      Cc: Stanislav Fomichev <sdf@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1c17a9a
  12. 09 5月, 2019 1 次提交
  13. 06 5月, 2019 1 次提交
  14. 05 5月, 2019 1 次提交
  15. 04 5月, 2019 2 次提交
  16. 01 5月, 2019 1 次提交
  17. 28 4月, 2019 4 次提交
    • J
      genetlink: optionally validate strictly/dumps · ef6243ac
      Johannes Berg 提交于
      Add options to strictly validate messages and dump messages,
      sometimes perhaps validating dump messages non-strictly may
      be required, so add an option for that as well.
      
      Since none of this can really be applied to existing commands,
      set the options everwhere using the following spatch:
      
          @@
          identifier ops;
          expression X;
          @@
          struct genl_ops ops[] = {
          ...,
           {
                  .cmd = X,
          +       .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
                  ...
           },
          ...
          };
      
      For new commands one should just not copy the .validate 'opt-out'
      flags and thus get strict validation.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef6243ac
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
    • M
      netlink: make nla_nest_start() add NLA_F_NESTED flag · ae0be8de
      Michal Kubecek 提交于
      Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
      netlink based interfaces (including recently added ones) are still not
      setting it in kernel generated messages. Without the flag, message parsers
      not aware of attribute semantics (e.g. wireshark dissector or libmnl's
      mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
      the structure of their contents.
      
      Unfortunately we cannot just add the flag everywhere as there may be
      userspace applications which check nlattr::nla_type directly rather than
      through a helper masking out the flags. Therefore the patch renames
      nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
      as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
      are rewritten to use nla_nest_start().
      
      Except for changes in include/net/netlink.h, the patch was generated using
      this semantic patch:
      
      @@ expression E1, E2; @@
      -nla_nest_start(E1, E2)
      +nla_nest_start_noflag(E1, E2)
      
      @@ expression E1, E2; @@
      -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
      +nla_nest_start(E1, E2)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae0be8de
    • M
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau 提交于
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
  18. 26 4月, 2019 3 次提交
    • D
      ipv6: Initialize fib6_result in bpf_ipv6_fib_lookup · e55449e7
      David Ahern 提交于
      fib6_result is not initialized in bpf_ipv6_fib_lookup and potentially
      passses garbage to the fib lookup which triggers a KASAN warning:
      
      [  262.055450] ==================================================================
      [  262.057640] BUG: KASAN: user-memory-access in fib6_rule_suppress+0x4b/0xce
      [  262.059488] Read of size 8 at addr 00000a20000000b0 by task swapper/1/0
      [  262.061238]
      [  262.061673] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.0-rc5+ #56
      [  262.063493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
      [  262.065593] Call Trace:
      [  262.066277]  <IRQ>
      [  262.066848]  dump_stack+0x7e/0xbb
      [  262.067764]  kasan_report+0x18b/0x1b5
      [  262.069921]  __asan_load8+0x7f/0x81
      [  262.070879]  fib6_rule_suppress+0x4b/0xce
      [  262.071980]  fib_rules_lookup+0x275/0x2cd
      [  262.073090]  fib6_lookup+0x119/0x218
      [  262.076457]  bpf_ipv6_fib_lookup+0x39d/0x664
      ...
      
      Initialize fib6_result to 0.
      
      Fixes: b1d40991 ("ipv6: Rename fib6_multipath_select and pass fib6_result")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e55449e7
    • S
      bpf: support BPF_PROG_QUERY for BPF_FLOW_DISSECTOR attach_type · 118c8e9a
      Stanislav Fomichev 提交于
      target_fd is target namespace. If there is a flow dissector BPF program
      attached to that namespace, its (single) id is returned.
      
      v5:
      * drop net ref right after rcu unlock (Daniel Borkmann)
      
      v4:
      * add missing put_net (Jann Horn)
      
      v3:
      * add missing inline to skb_flow_dissector_prog_query static def
        (kbuild test robot <lkp@intel.com>)
      
      v2:
      * don't sleep in rcu critical section (Jakub Kicinski)
      * check input prog_cnt (exit early)
      
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      118c8e9a
    • K
      net-sysfs: Replace ktype default_attrs field with groups · be0d6926
      Kimberly Brown 提交于
      The kobj_type default_attrs field is being replaced by the
      default_groups field. Replace the default_attrs fields in rx_queue_ktype
      and netdev_queue_ktype with default_groups. Use the ATTRIBUTE_GROUPS
      macro to create rx_queue_default_groups and netdev_queue_default_groups.
      
      This patch was tested by verifying that the sysfs files for the
      attributes in the default groups were created.
      Signed-off-by: NKimberly Brown <kimbrownkd@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      be0d6926
  19. 24 4月, 2019 4 次提交