1. 08 4月, 2019 1 次提交
  2. 22 3月, 2019 4 次提交
    • J
      genetlink: make policy common to family · 3b0f31f2
      Johannes Berg 提交于
      Since maxattr is common, the policy can't really differ sanely,
      so make it common as well.
      
      The only user that did in fact manage to make a non-common policy
      is taskstats, which has to be really careful about it (since it's
      still using a common maxattr!). This is no longer supported, but
      we can fake it using pre_doit.
      
      This reduces the size of e.g. nl80211.o (which has lots of commands):
      
         text	   data	    bss	    dec	    hex	filename
       398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
       397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
      --------------------------------
         -832      +8       0    -824
      
      Which is obviously just 8 bytes for each command, and an added 8
      bytes for the new policy pointer. I'm not sure why the ops list is
      counted as .text though.
      
      Most of the code transformations were done using the following spatch:
          @ops@
          identifier OPS;
          expression POLICY;
          @@
          struct genl_ops OPS[] = {
          ...,
           {
          -	.policy = POLICY,
           },
          ...
          };
      
          @@
          identifier ops.OPS;
          expression ops.POLICY;
          identifier fam;
          expression M;
          @@
          struct genl_family fam = {
                  .ops = OPS,
                  .maxattr = M,
          +       .policy = POLICY,
                  ...
          };
      
      This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
      the cb->data as ops, which we want to change in a later genl patch.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b0f31f2
    • J
      net: dst: remove gc leftovers · 02afc7ad
      Julian Wiedmann 提交于
      Get rid of some obsolete gc-related documentation and macros that were
      missed in commit 5b7c9a8f ("net: remove dst gc related code").
      
      CC: Wei Wang <weiwan@google.com>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02afc7ad
    • D
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern 提交于
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab948a9
    • D
      ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create · c7a1ce39
      David Ahern 提交于
      Change addrconf_f6i_alloc to generate a fib6_config and call
      ip6_route_info_create. addrconf_f6i_alloc is the last caller to
      fib6_info_alloc besides ip6_route_info_create, and there is no
      reason for it to do its own initialization on a fib6_info.
      
      Host routes need to be created even if the device is down, so add a
      new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
      to not error out if device is not up.
      
      Notes on the conversion:
      - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
        and fc_mx_len set to 0
      - dst_nocount is handled by the RTF_ADDRCONF flag
      - dst_host is handled by fc_dst_len = 128
      
      nh_gw does not get set after the conversion to ip6_route_info_create
      but it should not be set in addrconf_f6i_alloc since this is a host
      route not a gateway route.
      
      Everything else is a straight forward map between fib6_info and
      fib6_config.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7a1ce39
  3. 21 3月, 2019 2 次提交
  4. 20 3月, 2019 2 次提交
  5. 19 3月, 2019 2 次提交
    • X
      sctp: get sctphdr by offset in sctp_compute_cksum · 273160ff
      Xin Long 提交于
      sctp_hdr(skb) only works when skb->transport_header is set properly.
      
      But in Netfilter, skb->transport_header for ipv6 is not guaranteed
      to be right value for sctphdr. It would cause to fail to check the
      checksum for sctp packets.
      
      So fix it by using offset, which is always right in all places.
      
      v1->v2:
        - Fix the changelog.
      
      Fixes: e6d8b64b ("net: sctp: fix and consolidate SCTP checksumming code")
      Reported-by: NLi Shuang <shuali@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      273160ff
    • M
      packets: Always register packet sk in the same order · a4dc6a49
      Maxime Chevallier 提交于
      When using fanouts with AF_PACKET, the demux functions such as
      fanout_demux_cpu will return an index in the fanout socket array, which
      corresponds to the selected socket.
      
      The ordering of this array depends on the order the sockets were added
      to a given fanout group, so for FANOUT_CPU this means sockets are bound
      to cpus in the order they are configured, which is OK.
      
      However, when stopping then restarting the interface these sockets are
      bound to, the sockets are reassigned to the fanout group in the reverse
      order, due to the fact that they were inserted at the head of the
      interface's AF_PACKET socket list.
      
      This means that traffic that was directed to the first socket in the
      fanout group is now directed to the last one after an interface restart.
      
      In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
      socket that used to receive traffic from the last CPU after an interface
      restart.
      
      This commit introduces a helper to add a socket at the tail of a list,
      then uses it to register AF_PACKET sockets.
      
      Note that this changes the order in which sockets are listed in /proc and
      with sock_diag.
      
      Fixes: dc99f600 ("packet: Add fanout support")
      Signed-off-by: NMaxime Chevallier <maxime.chevallier@bootlin.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4dc6a49
  6. 16 3月, 2019 1 次提交
  7. 13 3月, 2019 1 次提交
  8. 11 3月, 2019 1 次提交
  9. 10 3月, 2019 1 次提交
  10. 08 3月, 2019 1 次提交
    • P
      netfilter: nf_tables: fix set double-free in abort path · 40ba1d9b
      Pablo Neira Ayuso 提交于
      The abort path can cause a double-free of an anonymous set.
      Added-and-to-be-aborted rule looks like this:
      
      udp dport { 137, 138 } drop
      
      The to-be-aborted transaction list looks like this:
      
      newset
      newsetelem
      newsetelem
      rule
      
      This gets walked in reverse order, so first pass disables the rule, the
      set elements, then the set.
      
      After synchronize_rcu(), we then destroy those in same order: rule, set
      element, set element, newset.
      
      Problem is that the anonymous set has already been bound to the rule, so
      the rule (lookup expression destructor) already frees the set, when then
      cause use-after-free when trying to delete the elements from this set,
      then try to free the set again when handling the newset expression.
      
      Rule releases the bound set in first place from the abort path, this
      causes the use-after-free on set element removal when undoing the new
      element transactions. To handle this, skip new element transaction if
      set is bound from the abort path.
      
      This is still causes the use-after-free on set element removal.  To
      handle this, remove transaction from the list when the set is already
      bound.
      
      Joint work with Florian Westphal.
      
      Fixes: f6ac8585 ("netfilter: nf_tables: unbind set in rule from commit path")
      Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      40ba1d9b
  11. 05 3月, 2019 1 次提交
  12. 04 3月, 2019 3 次提交
    • B
      tls: Fix write space handling · 7463d3a2
      Boris Pismenny 提交于
      TLS device cannot use the sw context. This patch returns the original
      tls device write space handler and moves the sw/device specific portions
      to the relevant files.
      
      Also, we remove the write_space call for the tls_sw flow, because it
      handles partial records in its delayed tx work handler.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7463d3a2
    • B
      tls: Fix tls_device handling of partial records · 94850257
      Boris Pismenny 提交于
      Cleanup the handling of partial records while fixing a bug where the
      tls_push_pending_closed_record function is using the software tls
      context instead of the hardware context.
      
      The bug resulted in the following crash:
      [   88.791229] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [   88.793271] #PF error: [normal kernel read fault]
      [   88.794449] PGD 800000022a426067 P4D 800000022a426067 PUD 22a156067 PMD 0
      [   88.795958] Oops: 0000 [#1] SMP PTI
      [   88.796884] CPU: 2 PID: 4973 Comm: openssl Not tainted 5.0.0-rc4+ #3
      [   88.798314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [   88.800067] RIP: 0010:tls_tx_records+0xef/0x1d0 [tls]
      [   88.801256] Code: 00 02 48 89 43 08 e8 a0 0b 96 d9 48 89 df e8 48 dd
      4d d9 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39
      c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00
      [   88.805179] RSP: 0018:ffffbd888186fca8 EFLAGS: 00010213
      [   88.806458] RAX: ffff9af1ed657c98 RBX: ffff9af1e88a1980 RCX: 0000000000000000
      [   88.808050] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9af1e88a1980
      [   88.809724] RBP: ffff9af1e88a1980 R08: 0000000000000017 R09: ffff9af1ebeeb700
      [   88.811294] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [   88.812917] R13: ffff9af1e88a1980 R14: ffff9af1ec13f800 R15: 0000000000000000
      [   88.814506] FS:  00007fcad2240740(0000) GS:ffff9af1f7880000(0000) knlGS:0000000000000000
      [   88.816337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   88.817717] CR2: 0000000000000000 CR3: 0000000228b3e000 CR4: 00000000001406e0
      [   88.819328] Call Trace:
      [   88.820123]  tls_push_data+0x628/0x6a0 [tls]
      [   88.821283]  ? remove_wait_queue+0x20/0x60
      [   88.822383]  ? n_tty_read+0x683/0x910
      [   88.823363]  tls_device_sendmsg+0x53/0xa0 [tls]
      [   88.824505]  sock_sendmsg+0x36/0x50
      [   88.825492]  sock_write_iter+0x87/0x100
      [   88.826521]  __vfs_write+0x127/0x1b0
      [   88.827499]  vfs_write+0xad/0x1b0
      [   88.828454]  ksys_write+0x52/0xc0
      [   88.829378]  do_syscall_64+0x5b/0x180
      [   88.830369]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   88.831603] RIP: 0033:0x7fcad1451680
      
      [ 1248.470626] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [ 1248.472564] #PF error: [normal kernel read fault]
      [ 1248.473790] PGD 0 P4D 0
      [ 1248.474642] Oops: 0000 [#1] SMP PTI
      [ 1248.475651] CPU: 3 PID: 7197 Comm: openssl Tainted: G           OE 5.0.0-rc4+ #3
      [ 1248.477426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [ 1248.479310] RIP: 0010:tls_tx_records+0x110/0x1f0 [tls]
      [ 1248.480644] Code: 00 02 48 89 43 08 e8 4f cb 63 d7 48 89 df e8 f7 9c
      1b d7 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39
      c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00
      [ 1248.484825] RSP: 0018:ffffaa0a41543c08 EFLAGS: 00010213
      [ 1248.486154] RAX: ffff955a2755dc98 RBX: ffff955a36031980 RCX: 0000000000000006
      [ 1248.487855] RDX: 0000000000000000 RSI: 000000000000002b RDI: 0000000000000286
      [ 1248.489524] RBP: ffff955a36031980 R08: 0000000000000000 R09: 00000000000002b1
      [ 1248.491394] R10: 0000000000000003 R11: 00000000ad55ad55 R12: 0000000000000000
      [ 1248.493162] R13: 0000000000000000 R14: ffff955a2abe6c00 R15: 0000000000000000
      [ 1248.494923] FS:  0000000000000000(0000) GS:ffff955a378c0000(0000) knlGS:0000000000000000
      [ 1248.496847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1248.498357] CR2: 0000000000000000 CR3: 000000020c40e000 CR4: 00000000001406e0
      [ 1248.500136] Call Trace:
      [ 1248.500998]  ? tcp_check_oom+0xd0/0xd0
      [ 1248.502106]  tls_sk_proto_close+0x127/0x1e0 [tls]
      [ 1248.503411]  inet_release+0x3c/0x60
      [ 1248.504530]  __sock_release+0x3d/0xb0
      [ 1248.505611]  sock_close+0x11/0x20
      [ 1248.506612]  __fput+0xb4/0x220
      [ 1248.507559]  task_work_run+0x88/0xa0
      [ 1248.508617]  do_exit+0x2cb/0xbc0
      [ 1248.509597]  ? core_sys_select+0x17a/0x280
      [ 1248.510740]  do_group_exit+0x39/0xb0
      [ 1248.511789]  get_signal+0x1d0/0x630
      [ 1248.512823]  do_signal+0x36/0x620
      [ 1248.513822]  exit_to_usermode_loop+0x5c/0xc6
      [ 1248.515003]  do_syscall_64+0x157/0x180
      [ 1248.516094]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 1248.517456] RIP: 0033:0x7fb398bd3f53
      [ 1248.518537] Code: Bad RIP value.
      
      Fixes: a42055e8 ("net/tls: Add support for async encryption of records for performance")
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94850257
    • T
      net: dsa: add KSZ9893 switch tagging support · 88b573af
      Tristram Ha 提交于
      KSZ9893 switch is similar to KSZ9477 switch except the ingress tail tag
      has 1 byte instead of 2 bytes.  The size of the portmap is smaller and
      so the override and lookup bits are also moved.
      Signed-off-by: NTristram Ha <Tristram.Ha@microchip.com>
      Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88b573af
  13. 03 3月, 2019 1 次提交
    • E
      net: sched: put back q.qlen into a single location · 46b1c18f
      Eric Dumazet 提交于
      In the series fc8b81a5 ("Merge branch 'lockless-qdisc-series'")
      John made the assumption that the data path had no need to read
      the qdisc qlen (number of packets in the qdisc).
      
      It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
      children.
      
      But pfifo_fast can be used as leaf in class full qdiscs, and existing
      logic needs to access the child qlen in an efficient way.
      
      HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
        htb_activate() -> WARN_ON()
        htb_dequeue_tree() to decide if a class can be htb_deactivated
        when it has no more packets.
      
      HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
      qdisc_tree_reduce_backlog() also read q.qlen directly.
      
      Using qdisc_qlen_sum() (which iterates over all possible cpus)
      in the data path is a non starter.
      
      It seems we have to put back qlen in a central location,
      at least for stable kernels.
      
      For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
      so the existing q.qlen{++|--} are correct.
      
      For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
      because the spinlock might be not held (for example from
      pfifo_fast_enqueue() and pfifo_fast_dequeue())
      
      This patch adds atomic_qlen (in the same location than qlen)
      and renames the following helpers, since we want to express
      they can be used without qdisc lock, and that qlen is no longer percpu.
      
      - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec()
      - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc()
      
      Later (net-next) we might revert this patch by tracking all these
      qlen uses and replace them by a more efficient method (not having
      to access a precise qlen, but an empty/non_empty status that might
      be less expensive to maintain/track).
      
      Another possibility is to have a legacy pfifo_fast version that would
      be used when used a a child qdisc, since the parent qdisc needs
      a spinlock anyway. But then, future lockless qdiscs would also
      have the same problem.
      
      Fixes: 7e66016f ("net: sched: helpers to sum qlen and qlen for per cpu logic")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b1c18f
  14. 02 3月, 2019 2 次提交
  15. 01 3月, 2019 2 次提交
  16. 28 2月, 2019 4 次提交
  17. 27 2月, 2019 11 次提交