1. 03 1月, 2020 3 次提交
    • D
      Merge branch 'page_pool-NUMA-node-handling-fixes' · c9a2069b
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      page_pool: NUMA node handling fixes
      
      The recently added NUMA changes (merged for v5.5) to page_pool, it both
      contains a bug in handling NUMA_NO_NODE condition, and added code to
      the fast-path.
      
      This patchset fixes the bug and moves code out of fast-path. The first
      patch contains a fix that should be considered for 5.5. The second
      patch reduce code size and overhead in case CONFIG_NUMA is disabled.
      
      Currently the NUMA_NO_NODE setting bug only affects driver 'ti_cpsw'
      (drivers/net/ethernet/ti/), but after this patchset, we plan to move
      other drivers (netsec and mvneta) to use NUMA_NO_NODE setting.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9a2069b
    • J
      page_pool: help compiler remove code in case CONFIG_NUMA=n · f13fc107
      Jesper Dangaard Brouer 提交于
      When kernel is compiled without NUMA support, then page_pool NUMA
      config setting (pool->p.nid) doesn't make any practical sense. The
      compiler cannot see that it can remove the code paths.
      
      This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
      in allocation and numa check code, which helps compiler to see the
      optimisation potential. It leaves update code intact to keep API the
      same.
      
       $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
                                 net/core/page_pool.o-numa-disabled
       add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
       Function                                     old     new   delta
       page_pool_create                             401     398      -3
       __page_pool_alloc_pages_slow                 439     426     -13
       page_pool_refill_alloc_cache                 425     328     -97
       Total: Before=3611, After=3498, chg -3.13%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f13fc107
    • J
      page_pool: handle page recycle for NUMA_NO_NODE condition · 44768dec
      Jesper Dangaard Brouer 提交于
      The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
      not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.
      
      The goal of the NUMA changes in commit d5394610 ("page_pool: Don't
      recycle non-reusable pages"), were to have RX-pages that belongs to the
      same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
      illustrated by the performance measurements.
      
      This patch moves the NAPI checks out of fast-path, and at the same time
      solves the NUMA_NO_NODE issue.
      
      First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
      will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
      as the the preferred nid.  It is only in rare situations, where
      e.g. NUMA zone runs dry, that page gets doesn't get allocated from
      preferred nid.  The page_pool API allows drivers to control the nid
      themselves via controlling pool->p.nid.
      
      This patch moves the NAPI check to when alloc cache is refilled, via
      dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
      pages from remote NUMA into the ptr_ring, as the dequeue/consume step
      will check the NUMA node. All current drivers using page_pool will
      alloc/refill RX-ring from same CPU running softirq/NAPI process.
      
      Drivers that control the nid explicitly, also use page_pool_update_nid
      when changing nid runtime.  To speed up transision to new nid the alloc
      cache is now flushed on nid changes.  This force pages to come from
      ptr_ring, which does the appropate nid check.
      
      For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
      node, we accept that transitioning the alloc cache doesn't happen
      immediately. The preferred nid change runtime via consulting
      numa_mem_id() based on the CPU processing RX-packets.
      
      Notice, to avoid stressing the page buddy allocator and avoid doing too
      much work under softirq with preempt disabled, the NUMA check at
      ptr_ring dequeue will break the refill cycle, when detecting a NUMA
      mismatch. This will cause a slower transition, but its done on purpose.
      
      Fixes: d5394610 ("page_pool: Don't recycle non-reusable pages")
      Reported-by: NLi RongQing <lirongqing@baidu.com>
      Reported-by: NYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44768dec
  2. 01 1月, 2020 15 次提交
  3. 31 12月, 2019 19 次提交
    • T
      hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename() · 04b69426
      Taehee Yoo 提交于
      hsr slave interfaces don't have debugfs directory.
      So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
      is changed.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add dummy1 type dummy
          ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
          ip link set dummy0 name ap
      
      Splat looks like:
      [21071.899367][T22666] ap: renamed from dummy0
      [21071.914005][T22666] ==================================================================
      [21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
      [21071.926941][T22666]
      [21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
      [21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [21071.935094][T22666] Call Trace:
      [21071.935867][T22666]  dump_stack+0x96/0xdb
      [21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
      [21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.940949][T22666]  __kasan_report+0x12a/0x16f
      [21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.942674][T22666]  kasan_report+0xe/0x20
      [21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
      [21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
      [21071.945052][T22666]  ? __module_text_address+0x13/0x140
      [21071.945897][T22666]  notifier_call_chain+0x90/0x160
      [21071.946743][T22666]  dev_change_name+0x419/0x840
      [21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
      [21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
      [21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
      [21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
      [21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
      [21071.951991][T22666]  do_setlink+0x811/0x2ef0
      [21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
      [ ... ]
      
      Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
      Fixes: 4c2d5e33 ("hsr: rename debugfs file when interface name is changed")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04b69426
    • D
      net/sched: add delete_empty() to filters and use it in cls_flower · a5b72a08
      Davide Caratti 提交于
      Revert "net/sched: cls_u32: fix refcount leak in the error path of
      u32_change()", and fix the u32 refcount leak in a more generic way that
      preserves the semantic of rule dumping.
      On tc filters that don't support lockless insertion/removal, there is no
      need to guard against concurrent insertion when a removal is in progress.
      Therefore, for most of them we can avoid a full walk() when deleting, and
      just decrease the refcount, like it was done on older Linux kernels.
      This fixes situations where walk() was wrongly detecting a non-empty
      filter, like it happened with cls_u32 in the error path of change(), thus
      leading to failures in the following tdc selftests:
      
       6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
       6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
       74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id
      
      On cls_flower, and on (future) lockless filters, this check is necessary:
      move all the check_empty() logic in a callback so that each filter
      can have its own implementation. For cls_flower, it's sufficient to check
      if no IDRs have been allocated.
      
      This reverts commit 275c44aa.
      
      Changes since v1:
       - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
         is used, thanks to Vlad Buslov
       - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
       - squash revert and new fix in a single patch, to be nice with bisect
         tests that run tdc on u32 filter, thanks to Dave Miller
      
      Fixes: 275c44aa ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
      Fixes: 6676d5e4 ("net: sched: set dedicated tcf_walker flag when tp is empty")
      Suggested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Suggested-by: NVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NVlad Buslov <vladbu@mellanox.com>
      Tested-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5b72a08
    • V
      net/ncsi: Fix gma flag setting after response · 9e860947
      Vijay Khemka 提交于
      gma_flag was set at the time of GMA command request but it should
      only be set after getting successful response. Movinng this flag
      setting in GMA response handler.
      
      This flag is used mainly for not repeating GMA command once
      received MAC address.
      Signed-off-by: NVijay Khemka <vijaykhemka@fb.com>
      Reviewed-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e860947
    • K
      sctp: add enabled check for path tracepoint loop. · f398efc1
      Kevin Kou 提交于
      sctp_outq_sack is the main function handles SACK, it is called very
      frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
      added below code to this function, sctp tracepoint is disabled most of time,
      but the loop of transport list will be always called even though the
      tracepoint is disabled, this is unnecessary.
      
      +	/* SCTP path tracepoint for congestion control debugging. */
      +	list_for_each_entry(transport, transport_list, transports) {
      +		trace_sctp_probe_path(transport, asoc);
      +	}
      
      This patch is to add tracepoint enabled check at outside of the loop of
      transport list, and avoid traversing the loop when trace is disabled,
      it is a small optimization.
      Signed-off-by: NKevin Kou <qdkevin.kou@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f398efc1
    • D
      Merge branch 'Improvements-to-SJA1105-DSA-RX-timestamping' · 9010ef57
      David S. Miller 提交于
      Vladimir Oltean says:
      
      ====================
      Improvements to SJA1105 DSA RX timestamping
      
      This series makes the sja1105 DSA driver use a dedicated kernel thread
      for RX timestamping, a process which is time-sensitive and otherwise a
      bit fragile. This allows users to customize their system (probabil an
      embedded PTP switch) fully and allocate the CPU bandwidth for the driver
      to expedite the RX timestamps as quickly as possible.
      
      While doing this conversion, add a function to the PTP core for
      cancelling this kernel thread (function which I found rather strange to
      be missing).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9010ef57
    • V
      net: dsa: sja1105: Empty the RX timestamping queue on PTP settings change · 19d1f0ed
      Vladimir Oltean 提交于
      When disabling PTP timestamping, don't reset the switch with the new
      static config until all existing PTP frames have been timestamped on the
      RX path or dropped. There's nothing we can do with these afterwards.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19d1f0ed
    • V
      net: dsa: sja1105: Use PTP core's dedicated kernel thread for RX timestamping · 1e762bd2
      Vladimir Oltean 提交于
      And move the queue of skb's waiting for RX timestamps into the ptp_data
      structure, since it isn't needed if PTP is not compiled.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e762bd2
    • V
      ptp: introduce ptp_cancel_worker_sync · 544fed47
      Vladimir Oltean 提交于
      In order to effectively use the PTP kernel thread for tasks such as
      timestamping packets, allow the user control over stopping it, which is
      needed e.g. when the timestamping queues must be drained.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      544fed47
    • C
      tcp: Fix highest_sack and highest_sack_seq · 85369750
      Cambda Zhu 提交于
      >From commit 50895b9d ("tcp: highest_sack fix"), the logic about
      setting tp->highest_sack to the head of the send queue was removed.
      Of course the logic is error prone, but it is logical. Before we
      remove the pointer to the highest sack skb and use the seq instead,
      we need to set tp->highest_sack to NULL when there is no skb after
      the last sack, and then replace NULL with the real skb when new skb
      inserted into the rtx queue, because the NULL means the highest sack
      seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
      the next ACK with sack option will increase tp->reordering unexpectedly.
      
      This patch sets tp->highest_sack to the tail of the rtx queue if
      it's NULL and new data is sent. The patch keeps the rule that the
      highest_sack can only be maintained by sack processing, except for
      this only case.
      
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85369750
    • V
      ptp: fix the race between the release of ptp_clock and cdev · a33121e5
      Vladis Dronov 提交于
      In a case when a ptp chardev (like /dev/ptp0) is open but an underlying
      device is removed, closing this file leads to a race. This reproduces
      easily in a kvm virtual machine:
      
      ts# cat openptp0.c
      int main() { ... fp = fopen("/dev/ptp0", "r"); ... sleep(10); }
      ts# uname -r
      5.5.0-rc3-46cf053e
      ts# cat /proc/cmdline
      ... slub_debug=FZP
      ts# modprobe ptp_kvm
      ts# ./openptp0 &
      [1] 670
      opened /dev/ptp0, sleeping 10s...
      ts# rmmod ptp_kvm
      ts# ls /dev/ptp*
      ls: cannot access '/dev/ptp*': No such file or directory
      ts# ...woken up
      [   48.010809] general protection fault: 0000 [#1] SMP
      [   48.012502] CPU: 6 PID: 658 Comm: openptp0 Not tainted 5.5.0-rc3-46cf053e #25
      [   48.014624] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
      [   48.016270] RIP: 0010:module_put.part.0+0x7/0x80
      [   48.017939] RSP: 0018:ffffb3850073be00 EFLAGS: 00010202
      [   48.018339] RAX: 000000006b6b6b6b RBX: 6b6b6b6b6b6b6b6b RCX: ffff89a476c00ad0
      [   48.018936] RDX: fffff65a08d3ea08 RSI: 0000000000000247 RDI: 6b6b6b6b6b6b6b6b
      [   48.019470] ...                                              ^^^ a slub poison
      [   48.023854] Call Trace:
      [   48.024050]  __fput+0x21f/0x240
      [   48.024288]  task_work_run+0x79/0x90
      [   48.024555]  do_exit+0x2af/0xab0
      [   48.024799]  ? vfs_write+0x16a/0x190
      [   48.025082]  do_group_exit+0x35/0x90
      [   48.025387]  __x64_sys_exit_group+0xf/0x10
      [   48.025737]  do_syscall_64+0x3d/0x130
      [   48.026056]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [   48.026479] RIP: 0033:0x7f53b12082f6
      [   48.026792] ...
      [   48.030945] Modules linked in: ptp i6300esb watchdog [last unloaded: ptp_kvm]
      [   48.045001] Fixing recursive fault but reboot is needed!
      
      This happens in:
      
      static void __fput(struct file *file)
      {   ...
          if (file->f_op->release)
              file->f_op->release(inode, file); <<< cdev is kfree'd here
          if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
                   !(mode & FMODE_PATH))) {
              cdev_put(inode->i_cdev); <<< cdev fields are accessed here
      
      Namely:
      
      __fput()
        posix_clock_release()
          kref_put(&clk->kref, delete_clock) <<< the last reference
            delete_clock()
              delete_ptp_clock()
                kfree(ptp) <<< cdev is embedded in ptp
        cdev_put
          module_put(p->owner) <<< *p is kfree'd, bang!
      
      Here cdev is embedded in posix_clock which is embedded in ptp_clock.
      The race happens because ptp_clock's lifetime is controlled by two
      refcounts: kref and cdev.kobj in posix_clock. This is wrong.
      
      Make ptp_clock's sysfs device a parent of cdev with cdev_device_add()
      created especially for such cases. This way the parent device with its
      ptp_clock is not released until all references to the cdev are released.
      This adds a requirement that an initialized but not exposed struct
      device should be provided to posix_clock_register() by a caller instead
      of a simple dev_t.
      
      This approach was adopted from the commit 72139dfa ("watchdog: Fix
      the race between the release of watchdog_core_data and cdev"). See
      details of the implementation in the commit 233ed09d ("chardev: add
      helper function to register char devs with a struct device").
      
      Link: https://lore.kernel.org/linux-fsdevel/20191125125342.6189-1-vdronov@redhat.com/T/#uAnalyzed-by: NStephen Johnston <sjohnsto@redhat.com>
      Analyzed-by: NVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: NVladis Dronov <vdronov@redhat.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a33121e5
    • V
      net: dsa: sja1105: Reconcile the meaning of TPID and TPID2 for E/T and P/Q/R/S · 54fa49ee
      Vladimir Oltean 提交于
      For first-generation switches (SJA1105E and SJA1105T):
      - TPID means C-Tag (typically 0x8100)
      - TPID2 means S-Tag (typically 0x88A8)
      
      While for the second generation switches (SJA1105P, SJA1105Q, SJA1105R,
      SJA1105S) it is the other way around:
      - TPID means S-Tag (typically 0x88A8)
      - TPID2 means C-Tag (typically 0x8100)
      
      In other words, E/T tags untagged traffic with TPID, and P/Q/R/S with
      TPID2.
      
      So the patch mentioned below fixed VLAN filtering for P/Q/R/S, but broke
      it for E/T.
      
      We strive for a common code path for all switches in the family, so just
      lie in the static config packing functions that TPID and TPID2 are at
      swapped bit offsets than they actually are, for P/Q/R/S. This will make
      both switches understand TPID to be ETH_P_8021Q and TPID2 to be
      ETH_P_8021AD. The meaning from the original E/T was chosen over P/Q/R/S
      because E/T is actually the one with public documentation available
      (UM10944.pdf).
      
      Fixes: f9a1a764 ("net: dsa: sja1105: Reverse TPID and TPID2")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54fa49ee
    • V
      Documentation: net: dsa: sja1105: Remove text about taprio base-time limitation · 3a323ed7
      Vladimir Oltean 提交于
      Since commit 86db36a3 ("net: dsa: sja1105: Implement state machine
      for TAS with PTP clock source"), this paragraph is no longer true. So
      remove it.
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a323ed7
    • V
      net: dsa: sja1105: Remove restriction of zero base-time for taprio offload · d00bdc0a
      Vladimir Oltean 提交于
      The check originates from the initial implementation which was not based
      on PTP time but on a standalone clock source. In the meantime we can now
      program the PTPSCHTM register at runtime with the dynamic base time
      (actually with a value that is 200 ns smaller, to avoid writing DELTA=0
      in the Schedule Entry Points Parameters Table). And we also have logic
      for moving the actual base time in the future of the PHC's current time
      base, so the check for zero serves no purpose, since even if the user
      will specify zero, that's not what will end up in the static config
      table where the limitation is.
      
      Fixes: 86db36a3 ("net: dsa: sja1105: Implement state machine for TAS with PTP clock source")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d00bdc0a
    • V
      net: dsa: sja1105: Really make the PTP command read-write · 5a47f588
      Vladimir Oltean 提交于
      When activating tc-taprio offload on the switch ports, the TAS state
      machine will try to check whether it is running or not, but will find
      both the STARTED and STOPPED bits as false in the
      sja1105_tas_check_running function. So the function will return -EINVAL
      (an abnormal situation) and the kernel will keep printing this from the
      TAS FSM workqueue:
      
      [   37.691971] sja1105 spi0.1: An operation returned -22
      
      The reason is that the underlying function that gets called,
      sja1105_ptp_commit, does not actually do a SPI_READ, but a SPI_WRITE. So
      the command buffer remains initialized with zeroes instead of retrieving
      the hardware state. Fix that.
      
      Fixes: 41603d78 ("net: dsa: sja1105: Make the PTP command read-write")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a47f588
    • V
      net: dsa: sja1105: Take PTP egress timestamp by port, not mgmt slot · 9fcf024d
      Vladimir Oltean 提交于
      The PTP egress timestamp N must be captured from register PTPEGR_TS[n],
      where n = 2 * PORT + TSREG. There are 10 PTPEGR_TS registers, 2 per
      port. We are only using TSREG=0.
      
      As opposed to the management slots, which are 4 in number
      (SJA1105_NUM_PORTS, minus the CPU port). Any management frame (which
      includes PTP frames) can be sent to any non-CPU port through any
      management slot. When the CPU port is not the last port (#4), there will
      be a mismatch between the slot and the port number.
      
      Luckily, the only mainline occurrence with this switch
      (arch/arm/boot/dts/ls1021a-tsn.dts) does have the CPU port as #4, so the
      issue did not manifest itself thus far.
      
      Fixes: 47ed985e ("net: dsa: sja1105: Add logic for TX timestamping")
      Signed-off-by: NVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fcf024d
    • C
      sfc: avoid duplicate error handling code in 'efx_ef10_sriov_set_vf_mac()' · db99d512
      Christophe JAILLET 提交于
      'eth_zero_addr()' is already called in the error handling path. This is
      harmless, but there is no point in calling it twice, so remove one.
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db99d512
    • E
      tcp_cubic: refactor code to perform a divide only when needed · f278b99c
      Eric Dumazet 提交于
      Neal Cardwell suggested to not change ca->delay_min
      and apply the ack delay cushion only when Hystart ACK train
      is still under consideration. This should avoid a 64bit
      divide unless needed.
      
      Tested:
      
      40Gbit(mlx4) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
        14815
        16280
        15293
        15563
        11574
        15145
        14789
        18548
        16972
        12520
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       1396               0.0
      $ dmesg | tail -10
      [ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
      [ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
      [ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
      [ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
      [ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
      [ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
      [ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
      [ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
      [ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
      [ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152
      
      10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)
      
      $ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
      $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
         7050
         7065
         7100
         6900
         7202
         7263
         7189
         6869
         7463
         7034
      TcpExtTCPHystartTrainDetect     10                 0.0
      TcpExtTCPHystartTrainCwnd       3199               0.0
      $ dmesg | tail -10
      [  176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
      [  179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
      [  181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
      [  183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
      [  185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
      [  187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
      [  190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
      [  192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
      [  194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
      [  196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399
      
      Fixes: 42f3a8aa ("tcp_cubic: tweak Hystart detection for short RTT flows")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://www.spinics.net/lists/netdev/msg621886.html
      Link: https://www.spinics.net/lists/netdev/msg621797.htmlAcked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f278b99c
    • R
      cxgb4/cxgb4vf: fix flow control display for auto negotiation · 0caeaf6a
      Rahul Lakkireddy 提交于
      As per 802.3-2005, Section Two, Annex 28B, Table 28B-2 [1], when
      _only_ Rx pause is enabled, both symmetric and asymmetric pause
      towards local device must be enabled. Also, firmware returns the local
      device's flow control pause params as part of advertised capabilities
      and negotiated params as part of current link attributes. So, fix up
      ethtool's flow control pause params fetch logic to read from acaps,
      instead of linkattr.
      
      [1] https://standards.ieee.org/standard/802_3-2005.html
      
      Fixes: c3168cab ("cxgb4/cxgbvf: Handle 32-bit fw port capabilities")
      Signed-off-by: NSurendra Mobiya <surendra@chelsio.com>
      Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0caeaf6a
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · ba402810
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.
      
      2) Document ingress hook in netdevice, also from Lukas.
      
      3) Remove htons() in tunnel metadata port netlink attributes,
         from Xin Long.
      
      4) Missing erspan netlink attribute validation also from Xin Long.
      
      5) Missing erspan version in tunnel, from Xin Long.
      
      6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
         Patch from Xin Long.
      
      7) Missing nla_nest_cancel() in tunnel netlink dump path,
         from Xin Long.
      
      8) Remove two exported conntrack symbols with no clients,
         from Florian Westphal.
      
      9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.
      
      10) Add nft_meta_pkttype helper for loopback, also from Florian.
      
      11) Add nft_meta_socket uid helper, from Florian Westphal.
      
      12) Add nft_meta_cgroup helper, from Florian.
      
      13) Add nft_meta_ifkind helper, from Florian.
      
      14) Group all interface related meta selector, from Florian.
      
      15) Add nft_prandom_u32() helper, from Florian.
      
      16) Add nft_meta_rtclassid helper, from Florian.
      
      17) Add support for matching on the slave device index,
          from Florian.
      
      This batch, among other things, contains updates for the netfilter
      tunnel netlink interface: This extension is still incomplete and lacking
      proper userspace support which is actually my fault, I did not find the
      time to go back and finish this. This update is breaking tunnel UAPI in
      some aspects to fix it but do it better sooner than never.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba402810
  4. 30 12月, 2019 3 次提交
    • L
      Linux 5.5-rc4 · fd698849
      Linus Torvalds 提交于
      fd698849
    • D
      Merge branch 'mlxsw-fixes' · 3faf6eda
      David S. Miller 提交于
      Ido Schimmel says:
      
      ====================
      mlxsw: Couple of fixes
      
      This patch set contains two fixes for mlxsw. Please consider both for
      stable.
      
      Patch #1 from Amit fixes a wrong check during MAC validation when
      creating router interfaces (RIFs). Given a particular order of
      configuration this can result in the driver refusing to create new RIFs.
      
      Patch #2 fixes a wrong trap configuration in which VRRP packets and
      routing exceptions were policed by the same policer towards the CPU. In
      certain situations this can prevent VRRP packets from reaching the CPU.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3faf6eda
    • I
      mlxsw: spectrum: Use dedicated policer for VRRP packets · acca789a
      Ido Schimmel 提交于
      Currently, VRRP packets and packets that hit exceptions during routing
      (e.g., MTU error) are policed using the same policer towards the CPU.
      This means, for example, that misconfiguration of the MTU on a routed
      interface can prevent VRRP packets from reaching the CPU, which in turn
      can cause the VRRP daemon to assume it is the Master router.
      
      Fix this by using a dedicated policer for VRRP packets.
      
      Fixes: 11566d34 ("mlxsw: spectrum: Add VRRP traps")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NAlex Veber <alexve@mellanox.com>
      Tested-by: NAlex Veber <alexve@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acca789a