1. 11 2月, 2021 13 次提交
    • P
      i40e: Add flow director support for IPv6 · efca91e8
      Przemyslaw Patynowski 提交于
      Flow director for IPv6 is not supported.
      1) Implementation of support for IPv6 flow director.
      2) Added handlers for addition of TCP6, UDP6, SCTP6, IPv6.
      3) Refactored legacy code to make it more generic.
      4) Added packet templates for TCP6, UDP6, SCTP6, IPv6.
      5) Added handling of IPv6 source and destination address for flow director.
      6) Improved argument passing for source and destination portin TCP6, UDP6
         and SCTP6.
      7) Added handling of ethtool -n for IPv6, TCP6,UDP6, SCTP6.
      8) Used correct bit flag regarding FLEXOFF field of flow director data
         descriptor.
      
      Without this patch, there would be no support for flow director on IPv6,
      TCP6, UDP6, SCTP6.
      Tested based on x710 datasheet by using:
      ethtool -N enp133s0f0 flow-type tcp4 src-port 13 dst-port 37 user-def 0x44142 action 1
      ethtool -N enp133s0f0 flow-type tcp6 src-port 13 dst-port 40 user-def 0x44142 action 2
      ethtool -N enp133s0f0 flow-type udp4 src-port 20 dst-port 40 user-def 0x44142 action 3
      ethtool -N enp133s0f0 flow-type udp6 src-port 25 dst-port 40 user-def 0x44142 action 4
      ethtool -N enp133s0f0 flow-type sctp4 src-port 55 dst-port 65 user-def 0x44142 action 5
      ethtool -N enp133s0f0 flow-type sctp6 src-port 60 dst-port 40 user-def 0x44142 action 6
      ethtool -N enp133s0f0 flow-type ip4 src-ip 1.1.1.1 dst-ip 1.1.1.4 user-def 0x44142 action 7
      ethtool -N enp133s0f0 flow-type ip6 src-ip fe80::3efd:feff:fe6f:bbbb dst-ip fe80::3efd:feff:fe6f:aaaa user-def 0x44142 action 8
      Then send traffic from client which matches the criteria provided to ethtool.
      Observe that packets are redirected to user set queues with ethtool -S <interface>
      Signed-off-by: NPrzemyslaw Patynowski <przemyslawx.patynowski@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      efca91e8
    • A
      i40e: Add EEE status getting & setting implementation · 95f352dc
      Aleksandr Loktionov 提交于
      Implement Energy Efficient Ethernet (EEE) status getting & setting.
      The i40e_get_eee() requesting PHY EEE capabilities from firmware.
      The i40e_set_eee() function requests PHY EEE capabilities
      from firmware and sets PHY EEE advertising to full abilities or 0
      depending whether EEE is to be enabled or disabled.
      Signed-off-by: NAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      95f352dc
    • A
      i40e: Add netlink callbacks support for software based DCB · 5effa78e
      Arkadiusz Kubalewski 提交于
      Add callbacks used by software based LLDP agent, which allows to
      configure DCB feature from userspace.
      
      Update copyright dates as appropriate.
      
      If LLDP agent is turned off in BIOS, or after setting private flag
      ("disable-fw-lldp on"). The driver initialized DCB functionality with
      default values, one traffic class with 100% bandwidth allocated.
      
      The new netlink callbacks are required for software LLDP agent, it
      must be able to acquire current DCB configuration of a network port
      and apply DCB configuration changes, if required.
      Signed-off-by: NAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Signed-off-by: NArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      5effa78e
    • A
      i40e: Add init and default config of software based DCB · 4b208eaa
      Arkadiusz Kubalewski 提交于
      Add extra handling on changing the "disable-fw-lldp" private
      flag to properly initialize software based DCB feature.
      
      Add default configuration of DCB functionality when Firmware
      LLDP agent is turned off, in case of driver probe and device
      reset on reconfiguration.
      
      Update copyright dates as appropriate.
      
      Software based DCB is a brand-new feature in i40e driver.
      Before, DCB was implemented by Firmware LLDP agent only. The agent was
      responsible for handling incoming DCB-related LLDP frames and
      applying received DCB configuration to hardware.
      
      Default configuration and new initialization flow for software based
      DCB is required. If LLDP agent is turned off in BIOS, or after
      setting private flag ("disable-fw-lldp on"). The driver initializes
      DCB functionality with default values, one traffic class with 100%
      bandwidth allocated.
      Signed-off-by: NAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Signed-off-by: NArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      4b208eaa
    • A
      i40e: Add hardware configuration for software based DCB · 90bc8e00
      Arkadiusz Kubalewski 提交于
      Add registers and definitions required for applying
      DCB related hardware configuration.
      
      Add functions responsible for calculating and setting proper
      hardware configuration values for software based DCB functionality.
      
      Add function responsible for invoking Admin Queue command, which
      results in applying new DCB configuration to the hardware.
      
      Update copyright dates as appropriate.
      
      Software based DCB is a brand-new feature in i40e driver.
      Before, DCB was implemented by Firmware LLDP agent only. The agent was
      responsible for handling incoming DCB-related LLDP frames and
      applying received DCB configuration to hardware.
      
      New communication channel between software and hardware is required
      for software driver. It must be able to calculate and configure all
      the registers related for DCB feature.
      Signed-off-by: NAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Signed-off-by: NArkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
      Tested-by: NTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: NTony Nguyen <anthony.l.nguyen@intel.com>
      90bc8e00
    • D
      dc9d8758
    • L
      Merge tag 'pm-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 291009f6
      Linus Torvalds 提交于
      Pull power management fixes from Rafael Wysocki:
       "Address a performance regression related to scale-invariance on x86
        that may prevent turbo CPU frequencies from being used in certain
        workloads on systems using acpi-cpufreq as the CPU performance scaling
        driver and schedutil as the scaling governor"
      
      * tag 'pm-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: ACPI: Update arch scale-invariance max perf ratio if CPPC is not there
        cpufreq: ACPI: Extend frequency tables to cover boost frequencies
      291009f6
    • L
      Merge tag 'acpi-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · a3961497
      Linus Torvalds 提交于
      Pull ACPI fix from Rafael Wysocki:
       "Revert a problematic ACPICA commit that changed the code to attempt to
        update memory regions which may be read-only on some systems (Ard
        Biesheuvel)"
      
      * tag 'acpi-5.11-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "ACPICA: Interpreter: fix memory leak by using existing buffer"
      a3961497
    • L
      Merge tag 'dmaengine-fix2-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · 708c2e41
      Linus Torvalds 提交于
      Pull dmaengine fixes from Vinod Koul:
       "Some late fixes for dmaengine:
      
        Core:
         - fix channel device_node deletion
      
        Driver fixes:
         - dw: revert of runtime pm enabling
         - idxd: device state fix, interrupt completion and list corruption
         - ti: resource leak
      
      * tag 'dmaengine-fix2-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
        dmaengine dw: Revert "dmaengine: dw: Enable runtime PM"
        dmaengine: idxd: check device state before issue command
        dmaengine: ti: k3-udma: Fix a resource leak in an error handling path
        dmaengine: move channel device_node deletion to driver
        dmaengine: idxd: fix misc interrupt completion
        dmaengine: idxd: Fix list corruption in description completion
      708c2e41
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 6016bf19
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "Another pile of networing fixes:
      
         1) ath9k build error fix from Arnd Bergmann
      
         2) dma memory leak fix in mediatec driver from Lorenzo Bianconi.
      
         3) bpf int3 kprobe fix from Alexei Starovoitov.
      
         4) bpf stackmap integer overflow fix from Bui Quang Minh.
      
         5) Add usb device ids for Cinterion MV31 to qmi_qwwan driver, from
            Christoph Schemmel.
      
         6) Don't update deleted entry in xt_recent netfilter module, from
            Jazsef Kadlecsik.
      
         7) Use after free in nftables, fix from Pablo Neira Ayuso.
      
         8) Header checksum fix in flowtable from Sven Auhagen.
      
         9) Validate user controlled length in qrtr code, from Sabyrzhan
            Tasbolatov.
      
        10) Fix race in xen/netback, from Juergen Gross,
      
        11) New device ID in cxgb4, from Raju Rangoju.
      
        12) Fix ring locking in rxrpc release call, from David Howells.
      
        13) Don't return LAPB error codes from x25_open(), from Xie He.
      
        14) Missing error returns in gsi_channel_setup() from Alex Elder.
      
        15) Get skb_copy_and_csum_datagram working properly with odd segment
            sizes, from Willem de Bruijn.
      
        16) Missing RFS/RSS table init in enetc driver, from Vladimir Oltean.
      
        17) Do teardown on probe failure in DSA, from Vladimir Oltean.
      
        18) Fix compilation failures of txtimestamp selftest, from Vadim
            Fedorenko.
      
        19) Limit rx per-napi gro queue size to fix latency regression, from
            Eric Dumazet.
      
        20) dpaa_eth xdp fixes from Camelia Groza.
      
        21) Missing txq mode update when switching CBS off, in stmmac driver,
            from Mohammad Athari Bin Ismail.
      
        22) Failover pending logic fix in ibmvnic driver, from Sukadev
            Bhattiprolu.
      
        23) Null deref fix in vmw_vsock, from Norbert Slusarek.
      
        24) Missing verdict update in xdp paths of ena driver, from Shay
            Agroskin.
      
        25) seq_file iteration fix in sctp from Neil Brown.
      
        26) bpf 32-bit src register truncation fix on div/mod, from Daniel
            Borkmann.
      
        27) Fix jmp32 pruning in bpf verifier, from Daniel Borkmann.
      
        28) Fix locking in vsock_shutdown(), from Stefano Garzarella.
      
        29) Various missing index bound checks in hns3 driver, from Yufeng Mo.
      
        30) Flush ports on .phylink_mac_link_down() in dsa felix driver, from
            Vladimir Oltean.
      
        31) Don't mix up stp and mrp port states in bridge layer, from Horatiu
            Vultur.
      
        32) Fix locking during netif_tx_disable(), from Edwin Peer"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
        bpf: Fix 32 bit src register truncation on div/mod
        bpf: Fix verifier jmp32 pruning decision logic
        bpf: Fix verifier jsgt branch analysis on max bound
        vsock: fix locking in vsock_shutdown()
        net: hns3: add a check for index in hclge_get_rss_key()
        net: hns3: add a check for tqp_index in hclge_get_ring_chain_from_mbx()
        net: hns3: add a check for queue_id in hclge_reset_vf_queue()
        net: dsa: felix: implement port flushing on .phylink_mac_link_down
        switchdev: mrp: Remove SWITCHDEV_ATTR_ID_MRP_PORT_STAT
        bridge: mrp: Fix the usage of br_mrp_port_switchdev_set_state
        net: watchdog: hold device global xmit lock during tx disable
        netfilter: nftables: relax check for stateful expressions in set definition
        netfilter: conntrack: skip identical origin tuple in same zone only
        vsock/virtio: update credit only if socket is not closed
        net: fix iteration for sctp transport seq_files
        net: ena: Update XDP verdict upon failure
        net/vmw_vsock: improve locking in vsock_connect_timeout()
        net/vmw_vsock: fix NULL pointer dereference
        ibmvnic: Clear failover_pending if unable to schedule
        net: stmmac: set TxQ mode back to DCB after disabling CBS
        ...
      6016bf19
    • L
      Merge branch 'akpm' (patches from Andrew) · 4b16b656
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "14 patches.
      
        Subsystems affected by this patch series: mm (kasan, mremap, tmpfs,
        selftests, memcg, and slub), MAINTAINERS, squashfs, nilfs2, and
        firmware"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        nilfs2: make splice write available again
        mm, slub: better heuristic for number of cpus when calculating slab order
        Revert "mm: memcontrol: avoid workload stalls when lowering memory.high"
        MAINTAINERS: update Andrey Ryabinin's email address
        selftests/vm: rename file run_vmtests to run_vmtests.sh
        tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha
        tmpfs: disallow CONFIG_TMPFS_INODE64 on s390
        mm/mremap: fix BUILD_BUG_ON() error in get_extent
        firmware_loader: align .builtin_fw to 8
        kasan: fix stack traces dependency for HW_TAGS
        squashfs: add more sanity checks in xattr id lookup
        squashfs: add more sanity checks in inode lookup
        squashfs: add more sanity checks in id lookup
        squashfs: avoid out of bounds writes in decompressors
      4b16b656
    • J
      nilfs2: make splice write available again · a35d8f01
      Joachim Henke 提交于
      Since 5.10, splice() or sendfile() to NILFS2 return EINVAL.  This was
      caused by commit 36e2c742 ("fs: don't allow splice read/write
      without explicit ops").
      
      This patch initializes the splice_write field in file_operations, like
      most file systems do, to restore the functionality.
      
      Link: https://lkml.kernel.org/r/1612784101-14353-1-git-send-email-konishi.ryusuke@gmail.comSigned-off-by: NJoachim Henke <joachim.henke@t-systems.com>
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@gmail.com>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a35d8f01
    • V
      mm, slub: better heuristic for number of cpus when calculating slab order · 3286222f
      Vlastimil Babka 提交于
      When creating a new kmem cache, SLUB determines how large the slab pages
      will based on number of inputs, including the number of CPUs in the
      system.  Larger slab pages mean that more objects can be allocated/free
      from per-cpu slabs before accessing shared structures, but also
      potentially more memory can be wasted due to low slab usage and
      fragmentation.  The rough idea of using number of CPUs is that larger
      systems will be more likely to benefit from reduced contention, and also
      should have enough memory to spare.
      
      Number of CPUs used to be determined as nr_cpu_ids, which is number of
      possible cpus, but on some systems many will never be onlined, thus
      commit 045ab8c9 ("mm/slub: let number of online CPUs determine the
      slub page order") changed it to nr_online_cpus().  However, for kmem
      caches created early before CPUs are onlined, this may lead to
      permamently low slab page sizes.
      
      Vincent reports a regression [1] of hackbench on arm64 systems:
      
        "I'm facing significant performances regression on a large arm64
         server system (224 CPUs). Regressions is also present on small arm64
         system (8 CPUs) but in a far smaller order of magnitude
      
         On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
         v5.11-rc4 : 9.135sec (+/- 0.45%)
         v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
         v5.10: 3.136sec (+/- 0.40%)"
      
      Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
      page allocator contention:
      
        "i.e. the patch incurs a 7% to 32% performance penalty. This bisected
         cleanly yesterday when I was looking for the regression and then
         found the thread.
      
         Numerous caches change size. For example, kmalloc-512 goes from
         order-0 (vanilla) to order-2 with the revert.
      
         So mostly this is down to the number of times SLUB calls into the
         page allocator which only caches order-0 pages on a per-cpu basis"
      
      Clearly num_online_cpus() doesn't work too early in bootup.  We could
      change the order dynamically in a memory hotplug callback, but runtime
      order changing for existing kmem caches has been already shown as
      dangerous, and removed in 32a6f409 ("mm, slub: remove runtime
      allocation order changes").
      
      It could be resurrected in a safe manner with some effort, but to fix
      the regression we need something simpler.
      
      We could use num_present_cpus() that should be the number of physically
      present CPUs even before they are onlined.  That would work for PowerPC
      [3], which triggered the original commit, but that still doesn't work on
      arm64 [4] as explained in [5].
      
      So this patch tries to determine the best available value without
      specific arch knowledge.
      
       - num_present_cpus() if the number is larger than 1, as that means the
         arch is likely setting it properly
      
       - nr_cpu_ids otherwise
      
      This should fix the reported regressions while also keeping the effect
      of 045ab8c9 for PowerPC systems.  It's possible there are
      configurations where num_present_cpus() is 1 during boot while
      nr_cpu_ids is at the same time bloated, so these (if they exist) would
      keep the large orders based on nr_cpu_ids as was before 045ab8c9.
      
      [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
      [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
      [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
      [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
      [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/
      
      Link: https://lkml.kernel.org/r/20210208134108.22286-1-vbabka@suse.cz
      Fixes: 045ab8c9 ("mm/slub: let number of online CPUs determine the slub page order")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3286222f
  2. 10 2月, 2021 27 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · b8776f14
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2021-02-10
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 5 non-merge commits during the last 8 day(s) which contain
      a total of 3 files changed, 22 insertions(+), 21 deletions(-).
      
      The main changes are:
      
      1) Fix missed execution of kprobes BPF progs when kprobe is firing via
         int3, from Alexei Starovoitov.
      
      2) Fix potential integer overflow in map max_entries for stackmap on
         32 bit archs, from Bui Quang Minh.
      
      3) Fix a verifier pruning and a insn rewrite issue related to 32 bit ops,
         from Daniel Borkmann.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c# Please enter a commit message to explain why this merge is necessary,
      b8776f14
    • J
      Revert "mm: memcontrol: avoid workload stalls when lowering memory.high" · e82553c1
      Johannes Weiner 提交于
      This reverts commit 536d3bf2, as it can
      cause writers to memory.high to get stuck in the kernel forever,
      performing page reclaim and consuming excessive amounts of CPU cycles.
      
      Before the patch, a write to memory.high would first put the new limit
      in place for the workload, and then reclaim the requested delta.  After
      the patch, the kernel tries to reclaim the delta before putting the new
      limit into place, in order to not overwhelm the workload with a sudden,
      large excess over the limit.  However, if reclaim is actively racing
      with new allocations from the uncurbed workload, it can keep the write()
      working inside the kernel indefinitely.
      
      This is causing problems in Facebook production.  A privileged
      system-level daemon that adjusts memory.high for various workloads
      running on a host can get unexpectedly stuck in the kernel and
      essentially turn into a sort of involuntary kswapd for one of the
      workloads.  We've observed that daemon busy-spin in a write() for
      minutes at a time, neglecting its other duties on the system, and
      expending privileged system resources on behalf of a workload.
      
      To remedy this, we have first considered changing the reclaim logic to
      break out after a couple of loops - whether the workload has converged
      to the new limit or not - and bound the write() call this way.  However,
      the root cause that inspired the sequence change in the first place has
      been fixed through other means, and so a revert back to the proven
      limit-setting sequence, also used by memory.max, is preferable.
      
      The sequence was changed to avoid extreme latencies in the workload when
      the limit was lowered: the sudden, large excess created by the limit
      lowering would erroneously trigger the penalty sleeping code that is
      meant to throttle excessive growth from below.  Allocating threads could
      end up sleeping long after the write() had already reclaimed the delta
      for which they were being punished.
      
      However, erroneous throttling also caused problems in other scenarios at
      around the same time.  This resulted in commit b3ff9291 ("mm, memcg:
      reclaim more aggressively before high allocator throttling"), included
      in the same release as the offending commit.  When allocating threads
      now encounter large excess caused by a racing write() to memory.high,
      instead of entering punitive sleeps, they will simply be tasked with
      helping reclaim down the excess, and will be held no longer than it
      takes to accomplish that.  This is in line with regular limit
      enforcement - i.e.  if the workload allocates up against or over an
      otherwise unchanged limit from below.
      
      With the patch breaking userspace, and the root cause addressed by other
      means already, revert it again.
      
      Link: https://lkml.kernel.org/r/20210122184341.292461-1-hannes@cmpxchg.org
      Fixes: 536d3bf2 ("mm: memcontrol: avoid workload stalls when lowering memory.high")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChris Down <chris@chrisdown.name>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e82553c1
    • A
      a0c2eb0a
    • R
      selftests/vm: rename file run_vmtests to run_vmtests.sh · d52db800
      Rong Chen 提交于
      Commit c2aa8afc has renamed run_vmtests in Makefile, but the file
      still uses the old name.
      
      The kernel test robot reported the following issue:
      
        # selftests: vm: run_vmtests.sh
        # Warning: file run_vmtests.sh is missing!
        not ok 1 selftests: vm: run_vmtests.sh
      
      Link: https://lkml.kernel.org/r/20210205085507.1479894-1-rong.a.chen@intel.com
      Fixes: c2aa8afc (selftests/vm: rename run_vmtests --> run_vmtests.sh)
      Signed-off-by: NRong Chen <rong.a.chen@intel.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d52db800
    • S
      tmpfs: disallow CONFIG_TMPFS_INODE64 on alpha · ad69c389
      Seth Forshee 提交于
      As with s390, alpha is a 64-bit architecture with a 32-bit ino_t.  With
      CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and
      display "inode64" in the mount options, whereas passing "inode64" in the
      mount options will fail.  This leads to erroneous behaviours such as
      this:
      
        # mkdir mnt
        # mount -t tmpfs nodev mnt
        # mount -o remount,rw mnt
        mount: /home/ubuntu/mnt: mount point not mounted or bad option.
      
      Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.
      
      Link: https://lkml.kernel.org/r/20210208215726.608197-1-seth.forshee@canonical.com
      Fixes: ea3271f7 ("tmpfs: support 64-bit inums per-sb")
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad69c389
    • S
      tmpfs: disallow CONFIG_TMPFS_INODE64 on s390 · b85a7a8b
      Seth Forshee 提交于
      Currently there is an assumption in tmpfs that 64-bit architectures also
      have a 64-bit ino_t.  This is not true on s390 which has a 32-bit ino_t.
      With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers
      and display "inode64" in the mount options, but passing the "inode64"
      mount option will fail.  This leads to the following behavior:
      
        # mkdir mnt
        # mount -t tmpfs nodev mnt
        # mount -o remount,rw mnt
        mount: /home/ubuntu/mnt: mount point not mounted or bad option.
      
      As mount sees "inode64" in the mount options and thus passes it in the
      options for the remount.
      
      So prevent CONFIG_TMPFS_INODE64 from being selected on s390.
      
      Link: https://lkml.kernel.org/r/20210205230620.518245-1-seth.forshee@canonical.com
      Fixes: ea3271f7 ("tmpfs: support 64-bit inums per-sb")
      Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Chris Down <chris@chrisdown.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b85a7a8b
    • A
      mm/mremap: fix BUILD_BUG_ON() error in get_extent · a30a2909
      Arnd Bergmann 提交于
      clang can't evaluate this function argument at compile time when the
      function is not inlined, which leads to a link time failure:
      
        ld.lld: error: undefined symbol: __compiletime_assert_414
        >>> referenced by mremap.c
        >>>               mremap.o:(get_extent) in archive mm/built-in.a
      
      Mark the function as __always_inline to avoid it.
      
      Link: https://lkml.kernel.org/r/20201230154104.522605-1-arnd@kernel.org
      Fixes: 9ad9718b ("mm/mremap: calculate extent in one place")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Tested-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Cc: Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a30a2909
    • F
      firmware_loader: align .builtin_fw to 8 · 793f49a8
      Fangrui Song 提交于
      arm64 references the start address of .builtin_fw (__start_builtin_fw)
      with a pair of R_AARCH64_ADR_PREL_PG_HI21/R_AARCH64_LDST64_ABS_LO12_NC
      relocations.  The compiler is allowed to emit the
      R_AARCH64_LDST64_ABS_LO12_NC relocation because struct builtin_fw in
      include/linux/firmware.h is 8-byte aligned.
      
      The R_AARCH64_LDST64_ABS_LO12_NC relocation requires the address to be a
      multiple of 8, which may not be the case if .builtin_fw is empty.
      Unconditionally align .builtin_fw to fix the linker error.  32-bit
      architectures could use ALIGN(4) but that would add unnecessary
      complexity, so just use ALIGN(8).
      
      Link: https://lkml.kernel.org/r/20201208054646.2913063-1-maskray@google.com
      Link: https://github.com/ClangBuiltLinux/linux/issues/1204
      Fixes: 5658c769 ("firmware: allow firmware files to be built into kernel image")
      Signed-off-by: NFangrui Song <maskray@google.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: NNick Desaulniers <ndesaulniers@google.com>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      Acked-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      793f49a8
    • A
      kasan: fix stack traces dependency for HW_TAGS · 1cc4cdb5
      Andrey Konovalov 提交于
      Currently, whether the alloc/free stack traces collection is enabled by
      default for hardware tag-based KASAN depends on CONFIG_DEBUG_KERNEL.
      The intention for this dependency was to only enable collection on slow
      debug kernels due to a significant perf and memory impact.
      
      As it turns out, CONFIG_DEBUG_KERNEL is not considered a debug option
      and is enabled on many productions kernels including Android and Ubuntu.
      As the result, this dependency is pointless and only complicates the
      code and documentation.
      
      Having stack traces collection disabled by default would make the
      hardware mode work differently to to the software ones, which is
      confusing.
      
      This change removes the dependency and enables stack traces collection
      by default.
      
      Looking into the future, this default might makes sense for production
      kernels, assuming we implement a fast stack trace collection approach.
      
      Link: https://lkml.kernel.org/r/6678d77ceffb71f1cff2cf61560e2ffe7bb6bfe9.1612808820.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cc4cdb5
    • P
      squashfs: add more sanity checks in xattr id lookup · 506220d2
      Phillip Lougher 提交于
      Sysbot has reported a warning where a kmalloc() attempt exceeds the
      maximum limit.  This has been identified as corruption of the xattr_ids
      count when reading the xattr id lookup table.
      
      This patch adds a number of additional sanity checks to detect this
      corruption and others.
      
      1. It checks for a corrupted xattr index read from the inode.  This could
         be because the metadata block is uncompressed, or because the
         "compression" bit has been corrupted (turning a compressed block
         into an uncompressed block).  This would cause an out of bounds read.
      
      2. It checks against corruption of the xattr_ids count.  This can either
         lead to the above kmalloc failure, or a smaller than expected
         table to be read.
      
      3. It checks the contents of the index table for corruption.
      
      [phillip@squashfs.org.uk: fix checkpatch issue]
        Link: https://lkml.kernel.org/r/270245655.754655.1612770082682@webmail.123-reg.co.uk
      
      Link: https://lkml.kernel.org/r/20210204130249.4495-5-phillip@squashfs.org.ukSigned-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: syzbot+2ccea6339d368360800d@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      506220d2
    • P
      squashfs: add more sanity checks in inode lookup · eabac19e
      Phillip Lougher 提交于
      Sysbot has reported an "slab-out-of-bounds read" error which has been
      identified as being caused by a corrupted "ino_num" value read from the
      inode.  This could be because the metadata block is uncompressed, or
      because the "compression" bit has been corrupted (turning a compressed
      block into an uncompressed block).
      
      This patch adds additional sanity checks to detect this, and the
      following corruption.
      
      1. It checks against corruption of the inodes count.  This can either
         lead to a larger table to be read, or a smaller than expected
         table to be read.
      
         In the case of a too large inodes count, this would often have been
         trapped by the existing sanity checks, but this patch introduces
         a more exact check, which can identify too small values.
      
      2. It checks the contents of the index table for corruption.
      
      [phillip@squashfs.org.uk: fix checkpatch issue]
        Link: https://lkml.kernel.org/r/527909353.754618.1612769948607@webmail.123-reg.co.uk
      
      Link: https://lkml.kernel.org/r/20210204130249.4495-4-phillip@squashfs.org.ukSigned-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: syzbot+04419e3ff19d2970ea28@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eabac19e
    • P
      squashfs: add more sanity checks in id lookup · f37aa4c7
      Phillip Lougher 提交于
      Sysbot has reported a number of "slab-out-of-bounds reads" and
      "use-after-free read" errors which has been identified as being caused
      by a corrupted index value read from the inode.  This could be because
      the metadata block is uncompressed, or because the "compression" bit has
      been corrupted (turning a compressed block into an uncompressed block).
      
      This patch adds additional sanity checks to detect this, and the
      following corruption.
      
      1. It checks against corruption of the ids count.  This can either
         lead to a larger table to be read, or a smaller than expected
         table to be read.
      
         In the case of a too large ids count, this would often have been
         trapped by the existing sanity checks, but this patch introduces
         a more exact check, which can identify too small values.
      
      2. It checks the contents of the index table for corruption.
      
      Link: https://lkml.kernel.org/r/20210204130249.4495-3-phillip@squashfs.org.ukSigned-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Reported-by: syzbot+b06d57ba83f604522af2@syzkaller.appspotmail.com
      Reported-by: syzbot+c021ba012da41ee9807c@syzkaller.appspotmail.com
      Reported-by: syzbot+5024636e8b5fd19f0f19@syzkaller.appspotmail.com
      Reported-by: syzbot+bcbc661df46657d0fa4f@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f37aa4c7
    • P
      squashfs: avoid out of bounds writes in decompressors · e812cbbb
      Phillip Lougher 提交于
      Patch series "Squashfs: fix BIO migration regression and add sanity checks".
      
      Patch [1/4] fixes a regression introduced by the "migrate from
      ll_rw_block usage to BIO" patch, which has produced a number of
      Sysbot/Syzkaller reports.
      
      Patches [2/4], [3/4], and [4/4] fix a number of filesystem corruption
      issues which have produced Sysbot reports in the id, inode and xattr
      lookup code.
      
      Each patch has been tested against the Sysbot reproducers using the
      given kernel configuration.  They have the appropriate "Reported-by:"
      lines added.
      
      Additionally, all of the reproducer filesystems are indirectly fixed by
      patch [4/4] due to the fact they all have xattr corruption which is now
      detected there.
      
      Additional testing with other configurations and architectures (32bit,
      big endian), and normal filesystems has also been done to trap any
      inadvertent regressions caused by the additional sanity checks.
      
      This patch (of 4):
      
      This is a regression introduced by the patch "migrate from ll_rw_block
      usage to BIO".
      
      Sysbot/Syskaller has reported a number of "out of bounds writes" and
      "unable to handle kernel paging request in squashfs_decompress" errors
      which have been identified as a regression introduced by the above
      patch.
      
      Specifically, the patch removed the following sanity check
      
              if (length < 0 || length > output->length ||
      		(index + length) > msblk->bytes_used)
      
      This check did two things:
      
      1. It ensured any reads were not beyond the end of the filesystem
      
      2. It ensured that the "length" field read from the filesystem
         was within the expected maximum length.  Without this any
         corrupted values can over-run allocated buffers.
      
      Link: https://lkml.kernel.org/r/20210204130249.4495-1-phillip@squashfs.org.uk
      Link: https://lkml.kernel.org/r/20210204130249.4495-2-phillip@squashfs.org.uk
      Fixes: 93e72b3c ("squashfs: migrate from ll_rw_block usage to BIO")
      Reported-by: syzbot+6fba78f99b9afd4b5634@syzkaller.appspotmail.com
      Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
      Cc: Philippe Liard <pliard@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e812cbbb
    • L
      Merge tag 'i3c/fixes-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux · ef7d0b59
      Linus Torvalds 提交于
      Pull i3c fix from Alexandre Belloni:
       "A single build warning fix"
      
      * tag 'i3c/fixes-for-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
        i3c/master/mipi-i3c-hci: Fix position of __maybe_unused in i3c_hci_of_match
      ef7d0b59
    • D
      bpf: Fix 32 bit src register truncation on div/mod · e88b2c6e
      Daniel Borkmann 提交于
      While reviewing a different fix, John and I noticed an oddity in one of the
      BPF program dumps that stood out, for example:
      
        # bpftool p d x i 13
         0: (b7) r0 = 808464450
         1: (b4) w4 = 808464432
         2: (bc) w0 = w0
         3: (15) if r0 == 0x0 goto pc+1
         4: (9c) w4 %= w0
        [...]
      
      In line 2 we noticed that the mov32 would 32 bit truncate the original src
      register for the div/mod operation. While for the two operations the dst
      register is typically marked unknown e.g. from adjust_scalar_min_max_vals()
      the src register is not, and thus verifier keeps tracking original bounds,
      simplified:
      
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (b7) r0 = -1
        1: R0_w=invP-1 R1=ctx(id=0,off=0,imm=0) R10=fp0
        1: (b7) r1 = -1
        2: R0_w=invP-1 R1_w=invP-1 R10=fp0
        2: (3c) w0 /= w1
        3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP-1 R10=fp0
        3: (77) r1 >>= 32
        4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP4294967295 R10=fp0
        4: (bf) r0 = r1
        5: R0_w=invP4294967295 R1_w=invP4294967295 R10=fp0
        5: (95) exit
        processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
      
      Runtime result of r0 at exit is 0 instead of expected -1. Remove the
      verifier mov32 src rewrite in div/mod and replace it with a jmp32 test
      instead. After the fix, we result in the following code generation when
      having dividend r1 and divisor r6:
      
        div, 64 bit:                             div, 32 bit:
      
         0: (b7) r6 = 8                           0: (b7) r6 = 8
         1: (b7) r1 = 8                           1: (b7) r1 = 8
         2: (55) if r6 != 0x0 goto pc+2           2: (56) if w6 != 0x0 goto pc+2
         3: (ac) w1 ^= w1                         3: (ac) w1 ^= w1
         4: (05) goto pc+1                        4: (05) goto pc+1
         5: (3f) r1 /= r6                         5: (3c) w1 /= w6
         6: (b7) r0 = 0                           6: (b7) r0 = 0
         7: (95) exit                             7: (95) exit
      
        mod, 64 bit:                             mod, 32 bit:
      
         0: (b7) r6 = 8                           0: (b7) r6 = 8
         1: (b7) r1 = 8                           1: (b7) r1 = 8
         2: (15) if r6 == 0x0 goto pc+1           2: (16) if w6 == 0x0 goto pc+1
         3: (9f) r1 %= r6                         3: (9c) w1 %= w6
         4: (b7) r0 = 0                           4: (b7) r0 = 0
         5: (95) exit                             5: (95) exit
      
      x86 in particular can throw a 'divide error' exception for div
      instruction not only for divisor being zero, but also for the case
      when the quotient is too large for the designated register. For the
      edx:eax and rdx:rax dividend pair it is not an issue in x86 BPF JIT
      since we always zero edx (rdx). Hence really the only protection
      needed is against divisor being zero.
      
      Fixes: 68fda450 ("bpf: fix 32-bit divide by zero")
      Co-developed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      e88b2c6e
    • D
      bpf: Fix verifier jmp32 pruning decision logic · fd675184
      Daniel Borkmann 提交于
      Anatoly has been fuzzing with kBdysch harness and reported a hang in
      one of the outcomes:
      
        func#0 @0
        0: R1=ctx(id=0,off=0,imm=0) R10=fp0
        0: (b7) r0 = 808464450
        1: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R10=fp0
        1: (b4) w4 = 808464432
        2: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP808464432 R10=fp0
        2: (9c) w4 %= w0
        3: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
        3: (66) if w4 s> 0x30303030 goto pc+0
         R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
        4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
        4: (7f) r0 >>= r0
        5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff),s32_max_value=808464432) R10=fp0
        5: (9c) w4 %= w0
        6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        6: (66) if w0 s> 0x3030 goto pc+0
         R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
        7: (d6) if w0 s<= 0x303030 goto pc+1
        9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
        9: (95) exit
        propagating r0
      
        from 6 to 7: safe
        4: R0_w=invP808464450 R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
        4: (7f) r0 >>= r0
        5: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0,umin_value=808464433,umax_value=2147483647,var_off=(0x0; 0x7fffffff)) R10=fp0
        5: (9c) w4 %= w0
        6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        6: (66) if w0 s> 0x3030 goto pc+0
         R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        propagating r0
        7: safe
        propagating r0
      
        from 6 to 7: safe
        processed 15 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1
      
      The underlying program was xlated as follows:
      
        # bpftool p d x i 10
         0: (b7) r0 = 808464450
         1: (b4) w4 = 808464432
         2: (bc) w0 = w0
         3: (15) if r0 == 0x0 goto pc+1
         4: (9c) w4 %= w0
         5: (66) if w4 s> 0x30303030 goto pc+0
         6: (7f) r0 >>= r0
         7: (bc) w0 = w0
         8: (15) if r0 == 0x0 goto pc+1
         9: (9c) w4 %= w0
        10: (66) if w0 s> 0x3030 goto pc+0
        11: (d6) if w0 s<= 0x303030 goto pc+1
        12: (05) goto pc-1
        13: (95) exit
      
      The verifier rewrote original instructions it recognized as dead code with
      'goto pc-1', but reality differs from verifier simulation in that we are
      actually able to trigger a hang due to hitting the 'goto pc-1' instructions.
      
      Taking a closer look at the verifier analysis, the reason is that it misjudges
      its pruning decision at the first 'from 6 to 7: safe' occasion. What happens
      is that while both old/cur registers are marked as precise, they get misjudged
      for the jmp32 case as range_within() yields true, meaning that the prior
      verification path with a wider register bound could be verified successfully
      and therefore the current path with a narrower register bound is deemed safe
      as well whereas in reality it's not. R0 old/cur path's bounds compare as
      follows:
      
        old: smin_value=0x8000000000000000,smax_value=0x7fffffffffffffff,umin_value=0x0,umax_value=0xffffffffffffffff,var_off=(0x0; 0xffffffffffffffff)
        cur: smin_value=0x8000000000000000,smax_value=0x7fffffff7fffffff,umin_value=0x0,umax_value=0xffffffff7fffffff,var_off=(0x0; 0xffffffff7fffffff)
      
        old: s32_min_value=0x80000000,s32_max_value=0x00003030,u32_min_value=0x00000000,u32_max_value=0xffffffff
        cur: s32_min_value=0x00003031,s32_max_value=0x7fffffff,u32_min_value=0x00003031,u32_max_value=0x7fffffff
      
      The 64 bit bounds generally look okay and while the information that got
      propagated from 32 to 64 bit looks correct as well, it's not precise enough
      for judging a conditional jmp32. Given the latter only operates on subregisters
      we also need to take these into account as well for a range_within() probe
      in order to be able to prune paths. Extending the range_within() constraint
      to both bounds will be able to tell us that the old signed 32 bit bounds are
      not wider than the cur signed 32 bit bounds.
      
      With the fix in place, the program will now verify the 'goto' branch case as
      it should have been:
      
        [...]
        6: R0_w=invP(id=0) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        6: (66) if w0 s> 0x3030 goto pc+0
         R0_w=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        7: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
        7: (d6) if w0 s<= 0x303030 goto pc+1
        9: R0=invP(id=0,s32_max_value=12336) R1=ctx(id=0,off=0,imm=0) R4=invP(id=0) R10=fp0
        9: (95) exit
      
        7: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=12337,u32_min_value=12337,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        7: (d6) if w0 s<= 0x303030 goto pc+1
         R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        8: R0_w=invP(id=0,smax_value=9223372034707292159,umax_value=18446744071562067967,var_off=(0x0; 0xffffffff7fffffff),s32_min_value=3158065,u32_min_value=3158065,u32_max_value=2147483647) R1=ctx(id=0,off=0,imm=0) R4_w=invP(id=0) R10=fp0
        8: (30) r0 = *(u8 *)skb[808464432]
        BPF_LD_[ABS|IND] uses reserved fields
        processed 11 insns (limit 1000000) max_states_per_insn 1 total_states 1 peak_states 1 mark_read 1
      
      The bug is quite subtle in the sense that when verifier would determine that
      a given branch is dead code, it would (here: wrongly) remove these instructions
      from the program and hard-wire the taken branch for privileged programs instead
      of the 'goto pc-1' rewrites which will cause hard to debug problems.
      
      Fixes: 3f50f132 ("bpf: Verifier, do explicit ALU32 bounds tracking")
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      fd675184
    • D
      bpf: Fix verifier jsgt branch analysis on max bound · ee114dd6
      Daniel Borkmann 提交于
      Fix incorrect is_branch{32,64}_taken() analysis for the jsgt case. The return
      code for both will tell the caller whether a given conditional jump is taken
      or not, e.g. 1 means branch will be taken [for the involved registers] and the
      goto target will be executed, 0 means branch will not be taken and instead we
      fall-through to the next insn, and last but not least a -1 denotes that it is
      not known at verification time whether a branch will be taken or not. Now while
      the jsgt has the branch-taken case correct with reg->s32_min_value > sval, the
      branch-not-taken case is off-by-one when testing for reg->s32_max_value < sval
      since the branch will also be taken for reg->s32_max_value == sval. The jgt
      branch analysis, for example, gets this right.
      
      Fixes: 3f50f132 ("bpf: Verifier, do explicit ALU32 bounds tracking")
      Fixes: 4f7b3e82 ("bpf: improve verifier branch analysis")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      ee114dd6
    • D
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · de1db4a6
      David S. Miller 提交于
      Tony Nguyen says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2021-02-08
      
      This series contains updates to i40e driver only.
      
      Cristian makes improvements to driver XDP path. Avoids writing
      next-to-clean pointer on every update, removes redundant updates of
      cleaned_count and buffer info, creates a helper function to consolidate
      XDP actions and simplifies some of the behavior.
      
      Eryk adds messages to inform the user when MTU is larger than supported
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de1db4a6
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 450bbc33
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) nf_conntrack_tuple_taken() needs to recheck zone for
         NAT clash resolution, from Florian Westphal.
      
      2) Restore support for stateful expressions when set definition
         specifies no stateful expressions.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      450bbc33
    • D
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 74784ee0
      David S. Miller 提交于
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-02-08
      
      This series contains updates to the ice driver and documentation.
      
      Brett adds a log message when a trusted VF goes in and out of promiscuous
      for consistency with i40e driver.
      
      Dave implements a new LLDP command that allows adding VSI destinations to
      existing filters and adds support for netdev bonding events, current
      support is software based.
      
      Michal refactors code to move from VSI stored xsk_buff_pools to
      netdev-provided ones.
      
      Kiran implements the creation scheduler aggregator nodes and distributing
      VSIs within the nodes.
      
      Ben modifies rate limit calculations to use clock frequency from the
      hardware instead of using a hardcoded one.
      
      Jesse adds support for user to control writeback frequency.
      
      Chinh refactors DCB variables out of the ice_port_info struct.
      
      Bruce removes some unnecessary casting.
      
      Mitch fixes an error message that was reported as if_up instead of if_down.
      
      Tony adjusts fallback allocation for MSI-X to use all given vectors instead
      of using only the minimum configuration and updates documentation for
      the ice driver.
      ====================
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74784ee0
    • D
      Merge branch 'hns3-cleanups' · 3e566dac
      David S. Miller 提交于
      Huazhong Tan says:
      
      ====================
      net: hns3: some cleanups for -next
      
      There are some cleanups for the HNS3 ethernet driver.
      
      change log:
      V2: remove previous #3 which should target net.
      
      previous version:
      V1: https://patchwork.kernel.org/project/netdevbpf/cover/1612784382-27262-1-git-send-email-tanhuazhong@huawei.com/
      ====================
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e566dac
    • J
      net: hns3: cleanup for endian issue for VF RSS · 55ff3ed5
      Jian Shen 提交于
      Currently the RSS commands of VF are using host byte order.
      According to the user manual, it should use little endian in
      the command to firmware. For the host and firmware are both
      using little endian, so it can work well in this case.
      
      Do cleanup to make it more explicitly.
      Signed-off-by: NJian Shen <shenjian15@huawei.com>
      Signed-off-by: NHuazhong tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55ff3ed5
    • P
      net: hns3: remove unused macro definition · 7ceb40b8
      Peng Li 提交于
      Some macros are defined but unused, so remove them.
      Signed-off-by: NPeng Li <lipeng321@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ceb40b8
    • H
      net: hns3: remove an unused parameter in hclge_vf_rate_param_check() · 11ef971f
      Huazhong Tan 提交于
      Parameter vf in hclge_vf_rate_param_check() is unused now,
      so remove it.
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11ef971f
    • H
      net: hns3: remove redundant return value of hns3_uninit_all_ring() · 64749c9c
      Huazhong Tan 提交于
      Since hns3_uninit_all_ring() only returns 0, so remove this
      redundant return value and function declaration in hns3_enet.h.
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64749c9c
    • P
      net: hns3: change hclge_query_bd_num() param type · cad8dfe8
      Peng Li 提交于
      The type of parameter mpf_bd_num and pf_bd_num in
      hclge_query_bd_num() should be u32* instead of int*,
      so change them.
      Signed-off-by: NPeng Li <lipeng321@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cad8dfe8
    • P
      net: hns3: change hclge_parse_speed() param type · 6e7f109e
      Peng Li 提交于
      The type of parameters in hclge_parse_speed() should be
      unsigned type, so change them.
      Signed-off-by: NPeng Li <lipeng321@huawei.com>
      Signed-off-by: NHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e7f109e