1. 29 5月, 2018 13 次提交
    • Y
      ptp_qoriq: move some definitions to header file · 6c50c1ed
      Yangbo Lu 提交于
      This patch is to move some definitions in ptp_qoriq.c
      to the header file.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c50c1ed
    • Y
      ptp: rework gianfar_ptp as QorIQ common PTP driver · ceefc71d
      Yangbo Lu 提交于
      gianfar_ptp was the PTP clock driver for 1588 timer
      module of Freescale QorIQ eTSEC (Enhanced Three-Speed
      Ethernet Controllers) platforms. Actually QorIQ DPAA
      (Data Path Acceleration Architecture) platforms is
      also using the same 1588 timer module in hardware.
      
      This patch is to rework gianfar_ptp as QorIQ common
      PTP driver to support both DPAA and eTSEC. Moved
      gianfar_ptp.c to drivers/ptp/, renamed it as
      ptp_qoriq.c, and renamed many variables. There were
      not any function changes.
      Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com>
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceefc71d
    • J
      ifb: fix packets checksum · b1d2e4e0
      Jon Maxwell 提交于
      Fixup the checksum for CHECKSUM_COMPLETE when pulling skbs on RX path.
      Otherwise we get splats when tc mirred is used to redirect packets to ifb.
      
      Before fix:
      
      nic: hw csum failure
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1d2e4e0
    • H
      net: phy: realtek: add suspend/resume callbacks for RTL8211B · 049ff57a
      Heiner Kallweit 提交于
      Add RTL8211B suspend / resume callbacks.
      Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      049ff57a
    • D
      Merge branch 'Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device' · deeaf1f8
      David S. Miller 提交于
      Sridhar Samudrala says:
      
      ====================
      Enable virtio_net to act as a standby for a passthru device
      
      The main motivation for this patch is to enable cloud service providers
      to provide an accelerated datapath to virtio-net enabled VMs in a
      transparent manner with no/minimal guest userspace changes. This also
      enables hypervisor controlled live migration to be supported with VMs that
      have direct attached SR-IOV VF devices.
      
      Patch 1 introduces a failover module that provides a generic interface for
      paravirtual drivers to listen for netdev register/unregister/link change
      events from pci ethernet devices with the same MAC and takeover their
      datapath. The notifier and event handling code is based on the existing
      netvsc implementation.
      
      Patch 2 refactors netvsc to use the registration/notification framework
      introduced by failover module.
      
      Patch 3 introduces a net_failover driver that provides an automated
      failover mechanism to paravirtual drivers via APIs to create and destroy
      a failover master netdev and mananges a primary and standby slave netdevs
      that get registered via the generic failover infrastructure.
      
      Patch 4 introduces a new feature bit VIRTIO_NET_F_STANDBY to virtio-net
      that can be used by hypervisor to indicate that virtio_net interface
      should act as a standby for another device with the same MAC address.
      
      Patch 5 extends virtio_net to use alternate datapath when available and
      registered. When STANDBY feature is enabled, virtio_net driver uese the
      net_failover API to create an additional 'failover' netdev that acts as
      a master device and controls 2 slave devices.  The original virtio_net
      netdev is registered as 'standby' netdev and a passthru/vf device with
      the same MAC gets registered as 'primary' netdev. Both 'standby' and
      'failover' netdevs are associated with the same 'pci' device.  The user
      accesses the network interface via 'failover' netdev. The 'failover'
      netdev chooses 'primary' netdev as default for transmits when it is
      available with link up and running.
      
      As this patch series is initially focusing on usecases where hypervisor
      fully controls the VM networking and the guest is not expected to directly
      configure any hardware settings, it doesn't expose all the ndo/ethtool ops
      that are supported by virtio_net at this time. To support additional usecases,
      it should be possible to enable additional ops later by caching the state
      in failover netdev and replaying when the 'primary' netdev gets registered.
      
      At the time of live migration, the hypervisor needs to unplug the VF device
      from the guest on the source host and reset the MAC filter of the VF to
      initiate failover of datapath to virtio before starting the migration. After
      the migration is completed, the destination hypervisor sets the MAC filter
      on the VF and plugs it back to the guest to switch over to VF datapath.
      
      This patch is based on the discussion initiated by Jesse on this thread.
      https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
      
      v12:
      - Tested live migration with virtio-net/AVF(i40evf) configured in failover
        mode while running iperf in background. Tried static ip and dhcp
        configurations using 'network' scripts and Network Manager.
      - Build tested netvsc module.
      Updates:
      - Extended generic failover module to do common functions like setting
        FAILOVER_SLAVE flag, registering rx-handler and linking to upper dev in
        the generic register/unregister handlers.
        This required adding 3 additional failover ops pre_register, pre_unregister
        and handle_frame.  netvsc and net_failover drivers are updated to support
        these ops.
      
      v11:
      - Split net_failover module into 2 components.
        1. 'failover' module that provides generic failover infrastructure
        to register a failover instance and listen for slave events.
        2. 'net_failover' driver that provides APIs to create/destroy upper
        netdev and supports 3-netdev model used by virtio-net.
      - Added documentation
      
      v10:
      - fix net_failover_open() to update failover CARRIER correctly based on
        standby and primary states.
      - fix net_failover_handle_frame() to handle frames received on standby
        when primary is present.
      - replace netdev_upper_dev_link with netdev_master_upper_dev_link and
        handle lower dev state changes.
      - fix net_failver_create() and net_failover_register() interfaces to
        use ERR_PTR and avoid arg **
      - disable setting mac address when virtio-net in STANDBY mode
      - document exported symbols
      - added entry to MAINTAINERS file
      
      v9:
      Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET
      are enabled. (stephen)
      
      v8:
      - Made the failover managment routines more robust by updating the feature
        bits/other fields in the failover netdev when slave netdevs are
        registered/unregistered. (mst)
      - added support for handling vlans.
      - Limited the changes in netvsc to only use the notifier/event/lookups
        from the failover module. The slave register/unregister/link-change
        handlers are only updated to use the getbymac routine to get the
        upper netdev. There is no change in their functionality. (stephen)
      - renamed structs/function/file names to use net_failover prefix. (mst)
      
      v7
      - Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
        (jiri, mst)
      - re-arranged dev_open() and dev_set_mtu() calls in the register routines
        so that they don't get called for 2-netdev model. (stephen)
      - fixed select_queue() routine to do queue selection based on VF if it is
        registered as primary. (stephen)
      -  minor bugfixes
      
      v6 RFC:
        Simplified virtio_net changes by moving all the ndo_ops of the
        bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
        avoided 2 phase registration(driver + instances).
        introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags
        replaced mutex with a spinlock
      
      v5 RFC:
        Based on Jiri's comments, moved the common functionality to a 'bypass'
        module so that the same notifier and event handlers to handle child
        register/unregister/link change events can be shared between virtio_net
        and netvsc.
        Improved error handling based on Siwei's comments.
      v4:
      - Based on the review comments on the v3 version of the RFC patch and
        Jakub's suggestion for the naming issue with 3 netdev solution,
        proposed 3 netdev in-driver bonding solution for virtio-net.
      v3 RFC:
      - Introduced 3 netdev model and pointed out a couple of issues with
        that model and proposed 2 netdev model to avoid these issues.
      - Removed broadcast/multicast optimization and only use virtio as
        backup path when VF is unplugged.
      v2 RFC:
      - Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
      - made a small change to the virtio-net xmit path to only use VF datapath
        for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
        east-west broadcasts to go over the PCI link.
      - added suppport for the feature bit in qemu
      ====================
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deeaf1f8
    • S
      virtio_net: Extend virtio to use VF datapath when available · ba5e4426
      Sridhar Samudrala 提交于
      This patch enables virtio_net to switch over to a VF datapath when STANDBY
      feature is enabled and a VF netdev is present with the same MAC address.
      It allows live migration of a VM with a direct attached VF without the need
      to setup a bond/team between a VF and virtio net device in the guest.
      
      It uses the API that is exported by the net_failover driver to create and
      and destroy a master failover netdev. When STANDBY feature is enabled, an
      additional netdev(failover netdev) is created that acts as a master device
      and tracks the state of the 2 lower netdevs. The original virtio_net netdev
      is marked as 'standby' netdev and a passthru device with the same MAC is
      registered as 'primary' netdev.
      
      The hypervisor needs to unplug the VF device from the guest on the source
      host and reset the MAC filter of the VF to initiate failover of datapath
      to virtio before starting the migration. After the migration is completed,
      the destination hypervisor sets the MAC filter on the VF and plugs it back
      to the guest to switch over to VF datapath.
      
      This patch is based on the discussion initiated by Jesse on this thread.
      https://marc.info/?l=linux-virtualization&m=151189725224231&w=2Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5e4426
    • S
      virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit · 9805069d
      Sridhar Samudrala 提交于
      This feature bit can be used by hypervisor to indicate virtio_net device to
      act as a standby for another device with the same MAC address.
      
      VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9805069d
    • S
      net: Introduce net_failover driver · cfc80d9a
      Sridhar Samudrala 提交于
      The net_failover driver provides an automated failover mechanism via APIs
      to create and destroy a failover master netdev and manages a primary and
      standby slave netdevs that get registered via the generic failover
      infrastructure.
      
      The failover netdev acts a master device and controls 2 slave devices. The
      original paravirtual interface gets registered as 'standby' slave netdev and
      a passthru/vf device with the same MAC gets registered as 'primary' slave
      netdev. Both 'standby' and 'failover' netdevs are associated with the same
      'pci' device. The user accesses the network interface via 'failover' netdev.
      The 'failover' netdev chooses 'primary' netdev as default for transmits when
      it is available with link up and running.
      
      This can be used by paravirtual drivers to enable an alternate low latency
      datapath. It also enables hypervisor controlled live migration of a VM with
      direct attached VF by failing over to the paravirtual datapath when the VF
      is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc80d9a
    • S
      netvsc: refactor notifier/event handling code to use the failover framework · 1ff78076
      Sridhar Samudrala 提交于
      Use the registration/notification framework supported by the generic
      failover infrastructure.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ff78076
    • S
      net: Introduce generic failover module · 30c8bd5a
      Sridhar Samudrala 提交于
      The failover module provides a generic interface for paravirtual drivers
      to register a netdev and a set of ops with a failover instance. The ops
      are used as event handlers that get called to handle netdev register/
      unregister/link change/name change events on slave pci ethernet devices
      with the same mac address as the failover netdev.
      
      This enables paravirtual drivers to use a VF as an accelerated low latency
      datapath. It also allows migration of VMs with direct attached VFs by
      failing over to the paravirtual datapath when the VF is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c8bd5a
    • D
      vrf: add CRC32c offload to device features · cb160394
      Davide Caratti 提交于
      SCTP sockets originated in a VRF can improve their performance if CRC32c
      computation is delegated to underlying devices: update device features,
      setting NETIF_F_SCTP_CRC. Iterating the following command in the topology
      proposed with [1],
      
       # ip vrf exec vrf-h2 netperf -H 192.0.2.1 -t SCTP_STREAM -- -m 10K
      
      the measured throughput in Mbit/s improved from 2395 ± 1% to 2720 ± 1%.
      
      [1] https://www.spinics.net/lists/netdev/msg486007.htmlSigned-off-by: NDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb160394
    • T
      net: stmmac: Use mutex instead of spinlock · 29555fa3
      Thierry Reding 提交于
      Some drivers, such as DWC EQOS on Tegra, need to perform operations that
      can sleep under this lock (clk_set_rate() in tegra_eqos_fix_speed()) for
      proper operation. Since there is no need for this lock to be a spinlock,
      convert it to a mutex instead.
      
      Fixes: e6ea2d16 ("net: stmmac: dwc-qos: Add Tegra186 support")
      Reported-by: NJon Hunter <jonathanh@nvidia.com>
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Tested-by: NBhadram Varka <vbhadram@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29555fa3
    • S
      bnx2x: Collect the device debug information during Tx timeout. · 1b40428c
      Sudarsana Reddy Kalluru 提交于
      Tx-timeout mostly happens due to some issue in the device. In such cases,
      debug dump would be helpful for identifying the cause of the issue.
      This patch adds support to spill debug data during the Tx timeout. Here
      bnx2x_panic_dump() API is used instead of bnx2x_panic(), since we still
      want to allow the Tx-timeout recovery a chance to succeed.
      
      Changes from previous version:
      -------------------------------
      v2: Fixed a coding error.
      
      Please consider applying this to "net-next".
      Signed-off-by: NSudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b40428c
  2. 27 5月, 2018 1 次提交
  3. 26 5月, 2018 26 次提交
    • L
      Merge branch 'akpm' (patches from Andrew) · bc2dbc54
      Linus Torvalds 提交于
      Merge misc fixes from Andrew Morton:
       "16 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        kasan: fix memory hotplug during boot
        kasan: free allocated shadow memory on MEM_CANCEL_ONLINE
        checkpatch: fix macro argument precedence test
        init/main.c: include <linux/mem_encrypt.h>
        kernel/sys.c: fix potential Spectre v1 issue
        mm/memory_hotplug: fix leftover use of struct page during hotplug
        proc: fix smaps and meminfo alignment
        mm: do not warn on offline nodes unless the specific node is explicitly requested
        mm, memory_hotplug: make has_unmovable_pages more robust
        mm/kasan: don't vfree() nonexistent vm_area
        MAINTAINERS: change hugetlbfs maintainer and update files
        ipc/shm: fix shmat() nil address after round-down when remapping
        Revert "ipc/shm: Fix shmat mmap nil-page protection"
        idr: fix invalid ptr dereference on item delete
        ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio"
        mm: fix nr_rotate_swap leak in swapon() error case
      bc2dbc54
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 03250e10
      Linus Torvalds 提交于
      Pull networking fixes from David Miller:
       "Let's begin the holiday weekend with some networking fixes:
      
         1) Whoops need to restrict cfg80211 wiphy names even more to 64
            bytes. From Eric Biggers.
      
         2) Fix flags being ignored when using kernel_connect() with SCTP,
            from Xin Long.
      
         3) Use after free in DCCP, from Alexey Kodanev.
      
         4) Need to check rhltable_init() return value in ipmr code, from Eric
            Dumazet.
      
         5) XDP handling fixes in virtio_net from Jason Wang.
      
         6) Missing RTA_TABLE in rtm_ipv4_policy[], from Roopa Prabhu.
      
         7) Need to use IRQ disabling spinlocks in mlx4_qp_lookup(), from Jack
            Morgenstein.
      
         8) Prevent out-of-bounds speculation using indexes in BPF, from
            Daniel Borkmann.
      
         9) Fix regression added by AF_PACKET link layer cure, from Willem de
            Bruijn.
      
        10) Correct ENIC dma mask, from Govindarajulu Varadarajan.
      
        11) Missing config options for PMTU tests, from Stefano Brivio"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
        ibmvnic: Fix partial success login retries
        selftests/net: Add missing config options for PMTU tests
        mlx4_core: allocate ICM memory in page size chunks
        enic: set DMA mask to 47 bit
        ppp: remove the PPPIOCDETACH ioctl
        ipv4: remove warning in ip_recv_error
        net : sched: cls_api: deal with egdev path only if needed
        vhost: synchronize IOTLB message with dev cleanup
        packet: fix reserve calculation
        net/mlx5: IPSec, Fix a race between concurrent sandbox QP commands
        net/mlx5e: When RXFCS is set, add FCS data into checksum calculation
        bpf: properly enforce index mask to prevent out-of-bounds speculation
        net/mlx4: Fix irq-unsafe spinlock usage
        net: phy: broadcom: Fix bcm_write_exp()
        net: phy: broadcom: Fix auxiliary control register reads
        net: ipv4: add missing RTA_TABLE to rtm_ipv4_policy
        net/mlx4: fix spelling mistake: "Inrerface" -> "Interface" and rephrase message
        ibmvnic: Only do H_EOI for mobility events
        tuntap: correctly set SOCKWQ_ASYNC_NOSPACE
        virtio-net: fix leaking page for gso packet during mergeable XDP
        ...
      03250e10
    • D
      kasan: fix memory hotplug during boot · 3f195972
      David Hildenbrand 提交于
      Using module_init() is wrong.  E.g.  ACPI adds and onlines memory before
      our memory notifier gets registered.
      
      This makes sure that ACPI memory detected during boot up will not result
      in a kernel crash.
      
      Easily reproducible with QEMU, just specify a DIMM when starting up.
      
      Link: http://lkml.kernel.org/r/20180522100756.18478-3-david@redhat.com
      Fixes: 786a8959 ("kasan: disable memory hotplug")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f195972
    • D
      kasan: free allocated shadow memory on MEM_CANCEL_ONLINE · ed1596f9
      David Hildenbrand 提交于
      We have to free memory again when we cancel onlining, otherwise a later
      onlining attempt will fail.
      
      Link: http://lkml.kernel.org/r/20180522100756.18478-2-david@redhat.com
      Fixes: fa69b598 ("mm/kasan: add support for memory hotplug")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed1596f9
    • J
      d41362ed
    • M
      init/main.c: include <linux/mem_encrypt.h> · ae67d58d
      Mathieu Malaterre 提交于
      In commit c7753208 ("x86, swiotlb: Add memory encryption support") a
      call to function `mem_encrypt_init' was added.  Include prototype
      defined in header <linux/mem_encrypt.h> to prevent a warning reported
      during compilation with W=1:
      
        init/main.c:494:20: warning: no previous prototype for `mem_encrypt_init' [-Wmissing-prototypes]
      
      Link: http://lkml.kernel.org/r/20180522195533.31415-1-malat@debian.orgSigned-off-by: NMathieu Malaterre <malat@debian.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Gargi Sharma <gs051095@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae67d58d
    • G
      kernel/sys.c: fix potential Spectre v1 issue · 23d6aef7
      Gustavo A. R. Silva 提交于
      `resource' can be controlled by user-space, hence leading to a potential
      exploitation of the Spectre variant 1 vulnerability.
      
      This issue was detected with the help of Smatch:
      
        kernel/sys.c:1474 __do_compat_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
        kernel/sys.c:1455 __do_sys_old_getrlimit() warn: potential spectre issue 'get_current()->signal->rlim' (local cap)
      
      Fix this by sanitizing *resource* before using it to index
      current->signal->rlim
      
      Notice that given that speculation windows are large, the policy is to
      kill the speculation on the first load and not worry if it can be
      completed with a dependent load/store [1].
      
      [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
      
      Link: http://lkml.kernel.org/r/20180515030038.GA11822@embeddedor.comSigned-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23d6aef7
    • J
      mm/memory_hotplug: fix leftover use of struct page during hotplug · a2155861
      Jonathan Cameron 提交于
      The case of a new numa node got missed in avoiding using the node info
      from page_struct during hotplug.  In this path we have a call to
      register_mem_sect_under_node (which allows us to specify it is hotplug
      so don't change the node), via link_mem_sections which unfortunately
      does not.
      
      Fix is to pass check_nid through link_mem_sections as well and disable
      it in the new numa node path.
      
      Note the bug only 'sometimes' manifests depending on what happens to be
      in the struct page structures - there are lots of them and it only needs
      to match one of them.
      
      The result of the bug is that (with a new memory only node) we never
      successfully call register_mem_sect_under_node so don't get the memory
      associated with the node in sysfs and meminfo for the node doesn't
      report it.
      
      It came up whilst testing some arm64 hotplug patches, but appears to be
      universal.  Whilst I'm triggering it by removing then reinserting memory
      to a node with no other elements (thus making the node disappear then
      appear again), it appears it would happen on hotplugging memory where
      there was none before and it doesn't seem to be related the arm64
      patches.
      
      These patches call __add_pages (where most of the issue was fixed by
      Pavel's patch).  If there is a node at the time of the __add_pages call
      then all is well as it calls register_mem_sect_under_node from there
      with check_nid set to false.  Without a node that function returns
      having not done the sysfs related stuff as there is no node to use.
      This is expected but it is the resulting path that fails...
      
      Exact path to the problem is as follows:
      
       mm/memory_hotplug.c: add_memory_resource()
      
         The node is not online so we enter the 'if (new_node)' twice, on the
         second such block there is a call to link_mem_sections which calls
         into
      
        drivers/node.c: link_mem_sections() which calls
      
        drivers/node.c: register_mem_sect_under_node() which calls
           get_nid_for_pfn and keeps trying until the output of that matches
           the expected node (passed all the way down from
           add_memory_resource)
      
      It is effectively the same fix as the one referred to in the fixes tag
      just in the code path for a new node where the comments point out we
      have to rerun the link creation because it will have failed in
      register_new_memory (as there was no node at the time).  (actually that
      comment is wrong now as we don't have register_new_memory any more it
      got renamed to hotplug_memory_register in Pavel's patch).
      
      Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com
      Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug")
      Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2155861
    • H
      proc: fix smaps and meminfo alignment · 6c04ab0e
      Hugh Dickins 提交于
      The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
      numbers (commonly 0) are misaligned.
      
      Remove seq_put_decimal_ull_width()'s leftover optimization for single
      digits: it's wrong now that num_to_str() takes care of the width.
      
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
      Fixes: d1be35cb ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c04ab0e
    • M
      mm: do not warn on offline nodes unless the specific node is explicitly requested · 8addc2d0
      Michal Hocko 提交于
      Oscar has noticed that we splat
      
         WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9
         [...]
         CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G        W   E     4.17.0-rc5-next-20180517-1-default+ #66
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
         Workqueue: kacpi_hotplug acpi_hotplug_work_fn
         Call Trace:
          vmemmap_populate+0xf2/0x2ae
          sparse_mem_map_populate+0x28/0x35
          sparse_add_one_section+0x4c/0x187
          __add_pages+0xe7/0x1a0
          add_pages+0x16/0x70
          add_memory_resource+0xa3/0x1d0
          add_memory+0xe4/0x110
          acpi_memory_device_add+0x134/0x2e0
          acpi_bus_attach+0xd9/0x190
          acpi_bus_scan+0x37/0x70
          acpi_device_hotplug+0x389/0x4e0
          acpi_hotplug_work_fn+0x1a/0x30
          process_one_work+0x146/0x340
          worker_thread+0x47/0x3e0
          kthread+0xf5/0x130
          ret_from_fork+0x35/0x40
      
      when adding memory to a node that is currently offline.
      
      The VM_WARN_ON is just too loud without a good reason.  In this
      particular case we are doing
      
      	alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order)
      
      so we do not insist on allocating from the given node (it is more a
      hint) so we can fall back to any other populated node and moreover we
      explicitly ask to not warn for the allocation failure.
      
      Soften the warning only to cases when somebody asks for the given node
      explicitly by __GFP_THISNODE.
      
      Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NOscar Salvador <osalvador@techadventures.net>
      Tested-by: NOscar Salvador <osalvador@techadventures.net>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8addc2d0
    • M
      mm, memory_hotplug: make has_unmovable_pages more robust · 15c30bc0
      Michal Hocko 提交于
      Oscar has reported:
      : Due to an unfortunate setting with movablecore, memblocks containing bootmem
      : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
      : So while trying to remove that memory, the system failed in do_migrate_range
      : and __offline_pages never returned.
      :
      : This can be reproduced by running
      : qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
      : and movablecore=4G kernel command line
      :
      : linux kernel: BIOS-provided physical RAM map:
      : linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
      : linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
      : linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
      : linux kernel: NX (Execute Disable) protection: active
      : linux kernel: SMBIOS 2.8 present.
      : linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
      : linux kernel: Hypervisor detected: KVM
      : linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
      : linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
      : linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
      :
      : linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
      : linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
      : linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
      : linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
      : linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
      :
      : zoneinfo shows that the zone movable is placed into both numa nodes:
      : Node 0, zone  Movable
      :   pages free     160140
      :         min      1823
      :         low      2278
      :         high     2733
      :         spanned  262144
      :         present  262144
      :         managed  245670
      : Node 1, zone  Movable
      :   pages free     448427
      :         min      3827
      :         low      4783
      :         high     5739
      :         spanned  524288
      :         present  524288
      :         managed  515766
      
      Note how only Node 0 has a hutplugable memory region which would rule it
      out from the early memblock allocations (most likely memmap).  Node1
      will surely contain memmaps on the same node and those would prevent
      offlining to succeed.  So this is arguably a configuration issue.
      Although one could argue that we should be more clever and rule early
      allocations from the zone movable.  This would be correct but probably
      not worth the effort considering what a hack movablecore is.
      
      Anyway, We could do better for those cases though.  We rely on
      start_isolate_page_range resp.  has_unmovable_pages to do their job.
      The first one isolates the whole range to be offlined so that we do not
      allocate from it anymore and the later makes sure we are not stumbling
      over non-migrateable pages.
      
      has_unmovable_pages is overly optimistic, however.  It doesn't check all
      the pages if we are withing zone_movable because we rely that those
      pages will be always migrateable.  As it turns out we are still not
      perfect there.  While bootmem pages in zonemovable sound like a clear
      bug which should be fixed let's remove the optimization for now and warn
      if we encounter unmovable pages in zone_movable in the meantime.  That
      should help for now at least.
      
      Btw.  this wasn't a real problem until commit 72b39cfc ("mm,
      memory_hotplug: do not fail offlining too early") because we used to
      have a small number of retries and then failed.  This turned out to be
      too fragile though.
      
      Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NOscar Salvador <osalvador@techadventures.net>
      Tested-by: NOscar Salvador <osalvador@techadventures.net>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15c30bc0
    • A
      mm/kasan: don't vfree() nonexistent vm_area · 0f901dcb
      Andrey Ryabinin 提交于
      KASAN uses different routines to map shadow for hot added memory and
      memory obtained in boot process.  Attempt to offline memory onlined by
      normal boot process leads to this:
      
          Trying to vfree() nonexistent vm area (000000005d3b34b9)
          WARNING: CPU: 2 PID: 13215 at mm/vmalloc.c:1525 __vunmap+0x147/0x190
      
          Call Trace:
           kasan_mem_notifier+0xad/0xb9
           notifier_call_chain+0x166/0x260
           __blocking_notifier_call_chain+0xdb/0x140
           __offline_pages+0x96a/0xb10
           memory_subsys_offline+0x76/0xc0
           device_offline+0xb8/0x120
           store_mem_state+0xfa/0x120
           kernfs_fop_write+0x1d5/0x320
           __vfs_write+0xd4/0x530
           vfs_write+0x105/0x340
           SyS_write+0xb0/0x140
      
      Obviously we can't call vfree() to free memory that wasn't allocated via
      vmalloc().  Use find_vm_area() to see if we can call vfree().
      
      Unfortunately it's a bit tricky to properly unmap and free shadow
      allocated during boot, so we'll have to keep it.  If memory will come
      online again that shadow will be reused.
      
      Matthew asked: how can you call vfree() on something that isn't a
      vmalloc address?
      
        vfree() is able to free any address returned by
        __vmalloc_node_range().  And __vmalloc_node_range() gives you any
        address you ask.  It doesn't have to be an address in [VMALLOC_START,
        VMALLOC_END] range.
      
        That's also how the module_alloc()/module_memfree() works on
        architectures that have designated area for modules.
      
      [aryabinin@virtuozzo.com: improve comments]
        Link: http://lkml.kernel.org/r/dabee6ab-3a7a-51cd-3b86-5468718e0390@virtuozzo.com
      [akpm@linux-foundation.org: fix typos, reflow comment]
      Link: http://lkml.kernel.org/r/20180201163349.8700-1-aryabinin@virtuozzo.com
      Fixes: fa69b598 ("mm/kasan: add support for memory hotplug")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NPaul Menzel <pmenzel+linux-kasan-dev@molgen.mpg.de>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f901dcb
    • M
      MAINTAINERS: change hugetlbfs maintainer and update files · b9ddff9b
      Mike Kravetz 提交于
      The current hugetlbfs maintainer has not been active for more than a few
      years.  I have been been active in this area for more than two years and
      plan to remain active in the foreseeable future.
      
      Also, update the hugetlbfs entry to include linux-mm mail list and
      additional hugetlbfs related files.  hugetlb.c and hugetlb.h are not
      100% hugetlbfs, but a majority of their content is hugetlbfs related.
      
      Link: http://lkml.kernel.org/r/20180518225236.19079-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ddff9b
    • D
      ipc/shm: fix shmat() nil address after round-down when remapping · 8f89c007
      Davidlohr Bueso 提交于
      shmat()'s SHM_REMAP option forbids passing a nil address for; this is in
      fact the very first thing we check for.  Andrea reported that for
      SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check,
      but we need to check again if the address was rounded down to nil.  As
      of this patch, such cases will return -EINVAL.
      
      Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f89c007
    • D
      Revert "ipc/shm: Fix shmat mmap nil-page protection" · a73ab244
      Davidlohr Bueso 提交于
      Patch series "ipc/shm: shmat() fixes around nil-page".
      
      These patches fix two issues reported[1] a while back by Joe and Andrea
      around how shmat(2) behaves with nil-page.
      
      The first reverts a commit that it was incorrectly thought that mapping
      nil-page (address=0) was a no no with MAP_FIXED.  This is not the case,
      with the exception of SHM_REMAP; which is address in the second patch.
      
      I chose two patches because it is easier to backport and it explicitly
      reverts bogus behaviour.  Both patches ought to be in -stable and ltp
      testcases need updated (the added testcase around the cve can be
      modified to just test for SHM_RND|SHM_REMAP).
      
      [1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805
      
      This patch (of 2):
      
      Commit 95e91b83 ("ipc/shm: Fix shmat mmap nil-page protection")
      worked on the idea that we should not be mapping as root addr=0 and
      MAP_FIXED.  However, it was reported that this scenario is in fact
      valid, thus making the patch both bogus and breaks userspace as well.
      
      For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem
      initialization[1].
      
      [1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347
      Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net
      Fixes: 95e91b83 ("ipc/shm: Fix shmat mmap nil-page protection")
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a73ab244
    • M
      idr: fix invalid ptr dereference on item delete · 7a4deea1
      Matthew Wilcox 提交于
      If the radix tree underlying the IDR happens to be full and we attempt
      to remove an id which is larger than any id in the IDR, we will call
      __radix_tree_delete() with an uninitialised 'slot' pointer, at which
      point anything could happen.  This was easiest to hit with a single
      entry at id 0 and attempting to remove a non-0 id, but it could have
      happened with 64 entries and attempting to remove an id >= 64.
      
      Roman said:
      
        The syzcaller test boils down to opening /dev/kvm, creating an
        eventfd, and calling a couple of KVM ioctls. None of this requires
        superuser. And the result is dereferencing an uninitialized pointer
        which is likely a crash. The specific path caught by syzbot is via
        KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
        other user-triggerable paths, so cc:stable is probably justified.
      
      Matthew added:
      
        We have around 250 calls to idr_remove() in the kernel today. Many of
        them pass an ID which is embedded in the object they're removing, so
        they're safe. Picking a few likely candidates:
      
        drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
        drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
        drivers/atm/nicstar.c could be taken down by a handcrafted packet
      
      Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
      Fixes: 0a835c4f ("Reimplement IDR and IDA using the radix tree")
      Reported-by: <syzbot+35666cba7f0a337e2e79@syzkaller.appspotmail.com>
      Debugged-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4deea1
    • C
      ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" · 3373de20
      Changwei Ge 提交于
      This reverts commit ba16ddfb ("ocfs2/o2hb: check len for
      bio_add_page() to avoid getting incorrect bio").
      
      In my testing, this patch introduces a problem that mkfs can't have
      slots more than 16 with 4k block size.
      
      And the original logic is safe actually with the situation it mentions
      so revert this commit.
      
      Attach test log:
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0
        (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192
        (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
        (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
        (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5
      
      Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: Yiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3373de20
    • O
      mm: fix nr_rotate_swap leak in swapon() error case · 7cbf3192
      Omar Sandoval 提交于
      If swapon() fails after incrementing nr_rotate_swap, we don't decrement
      it and thus effectively leak it.  Make sure we decrement it if we
      incremented it.
      
      Link: http://lkml.kernel.org/r/b6fe6b879f17fa68eee6cbd876f459f6e5e33495.1526491581.git.osandov@fb.com
      Fixes: 81a0298b ("mm, swap: don't use VMA based swap readahead if HDD is used as swap")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cbf3192
    • F
      net: dsa: dsa_loop: Make dynamic debugging helpful · e52cde71
      Florian Fainelli 提交于
      Remove redundant debug prints from phy_read/write since we can trace those
      calls through trace events. Enhance dynamic debug prints to print arguments
      which helps figuring how what is going on at the driver level with higher level
      configuration interfaces.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e52cde71
    • D
      Merge branch 'ovs-ct-zone' · 910714f1
      David S. Miller 提交于
      Yi-Hung Wei says:
      
      ====================
      openvswitch: Support conntrack zone limit
      
      Currently, nf_conntrack_max is used to limit the maximum number of
      conntrack entries in the conntrack table for every network namespace.
      For the VMs and containers that reside in the same namespace,
      they share the same conntrack table, and the total # of conntrack entries
      for all the VMs and containers are limited by nf_conntrack_max.  In this
      case, if one of the VM/container abuses the usage the conntrack entries,
      it blocks the others from committing valid conntrack entries into the
      conntrack table.  Even if we can possibly put the VM in different network
      namespace, the current nf_conntrack_max configuration is kind of rigid
      that we cannot limit different VM/container to have different # conntrack
      entries.
      
      To address the aforementioned issue, this patch proposes to have a
      fine-grained mechanism that could further limit the # of conntrack entries
      per-zone.  For example, we can designate different zone to different VM,
      and set conntrack limit to each zone.  By providing this isolation, a
      mis-behaved VM only consumes the conntrack entries in its own zone, and
      it will not influence other well-behaved VMs.  Moreover, the users can
      set various conntrack limit to different zone based on their preference.
      
      The proposed implementation utilizes Netfilter's nf_conncount backend
      to count the number of connections in a particular zone.  If the number of
      connection is above a configured limitation, OVS will return ENOMEM to the
      userspace.  If userspace does not configure the zone limit, the limit
      defaults to zero that is no limitation, which is backward compatible to
      the behavior without this patch.
      
      The first patch defines the conntrack limit netlink definition, and the
      second patch provides the implementation.
      
      v4->v5:
        - Addresses comments from Parvin that include log error msg in
          ovs_ct_limit_init(), handle deletion for default limit, and
          add a common helper for get zone limit.
        - Rebases to master.
      
      v3->v4:
        - Addresses comments from Parvin that include simplify netlink API,
          and remove unncessary RCU lockings.
        - Rebases to master.
      
      v2->v3:
        - Addresses comments from Parvin that include using static keys to check
          if ovs_ct_limit features is used, only check ct_limit when a ct entry
          is unconfirmed, and reports rate limited warning messages when the ct
          limit is reached.
        - Rebases to master.
      
      v1->v2:
        - Fixes commit log typos suggested by Greg.
        - Fixes memory free issue that Julia found.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      910714f1
    • Y
      openvswitch: Support conntrack zone limit · 11efd5cb
      Yi-Hung Wei 提交于
      Currently, nf_conntrack_max is used to limit the maximum number of
      conntrack entries in the conntrack table for every network namespace.
      For the VMs and containers that reside in the same namespace,
      they share the same conntrack table, and the total # of conntrack entries
      for all the VMs and containers are limited by nf_conntrack_max.  In this
      case, if one of the VM/container abuses the usage the conntrack entries,
      it blocks the others from committing valid conntrack entries into the
      conntrack table.  Even if we can possibly put the VM in different network
      namespace, the current nf_conntrack_max configuration is kind of rigid
      that we cannot limit different VM/container to have different # conntrack
      entries.
      
      To address the aforementioned issue, this patch proposes to have a
      fine-grained mechanism that could further limit the # of conntrack entries
      per-zone.  For example, we can designate different zone to different VM,
      and set conntrack limit to each zone.  By providing this isolation, a
      mis-behaved VM only consumes the conntrack entries in its own zone, and
      it will not influence other well-behaved VMs.  Moreover, the users can
      set various conntrack limit to different zone based on their preference.
      
      The proposed implementation utilizes Netfilter's nf_conncount backend
      to count the number of connections in a particular zone.  If the number of
      connection is above a configured limitation, ovs will return ENOMEM to the
      userspace.  If userspace does not configure the zone limit, the limit
      defaults to zero that is no limitation, which is backward compatible to
      the behavior without this patch.
      
      The following high leve APIs are provided to the userspace:
        - OVS_CT_LIMIT_CMD_SET:
          * set default connection limit for all zones
          * set the connection limit for a particular zone
        - OVS_CT_LIMIT_CMD_DEL:
          * remove the connection limit for a particular zone
        - OVS_CT_LIMIT_CMD_GET:
          * get the default connection limit for all zones
          * get the connection limit for a particular zone
      Signed-off-by: NYi-Hung Wei <yihung.wei@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11efd5cb
    • Y
      openvswitch: Add conntrack limit netlink definition · 5972be6b
      Yi-Hung Wei 提交于
      Define netlink messages and attributes to support user kernel
      communication that uses the conntrack limit feature.
      Signed-off-by: NYi-Hung Wei <yihung.wei@gmail.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5972be6b
    • D
      Merge tag 'mlx5e-updates-2018-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · d7c52fc8
      David S. Miller 提交于
      Saeed Mahameed says:
      
      ====================
      mlx5e-updates-2018-05-19
      
      This series contains updates for mlx5e netdevice driver with one subject,
      DSCP to priority mapping, in the first patch Huy adds the needed API in
      dcbnl, the second patch adds the needed mlx5 core capability bits for the
      feature, and all other patches are mlx5e (netdev) only changes to add
      support for the feature.
      
      From: Huy Nguyen
      
      Dscp to priority mapping for Ethernet packet:
      
      These patches enable differentiated services code point (dscp) to
      priority mapping for Ethernet packet. Once this feature is
      enabled, the packet is routed to the corresponding priority based on its
      dscp. User can combine this feature with priority flow control (pfc)
      feature to have priority flow control based on the dscp.
      
      Firmware interface:
      Mellanox firmware provides two control knobs for this feature:
        QPTS register allow changing the trust state between dscp and
        pcp mode. The default is pcp mode. Once in dscp mode, firmware will
        route the packet based on its dscp value if the dscp field exists.
      
        QPDPM register allow mapping a specific dscp (0 to 63) to a
        specific priority (0 to 7). By default, all the dscps are mapped to
        priority zero.
      
      Software interface:
      This feature is controlled via application priority TLV. IEEE
      specification P802.1Qcd/D2.1 defines priority selector id 5 for
      application priority TLV. This APP TLV selector defines DSCP to priority
      map. This APP TLV can be sent by the switch or can be set locally using
      software such as lldptool. In mlx5 drivers, we add the support for net
      dcb's getapp and setapp call back. Mlx5 driver only handles the selector
      id 5 application entry (dscp application priority application entry).
      If user sends multiple dscp to priority APP TLV entries on the same
      dscp, the last sent one will take effect. All the previous sent will be
      deleted.
      
      This attribute combined with pfc attribute allows advanced user to
      fine tune the qos setting for specific priority queue. For example,
      user can give dedicated buffer for one or more priorities or user
      can give large buffer to certain priorities.
      
      The dcb buffer configuration will be controlled by lldptool.
      >> lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6
            maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6
      >> lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0
            sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively
      
      After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to
      choose dcbnl over devlink interface since this feature is intended to set
      port attributes which are governed by the netdev instance of that port, where
      devlink API is more suitable for global ASIC configurations.
      
      The firmware trust state (in QPTS register) is changed based on the
      number of dscp to priority application entries. When the first dscp to
      priority application entry is added by the user, the trust state is
      changed to dscp. When the last dscp to priority application entry is
      deleted by the user, the trust state is changed to pcp.
      
      When the port is in DSCP trust state, the transmit queue is selected
      based on the dscp of the skb.
      
      When the port is in DSCP trust state and vport inline mode is not NONE,
      firmware requires mlx5 driver to copy the IP header to the
      wqe ethernet segment inline header if the skb has it.
      This is done by changing the transmit queue sq's min inline mode to L3.
      Note that the min inline mode of sqs that belong to other features
      such as xdpsq, icosq are not modified.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7c52fc8
    • B
      8139too: Remove unnecessary netif_napi_del() · a4567579
      Bo Chen 提交于
      The call to free_netdev() in __rtl8139_cleanup_dev() clears the network device
      napi list, and explicit calls to netif_napi_del() are unnecessary.
      Signed-off-by: NBo Chen <chenbo@pdx.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4567579
    • T
      ibmvnic: Fix partial success login retries · eb110410
      Thomas Falcon 提交于
      In its current state, the driver will handle backing device
      login in a loop for a certain number of retries while the
      device returns a partial success, indicating that the driver
      may need to try again using a smaller number of resources.
      
      The variable it checks to continue retrying may change
      over the course of operations, resulting in reallocation
      of resources but exits without sending the login attempt.
      Guard against this by introducing a boolean variable that
      will retain the state indicating that the driver needs to
      reattempt login with backing device firmware.
      Signed-off-by: NThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb110410
    • D
      Merge branch 'qed-ethtool-rx-flow-classification-enhancements' · a527af9c
      David S. Miller 提交于
      Manish Chopra says:
      
      ====================
      qed*: ethtool rx flow classification enhancements.
      
      This series re-structures the driver's ethtool rx flow
      classification flow, following that it adds other flow
      profiles and rx flow classification enhancements
      via "ethtool -N/-U"
      
      Please consider applying this to "net-next"
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a527af9c