1. 25 1月, 2018 1 次提交
    • A
      net: separate SIOCGIFCONF handling from dev_ioctl() · 36fd633e
      Al Viro 提交于
      Only two of dev_ioctl() callers may pass SIOCGIFCONF to it.
      Separating that codepath from the rest of dev_ioctl() allows both
      to simplify dev_ioctl() itself (all other cases work with struct ifreq *)
      *and* seriously simplify the compat side of that beast: all it takes
      is passing to inet_gifconf() an extra argument - the size of individual
      records (sizeof(struct ifreq) or sizeof(struct compat_ifreq)).  With
      dev_ifconf() called directly from sock_do_ioctl()/compat_dev_ifconf()
      that's easy to arrange.
      
      As the result, compat side of SIOCGIFCONF doesn't need any
      allocations, copy_in_user() back and forth, etc.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      36fd633e
  2. 24 1月, 2018 1 次提交
  3. 23 1月, 2018 1 次提交
  4. 22 1月, 2018 5 次提交
    • M
      device property: Allow iterating over available child fwnodes · 3395de96
      Marcin Wojtas 提交于
      Implement a new helper function fwnode_get_next_available_child_node(),
      which enables obtaining next enabled child fwnode, which
      works on a similar basis to OF's of_get_next_available_child().
      
      This commit also introduces a macro, thanks to which it is
      possible to iterate over the available fwnodes, using the
      new function described above.
      Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3395de96
    • M
      device property: Introduce fwnode_irq_get() · 7c6c57f2
      Marcin Wojtas 提交于
      Until now there were two very similar functions allowing
      to get Linux IRQ number from ACPI handle (acpi_irq_get())
      and OF node (of_irq_get()). The first one appeared to be used
      only as a subroutine of platform_irq_get(), which (in the generic
      code) limited IRQ obtaining from _CRS method only to nodes
      associated to kernel's struct platform_device.
      
      This patch introduces a new helper routine - fwnode_irq_get(),
      which allows to get the IRQ number directly from the fwnode
      to be used as common for OF/ACPI worlds. It is usable not
      only for the parents fwnodes, but also for the child nodes
      comprising their own _CRS methods with interrupts description.
      
      In order to be able o satisfy compilation with !CONFIG_ACPI
      and also simplify the new code, introduce a helper macro
      (ACPI_HANDLE_FWNODE), with which it is possible to reach
      an ACPI handle directly from its fwnode.
      Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c6c57f2
    • M
      device property: Introduce fwnode_get_phy_mode() · b28f263b
      Marcin Wojtas 提交于
      Until now there were two almost identical functions for
      obtaining network PHY mode - of_get_phy_mode() and,
      more generic, device_get_phy_mode(). However it is not uncommon,
      that the network interface is represented as a child
      of the actual controller, hence it is not associated
      directly to any struct device, required by the latter
      routine.
      
      This commit allows for getting the PHY mode for
      children nodes in the ACPI world by introducing a new function -
      fwnode_get_phy_mode(). This commit also changes
      device_get_phy_mode() routine to be its wrapper, in order
      to prevent unnecessary duplication.
      Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b28f263b
    • M
      device property: Introduce fwnode_get_mac_address() · babe2dbb
      Marcin Wojtas 提交于
      Until now there were two almost identical functions for
      obtaining MAC address - of_get_mac_address() and, more generic,
      device_get_mac_address(). However it is not uncommon,
      that the network interface is represented as a child
      of the actual controller, hence it is not associated
      directly to any struct device, required by the latter
      routine.
      
      This commit allows for getting the MAC address for
      children nodes in the ACPI world by introducing a new function -
      fwnode_get_mac_address(). This commit also changes
      device_get_mac_address() routine to be its wrapper, in order
      to prevent unnecessary duplication.
      Signed-off-by: NMarcin Wojtas <mw@semihalf.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      babe2dbb
    • K
      mm, page_vma_mapped: Drop faulty pointer arithmetics in check_pte() · 0d665e7b
      Kirill A. Shutemov 提交于
      Tetsuo reported random crashes under memory pressure on 32-bit x86
      system and tracked down to change that introduced
      page_vma_mapped_walk().
      
      The root cause of the issue is the faulty pointer math in check_pte().
      As ->pte may point to an arbitrary page we have to check that they are
      belong to the section before doing math. Otherwise it may lead to weird
      results.
      
      It wasn't noticed until now as mem_map[] is virtually contiguous on
      flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
      'struct page' pointers. But with classic sparsemem, it doesn't because
      each section memap is allocated separately and so consecutive pfns
      crossing two sections might have struct pages at completely unrelated
      addresses.
      
      Let's restructure code a bit and replace pointer arithmetic with
      operations on pfns.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-and-tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Fixes: ace71a19 ("mm: introduce page_vma_mapped_walk()")
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d665e7b
  5. 20 1月, 2018 4 次提交
  6. 19 1月, 2018 1 次提交
  7. 18 1月, 2018 2 次提交
  8. 17 1月, 2018 2 次提交
  9. 16 1月, 2018 5 次提交
  10. 15 1月, 2018 5 次提交
  11. 14 1月, 2018 1 次提交
  12. 13 1月, 2018 3 次提交
    • K
      kdump: Write the correct address of mem_section into vmcoreinfo · 9f15b912
      Kirill A. Shutemov 提交于
      Depending on configuration mem_section can now be an array or a pointer
      to an array allocated dynamically. In most cases, we can continue to refer
      to it as 'mem_section' regardless of what it is.
      
      But there's one exception: '&mem_section' means "address of the array" if
      mem_section is an array, but if mem_section is a pointer, it would mean
      "address of the pointer".
      
      We've stepped onto this in the kdump code: VMCOREINFO_SYMBOL(mem_section)
      writes down the address of pointer into vmcoreinfo, not the array as we wanted,
      breaking kdump.
      
      Let's introduce VMCOREINFO_SYMBOL_ARRAY() that would handle the
      situation correctly for both cases.
      
      Mike Galbraith <efault@gmx.de>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: kexec@lists.infradead.org
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Fixes: 83e3c487 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
      Link: http://lkml.kernel.org/r/20180112162532.35896-1-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9f15b912
    • M
      error-injection: Add injectable error types · 663faf9f
      Masami Hiramatsu 提交于
      Add injectable error types for each error-injectable function.
      
      One motivation of error injection test is to find software flaws,
      mistakes or mis-handlings of expectable errors. If we find such
      flaws by the test, that is a program bug, so we need to fix it.
      
      But if the tester miss input the error (e.g. just return success
      code without processing anything), it causes unexpected behavior
      even if the caller is correctly programmed to handle any errors.
      That is not what we want to test by error injection.
      
      To clarify what type of errors the caller must expect for each
      injectable function, this introduces injectable error types:
      
       - EI_ETYPE_NULL : means the function will return NULL if it
      		    fails. No ERR_PTR, just a NULL.
       - EI_ETYPE_ERRNO : means the function will return -ERRNO
      		    if it fails.
       - EI_ETYPE_ERRNO_NULL : means the function will return -ERRNO
      		       (ERR_PTR) or NULL.
      
      ALLOW_ERROR_INJECTION() macro is expanded to get one of
      NULL, ERRNO, ERRNO_NULL to record the error type for
      each function. e.g.
      
       ALLOW_ERROR_INJECTION(open_ctree, ERRNO)
      
      This error types are shown in debugfs as below.
      
        ====
        / # cat /sys/kernel/debug/error_injection/list
        open_ctree [btrfs]	ERRNO
        io_ctl_init [btrfs]	ERRNO
        ====
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      663faf9f
    • M
      error-injection: Separate error-injection from kprobe · 540adea3
      Masami Hiramatsu 提交于
      Since error-injection framework is not limited to be used
      by kprobes, nor bpf. Other kernel subsystems can use it
      freely for checking safeness of error-injection, e.g.
      livepatch, ftrace etc.
      So this separate error-injection framework from kprobes.
      
      Some differences has been made:
      
      - "kprobe" word is removed from any APIs/structures.
      - BPF_ALLOW_ERROR_INJECTION() is renamed to
        ALLOW_ERROR_INJECTION() since it is not limited for BPF too.
      - CONFIG_FUNCTION_ERROR_INJECTION is the config item of this
        feature. It is automatically enabled if the arch supports
        error injection feature for kprobe or ftrace etc.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      540adea3
  13. 12 1月, 2018 2 次提交
    • S
      net/mlx5: Fix get vector affinity helper function · 05e0cc84
      Saeed Mahameed 提交于
      mlx5_get_vector_affinity used to call pci_irq_get_affinity and after
      reverting the patch that sets the device affinity via PCI_IRQ_AFFINITY
      API, calling pci_irq_get_affinity becomes useless and it breaks RDMA
      mlx5 users.  To fix this, this patch provides an alternative way to
      retrieve IRQ vector affinity using legacy IRQ API, following
      smp_affinity read procfs implementation.
      
      Fixes: 231243c8 ("Revert mlx5: move affinity hints assignments to generic code")
      Fixes: a435393a ("mlx5: move affinity hints assignments to generic code")
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      05e0cc84
    • E
      {net,ib}/mlx5: Don't disable local loopback multicast traffic when needed · 8978cc92
      Eran Ben Elisha 提交于
      There are systems platform information management interfaces (such as
      HOST2BMC) for which we cannot disable local loopback multicast traffic.
      
      Separate disable_local_lb_mc and disable_local_lb_uc capability bits so
      driver will not disable multicast loopback traffic if not supported.
      (It is expected that Firmware will not set disable_local_lb_mc if
      HOST2BMC is running for example.)
      
      Function mlx5_nic_vport_update_local_lb will do best effort to
      disable/enable UC/MC loopback traffic and return success only in case it
      succeeded to changed all allowed by Firmware.
      
      Adapt mlx5_ib and mlx5e to support the new cap bits.
      
      Fixes: 2c43c5a0 ("net/mlx5e: Enable local loopback in loopback selftest")
      Fixes: c85023e1 ("IB/mlx5: Add raw ethernet local loopback support")
      Fixes: bded747b ("net/mlx5: Add raw ethernet local loopback firmware command")
      Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
      Cc: kernel-team@fb.com
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      8978cc92
  14. 11 1月, 2018 4 次提交
  15. 10 1月, 2018 2 次提交
    • Q
      bpf: export function to write into verifier log buffer · 430e68d1
      Quentin Monnet 提交于
      Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and
      export it, so that other components (in particular, drivers for BPF
      offload) can reuse the user buffer log to dump error messages at
      verification time.
      
      Renaming `verbose()` was necessary in order to avoid a name so generic
      to be exported to the global namespace. However to prevent too much pain
      for backports, the calls to `verbose()` in the kernel BPF verifier were
      not changed. Instead, use function aliasing to make `verbose` point to
      `bpf_verifier_log_write`. Another solution could consist in making a
      wrapper around `verbose()`, but since it is a variadic function, I don't
      see a clean way without creating two identical wrappers, one for the
      verifier and one to export.
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      430e68d1
    • D
      bpf: avoid false sharing of map refcount with max_entries · be95a845
      Daniel Borkmann 提交于
      In addition to commit b2157399 ("bpf: prevent out-of-bounds
      speculation") also change the layout of struct bpf_map such that
      false sharing of fast-path members like max_entries is avoided
      when the maps reference counter is altered. Therefore enforce
      them to be placed into separate cachelines.
      
      pahole dump after change:
      
        struct bpf_map {
              const struct bpf_map_ops  * ops;                 /*     0     8 */
              struct bpf_map *           inner_map_meta;       /*     8     8 */
              void *                     security;             /*    16     8 */
              enum bpf_map_type          map_type;             /*    24     4 */
              u32                        key_size;             /*    28     4 */
              u32                        value_size;           /*    32     4 */
              u32                        max_entries;          /*    36     4 */
              u32                        map_flags;            /*    40     4 */
              u32                        pages;                /*    44     4 */
              u32                        id;                   /*    48     4 */
              int                        numa_node;            /*    52     4 */
              bool                       unpriv_array;         /*    56     1 */
      
              /* XXX 7 bytes hole, try to pack */
      
              /* --- cacheline 1 boundary (64 bytes) --- */
              struct user_struct *       user;                 /*    64     8 */
              atomic_t                   refcnt;               /*    72     4 */
              atomic_t                   usercnt;              /*    76     4 */
              struct work_struct         work;                 /*    80    32 */
              char                       name[16];             /*   112    16 */
              /* --- cacheline 2 boundary (128 bytes) --- */
      
              /* size: 128, cachelines: 2, members: 17 */
              /* sum members: 121, holes: 1, sum holes: 7 */
        };
      
      Now all entries in the first cacheline are read only throughout
      the life time of the map, set up once during map creation. Overall
      struct size and number of cachelines doesn't change from the
      reordering. struct bpf_map is usually first member and embedded
      in map structs in specific map implementations, so also avoid those
      members to sit at the end where it could potentially share the
      cacheline with first map values e.g. in the array since remote
      CPUs could trigger map updates just as well for those (easily
      dirtying members like max_entries intentionally as well) while
      having subsequent values in cache.
      
      Quoting from Google's Project Zero blog [1]:
      
        Additionally, at least on the Intel machine on which this was
        tested, bouncing modified cache lines between cores is slow,
        apparently because the MESI protocol is used for cache coherence
        [8]. Changing the reference counter of an eBPF array on one
        physical CPU core causes the cache line containing the reference
        counter to be bounced over to that CPU core, making reads of the
        reference counter on all other CPU cores slow until the changed
        reference counter has been written back to memory. Because the
        length and the reference counter of an eBPF array are stored in
        the same cache line, this also means that changing the reference
        counter on one physical CPU core causes reads of the eBPF array's
        length to be slow on other physical CPU cores (intentional false
        sharing).
      
      While this doesn't 'control' the out-of-bounds speculation through
      masking the index as in commit b2157399, triggering a manipulation
      of the map's reference counter is really trivial, so lets not allow
      to easily affect max_entries from it.
      
      Splitting to separate cachelines also generally makes sense from
      a performance perspective anyway in that fast-path won't have a
      cache miss if the map gets pinned, reused in other progs, etc out
      of control path, thus also avoids unintentional false sharing.
      
        [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.htmlSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      be95a845
  16. 09 1月, 2018 1 次提交
    • J
      tuntap: XDP transmission · fc72d1d5
      Jason Wang 提交于
      This patch implements XDP transmission for TAP. Since we can't create
      new queues for TAP during XDP set, exist ptr_ring was reused for
      queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG
      (0x1UL) was encoded into lowest bit of xpd_buff pointer during
      ptr_ring_produce, and was decoded during consuming. XDP metadata was
      stored in the headroom of the packet which should work in most of
      cases since driver usually reserve enough headroom. Very minor changes
      were done for vhost_net: it just need to peek the length depends on
      the type of pointer.
      
      Tests were done on two Intel E5-2630 2.40GHz machines connected back
      to back through two 82599ES. Traffic were generated/received through
      MoonGen/testpmd(rxonly). It reports ~20% improvements when
      xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps
      to 3.05Mpps)
      
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc72d1d5