1. 17 3月, 2017 2 次提交
    • I
      ipv4: fib_rules: Add notifier info to FIB rules notifications · 6a003a5f
      Ido Schimmel 提交于
      Whenever a FIB rule is added or removed, a notification is sent in the
      FIB notification chain. However, listeners don't have a way to tell
      which rule was added or removed.
      
      This is problematic as we would like to give listeners the ability to
      decide which action to execute based on the notified rule. Specifically,
      offloading drivers should be able to determine if they support the
      reflection of the notified FIB rule and flush their LPM tables in case
      they don't.
      
      Do that by adding a notifier info to these notifications and embed the
      common FIB rule struct in it.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a003a5f
    • I
      ipv4: fib_rules: Check if rule is a default rule · 3c71006d
      Ido Schimmel 提交于
      Currently, when non-default (custom) FIB rules are used, devices capable
      of layer 3 offloading flush their tables and let the kernel do the
      forwarding instead.
      
      When these devices' drivers are loaded they register to the FIB
      notification chain, which lets them know about the existence of any
      custom FIB rules. This is done by sending a RULE_ADD notification based
      on the value of 'net->ipv4.fib_has_custom_rules'.
      
      This approach is problematic when VRF offload is taken into account, as
      upon the creation of the first VRF netdev, a l3mdev rule is programmed
      to direct skbs to the VRF's table.
      
      Instead of merely reading the above value and sending a single RULE_ADD
      notification, we should iterate over all the FIB rules and send a
      detailed notification for each, thereby allowing offloading drivers to
      sanitize the rules they don't support and potentially flush their
      tables.
      
      While l3mdev rules are uniquely marked, the default rules are not.
      Therefore, when they are being notified they might invoke offloading
      drivers to unnecessarily flush their tables.
      
      Solve this by adding an helper to check if a FIB rule is a default rule.
      Namely, its selector should match all packets and its action should
      point to the local, main or default tables.
      
      As noted by David Ahern, uniquely marking the default rules is
      insufficient. When using VRFs, it's common to avoid false hits by moving
      the rule for the local table to just before the main table:
      
      Default configuration:
      $ ip rule show
      0:      from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      
      Common configuration with VRFs:
      $ ip rule show
      1000:   from all lookup [l3mdev-table]
      32765:  from all lookup local
      32766:  from all lookup main
      32767:  from all lookup default
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c71006d
  2. 16 3月, 2017 3 次提交
  3. 15 3月, 2017 1 次提交
  4. 14 3月, 2017 4 次提交
  5. 13 3月, 2017 13 次提交
  6. 11 3月, 2017 3 次提交
  7. 10 3月, 2017 14 次提交
    • P
      mlxsw: spectrum: Add support for vlan modify TC action · a150201a
      Petr Machata 提交于
      Add VLAN action offloading. Invoke it from Spectrum flower handler for
      "vlan modify" actions.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a150201a
    • A
      tcp: rename *_sequence_number() to *_seq_and_tsoff() · a30aad50
      Alexey Kodanev 提交于
      The functions that are returning tcp sequence number also setup
      TS offset value, so rename them to better describe their purpose.
      
      No functional changes in this patch.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a30aad50
    • D
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells 提交于
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      
      The theory lockdep comes up with is as follows:
      
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      
      	mmap_sem must be taken before sk_lock-AF_RXRPC
      
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
      
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      
      	sk_lock-AF_INET must be taken before mmap_sem
      
      However, lockdep's theory is wrong in this instance because it deals only
      with lock classes and not individual locks.  The AF_INET lock in (2) isn't
      really equivalent to the AF_INET lock in (3) as the former deals with a
      socket entirely internal to the kernel that never sees userspace.  This is
      a limitation in the design of lockdep.
      
      Fix the general case by:
      
       (1) Double up all the locking keys used in sockets so that one set are
           used if the socket is created by userspace and the other set is used
           if the socket is created by the kernel.
      
       (2) Store the kern parameter passed to sk_alloc() in a variable in the
           sock struct (sk_kern_sock).  This informs sock_lock_init(),
           sock_init_data() and sk_clone_lock() as to the lock keys to be used.
      
           Note that the child created by sk_clone_lock() inherits the parent's
           kern setting.
      
       (3) Add a 'kern' parameter to ->accept() that is analogous to the one
           passed in to ->create() that distinguishes whether kernel_accept() or
           sys_accept4() was the caller and can be passed to sk_alloc().
      
           Note that a lot of accept functions merely dequeue an already
           allocated socket.  I haven't touched these as the new socket already
           exists before we get the parameter.
      
           Note also that there are a couple of places where I've made the accepted
           socket unconditionally kernel-based:
      
      	irda_accept()
      	rds_rcp_accept_one()
      	tcp_accept_from_sock()
      
           because they follow a sock_create_kern() and accept off of that.
      
      Whilst creating this, I noticed that lustre and ocfs don't create sockets
      through sock_create_kern() and thus they aren't marked as for-kernel,
      though they appear to be internal.  I wonder if these should do that so
      that they use the new set of lock keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdfbabfb
    • A
      userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED · 70ccb92f
      Andrea Arcangeli 提交于
      userfaultfd_remove() has to be execute before zapping the pagetables or
      UFFDIO_COPY could keep filling pages after zap_page_range returned,
      which would result in non zero data after a MADV_DONTNEED.
      
      However userfaultfd_remove() may have to release the mmap_sem.  This was
      handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
      potentially stale vma (the very vma passed to zap_page_range(vma, ...)).
      
      The fix consists in revalidating the vma in case userfaultfd_remove()
      had to release the mmap_sem.
      
      This also optimizes away an unnecessary down_read/up_read in the
      MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.
      
      It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
      userfaultfd_remove() will be defined as "true" at build time.
      
      Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ccb92f
    • Y
      mm/vmstats: add thp_split_pud event for clarity · ce9311cf
      Yisheng Xie 提交于
      We added support for PUD-sized transparent hugepages, however we count
      the event "thp split pud" into thp_split_pmd event.
      
      To separate the event count of thp split pud from pmd, add a new event
      named thp_split_pud.
      
      Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.comSigned-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Siewior <bigeasy@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce9311cf
    • A
      include/linux/fs.h: fix unsigned enum warning with gcc-4.2 · cbfd0c10
      Arnd Bergmann 提交于
      With arm-linux-gcc-4.2, almost every file we build in the kernel ends up
      with this warning:
      
        include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is always false
      
      Later versions don't have this problem, but it's easy enough to work
      around.
      
      Link: http://lkml.kernel.org/r/20161216105634.235457-12-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Russell King <rmk+kernel@armlinux.org.uk>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbfd0c10
    • A
      userfaultfd: non-cooperative: rollback userfaultfd_exit · dd0db88d
      Andrea Arcangeli 提交于
      Patch series "userfaultfd non-cooperative further update for 4.11 merge
      window".
      
      Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
      more testing.  I've been doing testing before and this was also tested
      by kbuild bot and exercised by the selftest, but this bug never
      reproduced before.
      
      I dropped userfaultfd_exit as result.  I dropped it because of
      implementation difficulty in receiving signals in __mmput and because I
      think -ENOSPC as result from the background UFFDIO_COPY should be enough
      already.
      
      Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
      wasn't exercised by the selftest and when I tried to exercise it, after
      moving it to a more correct place in __mmput where it would make more
      sense and where the vma list is stable, it resulted in the
      event_wait_completion in D state.  So then I added the second patch to
      be sure even if we call userfaultfd_event_wait_completion too late
      during task exit(), we won't risk to generate tasks in D state.  The
      same check exists in handle_userfault() for the same reason, except it
      makes a difference there, while here is just a robustness check and it's
      run under WARN_ON_ONCE.
      
      While looking at the userfaultfd_event_wait_completion() function I
      looked back at its callers too while at it and I think it's not ok to
      stop executing dup_fctx on the fcs list because we relay on
      userfaultfd_event_wait_completion to execute
      userfaultfd_ctx_put(fctx->orig) which is paired against
      userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
      list_add(fcs).  This change only takes care of fctx->orig but this area
      also needs further review looking for similar problems in fctx->new.
      
      The only patch that is urgent is the first because it's an use after
      free during a SMP race condition that affects all processes if
      CONFIG_USERFAULTFD=y.  Very hard to reproduce though and probably
      impossible without SLUB poisoning enabled.
      
      This patch (of 3):
      
      I once reproduced this oops with the userfaultfd selftest, it's not
      easily reproducible and it requires SLUB poisoning to reproduce.
      
          general protection fault: 0000 [#1] SMP
          Modules linked in:
          CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G               ------------ T 3.10.0+ #15
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
          task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
          RIP: 0010:[<ffffffff81451299>]  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
          RSP: 0018:ffff8801f833fe80  EFLAGS: 00010202
          RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
          RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
          RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
          R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
          FS:  0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Call Trace:
            do_exit+0x297/0xd10
            SyS_exit+0x17/0x20
            tracesys+0xdd/0xe2
          Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
          RIP  [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0
           RSP <ffff8801f833fe80>
          ---[ end trace 9fecd6dcb442846a ]---
      
      In the debugger I located the "mm" pointer in the stack and walking
      mm->mmap->vm_next through the end shows the vma->vm_next list is fully
      consistent and it is null terminated list as expected.  So this has to
      be an SMP race condition where userfaultfd_exit was running while the
      vma list was being modified by another CPU.
      
      When userfaultfd_exit() run one of the ->vm_next pointers pointed to
      SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).
      
      The reason is that it's not running in __mmput but while there are still
      other threads running and it's not holding the mmap_sem (it can't as it
      has to wait the even to be received by the manager).  So this is an use
      after free that was happening for all processes.
      
      One more implementation problem aside from the race condition:
      userfaultfd_exit has really to check a flag in mm->flags before walking
      the vma or it's going to slowdown the exit() path for regular tasks.
      
      One more implementation problem: at that point signals can't be
      delivered so it would also create a task in D state if the manager
      doesn't read the event.
      
      The major design issue: it overall looks superfluous as the manager can
      check for -ENOSPC in the background transfer:
      
      	if (mmget_not_zero(ctx->mm)) {
      [..]
      	} else {
      		return -ENOSPC;
      	}
      
      It's safer to roll it back and re-introduce it later if at all.
      
      [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
        Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd0db88d
    • M
      scripts/spelling.txt: add "overide" pattern and fix typo instances · 505d3085
      Masahiro Yamada 提交于
      Fix typos and add the following to the scripts/spelling.txt:
      
        overide||override
      
      While we are here, fix the doubled "address" in the touched line
      Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
      
      Also, fix the comment block style in the touched hunks in
      drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
      
      Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505d3085
    • M
      scripts/spelling.txt: add "disble(d)" pattern and fix typo instances · 8a1115ff
      Masahiro Yamada 提交于
      Fix typos and add the following to the scripts/spelling.txt:
      
        disble||disable
        disbled||disabled
      
      I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
      untouched.  The macro is not referenced at all, but this commit is
      touching only comment blocks just in case.
      
      Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a1115ff
    • J
      ethtool: add CRC32 as an RSS hash function · abb521e3
      Jakub Kicinski 提交于
      CRC32 engines are usually easily available in hardware and generate
      OK spread for RSS hash.  Add CRC32 RSS hash function to ethtool API.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abb521e3
    • A
      bpf: convert htab map to hlist_nulls · 4fe84359
      Alexei Starovoitov 提交于
      when all map elements are pre-allocated one cpu can delete and reuse htab_elem
      while another cpu is still walking the hlist. In such case the lookup may
      miss the element. Convert hlist to hlist_nulls to avoid such scenario.
      When bucket lock is taken there is no need to take such precautions,
      so only convert map_lookup and map_get_next to nulls.
      The race window is extremely small and only reproducible with explicit
      udelay() inside lookup_nulls_elem_raw()
      
      Similar to hlist add hlist_nulls_for_each_entry_safe() and
      hlist_nulls_entry_safe() helpers.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Reported-by: NJonathan Perry <jonperry@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fe84359
    • D
      uapi: fix linux/packet_diag.h userspace compilation error · 745cb7f8
      Dmitry V. Levin 提交于
      Replace MAX_ADDR_LEN with its numeric value to fix the following
      linux/packet_diag.h userspace compilation error:
      
      /usr/include/linux/packet_diag.h:67:17: error: 'MAX_ADDR_LEN' undeclared here (not in a function)
        __u8 pdmc_addr[MAX_ADDR_LEN];
      
      This is not the first case in the UAPI where the numeric value
      of MAX_ADDR_LEN is used instead of symbolic one, uapi/linux/if_link.h
      already does the same:
      
      $ grep MAX_ADDR_LEN include/uapi/linux/if_link.h
      	__u8 mac[32]; /* MAX_ADDR_LEN */
      
      There are no UAPI headers besides these two that use MAX_ADDR_LEN.
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      745cb7f8
    • K
      mm: convert generic code to 5-level paging · c2febafc
      Kirill A. Shutemov 提交于
      Convert all non-architecture-specific code to 5-level paging.
      
      It's mostly mechanical adding handling one more page table level in
      places where we deal with pud_t.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2febafc
    • K
      asm-generic: introduce <asm-generic/pgtable-nop4d.h> · 048456dc
      Kirill A. Shutemov 提交于
      Like with pgtable-nopud.h for 4-level paging, this new header is base
      for converting an architectures to properly folded p4d_t level.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      048456dc