1. 04 11月, 2019 1 次提交
  2. 01 11月, 2019 1 次提交
    • E
      net: increase SOMAXCONN to 4096 · 19f92a03
      Eric Dumazet 提交于
      SOMAXCONN is /proc/sys/net/core/somaxconn default value.
      
      It has been defined as 128 more than 20 years ago.
      
      Since it caps the listen() backlog values, the very small value has
      caused numerous problems over the years, and many people had
      to raise it on their hosts after beeing hit by problems.
      
      Google has been using 1024 for at least 15 years, and we increased
      this to 4096 after TCP listener rework has been completed, more than
      4 years ago. We got no complain of this change breaking any
      legacy application.
      
      Many applications indeed setup a TCP listener with listen(fd, -1);
      meaning they let the system select the backlog.
      
      Raising SOMAXCONN lowers chance of the port being unavailable under
      even small SYNFLOOD attack, and reduces possibilities of side channel
      vulnerabilities.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f92a03
  3. 31 10月, 2019 3 次提交
  4. 30 10月, 2019 1 次提交
  5. 29 10月, 2019 2 次提交
    • T
      net: fix sk_page_frag() recursion from memory reclaim · 20eb4f29
      Tejun Heo 提交于
      sk_page_frag() optimizes skb_frag allocations by using per-task
      skb_frag cache when it knows it's the only user.  The condition is
      determined by seeing whether the socket allocation mask allows
      blocking - if the allocation may block, it obviously owns the task's
      context and ergo exclusively owns current->task_frag.
      
      Unfortunately, this misses recursion through memory reclaim path.
      Please take a look at the following backtrace.
      
       [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
           ...
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           sock_xmit.isra.24+0xa1/0x170 [nbd]
           nbd_send_cmd+0x1d2/0x690 [nbd]
           nbd_queue_rq+0x1b5/0x3b0 [nbd]
           __blk_mq_try_issue_directly+0x108/0x1b0
           blk_mq_request_issue_directly+0xbd/0xe0
           blk_mq_try_issue_list_directly+0x41/0xb0
           blk_mq_sched_insert_requests+0xa2/0xe0
           blk_mq_flush_plug_list+0x205/0x2a0
           blk_flush_plug_list+0xc3/0xf0
       [1] blk_finish_plug+0x21/0x2e
           _xfs_buf_ioapply+0x313/0x460
           __xfs_buf_submit+0x67/0x220
           xfs_buf_read_map+0x113/0x1a0
           xfs_trans_read_buf_map+0xbf/0x330
           xfs_btree_read_buf_block.constprop.42+0x95/0xd0
           xfs_btree_lookup_get_block+0x95/0x170
           xfs_btree_lookup+0xcc/0x470
           xfs_bmap_del_extent_real+0x254/0x9a0
           __xfs_bunmapi+0x45c/0xab0
           xfs_bunmapi+0x15/0x30
           xfs_itruncate_extents_flags+0xca/0x250
           xfs_free_eofblocks+0x181/0x1e0
           xfs_fs_destroy_inode+0xa8/0x1b0
           destroy_inode+0x38/0x70
           dispose_list+0x35/0x50
           prune_icache_sb+0x52/0x70
           super_cache_scan+0x120/0x1a0
           do_shrink_slab+0x120/0x290
           shrink_slab+0x216/0x2b0
           shrink_node+0x1b6/0x4a0
           do_try_to_free_pages+0xc6/0x370
           try_to_free_mem_cgroup_pages+0xe3/0x1e0
           try_charge+0x29e/0x790
           mem_cgroup_charge_skmem+0x6a/0x100
           __sk_mem_raise_allocated+0x18e/0x390
           __sk_mem_schedule+0x2a/0x40
       [0] tcp_sendmsg_locked+0x8eb/0xe10
           tcp_sendmsg+0x27/0x40
           sock_sendmsg+0x30/0x40
           ___sys_sendmsg+0x26d/0x2b0
           __sys_sendmsg+0x57/0xa0
           do_syscall_64+0x42/0x100
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      In [0], tcp_send_msg_locked() was using current->page_frag when it
      called sk_wmem_schedule().  It already calculated how many bytes can
      be fit into current->page_frag.  Due to memory pressure,
      sk_wmem_schedule() called into memory reclaim path which called into
      xfs and then IO issue path.  Because the filesystem in question is
      backed by nbd, the control goes back into the tcp layer - back into
      tcp_sendmsg_locked().
      
      nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
      sense - it's in the process of freeing memory and wants to be able to,
      e.g., drop clean pages to make forward progress.  However, this
      confused sk_page_frag() called from [2].  Because it only tests
      whether the allocation allows blocking which it does, it now thinks
      current->page_frag can be used again although it already was being
      used in [0].
      
      After [2] used current->page_frag, the offset would be increased by
      the used amount.  When the control returns to [0],
      current->page_frag's offset is increased and the previously calculated
      number of bytes now may overrun the end of allocated memory leading to
      silent memory corruptions.
      
      Fix it by adding gfpflags_normal_context() which tests sleepable &&
      !reclaim and use it to determine whether to use current->task_frag.
      
      v2: Eric didn't like gfp flags being tested twice.  Introduce a new
          helper gfpflags_normal_context() and combine the two tests.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20eb4f29
    • E
      net: add skb_queue_empty_lockless() · d7d16a89
      Eric Dumazet 提交于
      Some paths call skb_queue_empty() without holding
      the queue lock. We must use a barrier in order
      to not let the compiler do strange things, and avoid
      KCSAN splats.
      
      Adding a barrier in skb_queue_empty() might be overkill,
      I prefer adding a new helper to clearly identify
      points where the callers might be lockless. This might
      help us finding real bugs.
      
      The corresponding WRITE_ONCE() should add zero cost
      for current compilers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d16a89
  6. 28 10月, 2019 2 次提交
  7. 25 10月, 2019 5 次提交
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      net: core: add ignore flag to netdev_adjacent structure · 32b6d34f
      Taehee Yoo 提交于
      In order to link an adjacent node, netdev_upper_dev_link() is used
      and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used.
      unlink operation does not fail, but link operation can fail.
      
      In order to exchange adjacent nodes, we should unlink an old adjacent
      node first. then, link a new adjacent node.
      If link operation is failed, we should link an old adjacent node again.
      But this link operation can fail too.
      It eventually breaks the adjacent link relationship.
      
      This patch adds an ignore flag into the netdev_adjacent structure.
      If this flag is set, netdev_upper_dev_link() ignores an old adjacent
      node for a moment.
      
      This patch also adds new functions for other modules.
      netdev_adjacent_change_prepare()
      netdev_adjacent_change_commit()
      netdev_adjacent_change_abort()
      
      netdev_adjacent_change_prepare() inserts new device into adjacent list
      but new device is not allowed to use immediately.
      If netdev_adjacent_change_prepare() fails, it internally rollbacks
      adjacent list so that we don't need any other action.
      netdev_adjacent_change_commit() deletes old device in the adjacent list
      and allows new device to use.
      netdev_adjacent_change_abort() rollbacks adjacent list.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32b6d34f
    • T
      team: fix nested locking lockdep warning · 369f61be
      Taehee Yoo 提交于
      team interface could be nested and it's lock variable could be nested too.
      But this lock uses static lockdep key and there is no nested locking
      handling code such as mutex_lock_nested() and so on.
      so the Lockdep would warn about the circular locking scenario that
      couldn't happen.
      In order to fix, this patch makes the team module to use dynamic lock key
      instead of static key.
      
      Test commands:
          ip link add team0 type team
          ip link add team1 type team
          ip link set team0 master team1
          ip link set team0 nomaster
          ip link set team1 master team0
          ip link set team1 nomaster
      
      Splat that looks like:
      [   40.364352] WARNING: possible recursive locking detected
      [   40.364964] 5.4.0-rc3+ #96 Not tainted
      [   40.365405] --------------------------------------------
      [   40.365973] ip/750 is trying to acquire lock:
      [   40.366542] ffff888060b34c40 (&team->lock){+.+.}, at: team_set_mac_address+0x151/0x290 [team]
      [   40.367689]
      	       but task is already holding lock:
      [   40.368729] ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.370280]
      	       other info that might help us debug this:
      [   40.371159]  Possible unsafe locking scenario:
      
      [   40.371942]        CPU0
      [   40.372338]        ----
      [   40.372673]   lock(&team->lock);
      [   40.373115]   lock(&team->lock);
      [   40.373549]
      	       *** DEADLOCK ***
      
      [   40.374432]  May be due to missing lock nesting notation
      
      [   40.375338] 2 locks held by ip/750:
      [   40.375851]  #0: ffffffffabcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [   40.376927]  #1: ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.377989]
      	       stack backtrace:
      [   40.378650] CPU: 0 PID: 750 Comm: ip Not tainted 5.4.0-rc3+ #96
      [   40.379368] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [   40.380574] Call Trace:
      [   40.381208]  dump_stack+0x7c/0xbb
      [   40.381959]  __lock_acquire+0x269d/0x3de0
      [   40.382817]  ? register_lock_class+0x14d0/0x14d0
      [   40.383784]  ? check_chain_key+0x236/0x5d0
      [   40.384518]  lock_acquire+0x164/0x3b0
      [   40.385074]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.385805]  __mutex_lock+0x14d/0x14c0
      [   40.386371]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387038]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387632]  ? mutex_lock_io_nested+0x1380/0x1380
      [   40.388245]  ? team_del_slave+0x60/0x60 [team]
      [   40.388752]  ? rcu_read_lock_sched_held+0x90/0xc0
      [   40.389304]  ? rcu_read_lock_bh_held+0xa0/0xa0
      [   40.389819]  ? lock_acquire+0x164/0x3b0
      [   40.390285]  ? lockdep_rtnl_is_held+0x16/0x20
      [   40.390797]  ? team_port_get_rtnl+0x90/0xe0 [team]
      [   40.391353]  ? __module_text_address+0x13/0x140
      [   40.391886]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.392547]  team_set_mac_address+0x151/0x290 [team]
      [   40.393111]  dev_set_mac_address+0x1f0/0x3f0
      [ ... ]
      
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      369f61be
    • T
      net: core: add generic lockdep keys · ab92d68f
      Taehee Yoo 提交于
      Some interface types could be nested.
      (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
      These interface types should set lockdep class because, without lockdep
      class key, lockdep always warn about unexisting circular locking.
      
      In the current code, these interfaces have their own lockdep class keys and
      these manage itself. So that there are so many duplicate code around the
      /driver/net and /net/.
      This patch adds new generic lockdep keys and some helper functions for it.
      
      This patch does below changes.
      a) Add lockdep class keys in struct net_device
         - qdisc_running, xmit, addr_list, qdisc_busylock
         - these keys are used as dynamic lockdep key.
      b) When net_device is being allocated, lockdep keys are registered.
         - alloc_netdev_mqs()
      c) When net_device is being free'd llockdep keys are unregistered.
         - free_netdev()
      d) Add generic lockdep key helper function
         - netdev_register_lockdep_key()
         - netdev_unregister_lockdep_key()
         - netdev_update_lockdep_key()
      e) Remove unnecessary generic lockdep macro and functions
      f) Remove unnecessary lockdep code of each interfaces.
      
      After this patch, each interface modules don't need to maintain
      their lockdep keys.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab92d68f
    • T
      net: core: limit nested device depth · 5343da4c
      Taehee Yoo 提交于
      Current code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this needs huge stack
      memory. So, unlimited nested devices could make stack overflow.
      
      This patch adds upper_level and lower_level, they are common variables
      and represent maximum lower/upper depth.
      When upper/lower device is attached or dettached,
      {lower/upper}_level are updated. and if maximum depth is bigger than 8,
      attach routine fails and returns -EMLINK.
      
      In addition, this patch converts recursive routine of
      netdev_walk_all_{lower/upper} to iterator routine.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add link dummy0 name vlan1 type vlan id 1
          ip link set vlan1 up
      
          for i in {2..55}
          do
      	    let A=$i-1
      
      	    ip link add vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      Splat looks like:
      [  155.513226][  T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
      [  155.514162][  T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
      [  155.515048][  T908]
      [  155.515333][  T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  155.516147][  T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  155.517233][  T908] Call Trace:
      [  155.517627][  T908]
      [  155.517918][  T908] Allocated by task 0:
      [  155.518412][  T908] (stack is not available)
      [  155.518955][  T908]
      [  155.519228][  T908] Freed by task 0:
      [  155.519885][  T908] (stack is not available)
      [  155.520452][  T908]
      [  155.520729][  T908] The buggy address belongs to the object at ffff8880608a6ac0
      [  155.520729][  T908]  which belongs to the cache names_cache of size 4096
      [  155.522387][  T908] The buggy address is located 512 bytes inside of
      [  155.522387][  T908]  4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
      [  155.523920][  T908] The buggy address belongs to the page:
      [  155.524552][  T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
      [  155.525836][  T908] flags: 0x100000000010200(slab|head)
      [  155.526445][  T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
      [  155.527424][  T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  155.528429][  T908] page dumped because: kasan: bad access detected
      [  155.529158][  T908]
      [  155.529410][  T908] Memory state around the buggy address:
      [  155.530060][  T908]  ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.530971][  T908]  ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
      [  155.531889][  T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.532806][  T908]                                            ^
      [  155.533509][  T908]  ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
      [  155.534436][  T908]  ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
      [ ... ]
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5343da4c
  8. 24 10月, 2019 1 次提交
    • E
      net/flow_dissector: switch to siphash · 55667441
      Eric Dumazet 提交于
      UDP IPv6 packets auto flowlabels are using a 32bit secret
      (static u32 hashrnd in net/core/flow_dissector.c) and
      apply jhash() over fields known by the receivers.
      
      Attackers can easily infer the 32bit secret and use this information
      to identify a device and/or user, since this 32bit secret is only
      set at boot time.
      
      Really, using jhash() to generate cookies sent on the wire
      is a serious security concern.
      
      Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
      a dead end. Trying to periodically change the secret (like in sch_sfq.c)
      could change paths taken in the network for long lived flows.
      
      Let's switch to siphash, as we did in commit df453700
      ("inet: switch IP ID generator to siphash")
      
      Using a cryptographically strong pseudo random function will solve this
      privacy issue and more generally remove other weak points in the stack.
      
      Packet schedulers using skb_get_hash_perturb() benefit from this change.
      
      Fixes: b5677416 ("ipv6: Enable auto flow labels by default")
      Fixes: 42240901 ("ipv6: Implement different admin modes for automatic flow labels")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Fixes: cb1ce2ef ("ipv6: Implement automatic flow label generation on transmit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Berger <jonathann1@walla.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55667441
  9. 23 10月, 2019 2 次提交
    • A
      dynamic_debug: provide dynamic_hex_dump stub · 011c7289
      Arnd Bergmann 提交于
      The ionic driver started using dymamic_hex_dump(), but
      that is not always defined:
      
      drivers/net/ethernet/pensando/ionic/ionic_main.c:229:2: error: implicit declaration of function 'dynamic_hex_dump' [-Werror,-Wimplicit-function-declaration]
      
      Add a dummy implementation to use when CONFIG_DYNAMIC_DEBUG
      is disabled, printing nothing.
      
      Fixes: 938962d5 ("ionic: Add adminq action")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NShannon Nelson <snelson@pensando.io>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      011c7289
    • D
      bpf: Fix use after free in subprog's jited symbol removal · cd7455f1
      Daniel Borkmann 提交于
      syzkaller managed to trigger the following crash:
      
        [...]
        BUG: unable to handle page fault for address: ffffc90001923030
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD aa551067 P4D aa551067 PUD aa552067 PMD a572b067 PTE 80000000a1173163
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 0 PID: 7982 Comm: syz-executor912 Not tainted 5.4.0-rc3+ #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:bpf_jit_binary_hdr include/linux/filter.h:787 [inline]
        RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:531 [inline]
        RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline]
        RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline]
        RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline]
        RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline]
        RIP: 0010:is_bpf_text_address+0x184/0x3b0 kernel/bpf/core.c:709
        [...]
        Call Trace:
         kernel_text_address kernel/extable.c:147 [inline]
         __kernel_text_address+0x9a/0x110 kernel/extable.c:102
         unwind_get_return_address+0x4c/0x90 arch/x86/kernel/unwind_frame.c:19
         arch_stack_walk+0x98/0xe0 arch/x86/kernel/stacktrace.c:26
         stack_trace_save+0xb6/0x150 kernel/stacktrace.c:123
         save_stack mm/kasan/common.c:69 [inline]
         set_track mm/kasan/common.c:77 [inline]
         __kasan_kmalloc+0x11c/0x1b0 mm/kasan/common.c:510
         kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:518
         slab_post_alloc_hook mm/slab.h:584 [inline]
         slab_alloc mm/slab.c:3319 [inline]
         kmem_cache_alloc+0x1f5/0x2e0 mm/slab.c:3483
         getname_flags+0xba/0x640 fs/namei.c:138
         getname+0x19/0x20 fs/namei.c:209
         do_sys_open+0x261/0x560 fs/open.c:1091
         __do_sys_open fs/open.c:1115 [inline]
         __se_sys_open fs/open.c:1110 [inline]
         __x64_sys_open+0x87/0x90 fs/open.c:1110
         do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [...]
      
      After further debugging it turns out that we walk kallsyms while in parallel
      we tear down a BPF program which contains subprograms that have been JITed
      though the program itself has not been fully exposed and is eventually bailing
      out with error.
      
      The bpf_prog_kallsyms_del_subprogs() in bpf_prog_load()'s error path removes
      the symbols, however, bpf_prog_free() tears down the JIT memory too early via
      scheduled work. Instead, it needs to properly respect RCU grace period as the
      kallsyms walk for BPF is under RCU.
      
      Fix it by refactoring __bpf_prog_put()'s tear down and reuse it in our error
      path where we defer final destruction when we have subprogs in the program.
      
      Fixes: 7d1982b4 ("bpf: fix panic in prog load calls cleanup")
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Reported-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/bpf/55f6367324c2d7e9583fa9ccf5385dcbba0d7a6e.1571752452.git.daniel@iogearbox.net
      cd7455f1
  10. 21 10月, 2019 3 次提交
  11. 18 10月, 2019 2 次提交
  12. 17 10月, 2019 1 次提交
    • J
      arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled · 19c95f26
      Julien Thierry 提交于
      Preempting from IRQ-return means that the task has its PSTATE saved
      on the stack, which will get restored when the task is resumed and does
      the actual IRQ return.
      
      However, enabling some CPU features requires modifying the PSTATE. This
      means that, if a task was scheduled out during an IRQ-return before all
      CPU features are enabled, the task might restore a PSTATE that does not
      include the feature enablement changes once scheduled back in.
      
      * Task 1:
      
      PAN == 0 ---|                          |---------------
                  |                          |<- return from IRQ, PSTATE.PAN = 0
                  | <- IRQ                   |
                  +--------+ <- preempt()  +--
                                           ^
                                           |
                                           reschedule Task 1, PSTATE.PAN == 1
      * Init:
              --------------------+------------------------
                                  ^
                                  |
                                  enable_cpu_features
                                  set PSTATE.PAN on all CPUs
      
      Worse than this, since PSTATE is untouched when task switching is done,
      a task missing the new bits in PSTATE might affect another task, if both
      do direct calls to schedule() (outside of IRQ/exception contexts).
      
      Fix this by preventing preemption on IRQ-return until features are
      enabled on all CPUs.
      
      This way the only PSTATE values that are saved on the stack are from
      synchronous exceptions. These are expected to be fatal this early, the
      exception is BRK for WARN_ON(), but as this uses do_debug_exception()
      which keeps IRQs masked, it shouldn't call schedule().
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      [james: Replaced a really cool hack, with an even simpler static key in C.
       expanded commit message with Julien's cover-letter ascii art]
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      19c95f26
  13. 16 10月, 2019 1 次提交
    • D
      net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actions · fa4e0f88
      Davide Caratti 提交于
      the following script:
      
       # tc qdisc add dev eth0 clsact
       # tc filter add dev eth0 egress protocol ip matchall \
       > action mpls push protocol mpls_uc label 0x355aa bos 1
      
      causes corruption of all IP packets transmitted by eth0. On TC egress, we
      can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push'
      operation will result in an overwrite of the first 4 octets in the packet
      L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same
      error pattern is present also in the MPLS 'pop' operation. Fix this error
      in act_mpls data plane, computing 'mac_len' as the difference between the
      network header and the mac header (when not at TC ingress), and use it in
      MPLS 'push'/'pop' core functions.
      
      v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len'
          in skb_mpls_pop(), reported by kbuild test robot
      
      CC: Lorenzo Bianconi <lorenzo@kernel.org>
      Fixes: 2a2ea508 ("net: sched: add mpls manipulation actions to TC")
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Acked-by: NJohn Hurley <john.hurley@netronome.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa4e0f88
  14. 15 10月, 2019 5 次提交
  15. 14 10月, 2019 2 次提交
  16. 12 10月, 2019 1 次提交
  17. 11 10月, 2019 1 次提交
    • B
      SUNRPC: fix race to sk_err after xs_error_report · af84537d
      Benjamin Coddington 提交于
      Since commit 4f8943f8 ("SUNRPC: Replace direct task wakeups from
      softirq context") there has been a race to the value of the sk_err if both
      XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set.  In that case,
      we may end up losing the sk_err value that existed when xs_error_report was
      called.
      
      Fix this by reverting to the previous behavior: instead of using SO_ERROR
      to retrieve the value at a later time (which might also return sk_err_soft),
      copy the sk_err value onto struct sock_xprt, and use that value to wake
      pending tasks.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Fixes: 4f8943f8 ("SUNRPC: Replace direct task wakeups from softirq context")
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      af84537d
  18. 09 10月, 2019 1 次提交
  19. 08 10月, 2019 5 次提交
    • A
      lib/string: Make memzero_explicit() inline instead of external · bec50077
      Arvind Sankar 提交于
      With the use of the barrier implied by barrier_data(), there is no need
      for memzero_explicit() to be extern. Making it inline saves the overhead
      of a function call, and allows the code to be reused in arch/*/purgatory
      without having to duplicate the implementation.
      Tested-by: NHans de Goede <hdegoede@redhat.com>
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Reviewed-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H . Peter Anvin <hpa@zytor.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephan Mueller <smueller@chronox.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-crypto@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Fixes: 906a4bb9 ("crypto: sha256 - Use get/put_unaligned_be32 to get input, memzero_explicit")
      Link: https://lkml.kernel.org/r/20191007220000.GA408752@rani.riverdale.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bec50077
    • V
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka 提交于
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99