1. 01 2月, 2017 2 次提交
    • E
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman 提交于
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
    • O
      exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction · c6c70f44
      Oleg Nesterov 提交于
      find_new_reaper() checks same_thread_group(reaper, child_reaper) to
      prevent the cross-namespace reparenting but this is not enough if the
      exiting parent was injected by setns() + fork().
      
      Suppose we have a process P in the root namespace and some namespace X.
      P does setns() to enter the X namespace, and forks the child C.
      C forks a grandchild G and exits.
      
      The grandchild G should be re-parented to X->child_reaper, but in this
      case the ->real_parent chain does not lead to ->child_reaper, so it will
      be wrongly reparanted to P's sub-reaper or a global init.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      c6c70f44
  2. 24 1月, 2017 2 次提交
    • N
      inotify: Convert to using per-namespace limits · 1cce1eea
      Nikolay Borisov 提交于
      This patchset converts inotify to using the newly introduced
      per-userns sysctl infrastructure.
      
      Currently the inotify instances/watches are being accounted in the
      user_struct structure. This means that in setups where multiple
      users in unprivileged containers map to the same underlying
      real user (i.e. pointing to the same user_struct) the inotify limits
      are going to be shared as well, allowing one user(or application) to exhaust
      all others limits.
      
      Fix this by switching the inotify sysctls to using the
      per-namespace/per-user limits. This will allow the server admin to
      set sensible global limits, which can further be tuned inside every
      individual user namespace. Additionally, in order to preserve the
      sysctl ABI make the existing inotify instances/watches sysctls
      modify the values of the initial user namespace.
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      1cce1eea
    • N
      userns: Make ucounts lock irq-safe · 880a3854
      Nikolay Borisov 提交于
      The ucounts_lock is being used to protect various ucounts lifecycle
      management functionalities. However, those services can also be invoked
      when a pidns is being freed in an RCU callback (e.g. softirq context).
      This can lead to deadlocks. There were already efforts trying to
      prevent similar deadlocks in add7c65c ("pid: fix lockdep deadlock
      warning due to ucount_lock"), however they just moved the context
      from hardirq to softrq. Fix this issue once and for all by explictly
      making the lock disable irqs altogether.
      
      Dmitry Vyukov <dvyukov@google.com> reported:
      
      > I've got the following deadlock report while running syzkaller fuzzer
      > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
      > device if it matters):
      >
      > =================================
      > [ INFO: inconsistent lock state ]
      > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
      > ---------------------------------
      > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      >  (ucounts_lock){+.?...}, at: [<     inline     >] spin_lock
      > ./include/linux/spinlock.h:302
      >  (ucounts_lock){+.?...}, at: [<ffff2000081678c8>]
      > put_ucounts+0x60/0x138 kernel/ucount.c:162
      > {SOFTIRQ-ON-W} state was registered at:
      > [<ffff2000081c82d8>] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2941
      > [<ffff2000081c97a8>] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<     inline     >] get_ucounts kernel/ucount.c:131
      > [<ffff200008167c28>] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
      > [<     inline     >] inc_mnt_namespaces fs/namespace.c:2818
      > [<ffff200008481850>] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
      > [<ffff200008487298>] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
      > [<     inline     >] init_mount_tree fs/namespace.c:3199
      > [<ffff200009bd6674>] mnt_init+0x258/0x384 fs/namespace.c:3251
      > [<ffff200009bd60bc>] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
      > [<ffff200009bb1114>] start_kernel+0x414/0x460 init/main.c:648
      > [<ffff200009bb01e8>] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
      > irq event stamp: 2316924
      > hardirqs last  enabled at (2316924): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2911
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last  enabled at (2316924): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last  enabled at (2316924): [<ffff200008210414>]
      > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
      > hardirqs last disabled at (2316923): [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2900
      > hardirqs last disabled at (2316923): [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > hardirqs last disabled at (2316923): [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      > hardirqs last disabled at (2316923): [<ffff20000820fe80>]
      > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
      > softirqs last  enabled at (2316912): [<ffff20000811b4c4>]
      > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
      > softirqs last disabled at (2316913): [<     inline     >]
      > do_softirq_own_stack ./include/linux/interrupt.h:488
      > softirqs last disabled at (2316913): [<     inline     >]
      > invoke_softirq kernel/softirq.c:371
      > softirqs last disabled at (2316913): [<ffff20000811c994>]
      > irq_exit+0x264/0x308 kernel/softirq.c:405
      >
      > other info that might help us debug this:
      >  Possible unsafe locking scenario:
      >
      >        CPU0
      >        ----
      >   lock(ucounts_lock);
      >   <Interrupt>
      >     lock(ucounts_lock);
      >
      >  *** DEADLOCK ***
      >
      > 1 lock held by swapper/2/0:
      >  #0:  (rcu_callback){......}, at: [<     inline     >] __rcu_reclaim
      > kernel/rcu/rcu.h:108
      >  #0:  (rcu_callback){......}, at: [<     inline     >] rcu_do_batch
      > kernel/rcu/tree.c:2919
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > invoke_rcu_callbacks kernel/rcu/tree.c:3182
      >  #0:  (rcu_callback){......}, at: [<     inline     >]
      > __rcu_process_callbacks kernel/rcu/tree.c:3149
      >  #0:  (rcu_callback){......}, at: [<ffff200008210390>]
      > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
      >
      > stack backtrace:
      > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
      > Hardware name: Hardkernel ODROID-C2 (DT)
      > Call trace:
      > [<ffff20000808fa60>] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
      > [<ffff20000808fec0>] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
      > [<ffff2000088a99e0>] dump_stack+0x110/0x168
      > [<ffff2000082fa2b4>] print_usage_bug.part.27+0x49c/0x4bc
      > kernel/locking/lockdep.c:2387
      > [<     inline     >] print_usage_bug kernel/locking/lockdep.c:2357
      > [<     inline     >] valid_state kernel/locking/lockdep.c:2400
      > [<     inline     >] mark_lock_irq kernel/locking/lockdep.c:2617
      > [<ffff2000081c89ec>] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
      > [<     inline     >] mark_irqflags kernel/locking/lockdep.c:2923
      > [<ffff2000081c9a60>] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
      > [<ffff2000081cce24>] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
      > [<     inline     >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
      > [<ffff200009798128>] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
      > [<     inline     >] spin_lock ./include/linux/spinlock.h:302
      > [<ffff2000081678c8>] put_ucounts+0x60/0x138 kernel/ucount.c:162
      > [<ffff200008168364>] dec_ucount+0xf4/0x158 kernel/ucount.c:214
      > [<     inline     >] dec_pid_namespaces kernel/pid_namespace.c:89
      > [<ffff200008293dc8>] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
      > [<     inline     >] __rcu_reclaim kernel/rcu/rcu.h:118
      > [<     inline     >] rcu_do_batch kernel/rcu/tree.c:2919
      > [<     inline     >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
      > [<     inline     >] __rcu_process_callbacks kernel/rcu/tree.c:3149
      > [<ffff2000082103d8>] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
      > [<ffff2000080821dc>] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
      > [<     inline     >] do_softirq_own_stack ./include/linux/interrupt.h:488
      > [<     inline     >] invoke_softirq kernel/softirq.c:371
      > [<ffff20000811c994>] irq_exit+0x264/0x308 kernel/softirq.c:405
      > [<ffff2000081ecc28>] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
      > [<ffff200008081c80>] gic_handle_irq+0x68/0xd8
      > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
      > 7dc0:                                   ffff8000648d4b3c 0000000000000007
      > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
      > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
      > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
      > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
      > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
      > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
      > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
      > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
      > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
      > [<ffff2000080837f8>] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
      > [<     inline     >] arch_local_irq_restore
      > ./arch/arm64/include/asm/irqflags.h:81
      > [<ffff20000820a47c>] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
      > [<     inline     >] cpuidle_idle_call kernel/sched/idle.c:200
      > [<ffff2000081bcbfc>] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
      > [<ffff2000081bd1cc>] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
      > [<ffff200008099f8c>] secondary_start_kernel+0x2cc/0x358
      > arch/arm64/kernel/smp.c:276
      > [<000000000279f1a4>] 0x279f1a4
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: add7c65c ("pid: fix lockdep deadlock warning due to ucount_lock")
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Link: https://www.spinics.net/lists/kernel/msg2426637.htmlSigned-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      880a3854
  3. 10 1月, 2017 1 次提交
    • A
      pid: fix lockdep deadlock warning due to ucount_lock · add7c65c
      Andrei Vagin 提交于
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
      ---------------------------------------------------------
      swapper/1/0 just changed the state of lock:
       (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (ucounts_lock){+.+...}
      and interrupts could create inverse lock ordering between them.
      other info that might help us debug this:
      Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(ucounts_lock);
                                     local_irq_disable();
                                     lock(&(&sighand->siglock)->rlock);
                                     lock(&(&tty->ctrl_lock)->rlock);
        <Interrupt>
          lock(&(&sighand->siglock)->rlock);
      
       *** DEADLOCK ***
      
      This patch removes a dependency between rlock and ucount_lock.
      
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      add7c65c
  4. 27 12月, 2016 1 次提交
  5. 26 12月, 2016 2 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
    • T
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner 提交于
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  6. 25 12月, 2016 4 次提交
  7. 23 12月, 2016 1 次提交
    • A
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro 提交于
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  8. 21 12月, 2016 3 次提交
  9. 18 12月, 2016 4 次提交
    • M
      uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation · 297e765e
      Marcin Nowakowski 提交于
      Commit:
      
        72e6ae28 ('ARM: 8043/1: uprobes need icache flush after xol write'
      
      ... has introduced an arch-specific method to ensure all caches are
      flushed appropriately after an instruction is written to an XOL page.
      
      However, when the XOL area is created and the out-of-line breakpoint
      instruction is copied, caches are not flushed at all and stale data may
      be found in icache.
      
      Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
      the arch to ensure all caches are updated accordingly.
      
      This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
      Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Victor Kamensky <victor.kamensky@linaro.org>
      Cc: linux-mips@linux-mips.org
      Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      297e765e
    • D
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann 提交于
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6760bf2d
    • D
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann 提交于
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • D
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann 提交于
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  10. 17 12月, 2016 1 次提交
    • D
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann 提交于
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NThomas Graf <tgraf@suug.ch>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a08dd0da
  11. 16 12月, 2016 2 次提交
  12. 15 12月, 2016 17 次提交