1. 13 5月, 2014 3 次提交
    • T
      cgroup: fix rcu_read_lock() leak in update_if_frozen() · 36e9d2eb
      Tejun Heo 提交于
      While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
      replace freezer->lock with freezer_mutex") introduced a bug in
      update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
      by adding rcu_read_unlock() before returning.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      36e9d2eb
    • T
      cgroup_freezer: replace freezer->lock with freezer_mutex · e5ced8eb
      Tejun Heo 提交于
      After 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it
      to css_set_rwsem"), css task iterators requires sleepable context as
      it may block on css_set_rwsem.  I missed that cgroup_freezer was
      iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
      errors like the following on freezer state reads and transitions.
      
        BUG: sleeping function called from invalid context at /work
       /os/work/kernel/locking/rwsem.c:20
        in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
        5 locks held by bash/462:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
         #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
         #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
         #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
         #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
        Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
      
        CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
         ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
         0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
        Call Trace:
         [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
         [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
         [<ffffffff81d05974>] down_read+0x24/0x60
         [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
         [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
         [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
         [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
         [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
         [<ffffffff811f121d>] SyS_write+0x4d/0xc0
         [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
      
      freezer->lock used to be used in hot paths but that time is long gone
      and there's no reason for the lock to be IRQ-safe spinlock or even
      per-cgroup.  In fact, given the fact that a cgroup may contain large
      number of tasks, it's not a good idea to iterate over them while
      holding IRQ-safe spinlock.
      
      Let's simplify locking by replacing per-cgroup freezer->lock with
      global freezer_mutex.  This also makes the comments explaining the
      intricacies of policy inheritance and the locking around it as the
      states are protected by a common mutex.
      
      The conversion is mostly straight-forward.  The followings are worth
      mentioning.
      
      * freezer_css_online() no longer needs double locking.
      
      * freezer_attach() now performs propagation simply while holding
        freezer_mutex.  update_if_frozen() race no longer exists and the
        comment is removed.
      
      * freezer_fork() now tests whether the task is in root cgroup using
        the new task_css_is_root() without doing rcu_read_lock/unlock().  If
        not, it grabs freezer_mutex and performs the operation.
      
      * freezer_read() and freezer_change_state() grab freezer_mutex across
        the whole operation and pin the css while iterating so that each
        descendant processing happens in sleepable context.
      
      Fixes: 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e5ced8eb
    • T
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo 提交于
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5024ae29
  2. 06 5月, 2014 1 次提交
    • T
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo 提交于
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
  3. 05 5月, 2014 2 次提交
    • A
      device_cgroup: check if exception removal is allowed · d2c2b11c
      Aristeu Rozanski 提交于
      [PATCH v3 1/2] device_cgroup: check if exception removal is allowed
      
      When the device cgroup hierarchy was introduced in
      	bd2953eb - devcg: propagate local changes down the hierarchy
      
      a specific case was overlooked. Consider the hierarchy bellow:
      
      	A	default policy: ALLOW, exceptions will deny access
      	 \
      	  B	default policy: ALLOW, exceptions will deny access
      
      There's no need to verify when an new exception is added to B because
      in this case exceptions will deny access to further devices, which is
      always fine. Hierarchy in device cgroup only makes sure B won't have
      more access than A.
      
      But when an exception is removed (by writing devices.allow), it isn't
      checked if the user is in fact removing an inherited exception from A,
      thus giving more access to B.
      
      Example:
      
      	# echo 'a' >A/devices.allow
      	# echo 'c 1:3 rw' >A/devices.deny
      	# echo $$ >A/B/tasks
      	# echo >/dev/null
      	-bash: /dev/null: Operation not permitted
      	# echo 'c 1:3 w' >A/B/devices.allow
      	# echo >/dev/null
      	#
      
      This shouldn't be allowed and this patch fixes it by making sure to never allow
      exceptions in this case to be removed if the exception is partially or fully
      present on the parent.
      
      v3: missing '*' in function description
      v2: improved log message and formatting fixes
      
      Cc: cgroups@vger.kernel.org
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d2c2b11c
    • A
      device_cgroup: fix the comment format for recently added functions · f5f3cf6f
      Aristeu Rozanski 提交于
      Moving more extensive explanations to the end of the comment.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f5f3cf6f
  4. 22 4月, 2014 1 次提交
    • A
      device_cgroup: rework device access check and exception checking · 79d71974
      Aristeu Rozanski 提交于
      Whenever a device file is opened and checked against current device
      cgroup rules, it uses the same function (may_access()) as when a new
      exception rule is added by writing devices.{allow,deny}. And in both
      cases, the algorithm is the same, doesn't matter the behavior.
      
      First problem is having device access to be considered the same as rule
      checking. Consider the following structure:
      
      	A	(default behavior: allow, exceptions disallow access)
      	 \
      	  B	(default behavior: allow, exceptions disallow access)
      
      A new exception is added to B by writing devices.deny:
      
      	c 12:34 rw
      
      When checking if that exception is allowed in may_access():
      
      	if (dev_cgroup->behavior == DEVCG_DEFAULT_ALLOW) {
      		if (behavior == DEVCG_DEFAULT_ALLOW) {
      			/* the exception will deny access to certain devices */
      			return true;
      
      Which is ok, since B is not getting more privileges than A, it doesn't
      matter and the rule is accepted
      
      Now, consider it's a device file open check and the process belongs to
      cgroup B. The access will be generated as:
      
      	behavior: allow
      	exception: c 12:34 rw
      
      The very same chunk of code will allow it, even if there's an explicit
      exception telling to do otherwise.
      
      A simple test case:
      
      	# mkdir new_group
      	# cd new_group
      	# echo $$ >tasks
      	# echo "c 1:3 w" >devices.deny
      	# echo >/dev/null
      	# echo $?
      	0
      
      This is a serious bug and was introduced on
      
      	c39a2a30 devcg: prepare may_access() for hierarchy support
      
      To solve this problem, the device file open function was split from the
      new exception check.
      
      Second problem is how exceptions are processed by may_access(). The
      first part of the said function tries to match fully with an existing
      exception:
      
      	list_for_each_entry_rcu(ex, &dev_cgroup->exceptions, list) {
      		if ((refex->type & DEV_BLOCK) && !(ex->type & DEV_BLOCK))
      			continue;
      		if ((refex->type & DEV_CHAR) && !(ex->type & DEV_CHAR))
      			continue;
      		if (ex->major != ~0 && ex->major != refex->major)
      			continue;
      		if (ex->minor != ~0 && ex->minor != refex->minor)
      			continue;
      		if (refex->access & (~ex->access))
      			continue;
      		match = true;
      		break;
      	}
      
      That means the new exception should be contained into an existing one to
      be considered a match:
      
      	New exception		Existing	match?	notes
      	b 12:34 rwm		b 12:34 rwm	yes
      	b 12:34 r		b *:34 rw	yes
      	b 12:34 rw		b 12:34 w	no	extra "r"
      	b *:34 rw		b 12:34 rw	no	too broad "*"
      	b *:34 rw		b *:34 rwm	yes
      
      Which is fine in some cases. Consider:
      
      	A	(default behavior: deny, exceptions allow access)
      	 \
      	  B	(default behavior: deny, exceptions allow access)
      
      In this case the full match makes sense, the new exception cannot add
      more access than the parent allows
      
      But this doesn't always work, consider:
      
      	A	(default behavior: allow, exceptions disallow access)
      	 \
      	  B	(default behavior: deny, exceptions allow access)
      
      In this case, a new exception in B shouldn't match any of the exceptions
      in A, after all you can't allow something that was forbidden by A. But
      consider this scenario:
      
      	New exception	Existing in A	match?	outcome
      	b 12:34 rw	b 12:34 r	no	exception is accepted
      
      Because the new exception has "w" as extra, it doesn't match, so it'll
      be added to B's exception list.
      
      The same problem can happen during a file access check. Consider a
      cgroup with allow as default behavior:
      
      	Access		Exception	match?
      	b 12:34 rw	b 12:34 r	no
      
      In this case, the access didn't match any of the exceptions in the
      cgroup, which is required since exceptions will disallow access.
      
      To solve this problem, two new functions were created to match an
      exception either fully or partially. In the example above, a partial
      check will be performed and it'll produce a match since at least
      "b 12:34 r" from "b 12:34 rw" access matches.
      
      Cc: cgroups@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAristeu Rozanski <arozansk@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      79d71974
  5. 17 4月, 2014 1 次提交
    • L
      cgroup: fix the retry path of cgroup_mount() · e37a06f1
      Li Zefan 提交于
      If we hit the retry path, we'll call parse_cgroupfs_options() again,
      but the string we pass to it has been modified by the previous call
      to this function.
      
      This bug can be observed by:
      
        # mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \
          mount -t cgroup -o name=foo,cpuset xxx /mnt
        mount: wrong fs type, bad option, bad superblock on xxx,
               missing codepage or helper program, or other error
        ...
      
      The second mount passed "name=foo,cpuset" to the parser, and then it
      hit the retry path and call the parser again, but this time the string
      passed to the parser is "name=foo".
      
      To fix this, we avoid calling parse_cgroupfs_options() again in this
      case.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e37a06f1
  6. 14 4月, 2014 4 次提交
  7. 13 4月, 2014 20 次提交
    • L
      Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild · 321d03c8
      Linus Torvalds 提交于
      Pull misc kbuild changes from Michal Marek:
       "Here is the non-critical part of kbuild:
         - One bogus coccinelle check removed, one check fixed not to suggest
           the obsolete PTR_RET macro
         - scripts/tags.sh does not index the generated *.mod.c files
         - new objdiff tool to list differences between two versions of an
           object file
         - A fix for scripts/bootgraph.pl"
      
      * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
        scripts/coccinelle: Use PTR_ERR_OR_ZERO
        scripts/bootgraph.pl: Add graphic header
        scripts: objdiff: detect object code changes between two commits
        Coccicheck: Remove memcpy to struct assignment test
        scripts/tags.sh: Ignore *.mod.c
      321d03c8
    • M
      sym53c8xx_2: Set DID_REQUEUE return code when aborting squeue · fd1232b2
      Mikulas Patocka 提交于
      This patch fixes I/O errors with the sym53c8xx_2 driver when the disk
      returns QUEUE FULL status.
      
      When the controller encounters an error (including QUEUE FULL or BUSY
      status), it aborts all not yet submitted requests in the function
      sym_dequeue_from_squeue.
      
      This function aborts them with DID_SOFT_ERROR.
      
      If the disk has full tag queue, the request that caused the overflow is
      aborted with QUEUE FULL status (and the scsi midlayer properly retries
      it until it is accepted by the disk), but the sym53c8xx_2 driver aborts
      the following requests with DID_SOFT_ERROR --- for them, the midlayer
      does just a few retries and then signals the error up to sd.
      
      The result is that disk returning QUEUE FULL causes request failures.
      
      The error was reproduced on 53c895 with COMPAQ BD03685A24 disk
      (rebranded ST336607LC) with command queue 48 or 64 tags.  The disk has
      64 tags, but under some access patterns it return QUEUE FULL when there
      are less than 64 pending tags.  The SCSI specification allows returning
      QUEUE FULL anytime and it is up to the host to retry.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd1232b2
    • P
      powerpc: Don't try to set LPCR unless we're in hypervisor mode · 18aa0da3
      Paul Mackerras 提交于
      Commit 8f619b54 ("powerpc/ppc64: Do not turn AIL (reloc-on
      interrupts) too early") added code to set the AIL bit in the LPCR
      without checking whether the kernel is running in hypervisor mode.  The
      result is that when the kernel is running as a guest (i.e., under
      PowerKVM or PowerVM), the processor takes a privileged instruction
      interrupt at that point, causing a panic.  The visible result is that
      the kernel hangs after printing "returning from prom_init".
      
      This fixes it by checking for hypervisor mode being available before
      setting LPCR.  If we are not in hypervisor mode, we enable relocation-on
      interrupts later in pSeries_setup_arch using the H_SET_MODE hcall.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18aa0da3
    • D
      futex: update documentation for ordering guarantees · d7e8af1a
      Davidlohr Bueso 提交于
      Commits 11d4616b ("futex: revert back to the explicit waiter
      counting code") and 69cd9eba ("futex: avoid race between requeue and
      wake") changed some of the finer details of how we think about futexes.
      One was a late fix and the other a consequence of overlooking the whole
      requeuing logic.
      
      The first change caused our documentation to be incorrect, and the
      second made us aware that we need to explicitly add more details to it.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7e8af1a
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 454fd351
      Linus Torvalds 提交于
      Pull yet more networking updates from David Miller:
      
       1) Various fixes to the new Redpine Signals wireless driver, from
          Fariya Fatima.
      
       2) L2TP PPP connect code takes PMTU from the wrong socket, fix from
          Dmitry Petukhov.
      
       3) UFO and TSO packets differ in whether they include the protocol
          header in gso_size, account for that in skb_gso_transport_seglen().
         From Florian Westphal.
      
       4) If VLAN untagging fails, we double free the SKB in the bridging
          output path.  From Toshiaki Makita.
      
       5) Several call sites of sk->sk_data_ready() were referencing an SKB
          just added to the socket receive queue in order to calculate the
          second argument via skb->len.  This is dangerous because the moment
          the skb is added to the receive queue it can be consumed in another
          context and freed up.
      
          It turns out also that none of the sk->sk_data_ready()
          implementations even care about this second argument.
      
          So just kill it off and thus fix all these use-after-free bugs as a
          side effect.
      
       6) Fix inverted test in tcp_v6_send_response(), from Lorenzo Colitti.
      
       7) pktgen needs to do locking properly for LLTX devices, from Daniel
          Borkmann.
      
       8) xen-netfront driver initializes TX array entries in RX loop :-) From
          Vincenzo Maffione.
      
       9) After refactoring, some tunnel drivers allow a tunnel to be
          configured on top itself.  Fix from Nicolas Dichtel.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
        vti: don't allow to add the same tunnel twice
        gre: don't allow to add the same tunnel twice
        drivers: net: xen-netfront: fix array initialization bug
        pktgen: be friendly to LLTX devices
        r8152: check RTL8152_UNPLUG
        net: sun4i-emac: add promiscuous support
        net/apne: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
        net: ipv6: Fix oif in TCP SYN+ACK route lookup.
        drivers: net: cpsw: enable interrupts after napi enable and clearing previous interrupts
        drivers: net: cpsw: discard all packets received when interface is down
        net: Fix use after free by removing length arg from sk_data_ready callbacks.
        Drivers: net: hyperv: Address UDP checksum issues
        Drivers: net: hyperv: Negotiate suitable ndis version for offload support
        Drivers: net: hyperv: Allocate memory for all possible per-pecket information
        bridge: Fix double free and memory leak around br_allowed_ingress
        bonding: Remove debug_fs files when module init fails
        i40evf: program RSS LUT correctly
        i40evf: remove open-coded skb_cow_head
        ixgb: remove open-coded skb_cow_head
        igbvf: remove open-coded skb_cow_head
        ...
      454fd351
    • L
      Merge tag 'blackfin-for-linus' of... · fd18f00d
      Linus Torvalds 提交于
      Merge tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux
      
      Pull blackfin updates from Steven Miao:
       "Code cleanup, some previously ignored patches, and bug fixes"
      
      * tag 'blackfin-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/realmz6/blackfin-linux:
        blackfin: cleanup board files
        bf609: clock: drop unused clock bit set/clear functions
        Blackfin: bf537: rename "CONFIG_ADT75"
        Blackfin: bf537: rename "CONFIG_AD7314"
        Blackfin: bf537: rename ad2s120x ->ad2s1200
        blackfin: bf537: fix typo "CONFIG_SND_SOC_ADV80X_MODULE"
        blackfin: dma: current count mmr is read only
        bfin_crc: Move architecture independant crc header file out of the blackfin folder.
        bf54x: drop unuesd HOST status,control,timeout registers bit define macros
        blackfin: portmux: cleanup head file
        Blackfin: remove "config IP_CHECKSUM_L1"
        blackfin: Remove GENERIC_GPIO config option again
        blackfin:Use generic /proc/interrupts implementation
        blackfin: bf60x: fix typo "CONFIG_PM_BFIN_WAKE_PA15_POL"
      fd18f00d
    • L
      Merge tag 'remoteproc-3.15-cleanups' of... · de0c9cf9
      Linus Torvalds 提交于
      Merge tag 'remoteproc-3.15-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/remoteproc
      
      Pull remoteproc cleanups from Ohad Ben-Cohen:
       "Several remoteproc cleanup patches coming from Jingoo Han, Julia
        Lawall and Uwe Kleine-König"
      
      * tag 'remoteproc-3.15-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/ohad/remoteproc:
        remoteproc/ste_modem: staticize local symbols
        remoteproc/davinci: simplify use of devm_ioremap_resource
        remoteproc/davinci: drop needless devm_clk_put
      de0c9cf9
    • L
      Merge tag 'llvmlinux-for-v3.15' of git://git.linuxfoundation.org/llvmlinux/kernel · 09c9b61d
      Linus Torvalds 提交于
      Pull llvm patches from Behan Webster:
       "These are some initial updates to support compiling the kernel with
        clang.
      
        These patches have been through the proper reviews to the best of my
        ability, and have been soaking in linux-next for a few weeks.  These
        patches by themselves still do not completely allow clang to be used
        with the kernel code, but lay the foundation for other patches which
        are still under review.
      
        Several other of the LLVMLinux patches have been already added via
        maintainer trees"
      
      * tag 'llvmlinux-for-v3.15' of git://git.linuxfoundation.org/llvmlinux/kernel:
        x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id"
        x86 kbuild: LLVMLinux: More cc-options added for clang
        x86, acpi: LLVMLinux: Remove nested functions from Thinkpad ACPI
        LLVMLinux: Add support for clang to compiler.h and new compiler-clang.h
        LLVMLinux: Remove warning about returning an uninitialized variable
        kbuild: LLVMLinux: Fix LINUX_COMPILER definition script for compilation with clang
        Documentation: LLVMLinux: Update Documentation/dontdiff
        kbuild: LLVMLinux: Adapt warnings for compilation with clang
        kbuild: LLVMLinux: Add Kbuild support for building kernel with Clang
      09c9b61d
    • L
      Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending · 141eaccd
      Linus Torvalds 提交于
      Pull SCSI target updates from Nicholas Bellinger:
       "Here are the target pending updates for v3.15-rc1.  Apologies in
        advance for waiting until the second to last day of the merge window
        to send these out.
      
        The highlights this round include:
      
         - iser-target support for T10 PI (DIF) offloads (Sagi + Or)
         - Fix Task Aborted Status (TAS) handling in target-core (Alex Leung)
         - Pass in transport supported PI at session initialization (Sagi + MKP + nab)
         - Add WRITE_INSERT + READ_STRIP T10 PI support in target-core (nab + Sagi)
         - Fix iscsi-target ERL=2 ASYNC_EVENT connection pointer bug (nab)
         - Fix tcm_fc use-after-free of ft_tpg (Andy Grover)
         - Use correct ib_sg_dma primitives in ib_isert (Mike Marciniszyn)
      
        Also, note the virtio-scsi + vhost-scsi changes to expose T10 PI
        metadata into KVM guest have been left-out for now, as there where a
        few comments from MST + Paolo that where not able to be addressed in
        time for v3.15.  Please expect this feature for v3.16-rc1"
      
      * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (43 commits)
        ib_srpt: Use correct ib_sg_dma primitives
        target/tcm_fc: Rename ft_tport_create to ft_tport_get
        target/tcm_fc: Rename ft_{add,del}_lport to {add,del}_wwn
        target/tcm_fc: Rename structs and list members for clarity
        target/tcm_fc: Limit to 1 TPG per wwn
        target/tcm_fc: Don't export ft_lport_list
        target/tcm_fc: Fix use-after-free of ft_tpg
        target: Add check to prevent Abort Task from aborting itself
        target: Enable READ_STRIP emulation in target_complete_ok_work
        target/sbc: Add sbc_dif_read_strip software emulation
        target: Enable WRITE_INSERT emulation in target_execute_cmd
        target/sbc: Add sbc_dif_generate software emulation
        target/sbc: Only expose PI read_cap16 bits when supported by fabric
        target/spc: Only expose PI mode page bits when supported by fabric
        target/spc: Only expose PI inquiry bits when supported by fabric
        target: Pass in transport supported PI at session initialization
        target/iblock: Fix double bioset_integrity_free bug
        Target/sbc: Initialize COMPARE_AND_WRITE write_sg scatterlist
        target/rd: T10-Dif: RAM disk is allocating more space than required.
        iscsi-target: Fix ERL=2 ASYNC_EVENT connection pointer bug
        ...
      141eaccd
    • L
      Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 93094449
      Linus Torvalds 提交于
      Pull media fixes from Mauro Carvalho Chehab:
       "A series of bug fix patches for v3.15-rc1.  Most are just driver
        fixes.  There are some changes at remote controller core level, fixing
        some definitions on a new API added for Kernel v3.15.
      
        It also adds the missing include at include/uapi/linux/v4l2-common.h,
        to allow its compilation on userspace, as pointed by you"
      
      * 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (24 commits)
        [media] gpsca: remove the risk of a division by zero
        [media] stk1160: warrant a NUL terminated string
        [media] v4l: ti-vpe: retain v4l2_buffer flags for captured buffers
        [media] v4l: ti-vpe: Set correct field parameter for output and capture buffers
        [media] v4l: ti-vpe: zero out reserved fields in try_fmt
        [media] v4l: ti-vpe: Fix initial configuration queue data
        [media] v4l: ti-vpe: Use correct bus_info name for the device in querycap
        [media] v4l: ti-vpe: report correct capabilities in querycap
        [media] v4l: ti-vpe: Allow usage of smaller images
        [media] v4l: ti-vpe: Use video_device_release_empty
        [media] v4l: ti-vpe: Make sure in job_ready that we have the needed number of dst_bufs
        [media] lgdt3305: include sleep functionality in lgdt3304_ops
        [media] drx-j: use customise option correctly
        [media] m88rs2000: fix sparse static warnings
        [media] r820t: fix size and init values
        [media] rc-core: remove generic scancode filter
        [media] rc-core: split dev->s_filter
        [media] rc-core: do not change 32bit NEC scancode format for now
        [media] rtl28xxu: remove duplicate ID 0458:707f Genius TVGo DVB-T03
        [media] xc2028: add missing break to switch
        ...
      93094449
    • L
      Merge tag 'ntb-3.15' of git://github.com/jonmason/ntb · 07f5fef9
      Linus Torvalds 提交于
      Pull PCIe non-transparent bridge fixes and features from Jon Mason:
       "NTB driver bug fixes to address issues in list traversal, skb leak in
        ntb_netdev, a typo, and a leak of msix entries in the error path.
        Clean ups of the event handling logic, as well as a overall style
        cleanup.  Finally, the driver was converted to use the new
        pci_enable_msix_range logic (and the refactoring to go along with it)"
      
      * tag 'ntb-3.15' of git://github.com/jonmason/ntb:
        ntb: Use pci_enable_msix_range() instead of pci_enable_msix()
        ntb: Split ntb_setup_msix() into separate BWD/SNB routines
        ntb: Use pci_msix_vec_count() to obtain number of MSI-Xs
        NTB: Code Style Clean-up
        NTB: client event cleanup
        ntb: Fix leakage of ntb_device::msix_entries[] array
        NTB: Fix typo in setting one translation register
        ntb_netdev: Fix skb free issue in open
        ntb_netdev: Fix list_for_each_entry exit issue
      07f5fef9
    • L
      ceph: fix pr_fmt() redefinition · 96c57ade
      Linus Torvalds 提交于
      The vfs merge caused a latent bug to show up:
      
         In file included from fs/ceph/super.h:4:0,
                          from fs/ceph/ioctl.c:3:
         include/linux/ceph/ceph_debug.h:4:0: warning: "pr_fmt" redefined [enabled by default]
          #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
          ^
         In file included from include/linux/kernel.h:13:0,
                          from include/linux/uio.h:12,
                          from include/linux/socket.h:7,
                          from include/uapi/linux/in.h:22,
                          from include/linux/in.h:23,
                          from fs/ceph/ioctl.c:1:
         include/linux/printk.h:214:0: note: this is the location of the previous definition
          #define pr_fmt(fmt) fmt
          ^
      
      where the reason is that <linux/ceph_debug.h> is included much too late
      for the "pr_fmt()" define.
      
      The include of <linux/ceph_debug.h> needs to be the first include in the
      file, but fs/ceph/ioctl.c had for some reason missed that, and it wasn't
      noticeable until some unrelated header file changes brought in an
      indirect earlier include of <linux/kernel.h>.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c57ade
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 5166701b
      Linus Torvalds 提交于
      Pull vfs updates from Al Viro:
       "The first vfs pile, with deep apologies for being very late in this
        window.
      
        Assorted cleanups and fixes, plus a large preparatory part of iov_iter
        work.  There's a lot more of that, but it'll probably go into the next
        merge window - it *does* shape up nicely, removes a lot of
        boilerplate, gets rid of locking inconsistencie between aio_write and
        splice_write and I hope to get Kent's direct-io rewrite merged into
        the same queue, but some of the stuff after this point is having
        (mostly trivial) conflicts with the things already merged into
        mainline and with some I want more testing.
      
        This one passes LTP and xfstests without regressions, in addition to
        usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
        giving failures since "mm/readahead.c: fix readahead failure for
        memoryless NUMA nodes and limit readahead pages" - might be a false
        positive, might be a real regression..."
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
        missing bits of "splice: fix racy pipe->buffers uses"
        cifs: fix the race in cifs_writev()
        ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
        kill generic_file_buffered_write()
        ocfs2_file_aio_write(): switch to generic_perform_write()
        ceph_aio_write(): switch to generic_perform_write()
        xfs_file_buffered_aio_write(): switch to generic_perform_write()
        export generic_perform_write(), start getting rid of generic_file_buffer_write()
        generic_file_direct_write(): get rid of ppos argument
        btrfs_file_aio_write(): get rid of ppos
        kill the 5th argument of generic_file_buffered_write()
        kill the 4th argument of __generic_file_aio_write()
        lustre: don't open-code kernel_recvmsg()
        ocfs2: don't open-code kernel_recvmsg()
        drbd: don't open-code kernel_recvmsg()
        constify blk_rq_map_user_iov() and friends
        lustre: switch to kernel_sendmsg()
        ocfs2: don't open-code kernel_sendmsg()
        take iov_iter stuff to mm/iov_iter.c
        process_vm_access: tidy up a bit
        ...
      5166701b
    • D
      Merge branch 'tunnels' · eda43ce0
      David S. Miller 提交于
      Nicolas Dichtel says:
      
      ====================
      tunnels: don't allow to add the same tunnel twice
      
      This series fixes the check of an existing tunnel with the same
      parameters when a new tunnel is added.  I've checked all users of
      ip_tunnel_newlink(): gre, gretap, ipip and vti. The bug exists only
      for gre and vti.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eda43ce0
    • N
      vti: don't allow to add the same tunnel twice · 8d89dcdf
      Nicolas Dichtel 提交于
      Before the patch, it was possible to add two times the same tunnel:
      ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit b9959fd3 ("vti: switch to new ip tunnel code").
      
      CC: Cong Wang <amwang@redhat.com>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d89dcdf
    • N
      gre: don't allow to add the same tunnel twice · 5a455275
      Nicolas Dichtel 提交于
      Before the patch, it was possible to add two times the same tunnel:
      ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249
      ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249
      
      It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the
      argument dev->type, which was set only later (when calling ndo_init handler
      in register_netdevice()). Let's set this type in the setup handler, which is
      called before newlink handler.
      
      Introduced by commit c5441932 ("GRE: Refactor GRE tunneling code.").
      
      CC: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a455275
    • V
      drivers: net: xen-netfront: fix array initialization bug · 810d8ced
      Vincenzo Maffione 提交于
      This patch fixes the initialization of an array used in the TX
      datapath that was mistakenly initialized together with the
      RX datapath arrays. An out of range array access could happen
      when RX and TX rings had different sizes.
      Signed-off-by: NVincenzo Maffione <v.maffione@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      810d8ced
    • D
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net · dcfba949
      David S. Miller 提交于
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to e1000, e1000e, igb, igbvf, ixgb, ixgbe,
      ixgbevf and i40evf.
      
      Mark fixes an issue with ixgbe and ixgbevf by adding a bit to indicate
      when workqueues have been initialized.  This permits the register read
      error handling from attempting to use them prior to that, which also
      generates warnings.  Checking for a detected removal after initializing
      the work queues allows the probe function to return an error without
      getting the workqueue involved.  Further, if the error_detected
      callback is entered before the workqueues are initialized, exit without
      recovery since the device initialization was so truncated.
      
      Francois Romieu provides several patches to all the drivers to remove
      the open coded skb_cow_head.
      
      Jakub Kicinski provides a fix for igb where last_rx_timestamp should be
      updated only when Rx time stamp is read.
      
      Mitch provides a fix for i40evf where a recent change broke the RSS LUT
      programming causing it to be programmed with all 0's.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcfba949
    • L
      Merge tag 'trace-3.15-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · 0a7418f5
      Linus Torvalds 提交于
      Pull more tracing updates from Steven Rostedt:
       "This includes the final patch to clean up and fix the issue with the
        design of tracepoints and how a user could register a tracepoint and
        have that tracepoint not be activated but no error was shown.
      
        The design was for an out of tree module but broke in tree users.  The
        clean up was to remove the saving of the hash table of tracepoint
        names such that they can be enabled before they exist (enabling a
        module tracepoint before that module is loaded).  This added more
        complexity than needed.  The clean up was to remove that code and just
        enable tracepoints that exist or fail if they do not.
      
        This removed a lot of code as well as the complexity that it brought.
        As a side effect, instead of registering a tracepoint by its name, the
        tracepoint needs to be registered with the tracepoint descriptor.
        This removes having to duplicate the tracepoint names that are
        enabled.
      
        The second patch was added that simplified the way modules were
        searched for.
      
        This cleanup required changes that were in the 3.15 queue as well as
        some changes that were added late in the 3.14-rc cycle.  This final
        change waited till the two were merged in upstream and then the change
        was added and full tests were run.  Unfortunately, the test found some
        errors, but after it was already submitted to the for-next branch and
        not to be rebased.  Sparse errors were detected by Fengguang Wu's bot
        tests, and my internal tests discovered that the anonymous union
        initialization triggered a bug in older gcc compilers.  Luckily, there
        was a bugzilla for the gcc bug which gave a work around to the
        problem.  The third and fourth patch handled the sparse error and the
        gcc bug respectively.
      
        A final patch was tagged along to fix a missing documentation for the
        README file"
      
      * tag 'trace-3.15-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        tracing: Add missing function triggers dump and cpudump to README
        tracing: Fix anonymous unions in struct ftrace_event_call
        tracepoint: Fix sparse warnings in tracepoint.c
        tracepoint: Simplify tracepoint module search
        tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints
      0a7418f5
    • L
      Merge git://git.infradead.org/users/eparis/audit · 0b747172
      Linus Torvalds 提交于
      Pull audit updates from Eric Paris.
      
      * git://git.infradead.org/users/eparis/audit: (28 commits)
        AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
        audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
        audit: do not cast audit_rule_data pointers pointlesly
        AUDIT: Allow login in non-init namespaces
        audit: define audit_is_compat in kernel internal header
        kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
        sched: declare pid_alive as inline
        audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
        syscall_get_arch: remove useless function arguments
        audit: remove stray newline from audit_log_execve_info() audit_panic() call
        audit: remove stray newlines from audit_log_lost messages
        audit: include subject in login records
        audit: remove superfluous new- prefix in AUDIT_LOGIN messages
        audit: allow user processes to log from another PID namespace
        audit: anchor all pid references in the initial pid namespace
        audit: convert PPIDs to the inital PID namespace.
        pid: get pid_t ppid of task in init_pid_ns
        audit: rename the misleading audit_get_context() to audit_take_context()
        audit: Add generic compat syscall support
        audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
        ...
      0b747172
  8. 12 4月, 2014 8 次提交