1. 23 6月, 2017 7 次提交
    • T
      genirq: Rename setup_affinity() to irq_setup_affinity() · 43564bd9
      Thomas Gleixner 提交于
      Rename it with a proper irq_ prefix and make it available for other files
      in the core code. Preparatory patch for moving the irq affinity setup
      around.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235444.928501004@linutronix.de
      43564bd9
    • T
      genirq: Remove mask argument from setup_affinity() · cba4235e
      Thomas Gleixner 提交于
      No point to have this alloc/free dance of cpumasks. Provide a static mask
      for setup_affinity() and protect it proper.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235444.851571573@linutronix.de
      cba4235e
    • T
      genirq: Provide irq_fixup_move_pending() · cdd16365
      Thomas Gleixner 提交于
      If an CPU goes offline, the interrupts are migrated away, but a eventually
      pending interrupt move, which has not yet been made effective is kept
      pending even if the outgoing CPU is the sole target of the pending affinity
      mask. What's worse is, that the pending affinity mask is discarded even if
      it would contain a valid subset of the online CPUs.
      
      Implement a helper function which allows to avoid these issues.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235444.691345468@linutronix.de
      cdd16365
    • T
      genirq/debugfs: Add proper debugfs interface · 087cdfb6
      Thomas Gleixner 提交于
      Debugging (hierarchical) interupt domains is tedious as there is no
      information about the hierarchy and no information about states of
      interrupts in the various domain levels.
      
      Add a debugfs directory 'irq' and subdirectories 'domains' and 'irqs'.
      
      The domains directory contains the domain files. The content is information
      about the domain. If the domain is part of a hierarchy then the parent
      domains are printed as well.
      
      # ls /sys/kernel/debug/irq/domains/
      default     INTEL-IR-2	    INTEL-IR-MSI-2  IO-APIC-IR-2  PCI-MSI
      DMAR-MSI    INTEL-IR-3	    INTEL-IR-MSI-3  IO-APIC-IR-3  unknown-1
      INTEL-IR-0  INTEL-IR-MSI-0  IO-APIC-IR-0    IO-APIC-IR-4  VECTOR
      INTEL-IR-1  INTEL-IR-MSI-1  IO-APIC-IR-1    PCI-HT
      
      # cat /sys/kernel/debug/irq/domains/VECTOR 
      name:   VECTOR
       size:   0
       mapped: 216
       flags:  0x00000041
      
      # cat /sys/kernel/debug/irq/domains/IO-APIC-IR-0 
      name:   IO-APIC-IR-0
       size:   24
       mapped: 19
       flags:  0x00000041
       parent: INTEL-IR-3
          name:   INTEL-IR-3
           size:   65536
           mapped: 167
           flags:  0x00000041
           parent: VECTOR
              name:   VECTOR
               size:   0
               mapped: 216
               flags:  0x00000041
      
      Unfortunately there is no per cpu information about the VECTOR domain (yet).
      
      The irqs directory contains detailed information about mapped interrupts.
      
      # cat /sys/kernel/debug/irq/irqs/3
      handler:  handle_edge_irq
      status:   0x00004000
      istate:   0x00000000
      ddepth:   1
      wdepth:   0
      dstate:   0x01018000
                  IRQD_IRQ_DISABLED
                  IRQD_SINGLE_TARGET
                  IRQD_MOVE_PCNTXT
      node:     0
      affinity: 0-143
      effectiv: 0
      pending:  
      domain:  IO-APIC-IR-0
       hwirq:   0x3
       chip:    IR-IO-APIC
        flags:   0x10
                   IRQCHIP_SKIP_SET_WAKE
       parent:
          domain:  INTEL-IR-3
           hwirq:   0x20000
           chip:    INTEL-IR
            flags:   0x0
           parent:
              domain:  VECTOR
               hwirq:   0x3
               chip:    APIC
                flags:   0x0
      
      This was developed to simplify the debugging of the managed affinity
      changes.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235444.537566163@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      087cdfb6
    • T
      genirq/irqdomain: Add map counter · 9dc6be3d
      Thomas Gleixner 提交于
      Add a map counter instead of counting radix tree entries for
      diagnosis. That also gives correct information for linear domains.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235444.459397746@linutronix.de
      9dc6be3d
    • T
      genirq: Allow fwnode to carry name information only · d59f6617
      Thomas Gleixner 提交于
      In order to provide proper debug interface it's required to have domain
      names available when the domain is added. Non fwnode based architectures
      like x86 have no way to do so.
      
      It's not possible to use domain ops or host data for this as domain ops
      might be the same for several instances, but the names have to be unique.
      
      Extend the irqchip fwnode to allow transporting the domain name. If no node
      is supplied, create a 'unknown-N' placeholder.
      
      Warn if an invalid node is supplied and treat it like no node. This happens
      e.g. with i2 devices on x86 which hand in an ACPI type node which has no
      interface for retrieving the name.
      
      [ Folded a fix from Marc to make DT name parsing work ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235443.588784933@linutronix.de
      d59f6617
    • T
      genirq/msi: Prevent overwriting domain name · 0165308a
      Thomas Gleixner 提交于
      Prevent overwriting an already assigned domain name. Remove the extra check
      for chip->name, because if domain->name is NULL overwriting it with NULL is
      not a problem.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Link: http://lkml.kernel.org/r/20170619235443.510684976@linutronix.de
      0165308a
  2. 21 6月, 2017 3 次提交
  3. 13 6月, 2017 2 次提交
  4. 12 6月, 2017 1 次提交
  5. 11 6月, 2017 2 次提交
  6. 08 6月, 2017 4 次提交
    • P
      srcu: Allow use of Classic SRCU from both process and interrupt context · 1123a604
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Cc: stable@vger.kernel.org
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1123a604
    • P
      srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context · cdf7abc4
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
      a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
      implementation already supports mixed-context use of srcu_read_lock()
      and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
      and srcu_read_unlock() in each handler are nested and paired properly.
      In other words, it is still illegal to (say) invoke srcu_read_lock()
      in an interrupt handler and to invoke the matching srcu_read_unlock()
      in a softirq handler.  Therefore, the only change required for Tiny SRCU
      is to its comments.
      
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdf7abc4
    • P
      Revert "printk: fix double printing with earlycon" · dac8bbba
      Petr Mladek 提交于
      This reverts commit cf39bf58.
      
      The commit regression to users that define both console=ttyS1
      and console=ttyS0 on the command line, see
      https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain
      
      The kernel log messages always appeared only on one serial port. It is
      even documented in Documentation/admin-guide/serial-console.rst:
      
      "Note that you can only define one console per device type (serial,
      video)."
      
      The above mentioned commit changed the order in which the command line
      parameters are searched. As a result, the kernel log messages go to
      the last mentioned ttyS* instead of the first one.
      
      We long thought that using two console=ttyS* on the command line
      did not make sense. But then we realized that console= parameters
      were handled also by systemd, see
      http://0pointer.de/blog/projects/serial-console.html
      
      "By default systemd will instantiate one serial-getty@.service on
      the main kernel console, if it is not a virtual terminal."
      
      where
      
      "[4] If multiple kernel consoles are used simultaneously, the main
      console is the one listed first in /sys/class/tty/console/active,
      which is the last one listed on the kernel command line."
      
      This puts the original report into another light. The system is running
      in qemu. The first serial port is used to store the messages into a file.
      The second one is used to login to the system via a socket. It depends
      on systemd and the historic kernel behavior.
      
      By other words, systemd causes that it makes sense to define both
      console=ttyS1 console=ttyS0 on the command line. The kernel fix
      caused regression related to userspace (systemd) and need to be
      reverted.
      
      In addition, it went out that the fix helped only partially.
      The messages still were duplicated when the boot console was
      removed early by late_initcall(printk_late_init). Then the entire
      log was replayed when the same console was registered as a normal one.
      
      Link: 20170606160339.GC7604@pathway.suse.cz
      Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Robin Murphy <robin.murphy@arm.com>,
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
      Cc: linux-serial@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      dac8bbba
    • J
      perf/core: Drop kernel samples even though :u is specified · cc1582c2
      Jin Yao 提交于
      When doing sampling, for example:
      
        perf record -e cycles:u ...
      
      On workloads that do a lot of kernel entry/exits we see kernel
      samples, even though :u is specified. This is due to skid existing.
      
      This might be a security issue because it can leak kernel addresses even
      though kernel sampling support is disabled.
      
      The patch drops the kernel samples if exclude_kernel is specified.
      
      For example, test on Haswell desktop:
      
        perf record -e cycles:u <mgen>
        perf report --stdio
      
      Before patch applied:
      
          99.77%  mgen     mgen              [.] buf_read
           0.20%  mgen     mgen              [.] rand_buf_init
           0.01%  mgen     [kernel.vmlinux]  [k] apic_timer_interrupt
           0.00%  mgen     mgen              [.] last_free_elem
           0.00%  mgen     libc-2.23.so      [.] __random_r
           0.00%  mgen     libc-2.23.so      [.] _int_malloc
           0.00%  mgen     mgen              [.] rand_array_init
           0.00%  mgen     [kernel.vmlinux]  [k] page_fault
           0.00%  mgen     libc-2.23.so      [.] __random
           0.00%  mgen     libc-2.23.so      [.] __strcasestr
           0.00%  mgen     ld-2.23.so        [.] strcmp
           0.00%  mgen     ld-2.23.so        [.] _dl_start
           0.00%  mgen     libc-2.23.so      [.] sched_setaffinity@@GLIBC_2.3.4
           0.00%  mgen     ld-2.23.so        [.] _start
      
      We can see kernel symbols apic_timer_interrupt and page_fault.
      
      After patch applied:
      
          99.79%  mgen     mgen           [.] buf_read
           0.19%  mgen     mgen           [.] rand_buf_init
           0.00%  mgen     libc-2.23.so   [.] __random_r
           0.00%  mgen     mgen           [.] rand_array_init
           0.00%  mgen     mgen           [.] last_free_elem
           0.00%  mgen     libc-2.23.so   [.] vfprintf
           0.00%  mgen     libc-2.23.so   [.] rand
           0.00%  mgen     libc-2.23.so   [.] __random
           0.00%  mgen     libc-2.23.so   [.] _int_malloc
           0.00%  mgen     libc-2.23.so   [.] _IO_doallocbuf
           0.00%  mgen     ld-2.23.so     [.] do_lookup_x
           0.00%  mgen     ld-2.23.so     [.] open_verify.constprop.7
           0.00%  mgen     ld-2.23.so     [.] _dl_important_hwcaps
           0.00%  mgen     libc-2.23.so   [.] sched_setaffinity@@GLIBC_2.3.4
           0.00%  mgen     ld-2.23.so     [.] _start
      
      There are only userspace symbols.
      Signed-off-by: NJin Yao <yao.jin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: acme@kernel.org
      Cc: jolsa@kernel.org
      Cc: kan.liang@intel.com
      Cc: mark.rutland@arm.com
      Cc: will.deacon@arm.com
      Cc: yao.jin@intel.com
      Link: http://lkml.kernel.org/r/1495706947-3744-1-git-send-email-yao.jin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cc1582c2
  7. 07 6月, 2017 1 次提交
  8. 04 6月, 2017 4 次提交
    • T
      alarmtimer: Rate limit periodic intervals · ff86bf0c
      Thomas Gleixner 提交于
      The alarmtimer code has another source of potentially rearming itself too
      fast. Interval timers with a very samll interval have a similar CPU hog
      effect as the previously fixed overflow issue.
      
      The reason is that alarmtimers do not implement the normal protection
      against this kind of problem which the other posix timer use:
      
        timer expires -> queue signal -> deliver signal -> rearm timer
      
      This scheme brings the rearming under scheduler control and prevents
      permanently firing timers which hog the CPU.
      
      Bringing this scheme to the alarm timer code is a major overhaul because it
      lacks all the necessary mechanisms completely.
      
      So for a quick fix limit the interval to one jiffie. This is not
      problematic in practice as alarmtimers are usually backed by an RTC for
      suspend which have 1 second resolution. It could be therefor argued that
      the resolution of this clock should be set to 1 second in general, but
      that's outside the scope of this fix.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170530211655.896767100@linutronix.de
      ff86bf0c
    • T
      alarmtimer: Prevent overflow of relative timers · f4781e76
      Thomas Gleixner 提交于
      Andrey reported a alartimer related RCU stall while fuzzing the kernel with
      syzkaller.
      
      The reason for this is an overflow in ktime_add() which brings the
      resulting time into negative space and causes immediate expiry of the
      timer. The following rearm with a small interval does not bring the timer
      back into positive space due to the same issue.
      
      This results in a permanent firing alarmtimer which hogs the CPU.
      
      Use ktime_add_safe() instead which detects the overflow and clamps the
      result to KTIME_SEC_MAX.
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170530211655.802921648@linutronix.de
      f4781e76
    • T
      genirq: Warn when IRQ_NOAUTOEN is used with shared interrupts · 04c848d3
      Thomas Gleixner 提交于
      Shared interrupts do not go well with disabling auto enable:
      
      1) The sharing interrupt might request it while it's still disabled and
         then wait for interrupts forever.
      
      2) The interrupt might have been requested by the driver sharing the line
         before IRQ_NOAUTOEN has been set. So the driver which expects that
         disabled state after calling request_irq() will not get what it wants.
         Even worse, when it calls enable_irq() later, it will trigger the
         unbalanced enable_irq() warning.
      Reported-by: NBrian Norris <briannorris@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dianders@chromium.org
      Cc: jeffy <jeffy.chen@rock-chips.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: tfiga@chromium.org
      Link: http://lkml.kernel.org/r/20170531100212.210682135@linutronix.de
      04c848d3
    • T
      genirq: Handle NOAUTOEN interrupt setup proper · 201d7f47
      Thomas Gleixner 提交于
      If an interrupt is marked NOAUTOEN then request_irq() installs the action,
      but does not enable the interrupt via startup_irq().  The interrupt is
      enabled via enable_irq() later from the driver. enable_irq() calls
      irq_enable().
      
      That means that for interrupts which have a irq_startup() callback this
      callback is never invoked. Neither is irq_domain_activate_irq() invoked for
      such interrupts.
      
      If an interrupt depends on irq_startup() or irq_domain_activate_irq() then
      the enable via irq_enable() is not enough.
      
      Add a status flag IRQD_IRQ_STARTED_UP and use this to select the proper
      mechanism in enable_irq(). Use the flag also to avoid pointless calls into
      the low level functions.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: dianders@chromium.org
      Cc: jeffy <jeffy.chen@rock-chips.com>
      Cc: Brian Norris <briannorris@chromium.org>
      Cc: tfiga@chromium.org
      Link: http://lkml.kernel.org/r/20170531100212.130986205@linutronix.de
      201d7f47
  9. 03 6月, 2017 1 次提交
  10. 27 5月, 2017 3 次提交
  11. 26 5月, 2017 4 次提交
    • V
      genirq: Make early_irq_init() print out more informative · 5a29ef22
      Vincent Legoll 提交于
      The printk in early_irq_init() is cryptic and badly formatted:
      
        NR_IRQS:33024 nr_irqs:968 16
      
      The last number is the number of preallocated interrupts, so add a prefix
      to it:
      
        NR_IRQS: 33024, nr_irqs: 968, preallocated irqs: 16
      
      Cleanup the formatting for better readability as well.
      Signed-off-by: NVincent Legoll <vincent.legoll@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1494318849-6733-1-git-send-email-vincent.legoll@gmail.com
      5a29ef22
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
    • D
      bpf: properly reset caller saved regs after helper call and ld_abs/ind · a9789ef9
      Daniel Borkmann 提交于
      Currently, after performing helper calls, we clear all caller saved
      registers, that is r0 - r5 and fill r0 depending on struct bpf_func_proto
      specification. The way we reset these regs can affect pruning decisions
      in later paths, since we only reset register's imm to 0 and type to
      NOT_INIT. However, we leave out clearing of other variables such as id,
      min_value, max_value, etc, which can later on lead to pruning mismatches
      due to stale data.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9789ef9
    • D
      bpf: fix incorrect pruning decision when alignment must be tracked · 1ad2f583
      Daniel Borkmann 提交于
      Currently, when we enforce alignment tracking on direct packet access,
      the verifier lets the following program pass despite doing a packet
      write with unaligned access:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (bf) r0 = r2
        4: (07) r0 += 14
        5: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        6: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        7: (63) *(u32 *)(r0 -4) = r0
        8: (b7) r0 = 0
        9: (95) exit
      
        from 6 to 8:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1 R10=fp
        8: (b7) r0 = 0
        9: (95) exit
      
        from 5 to 10:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2 R10=fp
        10: (07) r0 += 1
        11: (05) goto pc-6
        6: safe                           <----- here, wrongly found safe
        processed 15 insns
      
      However, if we enforce a pruning mismatch by adding state into r8
      which is then being mismatched in states_equal(), we find that for
      the otherwise same program, the verifier detects a misaligned packet
      access when actually walking that path:
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (61) r7 = *(u32 *)(r1 +8)
        3: (b7) r8 = 1
        4: (bf) r0 = r2
        5: (07) r0 += 14
        6: (25) if r7 > 0x1 goto pc+4
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=14,r=14) R1=ctx R2=pkt(id=0,off=0,r=14)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        9: (b7) r0 = 0
        10: (95) exit
      
        from 7 to 9:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=0,max_value=1
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        9: (b7) r0 = 0
        10: (95) exit
      
        from 6 to 11:
         R0=pkt(id=0,off=14,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
         R3=pkt_end R7=inv,min_value=2
         R8=imm1,min_value=1,max_value=1,min_align=1 R10=fp
        11: (07) r0 += 1
        12: (b7) r8 = 0
        13: (05) goto pc-7                <----- mismatch due to r8
        7: (2d) if r0 > r3 goto pc+1
         R0=pkt(id=0,off=15,r=15) R1=ctx R2=pkt(id=0,off=0,r=15)
         R3=pkt_end R7=inv,min_value=2
         R8=imm0,min_value=0,max_value=0,min_align=2147483648 R10=fp
        8: (63) *(u32 *)(r0 -4) = r0
        misaligned packet access off 2+15+-4 size 4
      
      The reason why we fail to see it in states_equal() is that the
      third test in compare_ptrs_to_packet() ...
      
        if (old->off <= cur->off &&
            old->off >= old->range && cur->off >= cur->range)
                return true;
      
      ... will let the above pass. The situation we run into is that
      old->off <= cur->off (14 <= 15), meaning that prior walked paths
      went with smaller offset, which was later used in the packet
      access after successful packet range check and found to be safe
      already.
      
      For example: Given is R0=pkt(id=0,off=0,r=0). Adding offset 14
      as in above program to it, results in R0=pkt(id=0,off=14,r=0)
      before the packet range test. Now, testing this against R3=pkt_end
      with 'if r0 > r3 goto out' will transform R0 into R0=pkt(id=0,off=14,r=14)
      for the case when we're within bounds. A write into the packet
      at offset *(u32 *)(r0 -4), that is, 2 + 14 -4, is valid and
      aligned (2 is for NET_IP_ALIGN). After processing this with
      all fall-through paths, we later on check paths from branches.
      When the above skb->mark test is true, then we jump near the
      end of the program, perform r0 += 1, and jump back to the
      'if r0 > r3 goto out' test we've visited earlier already. This
      time, R0 is of type R0=pkt(id=0,off=15,r=0), and we'll prune
      that part because this time we'll have a larger safe packet
      range, and we already found that with off=14 all further insn
      were already safe, so it's safe as well with a larger off.
      However, the problem is that the subsequent write into the packet
      with 2 + 15 -4 is then unaligned, and not caught by the alignment
      tracking. Note that min_align, aux_off, and aux_off_align were
      all 0 in this example.
      
      Since we cannot tell at this time what kind of packet access was
      performed in the prior walk and what minimal requirements it has
      (we might do so in the future, but that requires more complexity),
      fix it to disable this pruning case for strict alignment for now,
      and let the verifier do check such paths instead. With that applied,
      the test cases pass and reject the program due to misalignment.
      
      Fixes: d1174416 ("bpf: Track alignment of register values in the verifier.")
      Reference: http://patchwork.ozlabs.org/patch/761909/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ad2f583
  12. 25 5月, 2017 1 次提交
    • T
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo 提交于
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      delays.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41c25707
  13. 24 5月, 2017 1 次提交
  14. 23 5月, 2017 6 次提交
    • E
      ptrace: Properly initialize ptracer_cred on fork · c70d9d80
      Eric W. Biederman 提交于
      When I introduced ptracer_cred I failed to consider the weirdness of
      fork where the task_struct copies the old value by default.  This
      winds up leaving ptracer_cred set even when a process forks and
      the child process does not wind up being ptraced.
      
      Because ptracer_cred is not set on non-ptraced processes whose
      parents were ptraced this has broken the ability of the enlightenment
      window manager to start setuid children.
      
      Fix this by properly initializing ptracer_cred in ptrace_init_task
      
      This must be done with a little bit of care to preserve the current value
      of ptracer_cred when ptrace carries through fork.  Re-reading the
      ptracer_cred from the ptracing process at this point is inconsistent
      with how PT_PTRACE_CAP has been maintained all of these years.
      Tested-by: NTakashi Iwai <tiwai@suse.de>
      Fixes: 64b875f7 ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      c70d9d80
    • M
      genirq/msi: Populate the domain name if provided by the irqchip · a97b852b
      Marc Zyngier 提交于
      In order to ease debug, let's populate the domain name upfront, before any
      MSI gets requested. This allows the domain to appear in the
      irq_domain_mapping, and the user to easily find the expected data.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/20170512115538.10767-4-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a97b852b
    • M
      irqdomain: Let irq_domain_mapping display ACPI fwnode attributes · 2370c00d
      Marc Zyngier 提交于
      If the system is using ACPI, there is no of_node to display. But ACPI can
      use a struct irqchip_fwid as a domain identifier, and it can be used to
      display the name contained in that structure.
      
      The output on such a system will look like this:
      
       pMSI      0           0           0  irqchip@00000000e1180000
       MSI      37           0           0  irqchip@00000000e1180000
       GICv2m   37           0           0  irqchip@00000000e1180000
       GICv2   448         448           0  irqchip@ffff000008003000
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/20170512115538.10767-3-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2370c00d
    • M
      irqdomain: Let irq_domain_mapping display hierarchical domains · fe17a42e
      Marc Zyngier 提交于
      Hierarchical domains seem to be hard to grasp, and a number of
      aspiring kernel hackers find them utterly discombobulating.
      
      In order to ease their pain, let's make them appear in
      /sys/kernel/debug/irq_domain_mapping, such as the following:
      
         96  0x81808  MSI    0x          (null) RADIX   MSI
         96+ 0x00063  GICv2m 0xffff8003ee116980 RADIX   GICv2m
         96+ 0x00063  GICv2  0xffff00000916bfd8 LINEAR  GICv2
      
      [output compressed to fit in a commit log]
      
      This shows that IRQ96 is implemented by a stack of three domains,
      the + sign indicating the stacking.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Link: http://lkml.kernel.org/r/20170512115538.10767-2-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe17a42e
    • V
      kthread: Fix use-after-free if kthread fork fails · 4d6501dc
      Vegard Nossum 提交于
      If a kthread forks (e.g. usermodehelper since commit 1da5c46f) but
      fails in copy_process() between calling dup_task_struct() and setting
      p->set_child_tid, then the value of p->set_child_tid will be inherited
      from the parent and get prematurely freed by free_kthread_struct().
      
          kthread()
           - worker_thread()
              - process_one_work()
              |  - call_usermodehelper_exec_work()
              |     - kernel_thread()
              |        - _do_fork()
              |           - copy_process()
              |              - dup_task_struct()
              |                 - arch_dup_task_struct()
              |                    - tsk->set_child_tid = current->set_child_tid // implied
              |              - ...
              |              - goto bad_fork_*
              |              - ...
              |              - free_task(tsk)
              |                 - free_kthread_struct(tsk)
              |                    - kfree(tsk->set_child_tid)
              - ...
              - schedule()
                 - __schedule()
                    - wq_worker_sleeping()
                       - kthread_data(task)->flags // UAF
      
      The problem started showing up with commit 1da5c46f since it reused
      ->set_child_tid for the kthread worker data.
      
      A better long-term solution might be to get rid of the ->set_child_tid
      abuse. The comment in set_kthread_struct() also looks slightly wrong.
      Debugged-by: NJamie Iles <jamie.iles@oracle.com>
      Fixes: 1da5c46f ("kthread: Make struct kthread kmalloc'ed")
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jamie Iles <jamie.iles@oracle.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170509073959.17858-1-vegard.nossum@oracle.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4d6501dc
    • P
      futex,rt_mutex: Fix rt_mutex_cleanup_proxy_lock() · 04dc1b2f
      Peter Zijlstra 提交于
      Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
      commit:
      
        cfafcd11 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
      
      The following trace shows the problem:
      
       ld-linux-x86-64-2161  [019] ....   410.760971: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_LOCK_PI
       ld-linux-x86-64-2161  [019] ...1   410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760978: SyS_futex: 00007ffbeb76b028: 80000875  op=FUTEX_UNLOCK_PI
       ld-linux-x86-64-2165  [011] d..1   410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
       ld-linux-x86-64-2165  [011] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
       ld-linux-x86-64-2161  [019] ....   410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT
      
      Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
      which then returns with -ETIMEDOUT. That wrecks the lock state, because now
      the owner isn't aware it acquired the lock and removes the pending robust
      list entry.
      
      If 2161 is killed, the robust list will not clear out this futex and the
      subsequent acquire on this futex will then (correctly) result in -ESRCH
      which is unexpected by glibc, triggers an internal assertion and dies.
      
      Task 2161			Task 2165
      
      rt_mutex_wait_proxy_lock()
         timeout();
         /* T2161 is still queued in  the waiter list */
         return -ETIMEDOUT;
      
      				futex_unlock_pi()
      				spin_lock(hb->lock);
      				rtmutex_unlock()
      				  remove_rtmutex_waiter(T2161);
      				   mark_lock_available();
      				/* Make the next waiter owner of the user space side */
      				futex_uval = 2161;
      				spin_unlock(hb->lock);
      spin_lock(hb->lock);
      rt_mutex_cleanup_proxy_lock()
        if (rtmutex_owner() !== current)
           ...
           return FAIL;
      ....
      return -ETIMEOUT;
      
      This means that rt_mutex_cleanup_proxy_lock() needs to call
      try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
      assigned by the waker. If the rtmutex is owned by some other task then this
      call is harmless and just confirmes that the waiter is not able to acquire
      it.
      
      While there, fix what looks like a merge error which resulted in
      rt_mutex_cleanup_proxy_lock() having two calls to
      fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
      Both should have one, since both potentially touch the waiter list.
      
      Fixes: 38d589f2 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Bug-Spotted-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      04dc1b2f