1. 17 11月, 2016 1 次提交
  2. 16 11月, 2016 5 次提交
    • M
      bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH · 8f844938
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f844938
    • M
      bpf: Add BPF_MAP_TYPE_LRU_HASH · 29ba732a
      Martin KaFai Lau 提交于
      Provide a LRU version of the existing BPF_MAP_TYPE_HASH.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba732a
    • M
      bpf: Refactor codes handling percpu map · fd91de7b
      Martin KaFai Lau 提交于
      Refactor the codes that populate the value
      of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
      typed bpf_map.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd91de7b
    • M
      bpf: Add percpu LRU list · 961578b6
      Martin KaFai Lau 提交于
      Instead of having a common LRU list, this patch allows a
      percpu LRU list which can be selected by specifying a map
      attribute.  The map attribute will be added in the later
      patch.
      
      While the common use case for LRU is #reads >> #updates,
      percpu LRU list allows bpf prog to absorb unusual #updates
      under pathological case (e.g. external traffic facing machine which
      could be under attack).
      
      Each percpu LRU is isolated from each other.  The LRU nodes (including
      free nodes) cannot be moved across different LRU Lists.
      
      Here are the update performance comparison between
      common LRU list and percpu LRU list (the test code is
      at the last patch):
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 16 $i | awk '{r += $3}END{print r " updates"}'; done
       1 cpus: 2934082 updates
       4 cpus: 7391434 updates
       8 cpus: 6500576 updates
      
      [root@kerneltest003.31.prn1 ~]# for i in 1 4 8; do echo -n "$i cpus: "; \
      ./map_perf_test 32 $i | awk '{r += $3}END{printr " updates"}'; done
        1 cpus: 2896553 updates
        4 cpus: 9766395 updates
        8 cpus: 17460553 updates
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      961578b6
    • M
      bpf: LRU List · 3a08c2fd
      Martin KaFai Lau 提交于
      Introduce bpf_lru_list which will provide LRU capability to
      the bpf_htab in the later patch.
      
      * General Thoughts:
      1. Target use case.  Read is more often than update.
         (i.e. bpf_lookup_elem() is more often than bpf_update_elem()).
         If bpf_prog does a bpf_lookup_elem() first and then an in-place
         update, it still counts as a read operation to the LRU list concern.
      2. It may be useful to think of it as a LRU cache
      3. Optimize the read case
         3.1 No lock in read case
         3.2 The LRU maintenance is only done during bpf_update_elem()
      4. If there is a percpu LRU list, it will lose the system-wise LRU
         property.  A completely isolated percpu LRU list has the best
         performance but the memory utilization is not ideal considering
         the work load may be imbalance.
      5. Hence, this patch starts the LRU implementation with a global LRU
         list with batched operations before accessing the global LRU list.
         As a LRU cache, #read >> #update/#insert operations, it will work well.
      6. There is a local list (for each cpu) which is named
         'struct bpf_lru_locallist'.  This local list is not used to sort
         the LRU property.  Instead, the local list is to batch enough
         operations before acquiring the lock of the global LRU list.  More
         details on this later.
      7. In the later patch, it allows a percpu LRU list by specifying a
         map-attribute for scalability reason and for use cases that need to
         prepare for the worst (and pathological) case like DoS attack.
         The percpu LRU list is completely isolated from each other and the
         LRU nodes (including free nodes) cannot be moved across the list.  The
         following description is for the global LRU list but mostly applicable
         to the percpu LRU list also.
      
      * Global LRU List:
      1. It has three sub-lists: active-list, inactive-list and free-list.
      2. The two list idea, active and inactive, is borrowed from the
         page cache.
      3. All nodes are pre-allocated and all sit at the free-list (of the
         global LRU list) at the beginning.  The pre-allocation reasoning
         is similar to the existing BPF_MAP_TYPE_HASH.  However,
         opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in
         the LRU map.
      
      * Active/Inactive List (of the global LRU list):
      1. The active list, as its name says it, maintains the active set of
         the nodes.  We can think of it as the working set or more frequently
         accessed nodes.  The access frequency is approximated by a ref-bit.
         The ref-bit is set during the bpf_lookup_elem().
      2. The inactive list, as its name also says it, maintains a less
         active set of nodes.  They are the candidates to be removed
         from the bpf_htab when we are running out of free nodes.
      3. The ordering of these two lists is acting as a rough clock.
         The tail of the inactive list is the older nodes and
         should be released first if the bpf_htab needs free element.
      
      * Rotating the Active/Inactive List (of the global LRU list):
      1. It is the basic operation to maintain the LRU property of
         the global list.
      2. The active list is only rotated when the inactive list is running
         low.  This idea is similar to the current page cache.
         Inactive running low is currently defined as
         "# of inactive < # of active".
      3. The active list rotation always starts from the tail.  It moves
         node without ref-bit set to the head of the inactive list.
         It moves node with ref-bit set back to the head of the active
         list and then clears its ref-bit.
      4. The inactive rotation is pretty simply.
         It walks the inactive list and moves the nodes back to the head of
         active list if its ref-bit is set. The ref-bit is cleared after moving
         to the active list.
         If the node does not have ref-bit set, it just leave it as it is
         because it is already in the inactive list.
      
      * Shrinking the Inactive List (of the global LRU list):
      1. Shrinking is the operation to get free nodes when the bpf_htab is
         full.
      2. It usually only shrinks the inactive list to get free nodes.
      3. During shrinking, it will walk the inactive list from the tail,
         delete the nodes without ref-bit set from bpf_htab.
      4. If no free node found after step (3), it will forcefully get
         one node from the tail of inactive or active list.  Forcefully is
         in the sense that it ignores the ref-bit.
      
      * Local List:
      1. Each CPU has a 'struct bpf_lru_locallist'.  The purpose is to
         batch enough operations before acquiring the lock of the
         global LRU.
      2. A local list has two sub-lists, free-list and pending-list.
      3. During bpf_update_elem(), it will try to get from the free-list
         of (the current CPU local list).
      4. If the local free-list is empty, it will acquire from the
         global LRU list.  The global LRU list can either satisfy it
         by its global free-list or by shrinking the global inactive
         list.  Since we have acquired the global LRU list lock,
         it will try to get at most LOCAL_FREE_TARGET elements
         to the local free list.
      5. When a new element is added to the bpf_htab, it will
         first sit at the pending-list (of the local list) first.
         The pending-list will be flushed to the global LRU list
         when it needs to acquire free nodes from the global list
         next time.
      
      * Lock Consideration:
      The LRU list has a lock (lru_lock).  Each bucket of htab has a
      lock (buck_lock).  If both locks need to be acquired together,
      the lock order is always lru_lock -> buck_lock and this only
      happens in the bpf_lru_list.c logic.
      
      In hashtab.c, both locks are not acquired together (i.e. one
      lock is always released first before acquiring another lock).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3a08c2fd
  3. 15 11月, 2016 2 次提交
  4. 13 11月, 2016 1 次提交
  5. 12 11月, 2016 1 次提交
    • H
      Revert "console: don't prefer first registered if DT specifies stdout-path" · c6c7d83b
      Hans de Goede 提交于
      This reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path").
      
      The reverted commit changes existing behavior on which many ARM boards
      rely.  Many ARM small-board-computers, like e.g.  the Raspberry Pi have
      both a video output and a serial console.  Depending on whether the user
      is using the device as a more regular computer; or as a headless device
      we need to have the console on either one or the other.
      
      Many users rely on the kernel behavior of the console being present on
      both outputs, before the reverted commit the console setup with no
      console= kernel arguments on an ARM board which sets stdout-path in dt
      would look like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
        tty0                 -WU (E  p  )    4:1
      
      Where as after the reverted commit, it looks like this:
      
        [root@localhost ~]# cat /proc/consoles
        ttyS0                -W- (EC p a)    4:64
      
      This commit reverts commit 05fd007e ("console: don't prefer first
      registered if DT specifies stdout-path") restoring the original
      behavior.
      
      Fixes: 05fd007e ("console: don't prefer first registered if DT specifies stdout-path")
      Link: http://lkml.kernel.org/r/20161104121135.4780-2-hdegoede@redhat.comSigned-off-by: NHans de Goede <hdegoede@redhat.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Thorsten Leemhuis <regressions@leemhuis.info>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6c7d83b
  6. 10 11月, 2016 1 次提交
  7. 08 11月, 2016 3 次提交
    • T
      genirq: Use irq type from irqdata instead of irqdesc · 7ee7e87d
      Thomas Gleixner 提交于
      The type flags in the irq descriptor are there for historical reasons and
      only updated via irq_modify_status() or irq_set_type(). Both functions also
      update the type flags in irqdata. __setup_irq() is the only left over user
      of the type flags in the irq descriptor.
      
      If __setup_irq() is called with empty irq type flags, then the type flags
      are retrieved from irqdata. If an interrupt is shared, then the type flags
      are compared with the type flags stored in the irq descriptor. 
      
      On x86 the ioapic does not have a irq_set_type() callback because the type
      is defined in the BIOS tables and cannot be changed. The type is stored in
      irqdata at setup time without updating the type data in the irq
      descriptor. As a result the comparison described above fails.
      
      There is no point in updating the irq descriptor flags because the only
      relevant storage is irqdata. Use the type flags from irqdata for both
      retrieval and comparison in __setup_irq() instead.
      
      Aside of that the print out in case of non matching type flags has the old
      and new type flags arguments flipped. Fix that as well.
      
      For correctness sake the flags stored in the irq descriptor should be
      removed, but this is beyond the scope of this bugfix and will be done in a
      later patch.
      
      Fixes: 4b357dae ("genirq: Look-up trigger type if not specified by caller")
      Reported-and-tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1611072020360.3501@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      7ee7e87d
    • D
      bpf: fix map not being uncharged during map creation failure · 20b2b24f
      Daniel Borkmann 提交于
      In map_create(), we first find and create the map, then once that
      suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
      a new anon fd through anon_inode_getfd(). The problem is, once the
      latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
      the map via map->ops->map_free(), but without uncharging the previously
      locked memory first. That means that the user_struct allocation is
      leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
      Make the label names in the fix consistent with bpf_prog_load().
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20b2b24f
    • D
      bpf: fix htab map destruction when extra reserve is in use · 483bed2b
      Daniel Borkmann 提交于
      Commit a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      added an extra per-cpu reserve to the hash table map to restore old
      behaviour from pre prealloc times. When non-prealloc is in use for a
      map, then problem is that once a hash table extra element has been
      linked into the hash-table, and the hash table is destroyed due to
      refcount dropping to zero, then htab_map_free() -> delete_all_elements()
      will walk the whole hash table and drop all elements via htab_elem_free().
      The problem is that the element from the extra reserve is first fed
      to the wrong backend allocator and eventually freed twice.
      
      Fixes: a6ed3ea6 ("bpf: restore behavior of bpf_map_update_elem")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      483bed2b
  8. 04 11月, 2016 1 次提交
  9. 03 11月, 2016 2 次提交
  10. 02 11月, 2016 1 次提交
  11. 01 11月, 2016 2 次提交
  12. 30 10月, 2016 1 次提交
  13. 28 10月, 2016 7 次提交
    • J
      perf/powerpc: Don't call perf_event_disable() from atomic context · 5aab90ce
      Jiri Olsa 提交于
      The trinity syscall fuzzer triggered following WARN() on powerpc:
      
        WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
        ...
        NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
        LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
        Call Trace:
        [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      Followed by a lockdep warning:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.8.0-rc5+ #7 Tainted: G        W
        -------------------------------
        ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 0
        2 locks held by ls/2998:
         #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
         #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
      
        stack backtrace:
        CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
        Call Trace:
        [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
        [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
        [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
        [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
        [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
        [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
        [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      While it looks like the first WARN() is probably valid, the other one is
      triggered by disabling event via perf_event_disable() from atomic context.
      
      The event is disabled here in case we were not able to emulate
      the instruction that hit the breakpoint. By disabling the event
      we unschedule the event and make sure it's not scheduled back.
      
      But we can't call perf_event_disable() from atomic context, instead
      we need to use the event's pending_disable irq_work method to disable it.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161026094824.GA21397@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5aab90ce
    • J
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix... · 0933840a
      Jiri Olsa 提交于
      perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
      
      CAI Qian reported a crash in the PMU uncore device removal code,
      enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:
      
        https://marc.info/?l=linux-kernel&m=147688837328451
      
      The reason for the crash is that perf_pmu_unregister() tries to remove
      a PMU device which is not added at this point. We add PMU devices
      only after pmu_bus is registered, which happens in the
      perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.
      
      The fix is to get the 'pmu_bus_running' flag state at the point
      the PMU is taken out of the PMU list and remove the device
      later only if it's set.
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Tested-by: NCAI Qian <caiqian@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161020111011.GA13361@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0933840a
    • A
      kcov: properly check if we are in an interrupt · b274c0bb
      Andrey Konovalov 提交于
      in_interrupt() returns a nonzero value when we are either in an
      interrupt or have bh disabled via local_bh_disable().  Since we are
      interested in only ignoring coverage from actual interrupts, do a proper
      check instead of just calling in_interrupt().
      
      As a result of this change, kcov will start to collect coverage from
      within local_bh_disable()/local_bh_enable() sections.
      
      Link: http://lkml.kernel.org/r/1476115803-20712-1-git-send-email-andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Nicolai Stange <nicstange@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b274c0bb
    • J
      genetlink: mark families as __ro_after_init · 56989f6d
      Johannes Berg 提交于
      Now genl_register_family() is the only thing (other than the
      users themselves, perhaps, but I didn't find any doing that)
      writing to the family struct.
      
      In all families that I found, genl_register_family() is only
      called from __init functions (some indirectly, in which case
      I've add __init annotations to clarifly things), so all can
      actually be marked __ro_after_init.
      
      This protects the data structure from accidental corruption.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56989f6d
    • J
      genetlink: statically initialize families · 489111e5
      Johannes Berg 提交于
      Instead of providing macros/inline functions to initialize
      the families, make all users initialize them statically and
      get rid of the macros.
      
      This reduces the kernel code size by about 1.6k on x86-64
      (with allyesconfig).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      489111e5
    • J
      genetlink: no longer support using static family IDs · a07ea4d9
      Johannes Berg 提交于
      Static family IDs have never really been used, the only
      use case was the workaround I introduced for those users
      that assumed their family ID was also their multicast
      group ID.
      
      Additionally, because static family IDs would never be
      reserved by the generic netlink code, using a relatively
      low ID would only work for built-in families that can be
      registered immediately after generic netlink is started,
      which is basically only the control family (apart from
      the workaround code, which I also had to add code for so
      it would reserve those IDs)
      
      Thus, anything other than GENL_ID_GENERATE is flawed and
      luckily not used except in the cases I mentioned. Move
      those workarounds into a few lines of code, and then get
      rid of GENL_ID_GENERATE entirely, making it more robust.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07ea4d9
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  14. 27 10月, 2016 1 次提交
  15. 25 10月, 2016 4 次提交
    • T
      timers: Prevent base clock corruption when forwarding · 6bad6bcc
      Thomas Gleixner 提交于
      When a timer is enqueued we try to forward the timer base clock. This
      mechanism has two issues:
      
      1) Forwarding a remote base unlocked
      
      The forwarding function is called from get_target_base() with the current
      timer base lock held. But if the new target base is a different base than
      the current base (can happen with NOHZ, sigh!) then the forwarding is done
      on an unlocked base. This can lead to corruption of base->clk.
      
      Solution is simple: Invoke the forwarding after the target base is locked.
      
      2) Possible corruption due to jiffies advancing
      
      This is similar to the issue in get_net_timer_interrupt() which was fixed
      in the previous patch. jiffies can advance between check and assignement
      and therefore advancing base->clk beyond the next expiry value.
      
      So we need to read jiffies into a local variable once and do the checks and
      assignment with the local copy.
      
      Fixes: a683f390("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6bad6bcc
    • T
      timers: Prevent base clock rewind when forwarding clock · 041ad7bc
      Thomas Gleixner 提交于
      Ashton and Michael reported, that kernel versions 4.8 and later suffer from
      USB timeouts which are caused by the timer wheel rework.
      
      This is caused by a bug in the base clock forwarding mechanism, which leads
      to timers expiring early. The scenario which leads to this is:
      
      run_timers()
        while (jiffies >= base->clk) {
          collect_expired_timers();
          base->clk++;
          expire_timers();
        }          
      
      So base->clk = jiffies + 1. Now the cpu goes idle:
      
      idle()
        get_next_timer_interrupt()
          nextevt = __next_time_interrupt();
          if (time_after(nextevt, base->clk))
             	base->clk = jiffies;
      
      jiffies has not advanced since run_timers(), so this assignment effectively
      decrements base->clk by one.
      
      base->clk is the index into the timer wheel arrays. So let's assume the
      following state after the base->clk increment in run_timers():
      
       jiffies = 0
       base->clk = 1
      
      A timer gets enqueued with an expiry delta of 63 ticks (which is the case
      with the USB timeout and HZ=250) so the resulting bucket index is:
      
        base->clk + delta = 1 + 63 = 64
      
      The timer goes into the first wheel level. The array size is 64 so it ends
      up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
      to index into bucket 0 again.
      
      If the cpu goes idle before jiffies advance, then the bug in the forwarding
      mechanism sets base->clk back to 0, so the next invocation of run_timers()
      at the next tick will index into bucket 0 and therefore expire the timer 62
      ticks too early.
      
      Instead of blindly setting base->clk to jiffies we must make the forwarding
      conditional on jiffies > base->clk, but we cannot use jiffies for this as
      we might run into the following issue:
      
        if (time_after(jiffies, base->clk) {
          if (time_after(nextevt, base->clk))
             base->clk = jiffies;
      
      jiffies can increment between the check and the assigment far enough to
      advance beyond nextevt. So we need to use a stable value for checking.
      
      get_next_timer_interrupt() has the basej argument which is the jiffies
      value snapshot taken in the calling code. So we can just that.
      
      Thanks to Ashton for bisecting and providing trace data!
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Reported-by: NAshton Holmes <scoopta@gmail.com>
      Reported-by: NMichael Thayer <michael.thayer@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Michal Necasek <michal.necasek@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: knut.osmundsen@oracle.com
      Cc: stable@vger.kernel.org
      Cc: stern@rowland.harvard.edu
      Cc: rt@linutronix.de
      Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      041ad7bc
    • T
      timers: Lock base for same bucket optimization · 4da9152a
      Thomas Gleixner 提交于
      Linus stumbled over the unlocked modification of the timer expiry value in
      mod_timer() which is an optimization for timers which stay in the same
      bucket - due to the bucket granularity - despite their expiry time getting
      updated.
      
      The optimization itself still makes sense even if we take the lock, because
      in case that the bucket stays the same, we avoid the pointless
      queue/enqueue dance.
      
      Make the check and the modification of timer->expires protected by the base
      lock and shuffle the remaining code around so we can keep the lock held
      when we actually have to requeue the timer to a different bucket.
      
      Fixes: f00c0afd ("timers: Implement optimization for same expiry time in mod_timer()")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      4da9152a
    • T
      timers: Plug locking race vs. timer migration · b831275a
      Thomas Gleixner 提交于
      Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
      timer flags. As a consequence the compiler is allowed to reload the flags
      between the initial check for TIMER_MIGRATION and the following timer base
      computation and the spin lock of the base.
      
      While this has not been observed (yet), we need to make sure that it never
      happens.
      
      Fixes: 0eeda71b ("timer: Replace timer base by a cpu index")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      b831275a
  16. 24 10月, 2016 1 次提交
    • J
      PM / suspend: Fix missing KERN_CONT for suspend message · 1adb469b
      Jon Hunter 提交于
      Commit 4bcc595c (printk: reinstate KERN_CONT for printing
      continuation lines) exposed a missing KERN_CONT from one of the
      messages shown on entering suspend. With v4.9-rc1, the 'done.' shown
      after syncing the filesystems no longer appears as a continuation but
      a new message with its own timestamp.
      
      [    9.259566] PM: Syncing filesystems ... [    9.264119] done.
      
      Fix this by adding the KERN_CONT log level for the 'done.' part of the
      message seen after syncing filesystems. While we are at it, convert
      these suspend printks to pr_info and pr_cont, respectively.
      Signed-off-by: NJon Hunter <jonathanh@nvidia.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      1adb469b
  17. 23 10月, 2016 2 次提交
  18. 22 10月, 2016 1 次提交
  19. 21 10月, 2016 2 次提交
  20. 20 10月, 2016 1 次提交
    • L
      printk: suppress empty continuation lines · 8835ca59
      Linus Torvalds 提交于
      We have a fairly common pattern where you print several things as
      continuations on one single line in a loop, and then at the end you do
      
      	printk(KERN_CONT "\n");
      
      to flush the buffered output.
      
      But if the output was flushed by something else (concurrent printk
      activity, or just system logging), we don't want that final flushing to
      just print an empty line.
      
      So just suppress empty continuation lines when they couldn't be merged
      into the line they are a continuation of.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8835ca59