1. 01 11月, 2016 1 次提交
  2. 31 10月, 2016 1 次提交
    • A
      net: add an ioctl to get a socket network namespace · c62cce2c
      Andrey Vagin 提交于
      Each socket operates in a network namespace where it has been created,
      so if we want to dump and restore a socket, we have to know its network
      namespace.
      
      We have a socket_diag to get information about sockets, it doesn't
      report sockets which are not bound or connected.
      
      This patch introduces a new socket ioctl, which is called SIOCGSKNS
      and used to get a file descriptor for a socket network namespace.
      
      A task must have CAP_NET_ADMIN in a target network namespace to
      use this ioctl.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c62cce2c
  3. 30 10月, 2016 10 次提交
  4. 28 10月, 2016 6 次提交
    • J
      perf/powerpc: Don't call perf_event_disable() from atomic context · 5aab90ce
      Jiri Olsa 提交于
      The trinity syscall fuzzer triggered following WARN() on powerpc:
      
        WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
        ...
        NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
        LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
        Call Trace:
        [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      Followed by a lockdep warning:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.8.0-rc5+ #7 Tainted: G        W
        -------------------------------
        ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 0
        2 locks held by ls/2998:
         #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
         #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
      
        stack backtrace:
        CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
        Call Trace:
        [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
        [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
        [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
        [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
        [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
        [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
        [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      While it looks like the first WARN() is probably valid, the other one is
      triggered by disabling event via perf_event_disable() from atomic context.
      
      The event is disabled here in case we were not able to emulate
      the instruction that hit the breakpoint. By disabling the event
      we unschedule the event and make sure it's not scheduled back.
      
      But we can't call perf_event_disable() from atomic context, instead
      we need to use the event's pending_disable irq_work method to disable it.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161026094824.GA21397@kravaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5aab90ce
    • M
      kconfig.h: remove config_enabled() macro · c0a0aba8
      Masahiro Yamada 提交于
      The use of config_enabled() is ambiguous.  For config options,
      IS_ENABLED(), IS_REACHABLE(), etc.  will make intention clearer.
      Sometimes config_enabled() has been used for non-config options because
      it is useful to check whether the given symbol is defined or not.
      
      I have been tackling on deprecating config_enabled(), and now is the
      time to finish this work.
      
      Some new users have appeared for v4.9-rc1, but it is trivial to replace
      them:
      
       - arch/x86/mm/kaslr.c
        replace config_enabled() with IS_ENABLED() because
        CONFIG_X86_ESPFIX64 and CONFIG_EFI are boolean.
      
       - include/asm-generic/export.h
        replace config_enabled() with __is_defined().
      
      Then, config_enabled() can be removed now.
      
      Going forward, please use IS_ENABLED(), IS_REACHABLE(), etc. for config
      options, and __is_defined() for non-config symbols.
      
      Link: http://lkml.kernel.org/r/1476616078-32252-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NNicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a0aba8
    • J
      genetlink: mark families as __ro_after_init · 56989f6d
      Johannes Berg 提交于
      Now genl_register_family() is the only thing (other than the
      users themselves, perhaps, but I didn't find any doing that)
      writing to the family struct.
      
      In all families that I found, genl_register_family() is only
      called from __init functions (some indirectly, in which case
      I've add __init annotations to clarifly things), so all can
      actually be marked __ro_after_init.
      
      This protects the data structure from accidental corruption.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56989f6d
    • J
      genetlink: statically initialize families · 489111e5
      Johannes Berg 提交于
      Instead of providing macros/inline functions to initialize
      the families, make all users initialize them statically and
      get rid of the macros.
      
      This reduces the kernel code size by about 1.6k on x86-64
      (with allyesconfig).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      489111e5
    • J
      genetlink: no longer support using static family IDs · a07ea4d9
      Johannes Berg 提交于
      Static family IDs have never really been used, the only
      use case was the workaround I introduced for those users
      that assumed their family ID was also their multicast
      group ID.
      
      Additionally, because static family IDs would never be
      reserved by the generic netlink code, using a relatively
      low ID would only work for built-in families that can be
      registered immediately after generic netlink is started,
      which is basically only the control family (apart from
      the workaround code, which I also had to add code for so
      it would reserve those IDs)
      
      Thus, anything other than GENL_ID_GENERATE is flawed and
      luckily not used except in the cases I mentioned. Move
      those workarounds into a few lines of code, and then get
      rid of GENL_ID_GENERATE entirely, making it more robust.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07ea4d9
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  5. 27 10月, 2016 6 次提交
  6. 26 10月, 2016 1 次提交
    • D
      x86/io: add interface to reserve io memtype for a resource range. (v1.1) · 8ef42276
      Dave Airlie 提交于
      A recent change to the mm code in:
      87744ab3 mm: fix cache mode tracking in vm_insert_mixed()
      
      started enforcing checking the memory type against the registered list for
      amixed pfn insertion mappings. It happens that the drm drivers for a number
      of gpus relied on this being broken. Currently the driver only inserted
      VRAM mappings into the tracking table when they came from the kernel,
      and userspace mappings never landed in the table. This led to a regression
      where all the mapping end up as UC instead of WC now.
      
      I've considered a number of solutions but since this needs to be fixed
      in fixes and not next, and some of the solutions were going to introduce
      overhead that hadn't been there before I didn't consider them viable at
      this stage. These mainly concerned hooking into the TTM io reserve APIs,
      but these API have a bunch of fast paths I didn't want to unwind to add
      this to.
      
      The solution I've decided on is to add a new API like the arch_phys_wc
      APIs (these would have worked but wc_del didn't take a range), and
      use them from the drivers to add a WC compatible mapping to the table
      for all VRAM on those GPUs. This means we can then create userspace
      mapping that won't get degraded to UC.
      
      v1.1: use CONFIG_X86_PAT + add some comments in io.h
      
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: x86@kernel.org
      Cc: mcgrof@suse.com
      Cc: Dan Williams <dan.j.williams@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      8ef42276
  7. 25 10月, 2016 1 次提交
    • L
      mm: unexport __get_user_pages() · 0d731759
      Lorenzo Stoakes 提交于
      This patch unexports the low-level __get_user_pages() function.
      
      Recent refactoring of the get_user_pages* functions allow flags to be
      passed through get_user_pages() which eliminates the need for access to
      this function from its one user, kvm.
      
      We can see that the two calls to get_user_pages() which replace
      __get_user_pages() in kvm_main.c are equivalent by examining their call
      stacks:
      
        get_user_page_nowait():
          get_user_pages(start, 1, flags, page, NULL)
          __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, start, 1,
      		     flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
      
        check_user_page_hwpoison():
          get_user_pages(addr, 1, flags, NULL, NULL)
          __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
      		     NULL, NULL)
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d731759
  8. 24 10月, 2016 1 次提交
  9. 23 10月, 2016 2 次提交
  10. 21 10月, 2016 4 次提交
    • J
      net: use core MTU range checking in misc drivers · b3e3893e
      Jarod Wilson 提交于
      firewire-net:
      - set min/max_mtu
      - remove fwnet_change_mtu
      
      nes:
      - set max_mtu
      - clean up nes_netdev_change_mtu
      
      xpnet:
      - set min/max_mtu
      - remove xpnet_dev_change_mtu
      
      hippi:
      - set min/max_mtu
      - remove hippi_change_mtu
      
      batman-adv:
      - set max_mtu
      - remove batadv_interface_change_mtu
      - initialization is a little async, not 100% certain that max_mtu is set
        in the optimal place, don't have hardware to test with
      
      rionet:
      - set min/max_mtu
      - remove rionet_change_mtu
      
      slip:
      - set min/max_mtu
      - streamline sl_change_mtu
      
      um/net_kern:
      - remove pointless ndo_change_mtu
      
      hsi/clients/ssi_protocol:
      - use core MTU range checking
      - remove now redundant ssip_pn_set_mtu
      
      ipoib:
      - set a default max MTU value
      - Note: ipoib's actual max MTU can vary, depending on if the device is in
        connected mode or not, so we'll just set the max_mtu value to the max
        possible, and let the ndo_change_mtu function continue to validate any new
        MTU change requests with checks for CM or not. Note that ipoib has no
        min_mtu set, and thus, the network core's mtu > 0 check is the only lower
        bounds here.
      
      mptlan:
      - use net core MTU range checking
      - remove now redundant mpt_lan_change_mtu
      
      fddi:
      - min_mtu = 21, max_mtu = 4470
      - remove now redundant fddi_change_mtu (including export)
      
      fjes:
      - min_mtu = 8192, max_mtu = 65536
      - The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
        get past the core net MTU range checks so fjes_change_mtu can validate a
        new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)
      
      hsr:
      - min_mtu = 0 (calls ether_setup, max_mtu is 1500)
      
      f_phonet:
      - min_mtu = 6, max_mtu = 65541
      
      u_ether:
      - min_mtu = 14, max_mtu = 15412
      
      phonet/pep-gprs:
      - min_mtu = 576, max_mtu = 65530
      - remove redundant gprs_set_mtu
      
      CC: netdev@vger.kernel.org
      CC: linux-rdma@vger.kernel.org
      CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
      CC: Faisal Latif <faisal.latif@intel.com>
      CC: linux-rdma@vger.kernel.org
      CC: Cliff Whickman <cpw@sgi.com>
      CC: Robin Holt <robinmholt@gmail.com>
      CC: Jes Sorensen <jes@trained-monkey.org>
      CC: Marek Lindner <mareklindner@neomailbox.ch>
      CC: Simon Wunderlich <sw@simonwunderlich.de>
      CC: Antonio Quartulli <a@unstable.cc>
      CC: Sathya Prakash <sathya.prakash@broadcom.com>
      CC: Chaitra P B <chaitra.basappa@broadcom.com>
      CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
      CC: MPT-FusionLinux.pdl@broadcom.com
      CC: Sebastian Reichel <sre@kernel.org>
      CC: Felipe Balbi <balbi@kernel.org>
      CC: Arvid Brodin <arvid.brodin@alten.se>
      CC: Remi Denis-Courmont <courmisch@gmail.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3e3893e
    • J
      net: use core MTU range checking in WAN drivers · 8b6b4135
      Jarod Wilson 提交于
      - set min/max_mtu in all hdlc drivers, remove hdlc_change_mtu
      - sent max_mtu in lec driver, remove lec_change_mtu
      - set min/max_mtu in x25_asy driver
      
      CC: netdev@vger.kernel.org
      CC: Krzysztof Halasa <khc@pm.waw.pl>
      CC: Krzysztof Halasa <khalasa@piap.pl>
      CC: Jan "Yenya" Kasprzak <kas@fi.muni.cz>
      CC: Francois Romieu <romieu@fr.zoreil.com>
      CC: Kevin Curtis <kevin.curtis@farsite.co.uk>
      CC: Zhao Qiang <qiang.zhao@nxp.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b6b4135
    • S
      net: add recursion limit to GRO · fcd91dd4
      Sabrina Dubroca 提交于
      Currently, GRO can do unlimited recursion through the gro_receive
      handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
      to one level with encap_mark, but both VLAN and TEB still have this
      problem.  Thus, the kernel is vulnerable to a stack overflow, if we
      receive a packet composed entirely of VLAN headers.
      
      This patch adds a recursion counter to the GRO layer to prevent stack
      overflow.  When a gro_receive function hits the recursion limit, GRO is
      aborted for this skb and it is processed normally.  This recursion
      counter is put in the GRO CB, but could be turned into a percpu counter
      if we run out of space in the CB.
      
      Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.
      
      Fixes: CVE-2016-7039
      Fixes: 9b174d88 ("net: Add Transparent Ethernet Bridging GRO support.")
      Fixes: 66e5133f ("vlan: Add GRO support for non hardware accelerated vlan")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcd91dd4
    • R
      clocksource: Add J-Core timer/clocksource driver · 9995f4f1
      Rich Felker 提交于
      At the hardware level, the J-Core PIT is integrated with the interrupt
      controller, but it is represented as its own device and has an
      independent programming interface. It provides a 12-bit countdown
      timer, which is not presently used, and a periodic timer. The interval
      length for the latter is programmable via a 32-bit throttle register
      whose units are determined by a bus-period register. The periodic
      timer is used to implement both periodic and oneshot clock event
      modes; in oneshot mode the interrupt handler simply disables the timer
      as soon as it fires.
      
      Despite its device tree node representing an interrupt for the PIT,
      the actual irq generated is programmable, not hard-wired. The driver
      is responsible for programming the PIT to generate the hardware irq
      number that the DT assigns to it.
      
      On SMP configurations, J-Core provides cpu-local instances of the PIT;
      no broadcast timer is needed. This driver supports the creation of the
      necessary per-cpu clock_event_device instances.
      
      A nanosecond-resolution clocksource is provided using the J-Core "RTC"
      registers, which give a 64-bit seconds count and 32-bit nanoseconds
      that wrap every second. The driver converts these to a full-range
      32-bit nanoseconds count.
      Signed-off-by: NRich Felker <dalias@libc.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: devicetree@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Link: http://lkml.kernel.org/r/b591ff12cc5ebf63d1edc98da26046f95a233814.1476393790.git.dalias@libc.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9995f4f1
  11. 20 10月, 2016 7 次提交