1. 10 10月, 2017 1 次提交
    • S
      audit: Record fanotify access control decisions · de8cd83e
      Steve Grubb 提交于
      The fanotify interface allows user space daemons to make access
      control decisions. Under common criteria requirements, we need to
      optionally record decisions based on policy. This patch adds a bit mask,
      FAN_AUDIT, that a user space daemon can 'or' into the response decision
      which will tell the kernel that it made a decision and record it.
      
      It would be used something like this in user space code:
      
        response.response = FAN_DENY | FAN_AUDIT;
        write(fd, &response, sizeof(struct fanotify_response));
      
      When the syscall ends, the audit system will record the decision as a
      AUDIT_FANOTIFY auxiliary record to denote that the reason this event
      occurred is the result of an access control decision from fanotify
      rather than DAC or MAC policy.
      
      A sample event looks like this:
      
      type=PATH msg=audit(1504310584.332:290): item=0 name="./evil-ls"
      inode=1319561 dev=fc:03 mode=0100755 ouid=1000 ogid=1000 rdev=00:00
      obj=unconfined_u:object_r:user_home_t:s0 nametype=NORMAL
      type=CWD msg=audit(1504310584.332:290): cwd="/home/sgrubb"
      type=SYSCALL msg=audit(1504310584.332:290): arch=c000003e syscall=2
      success=no exit=-1 a0=32cb3fca90 a1=0 a2=43 a3=8 items=1 ppid=901
      pid=959 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000
      fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts1 ses=3 comm="bash"
      exe="/usr/bin/bash" subj=unconfined_u:unconfined_r:unconfined_t:
      s0-s0:c0.c1023 key=(null)
      type=FANOTIFY msg=audit(1504310584.332:290): resp=2
      
      Prior to using the audit flag, the developer needs to call
      fanotify_init or'ing in FAN_ENABLE_AUDIT to ensure that the kernel
      supports auditing. The calling process must also have the CAP_AUDIT_WRITE
      capability.
      Signed-off-by: Nsgrubb <sgrubb@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      de8cd83e
  2. 04 10月, 2017 11 次提交
  3. 03 10月, 2017 2 次提交
  4. 01 10月, 2017 2 次提交
    • P
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni 提交于
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      addresses.
      
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      
      This still gives a measurable, but limited performance
      regression:
      
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc044e8d
    • P
      IPv4: early demux can return an error code · 7487449c
      Paolo Abeni 提交于
      Currently no error is emitted, but this infrastructure will
      used by the next patch to allow source address validation
      for mcast sockets.
      Since early demux can do a route lookup and an ipv4 route
      lookup can return an error code this is consistent with the
      current ipv4 route infrastructure.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7487449c
  5. 29 9月, 2017 6 次提交
  6. 28 9月, 2017 3 次提交
    • K
      timer: Prepare to change timer callback argument type · 686fef92
      Kees Cook 提交于
      Modern kernel callback systems pass the structure associated with a
      given callback to the callback function. The timer callback remains one
      of the legacy cases where an arbitrary unsigned long argument continues
      to be passed as the callback argument. This has several problems:
      
      - This bloats the timer_list structure with a normally redundant
        .data field.
      
      - No type checking is being performed, forcing callbacks to do
        explicit type casts of the unsigned long argument into the object
        that was passed, rather than using container_of(), as done in most
        of the other callback infrastructure.
      
      - Neighboring buffer overflows can overwrite both the .function and
        the .data field, providing attackers with a way to elevate from a buffer
        overflow into a simplistic ROP-like mechanism that allows calling
        arbitrary functions with a controlled first argument.
      
      - For future Control Flow Integrity work, this creates a unique function
        prototype for timer callbacks, instead of allowing them to continue to
        be clustered with other void functions that take a single unsigned long
        argument.
      
      This adds a new timer initialization API, which will ultimately replace
      the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
      named timer_setup() (to mirror hrtimer_setup(), making instances of its
      use much easier to grep for).
      
      In order to support the migration of existing timers into the new
      callback arguments, timer_setup() casts its arguments to the existing
      legacy types, and explicitly passes the timer pointer as the legacy
      data argument. Once all setup_*timer() callers have been replaced with
      timer_setup(), the casts can be removed, and the data argument can be
      dropped with the timer expiration code changed to just pass the timer
      to the callback directly.
      
      Since the regular pattern of using container_of() during local variable
      declaration repeats the need for the variable type declaration
      to be included, this adds a helper modeled after other from_*()
      helpers that wrap container_of(), named from_timer(). This helper uses
      typeof(*variable), removing the type redundancy and minimizing the need
      for line wraps in forthcoming conversions from "unsigned data long" to
      "struct timer_list *" in the timer callbacks:
      
      -void callback(unsigned long data)
      +void callback(struct timer_list *t)
      {
      -   struct some_data_structure *local = (struct some_data_structure *)data;
      +   struct some_data_structure *local = from_timer(local, t, timer);
      
      Finally, in order to support the handful of timer users that perform
      open-coded assignments of the .function (and .data) fields, provide
      cast macros (TIMER_FUNC_TYPE and TIMER_DATA_TYPE) that can be used
      temporarily. Once conversion has been completed, these can be globally
      trivially removed.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20170928133817.GA113410@beast
      686fef92
    • R
      net/mlx5: Check device capability for maximum flow counters · 16f1c5bb
      Raed Salem 提交于
      Added check for the maximal number of flow counters attached
      to rule (FTE).
      
      Fixes: bd5251db ('net/mlx5_core: Introduce flow steering destination of type counter')
      Signed-off-by: NRaed Salem <raeds@mellanox.com>
      Reviewed-by: NMaor Gottlieb <maorg@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      16f1c5bb
    • I
      net/mlx5: Fix FPGA capability location · 99d3cd27
      Inbar Karmy 提交于
      Currently, FPGA capability is located in (mdev)->caps.hca_cur,
      change the location to be (mdev)->caps.fpga,
      since hca_cur is reserved for HCA device capabilities.
      
      Fixes: e29341fb ("net/mlx5: FPGA, Add basic support for Innova")
      Signed-off-by: NInbar Karmy <inbark@mellanox.com>
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      99d3cd27
  7. 27 9月, 2017 1 次提交
  8. 26 9月, 2017 6 次提交
    • M
      percpu: make this_cpu_generic_read() atomic w.r.t. interrupts · e88d62cd
      Mark Rutland 提交于
      As raw_cpu_generic_read() is a plain read from a raw_cpu_ptr() address,
      it's possible (albeit unlikely) that the compiler will split the access
      across multiple instructions.
      
      In this_cpu_generic_read() we disable preemption but not interrupts
      before calling raw_cpu_generic_read(). Thus, an interrupt could be taken
      in the middle of the split load instructions. If a this_cpu_write() or
      RMW this_cpu_*() op is made to the same variable in the interrupt
      handling path, this_cpu_read() will return a torn value.
      
      For native word types, we can avoid tearing using READ_ONCE(), but this
      won't work in all cases (e.g. 64-bit types on most 32-bit platforms).
      This patch reworks this_cpu_generic_read() to use READ_ONCE() where
      possible, otherwise falling back to disabling interrupts.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e88d62cd
    • A
      netlink: fix nla_put_{u8,u16,u32} for KASAN · b4391db4
      Arnd Bergmann 提交于
      When CONFIG_KASAN is enabled, the "--param asan-stack=1" causes rather large
      stack frames in some functions. This goes unnoticed normally because
      CONFIG_FRAME_WARN is disabled with CONFIG_KASAN by default as of commit
      3f181b4d ("lib/Kconfig.debug: disable -Wframe-larger-than warnings with
      KASAN=y").
      
      The kernelci.org build bot however has the warning enabled and that led
      me to investigate it a little further, as every build produces these warnings:
      
      net/wireless/nl80211.c:4389:1: warning: the frame size of 2240 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/wireless/nl80211.c:1895:1: warning: the frame size of 3776 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/wireless/nl80211.c:1410:1: warning: the frame size of 2208 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      net/bridge/br_netlink.c:1282:1: warning: the frame size of 2544 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      
      Most of this problem is now solved in gcc-8, which can consolidate
      the stack slots for the inline function arguments. On older compilers
      we can add a workaround by declaring a local variable in each function
      to pass the inline function argument.
      
      Cc: stable@vger.kernel.org
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4391db4
    • P
      smp/hotplug: Hotplug state fail injection · 1db49484
      Peter Zijlstra 提交于
      Add a sysfs file to one-time fail a specific state. This can be used
      to test the state rollback code paths.
      
      Something like this (hotplug-up.sh):
      
        #!/bin/bash
      
        echo 0 > /debug/sched_debug
        echo 1 > /debug/tracing/events/cpuhp/enable
      
        ALL_STATES=`cat /sys/devices/system/cpu/hotplug/states | cut -d':' -f1`
        STATES=${1:-$ALL_STATES}
      
        for state in $STATES
        do
      	  echo 0 > /sys/devices/system/cpu/cpu1/online
      	  echo 0 > /debug/tracing/trace
      	  echo Fail state: $state
      	  echo $state > /sys/devices/system/cpu/cpu1/hotplug/fail
      	  cat /sys/devices/system/cpu/cpu1/hotplug/fail
      	  echo 1 > /sys/devices/system/cpu/cpu1/online
      
      	  cat /debug/tracing/trace > hotfail-${state}.trace
      
      	  sleep 1
        done
      
      Can be used to test for all possible rollback (barring multi-instance)
      scenarios on CPU-up, CPU-down is a trivial modification of the above.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.972581715@infradead.org
      
      1db49484
    • P
      smp/hotplug: Add state diagram · fac1c204
      Peter Zijlstra 提交于
      Add a state diagram to clarify when which states are ran where.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bigeasy@linutronix.de
      Cc: efault@gmx.de
      Cc: rostedt@goodmis.org
      Cc: max.byungchul.park@gmail.com
      Link: https://lkml.kernel.org/r/20170920170546.661598270@infradead.org
      
      fac1c204
    • J
      nvmet-fc: sync header templates with comments · 6b71f9e1
      James Smart 提交于
      Comments were incorrect:
      - defer_rcv was in host port template. moved to target port template
      - Added Mandatory statements for target port template items
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b71f9e1
    • G
      PCI: Add dummy pci_acs_enabled() for CONFIG_PCI=n build · fe594932
      Geert Uytterhoeven 提交于
      If CONFIG_PCI=n and gcc (e.g. 4.1.2) decides not to inline
      get_pci_function_alias_group(), the build fails with:
      
        drivers/iommu/iommu.o: In function `get_pci_function_alias_group':
        iommu.c:(.text+0xfdc): undefined reference to `pci_acs_enabled'
      
      Due to the various dummies for PCI calls in the CONFIG_PCI=n case,
      pci_acs_enabled() never called, but not all versions of gcc are smart
      enough to realize that.
      
      While explicitly marking get_pci_function_alias_group() inline would fix
      the build, this would inflate the code for the CONFIG_PCI=y case, as
      get_pci_function_alias_group() is a not-so-small function called from two
      places.
      
      Hence fix the issue by introducing a dummy for pci_acs_enabled() instead.
      
      Fixes: 0ae349a0 ("iommu/qcom: Add qcom_iommu")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      fe594932
  9. 25 9月, 2017 7 次提交
    • P
      IB: Correct MR length field to be 64-bit · edd31551
      Parav Pandit 提交于
      The ib_mr->length represents the length of the MR in bytes as per
      the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION).
      
      Currently ib_mr->length field is defined as only 32-bits field.
      This might result into truncation and failed WRs of consumers who
      registers more than 4GB bytes memory regions and whose WRs accessing
      such MRs.
      
      This patch makes the length 64-bit to avoid such truncation.
      
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Faisal Latif <faisal.latif@intel.com>
      Fixes: 4c67e2bf ("IB/core: Introduce new fast registration API")
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      edd31551
    • L
      IB/core: Fix typo in the name of the tag-matching cap struct · 78b1beb0
      Leon Romanovsky 提交于
      The tag matching functionality is implemented by mlx5 driver
      by extending XRQ, however this internal kernel information was
      exposed to user space applications with *xrq* name instead of *tm*.
      
      This patch renames *xrq* to *tm* to handle that.
      
      Fixes: 8d50505a ("IB/uverbs: Expose XRQ capabilities")
      Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: NYishai Hadas <yishaih@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      78b1beb0
    • M
      dm ioctl: fix alignment of event number in the device list · 62e08243
      Mikulas Patocka 提交于
      The size of struct dm_name_list is different on 32-bit and 64-bit
      kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).
      
      This mismatch caused some harmless difference in padding when using 32-bit
      or 64-bit kernel. Commit 23d70c5e ("dm ioctl: report event number in
      DM_LIST_DEVICES") added reporting event number in the output of
      DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
      userspace to determine the location of the event number (the location
      would be different when running on 32-bit and 64-bit kernels).
      
      Fix the padding by using offsetof(struct dm_name_list, name) instead of
      sizeof(struct dm_name_list) to determine the location of entries.
      
      Also, the ioctl version number is incremented to 37 so that userspace
      can use the version number to determine that the event number is present
      and correctly located.
      
      In addition, a global event is now raised when a DM device is created,
      removed, renamed or when table is swapped, so that the user can monitor
      for device changes.
      Reported-by: NEugene Syromiatnikov <esyr@redhat.com>
      Fixes: 23d70c5e ("dm ioctl: report event number in DM_LIST_DEVICES")
      Cc: stable@vger.kernel.org # 4.13
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      62e08243
    • J
      nvme: add transport SGL definitions · d85cf207
      James Smart 提交于
      Add transport SGL defintions from NVMe TP 4008, required for
      the final NVMe-FC standard.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d85cf207
    • J
      nvme.h: remove FC transport-specific error values · c98cb3bd
      James Smart 提交于
      The NVM express group recinded the reserved range for the transport.
      Remove the FC-centric values that had been defined.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c98cb3bd
    • W
      blktrace: Fix potential deadlock between delete & sysfs ops · 5acb3cc2
      Waiman Long 提交于
      The lockdep code had reported the following unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(s_active#228);
                                     lock(&bdev->bd_mutex/1);
                                     lock(s_active#228);
        lock(&bdev->bd_mutex);
      
       *** DEADLOCK ***
      
      The deadlock may happen when one task (CPU1) is trying to delete a
      partition in a block device and another task (CPU0) is accessing
      tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
      partition.
      
      The s_active isn't an actual lock. It is a reference count (kn->count)
      on the sysfs (kernfs) file. Removal of a sysfs file, however, require
      a wait until all the references are gone. The reference count is
      treated like a rwsem using lockdep instrumentation code.
      
      The fact that a thread is in the sysfs callback method or in the
      ioctl call means there is a reference to the opended sysfs or device
      file. That should prevent the underlying block structure from being
      removed.
      
      Instead of using bd_mutex in the block_device structure, a new
      blk_trace_mutex is now added to the request_queue structure to protect
      access to the blk_trace structure.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      
      Fix typo in patch subject line, and prune a comment detailing how
      the code used to work.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5acb3cc2
    • E
      KEYS: prevent creating a different user's keyrings · 237bbd29
      Eric Biggers 提交于
      It was possible for an unprivileged user to create the user and user
      session keyrings for another user.  For example:
      
          sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
                                 keyctl add keyring _uid_ses.4000 "" @u
                                 sleep 15' &
          sleep 1
          sudo -u '#4000' keyctl describe @u
          sudo -u '#4000' keyctl describe @us
      
      This is problematic because these "fake" keyrings won't have the right
      permissions.  In particular, the user who created them first will own
      them and will have full access to them via the possessor permissions,
      which can be used to compromise the security of a user's keys:
      
          -4: alswrv-----v------------  3000     0 keyring: _uid.4000
          -5: alswrv-----v------------  3000     0 keyring: _uid_ses.4000
      
      Fix it by marking user and user session keyrings with a flag
      KEY_FLAG_UID_KEYRING.  Then, when searching for a user or user session
      keyring by name, skip all keyrings that don't have the flag set.
      
      Fixes: 69664cf1 ("keys: don't generate user and user session keyrings unless they're accessed")
      Cc: <stable@vger.kernel.org>	[v2.6.26+]
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      237bbd29
  10. 24 9月, 2017 1 次提交