1. 29 9月, 2017 5 次提交
  2. 26 9月, 2017 1 次提交
  3. 25 9月, 2017 5 次提交
  4. 22 9月, 2017 3 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd
    • D
      Input: uinput - avoid FF flush when destroying device · e8b95728
      Dmitry Torokhov 提交于
      Normally, when input device supporting force feedback effects is being
      destroyed, we try to "flush" currently playing effects, so that the
      physical device does not continue vibrating (or executing other effects).
      Unfortunately this does not work well for uinput as flushing of the effects
      deadlocks with the destroy action:
      
      - if device is being destroyed because the file descriptor is being closed,
        then there is noone to even service FF requests;
      
      - if device is being destroyed because userspace sent UI_DEV_DESTROY,
        while theoretically it could be possible to service FF requests,
        userspace is unlikely to do so (they'd need to make sure FF handling
        happens on a separate thread) even if kernel solves the issue with FF
        ioctls deadlocking with UI_DEV_DESTROY ioctl on udev->mutex.
      
      To avoid lockups like the one below, let's install a custom input device
      flush handler, and avoid trying to flush force feedback effects when we
      destroying the device, and instead rely on uinput to shut off the device
      properly.
      
      NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
      ...
       <<EOE>>  [<ffffffff817a0307>] _raw_spin_lock_irqsave+0x37/0x40
       [<ffffffff810e633d>] complete+0x1d/0x50
       [<ffffffffa00ba08c>] uinput_request_done+0x3c/0x40 [uinput]
       [<ffffffffa00ba587>] uinput_request_submit.part.7+0x47/0xb0 [uinput]
       [<ffffffffa00bb62b>] uinput_dev_erase_effect+0x5b/0x76 [uinput]
       [<ffffffff815d91ad>] erase_effect+0xad/0xf0
       [<ffffffff815d929d>] flush_effects+0x4d/0x90
       [<ffffffff815d4cc0>] input_flush_device+0x40/0x60
       [<ffffffff815daf1c>] evdev_cleanup+0xac/0xc0
       [<ffffffff815daf5b>] evdev_disconnect+0x2b/0x60
       [<ffffffff815d74ac>] __input_unregister_device+0xac/0x150
       [<ffffffff815d75f7>] input_unregister_device+0x47/0x70
       [<ffffffffa00bac45>] uinput_destroy_device+0xb5/0xc0 [uinput]
       [<ffffffffa00bb2de>] uinput_ioctl_handler.isra.9+0x65e/0x740 [uinput]
       [<ffffffff811231ab>] ? do_futex+0x12b/0xad0
       [<ffffffffa00bb3f8>] uinput_ioctl+0x18/0x20 [uinput]
       [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
       [<ffffffff81337553>] ? security_file_ioctl+0x43/0x60
       [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
       [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      Reported-by: NRodrigo Rivas Costa <rodrigorivascosta@gmail.com>
      Reported-by: NClément VUCHENER <clement.vuchener@gmail.com>
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=193741Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      e8b95728
    • F
      net: ethtool: Add back transceiver type · 19cab887
      Florian Fainelli 提交于
      Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      deprecated the ethtool_cmd::transceiver field, which was fine in
      premise, except that the PHY library was actually using it to report the
      type of transceiver: internal or external.
      
      Use the first word of the reserved field to put this __u8 transceiver
      field back in. It is made read-only, and we don't expect the
      ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
      is mostly for the legacy path where we do:
      
      ethtool_get_settings()
      -> dev->ethtool_ops->get_link_ksettings()
         -> convert_link_ksettings_to_legacy_settings()
      
      to have no information loss compared to the legacy get_settings API.
      
      Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19cab887
  5. 21 9月, 2017 2 次提交
  6. 20 9月, 2017 2 次提交
    • J
      ACPI / bus: Make ACPI_HANDLE() work for non-GPL code again · 9e987b70
      John Hubbard 提交于
      Due to commit db3e50f3 (device property: Get rid of struct
      fwnode_handle type field), ACPI_HANDLE() inadvertently became
      a GPL-only call. The call path that led to that was:
      
      ACPI_HANDLE()
          ACPI_COMPANION()
              to_acpi_device_node()
                  is_acpi_device_node()
                      acpi_device_fwnode_ops
                          DECLARE_ACPI_FWNODE_OPS(acpi_device_fwnode_ops);
      
      ...and the new DECLARE_ACPI_FWNODE_OPS() includes
      EXPORT_SYMBOL_GPL, whereas previously it was a static struct.
      
      In order to avoid changing any of that, let's instead provide ever
      so slightly better encapsulation of those struct fwnode_operations
      instances. Those do not really need to be directly used in
      inline function calls in header files. Simply moving two small
      functions (is_acpi_device_node and is_acpi_data_node) out of
      acpi_bus.h, and into a .c file, does that.
      
      That leaves the internals of struct fwnode_operations as GPL-only
      (which I think was the intent all along), but un-breaks any driver
      code out there that relies on the ACPI subsystem's being (historically)
      an EXPORT_SYMBOL-usable system. By that, I mean, ACPI_HANDLE() and
      other basic ACPI calls were non-GPL-protected.
      
      Also, while I'm there, remove a tiny bit of redundancy that was missed
      in the earlier commit, by having is_acpi_node() use the other two
      routines, instead of checking fwnode directly.
      
      Fixes: db3e50f3 (device property: Get rid of struct fwnode_handle type field)
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Acked-by: NMika Westerberg <mika.westerberg@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      9e987b70
    • A
      of: provide inline helper for of_find_device_by_node · aa767cfb
      Arnd Bergmann 提交于
      The ipmmu-vmsa driver fails in compile-testing on non-OF platforms:
      
      drivers/iommu/ipmmu-vmsa.o: In function `ipmmu_of_xlate':
      ipmmu-vmsa.c:(.text+0x740): undefined reference to `of_find_device_by_node'
      
      It would be reasonable to assume that this interface works but
      returns failure on non-OF builds, like it does on machines that
      have been booted in another way, so this adds another inline
      function helper.
      
      Fixes: 7b2d5961 ("iommu/ipmmu-vmsa: Replace local utlb code with fwspec ids")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRob Herring <robh@kernel.org>
      aa767cfb
  7. 19 9月, 2017 2 次提交
  8. 18 9月, 2017 2 次提交
  9. 16 9月, 2017 1 次提交
    • X
      sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb
      Xin Long 提交于
      Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
      dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
      with holding sock and sock lock.
      
      But Paolo noticed that endpoint could be destroyed in sctp_rcv without
      sock lock protection. It means the use-after-free issue still could be
      triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
      !ep, although it's pretty hard to reproduce.
      
      I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
      and sctp_sock_dump long time.
      
      This patch is to add another param cb_done to sctp_for_each_transport
      and dump ep->assocs with holding tsp after jumping out of transport's
      traversal in it to avoid this issue.
      
      It can also improve sctp diag dump to make it run faster, as no need
      to save sk into cb->args[5] and keep calling sctp_for_each_transport
      any more.
      
      This patch is also to use int * instead of int for the pos argument
      in sctp_for_each_transport, which could make postion increment only
      in sctp_for_each_transport and no need to keep changing cb->args[2]
      in sctp_sock_filter and sctp_sock_dump any more.
      
      Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d25adbeb
  10. 15 9月, 2017 5 次提交
  11. 14 9月, 2017 2 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
    • D
      sctp: potential read out of bounds in sctp_ulpevent_type_enabled() · fa5f7b51
      Dan Carpenter 提交于
      This code causes a static checker warning because Smatch doesn't trust
      anything that comes from skb->data.  I've reviewed this code and I do
      think skb->data can be controlled by the user here.
      
      The sctp_event_subscribe struct has 13 __u8 fields and we want to see
      if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
      We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
      either before the start of the struct or after the end.
      
      This is a very old bug and it's surprising that it would go undetected
      for so long but my theory is that it just doesn't have a big impact so
      it would be hard to notice.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa5f7b51
  12. 13 9月, 2017 1 次提交
    • C
      net_sched: get rid of tcfa_rcu · d7fb60b9
      Cong Wang 提交于
      gen estimator has been rewritten in commit 1c0d32fd
      ("net_sched: gen_estimator: complete rewrite of rate estimators"),
      the caller is no longer needed to wait for a grace period.
      So this patch gets rid of it.
      
      This also completely closes a race condition between action free
      path and filter chain add/remove path for the following patch.
      Because otherwise the nested RCU callback can't be caught by
      rcu_barrier().
      
      Please see also the comments in code.
      
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7fb60b9
  13. 12 9月, 2017 4 次提交
  14. 11 9月, 2017 2 次提交
    • G
      block: tolerate tracing of NULL bio · f8e9ec16
      Greg Thelen 提交于
      __get_request() can call trace_block_getrq() with bio=NULL which causes
      block_get_rq::TP_fast_assign() to deref a NULL pointer and panic.
      
      Syzkaller fuzzer panics with
      linux-next (1d53d908b79d7870d89063062584eead4cf83448):
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Modules linked in:
        CPU: 0 PID: 2983 Comm: syzkaller401111 Not tainted 4.13.0-rc7-next-20170901+ #13
        task: ffff8801cf1da000 task.stack: ffff8801ce440000
        RIP: 0010:perf_trace_block_get_rq+0x697/0x970 include/trace/events/block.h:384
        RSP: 0018:ffff8801ce4473f0 EFLAGS: 00010246
        RAX: ffff8801cf1da000 RBX: 1ffff10039c88e84 RCX: 1ffffd1ffff84d27
        RDX: dffffc0000000001 RSI: 1ffff1003b643e7a RDI: ffffe8ffffc26938
        RBP: ffff8801ce447530 R08: 1ffff1003b643e6c R09: ffffe8ffffc26964
        R10: 0000000000000002 R11: fffff91ffff84d2d R12: ffffe8ffffc1f890
        R13: ffffe8ffffc26930 R14: ffffffff85cad9e0 R15: 0000000000000000
        FS:  0000000002641880(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 000000000043e670 CR3: 00000001d1d7a000 CR4: 00000000001406f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          trace_block_getrq include/trace/events/block.h:423 [inline]
          __get_request block/blk-core.c:1283 [inline]
          get_request+0x1518/0x23b0 block/blk-core.c:1355
          blk_old_get_request block/blk-core.c:1402 [inline]
          blk_get_request+0x1d8/0x3c0 block/blk-core.c:1427
          sg_scsi_ioctl+0x117/0x750 block/scsi_ioctl.c:451
          sg_ioctl+0x192d/0x2ed0 drivers/scsi/sg.c:1070
          vfs_ioctl fs/ioctl.c:45 [inline]
          do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
          SYSC_ioctl fs/ioctl.c:700 [inline]
          SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
          entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      block_get_rq::TP_fast_assign() has multiple redundant ->dev assignments.
      Only one of them is NULL tolerant.  Favor the NULL tolerant one.
      
      Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f8e9ec16
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  15. 09 9月, 2017 3 次提交
    • D
      bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed
      Daniel Borkmann 提交于
      Differ between illegal XDP action code and just driver
      unsupported one to provide better feedback when we throw
      a one-time warning here. Reason is that with 814abfab
      ("xdp: add bpf_redirect helper function") not all drivers
      support the new XDP return code yet and thus they will
      fall into their 'default' case when checking for return
      codes after program return, which then triggers a
      bpf_warn_invalid_xdp_action() stating that the return
      code is illegal, but from XDP perspective it's not.
      
      I decided not to place something like a XDP_ACT_MAX define
      into uapi i) given we don't have this either for all other
      program types, ii) future action codes could have further
      encoding there, which would render such define unsuitable
      and we wouldn't be able to rip it out again, and iii) we
      rarely add new action codes.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9beb8bed
    • J
      bpf: add support for sockmap detach programs · 5a67da2a
      John Fastabend 提交于
      The bpf map sockmap supports adding programs via attach commands. This
      patch adds the detach command to keep the API symmetric and allow
      users to remove previously added programs. Otherwise the user would
      have to delete the map and re-add it to get in this state.
      
      This also adds a series of additional tests to capture detach operation
      and also attaching/detaching invalid prog types.
      
      API note: socks will run (or not run) programs depending on the state
      of the map at the time the sock is added. We do not for example walk
      the map and remove programs from previously attached socks.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a67da2a
    • G
      ipc: optimize semget/shmget/msgget for lots of keys · 0cfb6aee
      Guillaume Knispel 提交于
      ipc_findkey() used to scan all objects to look for the wanted key.  This
      is slow when using a high number of keys.  This change adds an rhashtable
      of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n).
      
      This change gives a 865% improvement of benchmark reaim.jobs_per_min on a
      56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1]
      
      Other (more micro) benchmark results, by the author: On an i5 laptop, the
      following loop executed right after a reboot took, without and with this
      change:
      
          for (int i = 0, k=0x424242; i < KEYS; ++i)
              semget(k++, 1, IPC_CREAT | 0600);
      
                       total       total          max single  max single
         KEYS        without        with        call without   call with
      
            1            3.5         4.9   µs            3.5         4.9
           10            7.6         8.6   µs            3.7         4.7
           32           16.2        15.9   µs            4.3         5.3
          100           72.9        41.8   µs            3.7         4.7
         1000        5,630.0       502.0   µs             *           *
        10000    1,340,000.0     7,240.0   µs             *           *
        31900   17,600,000.0    22,200.0   µs             *           *
      
       *: unreliable measure: high variance
      
      The duration for a lookup-only usage was obtained by the same loop once
      the keys are present:
      
                       total       total          max single  max single
         KEYS        without        with        call without   call with
      
            1            2.1         2.5   µs            2.1         2.5
           10            4.5         4.8   µs            2.2         2.3
           32           13.0        10.8   µs            2.3         2.8
          100           82.9        25.1   µs             *          2.3
         1000        5,780.0       217.0   µs             *           *
        10000    1,470,000.0     2,520.0   µs             *           *
        31900   17,400,000.0     7,810.0   µs             *           *
      
      Finally, executing each semget() in a new process gave, when still
      summing only the durations of these syscalls:
      
      creation:
                       total       total
         KEYS        without        with
      
            1            3.7         5.0   µs
           10           32.9        36.7   µs
           32          125.0       109.0   µs
          100          523.0       353.0   µs
         1000       20,300.0     3,280.0   µs
        10000    2,470,000.0    46,700.0   µs
        31900   27,800,000.0   219,000.0   µs
      
      lookup-only:
                       total       total
         KEYS        without        with
      
            1            2.5         2.7   µs
           10           25.4        24.4   µs
           32          106.0        72.6   µs
          100          591.0       352.0   µs
         1000       22,400.0     2,250.0   µs
        10000    2,510,000.0    25,700.0   µs
        31900   28,200,000.0   115,000.0   µs
      
      [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop
      
      Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debixSigned-off-by: NGuillaume Knispel <guillaume.knispel@supersonicimagine.com>
      Reviewed-by: NMarc Pardo <marc.pardo@supersonicimagine.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com>
      Cc: Marc Pardo <marc.pardo@supersonicimagine.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cfb6aee