1. 06 11月, 2015 7 次提交
  2. 04 11月, 2015 4 次提交
    • L
      atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() · 105ff3cb
      Linus Torvalds 提交于
      This seems to be a mis-reading of how alpha memory ordering works, and
      is not backed up by the alpha architecture manual.  The helper functions
      don't do anything special on any other architectures, and the arguments
      that support them being safe on other architectures also argue that they
      are safe on alpha.
      
      Basically, the "control dependency" is between a previous read and a
      subsequent write that is dependent on the value read.  Even if the
      subsequent write is actually done speculatively, there is no way that
      such a speculative write could be made visible to other cpu's until it
      has been committed, which requires validating the speculation.
      
      Note that most weakely ordered architectures (very much including alpha)
      do not guarantee any ordering relationship between two loads that depend
      on each other on a control dependency:
      
          read A
          if (val == 1)
              read B
      
      because the conditional may be predicted, and the "read B" may be
      speculatively moved up to before reading the value A.  So we require the
      user to insert a smp_rmb() between the two accesses to be correct:
      
          read A;
          if (A == 1)
              smp_rmb()
              read B
      
      Alpha is further special in that it can break that ordering even if the
      *address* of B depends on the read of A, because the cacheline that is
      read later may be stale unless you have a memory barrier in between the
      pointer read and the read of the value behind a pointer:
      
          read ptr
          read offset(ptr)
      
      whereas all other weakly ordered architectures guarantee that the data
      dependency (as opposed to just a control dependency) will order the two
      accesses.  As a result, alpha needs a "smp_read_barrier_depends()" in
      between those two reads for them to be ordered.
      
      The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
      had was a control dependency to a subsequent *write*, however, and
      nobody can finalize such a subsequent write without having actually done
      the read.  And were you to write such a value to a "stale" cacheline
      (the way the unordered reads came to be), that would seem to lose the
      write entirely.
      
      So the things that make alpha able to re-order reads even more
      aggressively than other weak architectures do not seem to be relevant
      for a subsequent write.  Alpha memory ordering may be strange, but
      there's no real indication that it is *that* strange.
      
      Also, the alpha architecture reference manual very explicitly talks
      about the definition of "Dependence Constraints" in section 5.6.1.7,
      where a preceding read dominates a subsequent write.
      
      Such a dependence constraint admittedly does not impose a BEFORE (alpha
      architecture term for globally visible ordering), but it does guarantee
      that there can be no "causal loop".  I don't see how you could avoid
      such a loop if another cpu could see the stored value and then impact
      the value of the first read.  Put another way: the read and the write
      could not be seen as being out of order wrt other cpus.
      
      So I do not see how these "x_ctrl()" functions can currently be necessary.
      
      I may have to eat my words at some point, but in the absense of clear
      proof that alpha actually needs this, or indeed even an explanation of
      how alpha could _possibly_ need it, I do not believe these functions are
      called for.
      
      And if it turns out that alpha really _does_ need a barrier for this
      case, that barrier still should not be "smp_read_barrier_depends()".
      We'd have to make up some new speciality barrier just for alpha, along
      with the documentation for why it really is necessary.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul E McKenney <paulmck@us.ibm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      105ff3cb
    • M
      lightnvm: refactor phys addrs type to u64 · b7ceb7d5
      Matias Bjørling 提交于
      For cases where CONFIG_LBDAF is not set. The struct ppa_addr exceeds its
      type on 32 bit architectures. ppa_addr requires a 64bit integer to hold
      the generic ppa format. We therefore refactor it to u64 and
      replaces the sector_t usages with u64 for physical addresses.
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b7ceb7d5
    • J
      net/core: fix for_each_netdev_feature · 5ba3f7d6
      Jarod Wilson 提交于
      As pointed out by Nikolay and further explained by Geert, the initial
      for_each_netdev_feature macro was broken, as feature would get set outside
      of the block of code it was intended to run in, thus only ever working for
      the first feature bit in the mask. While less pretty this way, this is
      tested and confirmed functional with multiple feature bits set in
      NETIF_F_UPPER_DISABLES.
      
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      ...
      [  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      [root@dell-per730-01 ~]# ethtool -K bond0 gro off
      ...
      [  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
      [  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
      [  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: Geert Uytterhoeven <geert@linux-m68k.org>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ba3f7d6
    • S
      ptp: Change ptp_class to a proper bitmask · 5f94c943
      Stefan Sørensen 提交于
      Change the definition of PTP_CLASS_L2 to not have any bits overlapping with
      the other defined protocol values, allowing the PTP_CLASS_* definitions to
      be for simple filtering on packet type.
      Signed-off-by: NStefan Sørensen <stefan.sorensen@spectralink.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f94c943
  3. 03 11月, 2015 4 次提交
    • S
      leds: netxbig: add device tree binding · 2976b179
      Simon Guinot 提交于
      This patch adds device tree support for the netxbig LEDs.
      
      This also introduces a additionnal DT binding for the GPIO extension bus
      (netxbig-gpio-ext) used to configure the LEDs. Since this bus could also
      be used to control other devices, then it seems more suitable to have it
      in a separate DT binding.
      Signed-off-by: NSimon Guinot <simon.guinot@sequanux.org>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJacek Anaszewski <j.anaszewski@samsung.com>
      2976b179
    • J
      net/core: generic support for disabling netdev features down stack · fd867d51
      Jarod Wilson 提交于
      There are some netdev features, which when disabled on an upper device,
      such as a bonding master or a bridge, must be disabled and cannot be
      re-enabled on underlying devices.
      
      This is a rework of an earlier more heavy-handed appraoch, which simply
      disables and prevents re-enabling of netdev features listed in a new
      define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
      device that disables a flag in that feature mask, the disabling will
      propagate down the stack, and any lower device that has any upper device
      with one of those flags disabled should not be able to enable said flag.
      
      Initially, only LRO is included for proof of concept, and because this
      code effectively does the same thing as dev_disable_lro(), though it will
      also activate from the ethtool path, which was one of the goals here.
      
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: on
      [root@dell-per730-01 ~]# ethtool -K bond0 lro off
      [root@dell-per730-01 ~]# ethtool -k bond0 |grep large
      large-receive-offload: off
      [root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
      large-receive-offload: off
      
      dmesg dump:
      
      [ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
      [ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
      [ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
      [ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71
      
      This has been successfully tested with bnx2x, qlcnic and netxen network
      cards as slaves in a bond interface. Turning LRO on or off on the master
      also turns it on or off on each of the slaves, new slaves are added with
      LRO in the same state as the master, and LRO can't be toggled on the
      slaves.
      
      Also, this should largely remove the need for dev_disable_lro(), and most,
      if not all, of its call sites can be replaced by simply making sure
      NETIF_F_LRO isn't included in the relevant device's feature flags.
      
      Note that this patch is driven by bug reports from users saying it was
      confusing that bonds and slaves had different settings for the same
      features, and while it won't be 100% in sync if a lower device doesn't
      support a feature like LRO, I think this is a good step in the right
      direction.
      
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jay Vosburgh <j.vosburgh@gmail.com>
      CC: Veaceslav Falico <vfalico@gmail.com>
      CC: Andy Gospodarek <gospo@cumulusnetworks.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      CC: Nikolay Aleksandrov <razor@blackwall.org>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      CC: netdev@vger.kernel.org
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd867d51
    • D
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann 提交于
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2197755
    • D
      bpf: align and clean bpf_{map,prog}_get helpers · c2101297
      Daniel Borkmann 提交于
      Add a bpf_map_get() function that we're going to use later on and
      align/clean the remaining helpers a bit so that we have them a bit
      more consistent:
      
        - __bpf_map_get() and __bpf_prog_get() that both work on the fd
          struct, check whether the descriptor is eBPF and return the
          pointer to the map/prog stored in the private data.
      
          Also, we can return f.file->private_data directly, the function
          signature is enough of a documentation already.
      
        - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
          call their respective __bpf_map_get()/__bpf_prog_get() variants,
          and take a reference.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2101297
  4. 02 11月, 2015 1 次提交
    • L
      mm: get rid of 'vmalloc_info' from /proc/meminfo · a5ad88ce
      Linus Torvalds 提交于
      It turns out that at least some versions of glibc end up reading
      /proc/meminfo at every single startup, because glibc wants to know the
      amount of memory the machine has.  And while that's arguably insane,
      it's just how things are.
      
      And it turns out that it's not all that expensive most of the time, but
      the vmalloc information statistics (amount of virtual memory used in the
      vmalloc space, and the biggest remaining chunk) can be rather expensive
      to compute.
      
      The 'get_vmalloc_info()' function actually showed up on my profiles as
      4% of the CPU usage of "make test" in the git source repository, because
      the git tests are lots of very short-lived shell-scripts etc.
      
      It turns out that apparently this same silly vmalloc info gathering
      shows up on the facebook servers too, according to Dave Jones.  So it's
      not just "make test" for git.
      
      We had two patches to just cache the information (one by me, one by
      Ingo) to mitigate this issue, but the whole vmalloc information of of
      rather dubious value to begin with, and people who *actually* want to
      know what the situation is wrt the vmalloc area should just look at the
      much more complete /proc/vmallocinfo instead.
      
      In fact, according to my testing - and perhaps more importantly,
      according to that big search engine in the sky: Google - there is
      nothing out there that actually cares about those two expensive fields:
      VmallocUsed and VmallocChunk.
      
      So let's try to just remove them entirely.  Actually, this just removes
      the computation and reports the numbers as zero for now, just to try to
      be minimally intrusive.
      
      If this breaks anything, we'll obviously have to re-introduce the code
      to compute this all and add the caching patches on top.  But if given
      the option, I'd really prefer to just remove this bad idea entirely
      rather than add even more code to work around our historical mistake
      that likely nobody really cares about.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5ad88ce
  5. 01 11月, 2015 2 次提交
    • L
      vfs: Fix pathological performance case for __alloc_fd() · f3f86e33
      Linus Torvalds 提交于
      Al Viro points out that:
      > >     * [Linux-specific aside] our __alloc_fd() can degrade quite badly
      > > with some use patterns.  The cacheline pingpong in the bitmap is probably
      > > inevitable, unless we accept considerably heavier memory footprint,
      > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
      > > to trigger - close(3);open(...); will have the next open() after that
      > > scanning the entire in-use bitmap.
      
      And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
      that opens and closes a lot of sockets with minimal work per socket.
      
      This patch largely fixes it.  We keep a 2nd-level bitmap of the open
      file bitmaps, showing which words are already full.  So then we can
      traverse that second-level bitmap to efficiently skip already allocated
      file descriptors.
      
      On his benchmark, this improves performance by up to an order of
      magnitude, by avoiding the excessive open file bitmap scanning.
      Tested-and-acked-by: NEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3f86e33
    • C
      dm: refactor ioctl handling · e56f81e0
      Christoph Hellwig 提交于
      This moves the call to blkdev_ioctl and the argument checking to DM core
      code, and only leaves a callout to find the block device to operate on
      in the targets.  This simplifies the code and allows us to pass through
      ioctl-like command using other methods in the next patch.
      
      Also split out a helper around calling the prepare_ioctl method that
      will be reused for persistent reservation handling.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      e56f81e0
  6. 30 10月, 2015 1 次提交
  7. 29 10月, 2015 3 次提交
    • H
      Revert "Merge branch 'ipv6-overflow-arith'" · 1e0d69a9
      Hannes Frederic Sowa 提交于
      Linus dislikes these changes. To not hold up the net-merge let's revert
      it for now and fix the bug like Linus suggested.
      
      This reverts commit ec3661b4, reversing
      changes made to c80dbe04.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e0d69a9
    • M
      lightnvm: Support for Open-Channel SSDs · cd9e9808
      Matias Bjørling 提交于
      Open-channel SSDs are devices that share responsibilities with the host
      in order to implement and maintain features that typical SSDs keep
      strictly in firmware. These include (i) the Flash Translation Layer
      (FTL), (ii) bad block management, and (iii) hardware units such as the
      flash controller, the interface controller, and large amounts of flash
      chips. In this way, Open-channels SSDs exposes direct access to their
      physical flash storage, while keeping a subset of the internal features
      of SSDs.
      
      LightNVM is a specification that gives support to Open-channel SSDs
      LightNVM allows the host to manage data placement, garbage collection,
      and parallelism. Device specific responsibilities such as bad block
      management, FTL extensions to support atomic IOs, or metadata
      persistence are still handled by the device.
      
      The implementation of LightNVM consists of two parts: core and
      (multiple) targets. The core implements functionality shared across
      targets. This is initialization, teardown and statistics. The targets
      implement the interface that exposes physical flash to user-space
      applications. Examples of such targets include key-value store,
      object-store, as well as traditional block devices, which can be
      application-specific.
      
      Contributions in this patch from:
      
        Javier Gonzalez <jg@lightnvm.io>
        Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
        Jesper Madsen <jmad@itu.dk>
      Signed-off-by: NMatias Bjørling <m@bjorling.me>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      cd9e9808
    • A
      ARM: OMAP1: fix incorrect INT_DMA_LCD · 1bd5dfe4
      Aaro Koskinen 提交于
      Commit 685e2d08 ("ARM: OMAP1: Change interrupt numbering for
      sparse IRQ") turned on SPARSE_IRQ on OMAP1, but forgot to change
      the number of INT_DMA_LCD. This broke the boot at least on Nokia 770,
      where the device hangs during framebuffer initialization.
      
      Fix by defining INT_DMA_LCD like the other interrupts.
      
      Cc: stable@vger.kernel.org # v4.2+
      Fixes: 685e2d08 ("ARM: OMAP1: Change interrupt numbering for sparse IRQ")
      Signed-off-by: NAaro Koskinen <aaro.koskinen@iki.fi>
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      1bd5dfe4
  8. 28 10月, 2015 11 次提交
  9. 27 10月, 2015 6 次提交
  10. 26 10月, 2015 1 次提交