1. 27 9月, 2014 7 次提交
    • A
      bpf: verifier (add verifier core) · 17a52670
      Alexei Starovoitov 提交于
      This patch adds verifier core which simulates execution of every insn and
      records the state of registers and program stack. Every branch instruction seen
      during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
      it pops the state from the stack and continues until it reaches BPF_EXIT again.
      For program:
      1: bpf_mov r1, xxx
      2: if (r1 == 0) goto 5
      3: bpf_mov r0, 1
      4: goto 6
      5: bpf_mov r0, 2
      6: bpf_exit
      The verifier will walk insns: 1, 2, 3, 4, 6
      then it will pop the state recorded at insn#2 and will continue: 5, 6
      
      This way it walks all possible paths through the program and checks all
      possible values of registers. While doing so, it checks for:
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      - instruction encoding is not using reserved fields
      
      Kernel subsystem configures the verifier with two callbacks:
      
      - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
        that provides information to the verifer which fields of 'ctx'
        are accessible (remember 'ctx' is the first argument to eBPF program)
      
      - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
        returns argument constraints of kernel helper functions that eBPF program
        may call, so that verifier can checks that R1-R5 types match the prototype
      
      More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17a52670
    • A
      bpf: handle pseudo BPF_LD_IMM64 insn · 0246e64d
      Alexei Starovoitov 提交于
      eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
      to refer to process-local map_fd. Scan the program for such instructions and
      if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
      by verifier to check access to maps in bpf_map_lookup/update() calls.
      If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
      BPF_PSEUDO_MAP_FD flag.
      
      Note that eBPF interpreter is generic and knows nothing about pseudo insns.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0246e64d
    • A
      bpf: verifier (add docs) · 51580e79
      Alexei Starovoitov 提交于
      this patch adds all of eBPF verfier documentation and empty bpf_check()
      
      The end goal for the verifier is to statically check safety of the program.
      
      Verifier will catch:
      - loops
      - out of range jumps
      - unreachable instructions
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      
      More details in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51580e79
    • A
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov 提交于
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09756af4
    • A
      bpf: add lookup/update/delete/iterate methods to BPF maps · db20fd2b
      Alexei Starovoitov 提交于
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      The maps are accessed from user space via BPF syscall, which has commands:
      
      - create a map with given type and attributes
        fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
        returns fd or negative error
      
      - lookup key in a given map referenced by fd
        err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero and stores found elem into value or negative error
      
      - create or update key/value pair in a given map
        err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero or negative error
      
      - find and delete element by key in a given map
        err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key
      
      - iterate map elements (based on input key return next_key)
        err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->next_key
      
      - close(fd) deletes the map
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db20fd2b
    • A
      bpf: enable bpf syscall on x64 and i386 · 749730ce
      Alexei Starovoitov 提交于
      done as separate commit to ease conflict resolution
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749730ce
    • A
      bpf: introduce BPF syscall and maps · 99c55f7d
      Alexei Starovoitov 提交于
      BPF syscall is a multiplexor for a range of different operations on eBPF.
      This patch introduces syscall with single command to create a map.
      Next patch adds commands to access maps.
      
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      Userspace example:
      /* this syscall wrapper creates a map with given type and attributes
       * and returns map_fd on success.
       * use close(map_fd) to delete the map
       */
      int bpf_create_map(enum bpf_map_type map_type, int key_size,
                         int value_size, int max_entries)
      {
          union bpf_attr attr = {
              .map_type = map_type,
              .key_size = key_size,
              .value_size = value_size,
              .max_entries = max_entries
          };
      
          return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
      }
      
      'union bpf_attr' is backwards compatible with future extensions.
      
      More details in Documentation/networking/filter.txt and in manpage
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c55f7d
  2. 26 9月, 2014 1 次提交
  3. 24 9月, 2014 2 次提交
    • T
      blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe · 0a30288d
      Tejun Heo 提交于
      blk-mq uses percpu_ref for its usage counter which tracks the number
      of in-flight commands and used to synchronously drain the queue on
      freeze.  percpu_ref shutdown takes measureable wallclock time as it
      involves a sched RCU grace period.  This means that draining a blk-mq
      takes measureable wallclock time.  One would think that this shouldn't
      matter as queue shutdown should be a rare event which takes place
      asynchronously w.r.t. userland.
      
      Unfortunately, SCSI probing involves synchronously setting up and then
      tearing down a lot of request_queues back-to-back for non-existent
      LUNs.  This means that SCSI probing may take more than ten seconds
      when scsi-mq is used.
      
      This will be properly fixed by implementing a mechanism to keep
      q->mq_usage_counter in atomic mode till genhd registration; however,
      that involves rather big updates to percpu_ref which is difficult to
      apply late in the devel cycle (v3.17-rc6 at the moment).  As a
      stop-gap measure till the proper fix can be implemented in the next
      cycle, this patch introduces __percpu_ref_kill_expedited() and makes
      blk_mq_freeze_queue() use it.  This is heavy-handed but should work
      for testing the experimental SCSI blk-mq implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de
      Fixes: add703fd ("blk-mq: use percpu_ref for mq usage count")
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      0a30288d
    • T
      crypto: ccp - Check for CCP before registering crypto algs · c9f21cb6
      Tom Lendacky 提交于
      If the ccp is built as a built-in module, then ccp-crypto (whether
      built as a module or a built-in module) will be able to load and
      it will register its crypto algorithms.  If the system does not have
      a CCP this will result in -ENODEV being returned whenever a command
      is attempted to be queued by the registered crypto algorithms.
      
      Add an API, ccp_present(), that checks for the presence of a CCP
      on the system.  The ccp-crypto module can use this to determine if it
      should register it's crypto alogorithms.
      
      Cc: stable@vger.kernel.org
      Reported-by: NScot Doyle <lkml14@scotdoyle.com>
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Tested-by: NScot Doyle <lkml14@scotdoyle.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      c9f21cb6
  4. 20 9月, 2014 5 次提交
    • I
      net/mlx4_core: Enable CQE/EQE stride support · 77507aa2
      Ido Shamay 提交于
      This feature is intended for archs having cache line larger then 64B.
      
      Since our CQE/EQEs are generally 64B in those systems, HW will write
      twice to the same cache line consecutively, causing pipe locks due to
      he hazard prevention mechanism. For elements in a cyclic buffer, writes
      are consecutive, so entries smaller than a cache line should be
      avoided, especially if they are written at a high rate.
      
      Reduce consecutive writes to same cache line in CQs/EQs, by allowing the
      driver to increase the distance between entries so that each will reside
      in a different cache line. Until the introduction of this feature, there
      were two types of CQE/EQE:
      
      1. 32B stride and context in the [0-31] segment
      2. 64B stride and context in the [32-63] segment
      
      This feature introduces two additional types:
      
      3. 128B stride and context in the [0-31] segment (128B cache line)
      4. 256B stride and context in the [0-31] segment (256B cache line)
      
      Modify the mlx4_core driver to query the device for the CQE/EQE cache
      line stride capability and to enable that capability when the host
      cache line size is larger than 64 bytes (supported cache lines are
      128B and 256B).
      
      The mlx4 IB driver and libmlx4 need not be aware of this change. The PF
      context behaviour is changed to require this change in VF drivers
      running on such archs.
      Signed-off-by: NIdo Shamay <idos@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77507aa2
    • T
      fou: Add GRO support · afe93325
      Tom Herbert 提交于
      Implement fou_gro_receive and fou_gro_complete, and populate these
      in the correponsing udp_offloads for the socket. Added ipproto to
      udp_offloads and pass this from UDP to the fou GRO routine in proto
      field of napi_gro_cb structure.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      afe93325
    • F
      net: bcmgenet: remove PHY_BRCM_100MBPS_WAR · 80780a54
      Florian Fainelli 提交于
      Now that we have removed the need for the PHY_BRCM_100MBPS_WAR flag, we
      can remove it from the GENET driver and the broadcom shared header file.
      The PHY driver checks the PHY supported bitmask instead.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80780a54
    • F
      net: phy: broadcom: add helper for PHY revision and patch level · bb7d9349
      Florian Fainelli 提交于
      The Broadcom BCM7xxx internal PHYs do not contain any useful revision
      information in the low 4-bits of their MII_PHYSID2 (MII register 3)
      which could allow us to properly identify them.
      
      As a result, we need the actual hardware block integrating these PHYs:
      GENET or the SF2 switch to tell us what revision they are built with. To
      assist with that, add two helper macros for fetching the the PHY
      revision and patch level from the struct phy_device::dev_flags.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb7d9349
    • E
      net: add alloc_skb_with_frags() helper · 2e4e4410
      Eric Dumazet 提交于
      Extract from sock_alloc_send_pskb() code building skb with frags,
      so that we can reuse this in other contexts.
      
      Intent is to use it from tcp_send_rcvq(), tcp_collapse(), ...
      
      We also want to replace some skb_linearize() calls to a more reliable
      strategy in pathological cases where we need to reduce number of frags.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e4e4410
  5. 19 9月, 2014 1 次提交
  6. 17 9月, 2014 1 次提交
  7. 16 9月, 2014 1 次提交
    • A
      dsa: Split ops up, and avoid assigning tag_protocol and receive separately · 5075314e
      Alexander Duyck 提交于
      This change addresses several issues.
      
      First, it was possible to set tag_protocol without setting the ops pointer.
      To correct that I have reordered things so that rcv is now populated before
      we set tag_protocol.
      
      Second, it didn't make much sense to keep setting the device ops each time a
      new slave was registered.  So by moving the receive portion out into root
      switch initialization that issue should be addressed.
      
      Third, I wanted to avoid sending tags if the rcv pointer was not registered
      so I changed the tag check to verify if the rcv function pointer is set on
      the root tree.  If it is then we start sending DSA tagged frames.
      
      Finally I split the device ops pointer in the structures into two spots.  I
      placed the rcv function pointer in the root switch since this makes it
      easiest to access from there, and I placed the xmit function pointer in the
      slave for the same reason.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5075314e
  8. 15 9月, 2014 1 次提交
    • L
      vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4
      Linus Torvalds 提交于
      The performance regression that Josef Bacik reported in the pathname
      lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
      me look at performance stability of the dcache code, just to verify that
      the problem was actually fixed.  That turned up a few other problems in
      this area.
      
      There are a few cases where we exit RCU lookup mode and go to the slow
      serializing case when we shouldn't, Al has fixed those and they'll come
      in with the next VFS pull.
      
      But my performance verification also shows that link_path_walk() turns
      out to have a very unfortunate 32-bit store of the length and hash of
      the name we look up, followed by a 64-bit read of the combined hash_len
      field.  That screws up the processor store to load forwarding, causing
      an unnecessary hickup in this critical routine.
      
      It's caused by the ugly calling convention for the "hash_name()"
      function, and easily fixed by just making hash_name() fill in the whole
      'struct qstr' rather than passing it a pointer to just the hash value.
      
      With that, the profile for this function looks much smoother.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9226b5b4
  9. 14 9月, 2014 8 次提交
  10. 13 9月, 2014 2 次提交
    • A
      jiffies: Fix timeval conversion to jiffies · d78c9300
      Andrew Hunter 提交于
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Reported-by: NAaron Jacobs <jacobsa@google.com>
      Signed-off-by: NAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d78c9300
    • T
      workqueue: apply __WQ_ORDERED to create_singlethread_workqueue() · e09c2c29
      Tejun Heo 提交于
      create_singlethread_workqueue() is a compat interface for single
      threaded workqueue which maps to ordered workqueue w/ rescuer in the
      current implementation.  create_singlethread_workqueue() currently
      implemented by invoking alloc_workqueue() w/ appropriate parameters.
      
      8719dcea ("workqueue: reject adjusting max_active or applying
      attrs to ordered workqueues") introduced __WQ_ORDERED to protect
      ordered workqueues against dynamic attribute changes which can break
      ordering guarantees but forgot to apply it to
      create_singlethread_workqueue().  This in itself is okay as nobody
      currently uses dynamic attribute change on workqueues created with
      create_singlethread_workqueue().
      
      However, 4c16bd32 ("workqueue: implement NUMA affinity for unbound
      workqueues") broke singlethreaded guarantee for ordered workqueues
      through allocating a separate pool_workqueue on each NUMA node by
      default.  A later change 8a2b7538 ("workqueue: fix ordered
      workqueues in NUMA setups") fixed it by allocating only one global
      pool_workqueue if __WQ_ORDERED is set.
      
      Combined, the __WQ_ORDERED omission in create_singlethread_workqueue()
      became critical breaking its single threadedness and ordering
      guarantee.
      
      Let's make create_singlethread_workqueue() wrap
      alloc_ordered_workqueue() instead so that it inherits __WQ_ORDERED and
      can implicitly track future ordered_workqueue changes.
      
      v2: I missed that __WQ_ORDERED now protects against pwq splitting
          across NUMA nodes and incorrectly described the patch as a
          nice-to-have fix to protect against future dynamic attribute
          usages.  Oleg pointed out that this is actually a critical
          breakage due to 8a2b7538 ("workqueue: fix ordered workqueues
          in NUMA setups").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMike Anderson <mike.anderson@us.ibm.com>
      Cc: Oleg Nesterov <onestero@redhat.com>
      Cc: Gustavo Luiz Duarte <gduarte@redhat.com>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 4c16bd32 ("workqueue: implement NUMA affinity for unbound workqueues")
      e09c2c29
  11. 11 9月, 2014 3 次提交
    • M
      net/mlx4: Set vlan stripping policy by the right command · 09e05c3f
      Matan Barak 提交于
      Changing the vlan stripping policy of the QP isn't supported by older
      firmware versions for the INIT2RTR command. Nevertheless, we've used it.
      
      Fix that by doing this policy change using INIT2RTR only if the firmware
      supports it, otherwise, we call UPDATE_QP command to do the task.
      
      Fixes: 7677fc96 ('net/mlx4: Strengthen VLAN tags/priorities enforcement in VST mode')
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09e05c3f
    • D
      net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selected · b954d834
      Daniel Borkmann 提交于
      Since BPF JIT depends on the availability of module_alloc() and
      module_free() helpers (HAVE_BPF_JIT and MODULES), we better build
      that code only in case we have BPF_JIT in our config enabled, just
      like with other JIT code. Fixes builds for arm/marzen_defconfig
      and sh/rsk7269_defconfig.
      
      ====================
      kernel/built-in.o: In function `bpf_jit_binary_alloc':
      /home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc'
      kernel/built-in.o: In function `bpf_jit_binary_free':
      /home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free'
      make: *** [vmlinux] Error 1
      ====================
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Fixes: 738cbe72 ("net: bpf: consolidate JIT binary allocator")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b954d834
    • B
      PCI: Add pci_ignore_hotplug() to ignore hotplug events for a device · b440bde7
      Bjorn Helgaas 提交于
      Powering off a hot-pluggable device, e.g., with pci_set_power_state(D3cold),
      normally generates a hot-remove event that unbinds the driver.
      
      Some drivers expect to remain bound to a device even while they power it
      off and back on again.  This can be dangerous, because if the device is
      removed or replaced while it is powered off, the driver doesn't know that
      anything changed.  But some drivers accept that risk.
      
      Add pci_ignore_hotplug() for use by drivers that know their device cannot
      be removed.  Using pci_ignore_hotplug() tells the PCI core that hot-plug
      events for the device should be ignored.
      
      The radeon and nouveau drivers use this to switch between a low-power,
      integrated GPU and a higher-power, higher-performance discrete GPU.  They
      power off the unused GPU, but they want to remain bound to it.
      
      This is a reimplementation of f244d8b6 ("ACPIPHP / radeon / nouveau:
      Fix VGA switcheroo problem related to hotplug") but extends it to work with
      both acpiphp and pciehp.
      
      This fixes a problem where systems with dual GPUs using the radeon drivers
      become unusable, freezing every few seconds (see bugzillas below).  The
      resume of the radeon device may also fail, e.g.,
      
      This fixes problems on dual GPU systems where the radeon driver becomes
      unusable because of problems while suspending the device, as in bug 79701:
      
          [drm] radeon: finishing device.
          radeon 0000:01:00.0: Userspace still has active objects !
          radeon 0000:01:00.0: ffff8800cb4ec288 ffff8800cb4ec000 16384 4294967297 force free
          ...
          WARNING: CPU: 0 PID: 67 at /home/apw/COD/linux/drivers/gpu/drm/radeon/radeon_gart.c:234 radeon_gart_unbind+0xd2/0xe0 [radeon]()
          trying to unbind memory from uninitialized GART !
      
      or while resuming it, as in bug 77261:
      
          radeon 0000:01:00.0: ring 0 stalled for more than 10158msec
          radeon 0000:01:00.0: GPU lockup ...
          radeon 0000:01:00.0: GPU pci config reset
          pciehp 0000:00:01.0:pcie04: Card not present on Slot(1-1)
          radeon 0000:01:00.0: GPU reset succeeded, trying to resume
          *ERROR* radeon: dpm resume failed
          radeon 0000:01:00.0: Wait for MC idle timedout !
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=77261
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=79701Reported-by: NShawn Starr <shawn.starr@rogers.com>
      Reported-by: NJose P. <lbdkmjdf@sharklasers.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Acked-by: NRajat Jain <rajatxjain@gmail.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NDave Airlie <airlied@redhat.com>
      CC: stable@vger.kernel.org	# v3.15+
      b440bde7
  12. 10 9月, 2014 4 次提交
    • D
      net: bpf: be friendly to kmemcheck · 286aad3c
      Daniel Borkmann 提交于
      Reported by Mikulas Patocka, kmemcheck currently barks out a
      false positive since we don't have special kmemcheck annotation
      for bitfields used in bpf_prog structure.
      
      We currently have jited:1, len:31 and thus when accessing len
      while CONFIG_KMEMCHECK enabled, kmemcheck throws a warning that
      we're reading uninitialized memory.
      
      As we don't need the whole bit universe for pages member, we
      can just split it to u16 and use a bool flag for jited instead
      of a bitfield.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286aad3c
    • D
      net: bpf: consolidate JIT binary allocator · 738cbe72
      Daniel Borkmann 提交于
      Introduced in commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") and later on replicated in aa2d2c73
      ("s390/bpf,jit: address randomize and write protect jit code") for
      s390 architecture, write protection for BPF JIT images got added and
      a random start address of the JIT code, so that it's not on a page
      boundary anymore.
      
      Since both use a very similar allocator for the BPF binary header,
      we can consolidate this code into the BPF core as it's mostly JIT
      independant anyway.
      
      This will also allow for future archs that support DEBUG_SET_MODULE_RONX
      to just reuse instead of reimplementing it.
      
      JIT tested on x86_64 and s390x with BPF test suite.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      738cbe72
    • A
      net: filter: split filter.h and expose eBPF to user space · daedfb22
      Alexei Starovoitov 提交于
      allow user space to generate eBPF programs
      
      uapi/linux/bpf.h: eBPF instruction set definition
      
      linux/filter.h: the rest
      
      This patch only moves macro definitions, but practically it freezes existing
      eBPF instruction set, though new instructions can still be added in the future.
      
      These eBPF definitions cannot go into uapi/linux/filter.h, since the names
      may conflict with existing applications.
      
      Full eBPF ISA description is in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      daedfb22
    • A
      net: filter: add "load 64-bit immediate" eBPF instruction · 02ab695b
      Alexei Starovoitov 提交于
      add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
      All previous instructions were 8-byte. This is first 16-byte instruction.
      Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
      insn[0].code = BPF_LD | BPF_DW | BPF_IMM
      insn[0].dst_reg = destination register
      insn[0].imm = lower 32-bit
      insn[1].code = 0
      insn[1].imm = upper 32-bit
      All unused fields must be zero.
      
      Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
      which loads 32-bit immediate value into a register.
      
      x64 JITs it as single 'movabsq %rax, imm64'
      arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
      
      Note that old eBPF programs are binary compatible with new interpreter.
      
      It helps eBPF programs load 64-bit constant into a register with one
      instruction instead of using two registers and 4 instructions:
      BPF_MOV32_IMM(R1, imm32)
      BPF_ALU64_IMM(BPF_LSH, R1, 32)
      BPF_MOV32_IMM(R2, imm32)
      BPF_ALU64_REG(BPF_OR, R1, R2)
      
      User space generated programs will use this instruction to load constants only.
      
      To tell kernel that user space needs a pointer the _pseudo_ variant of
      this instruction may be added later, which will use extra bits of encoding
      to indicate what type of pointer user space is asking kernel to provide.
      For example 'off' or 'src_reg' fields can be used for such purpose.
      src_reg = 1 could mean that user space is asking kernel to validate and
      load in-kernel map pointer.
      src_reg = 2 could mean that user space needs readonly data section pointer
      src_reg = 3 could mean that user space needs a pointer to per-cpu local data
      All such future pseudo instructions will not be carrying the actual pointer
      as part of the instruction, but rather will be treated as a request to kernel
      to provide one. The kernel will verify the request_for_a_pointer, then
      will drop _pseudo_ marking and will store actual internal pointer inside
      the instruction, so the end result is the interpreter and JITs never
      see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
      loads 64-bit immediate into a register. User space never operates on direct
      pointers and verifier can easily recognize request_for_pointer vs other
      instructions.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ab695b
  13. 06 9月, 2014 4 次提交