1. 03 10月, 2014 3 次提交
    • P
      perf: fix perf bug in fork() · 6c72e350
      Peter Zijlstra 提交于
      Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
      calling perf_event_free_task() when failing sched_fork() we will not yet
      have done the memset() on ->perf_event_ctxp[] and will therefore try and
      'free' the inherited contexts, which are still in use by the parent
      process.  This is bad..
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NSylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c72e350
    • S
      ring-buffer: Fix infinite spin in reading buffer · 24607f11
      Steven Rostedt (Red Hat) 提交于
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2 "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      24607f11
    • K
      aarch64: filter $x from kallsyms · 6c34f1f5
      Kyle McMartin 提交于
      Similar to ARM, AArch64 is generating $x and $d syms... which isn't
      terribly helpful when looking at %pF output and the like. Filter those
      out in kallsyms, modpost and when looking at module symbols.
      
      Seems simplest since none of these check EM_ARM anyway, to just add it
      to the strchr used, rather than trying to make things overly
      complicated.
      
      initcall_debug improves:
      dmesg_before.txt: initcall $x+0x0/0x154 [sg] returned 0 after 26331 usecs
      dmesg_after.txt: initcall init_sg+0x0/0x154 [sg] returned 0 after 15461 usecs
      Signed-off-by: NKyle McMartin <kyle@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      6c34f1f5
  2. 02 10月, 2014 1 次提交
    • A
      bpf: add search pruning optimization to verifier · f1bca824
      Alexei Starovoitov 提交于
      consider C program represented in eBPF:
      int filter(int arg)
      {
          int a, b, c, *ptr;
      
          if (arg == 1)
              ptr = &a;
          else if (arg == 2)
              ptr = &b;
          else
              ptr = &c;
      
          *ptr = 0;
          return 0;
      }
      eBPF verifier has to follow all possible paths through the program
      to recognize that '*ptr = 0' instruction would be safe to execute
      in all situations.
      It's doing it by picking a path towards the end and observes changes
      to registers and stack at every insn until it reaches bpf_exit.
      Then it comes back to one of the previous branches and goes towards
      the end again with potentially different values in registers.
      When program has a lot of branches, the number of possible combinations
      of branches is huge, so verifer has a hard limit of walking no more
      than 32k instructions. This limit can be reached and complex (but valid)
      programs could be rejected. Therefore it's important to recognize equivalent
      verifier states to prune this depth first search.
      
      Basic idea can be illustrated by the program (where .. are some eBPF insns):
          1: ..
          2: if (rX == rY) goto 4
          3: ..
          4: ..
          5: ..
          6: bpf_exit
      In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
      Since insn#2 is a branch the verifier will remember its state in verifier stack
      to come back to it later.
      Since insn#4 is marked as 'branch target', the verifier will remember its state
      in explored_states[4] linked list.
      Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
      will continue.
      Without search pruning optimization verifier would have to walk 4, 5, 6 again,
      effectively simulating execution of insns 1, 2, 4, 5, 6
      With search pruning it will check whether state at #4 after jumping from #2
      is equivalent to one recorded in explored_states[4] during first pass.
      If there is an equivalent state, verifier can prune the search at #4 and declare
      this path to be safe as well.
      In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
      and 1, 2, 4 insns produces equivalent registers and stack.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1bca824
  3. 28 9月, 2014 1 次提交
  4. 27 9月, 2014 11 次提交
    • A
      bpf: mini eBPF library, test stubs and verifier testsuite · 3c731eba
      Alexei Starovoitov 提交于
      1.
      the library includes a trivial set of BPF syscall wrappers:
      int bpf_create_map(int key_size, int value_size, int max_entries);
      int bpf_update_elem(int fd, void *key, void *value);
      int bpf_lookup_elem(int fd, void *key, void *value);
      int bpf_delete_elem(int fd, void *key);
      int bpf_get_next_key(int fd, void *key, void *next_key);
      int bpf_prog_load(enum bpf_prog_type prog_type,
      		  const struct sock_filter_int *insns, int insn_len,
      		  const char *license);
      bpf_prog_load() stores verifier log into global bpf_log_buf[] array
      
      and BPF_*() macros to build instructions
      
      2.
      test stubs configure eBPF infra with 'unspec' map and program types.
      These are fake types used by user space testsuite only.
      
      3.
      verifier tests valid and invalid programs and expects predefined
      error log messages from kernel.
      40 tests so far.
      
      $ sudo ./test_verifier
       #0 add+sub+mul OK
       #1 unreachable OK
       #2 unreachable2 OK
       #3 out of range jump OK
       #4 out of range jump2 OK
       #5 test1 ld_imm64 OK
       ...
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c731eba
    • A
      bpf: verifier (add verifier core) · 17a52670
      Alexei Starovoitov 提交于
      This patch adds verifier core which simulates execution of every insn and
      records the state of registers and program stack. Every branch instruction seen
      during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
      it pops the state from the stack and continues until it reaches BPF_EXIT again.
      For program:
      1: bpf_mov r1, xxx
      2: if (r1 == 0) goto 5
      3: bpf_mov r0, 1
      4: goto 6
      5: bpf_mov r0, 2
      6: bpf_exit
      The verifier will walk insns: 1, 2, 3, 4, 6
      then it will pop the state recorded at insn#2 and will continue: 5, 6
      
      This way it walks all possible paths through the program and checks all
      possible values of registers. While doing so, it checks for:
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      - instruction encoding is not using reserved fields
      
      Kernel subsystem configures the verifier with two callbacks:
      
      - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
        that provides information to the verifer which fields of 'ctx'
        are accessible (remember 'ctx' is the first argument to eBPF program)
      
      - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
        returns argument constraints of kernel helper functions that eBPF program
        may call, so that verifier can checks that R1-R5 types match the prototype
      
      More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17a52670
    • A
      bpf: verifier (add branch/goto checks) · 475fb78f
      Alexei Starovoitov 提交于
      check that control flow graph of eBPF program is a directed acyclic graph
      
      check_cfg() does:
      - detect loops
      - detect unreachable instructions
      - check that program terminates with BPF_EXIT insn
      - check that all branches are within program boundary
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      475fb78f
    • A
      bpf: handle pseudo BPF_LD_IMM64 insn · 0246e64d
      Alexei Starovoitov 提交于
      eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
      to refer to process-local map_fd. Scan the program for such instructions and
      if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
      by verifier to check access to maps in bpf_map_lookup/update() calls.
      If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
      BPF_PSEUDO_MAP_FD flag.
      
      Note that eBPF interpreter is generic and knows nothing about pseudo insns.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0246e64d
    • A
      bpf: verifier (add ability to receive verification log) · cbd35700
      Alexei Starovoitov 提交于
      add optional attributes for BPF_PROG_LOAD syscall:
      union bpf_attr {
          struct {
      	...
      	__u32         log_level; /* verbosity level of eBPF verifier */
      	__u32         log_size;  /* size of user buffer */
      	__aligned_u64 log_buf;   /* user supplied 'char *buffer' */
          };
      };
      
      when log_level > 0 the verifier will return its verification log in the user
      supplied buffer 'log_buf' which can be used by program author to analyze why
      verifier rejected given program.
      
      'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
      provides several examples of these messages, like the program:
      
        BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
        BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
        BPF_LD_MAP_FD(BPF_REG_1, 0),
        BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
        BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
        BPF_EXIT_INSN(),
      
      will be rejected with the following multi-line message in log_buf:
      
        0: (7a) *(u64 *)(r10 -8) = 0
        1: (bf) r2 = r10
        2: (07) r2 += -8
        3: (b7) r1 = 0
        4: (85) call 1
        5: (15) if r0 == 0x0 goto pc+1
         R0=map_ptr R10=fp
        6: (7a) *(u64 *)(r0 +4) = 0
        misaligned access off 4 size 8
      
      The format of the output can change at any time as verifier evolves.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbd35700
    • A
      bpf: verifier (add docs) · 51580e79
      Alexei Starovoitov 提交于
      this patch adds all of eBPF verfier documentation and empty bpf_check()
      
      The end goal for the verifier is to statically check safety of the program.
      
      Verifier will catch:
      - loops
      - out of range jumps
      - unreachable instructions
      - invalid instructions
      - uninitialized register access
      - uninitialized stack access
      - misaligned stack access
      - out of range stack access
      - invalid calling convention
      
      More details in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51580e79
    • A
      bpf: handle pseudo BPF_CALL insn · 0a542a86
      Alexei Starovoitov 提交于
      in native eBPF programs userspace is using pseudo BPF_CALL instructions
      which encode one of 'enum bpf_func_id' inside insn->imm field.
      Verifier checks that program using correct function arguments to given func_id.
      If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
      replacing func_id with in-kernel function pointer.
      eBPF interpreter just calls the function.
      
      In-kernel eBPF users continue to use generic BPF_CALL.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a542a86
    • A
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov 提交于
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09756af4
    • A
      bpf: add lookup/update/delete/iterate methods to BPF maps · db20fd2b
      Alexei Starovoitov 提交于
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      The maps are accessed from user space via BPF syscall, which has commands:
      
      - create a map with given type and attributes
        fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
        returns fd or negative error
      
      - lookup key in a given map referenced by fd
        err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero and stores found elem into value or negative error
      
      - create or update key/value pair in a given map
        err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->value
        returns zero or negative error
      
      - find and delete element by key in a given map
        err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key
      
      - iterate map elements (based on input key return next_key)
        err = bpf(BPF_MAP_GET_NEXT_KEY, union bpf_attr *attr, u32 size)
        using attr->map_fd, attr->key, attr->next_key
      
      - close(fd) deletes the map
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db20fd2b
    • A
      bpf: enable bpf syscall on x64 and i386 · 749730ce
      Alexei Starovoitov 提交于
      done as separate commit to ease conflict resolution
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749730ce
    • A
      bpf: introduce BPF syscall and maps · 99c55f7d
      Alexei Starovoitov 提交于
      BPF syscall is a multiplexor for a range of different operations on eBPF.
      This patch introduces syscall with single command to create a map.
      Next patch adds commands to access maps.
      
      'maps' is a generic storage of different types for sharing data between kernel
      and userspace.
      
      Userspace example:
      /* this syscall wrapper creates a map with given type and attributes
       * and returns map_fd on success.
       * use close(map_fd) to delete the map
       */
      int bpf_create_map(enum bpf_map_type map_type, int key_size,
                         int value_size, int max_entries)
      {
          union bpf_attr attr = {
              .map_type = map_type,
              .key_size = key_size,
              .value_size = value_size,
              .max_entries = max_entries
          };
      
          return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
      }
      
      'union bpf_attr' is backwards compatible with future extensions.
      
      More details in Documentation/networking/filter.txt and in manpage
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      99c55f7d
  5. 26 9月, 2014 1 次提交
  6. 25 9月, 2014 4 次提交
    • V
      irq: Export handle_fasteoi_irq · 4fdea267
      Vincent Stehlé 提交于
      Export handle_fasteoi_irq to be able to use it in e.g. the Zynq gpio driver
      since commit 6dd85950 ("gpio: zynq: Fix IRQ handlers").
      
      This fixes the following link issue:
      
        ERROR: "handle_fasteoi_irq" [drivers/gpio/gpio-zynq.ko] undefined!
      Signed-off-by: NVincent Stehlé <vincent.stehle@laposte.net>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Vincent Stehle <vincent.stehle@laposte.net>
      Cc: Lars-Peter Clausen <lars@metafoo.de>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Link: http://lkml.kernel.org/r/1408663880-29179-1-git-send-email-vincent.stehle@laposte.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4fdea267
    • N
      SCHED: add some "wait..on_bit...timeout()" interfaces. · cbbce822
      NeilBrown 提交于
      In commit c1221321
         sched: Allow wait_on_bit_action() functions to support a timeout
      
      I suggested that a "wait_on_bit_timeout()" interface would not meet my
      need.  This isn't true - I was just over-engineering.
      
      Including a 'private' field in wait_bit_key instead of a focused
      "timeout" field was just premature generalization.  If some other
      use is ever found, it can be generalized or added later.
      
      So this patch renames "private" to "timeout" with a meaning "stop
      waiting when "jiffies" reaches or passes "timeout",
      and adds two of the many possible wait..bit..timeout() interfaces:
      
      wait_on_page_bit_killable_timeout(), which is the one I want to use,
      and out_of_line_wait_on_bit_timeout() which is a reasonably general
      example.  Others can be added as needed.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      cbbce822
    • Z
      cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags · 2ad654bc
      Zefan Li 提交于
      When we change cpuset.memory_spread_{page,slab}, cpuset will flip
      PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
      This should be done using atomic bitops, but currently we don't,
      which is broken.
      
      Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
      when one thread tried to clear PF_USED_MATH while at the same time another
      thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
      the same task.
      
      Here's the full report:
      https://lkml.org/lkml/2014/9/19/230
      
      To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
      
      v4:
      - updated mm/slab.c. (Fengguang Wu)
      - updated Documentation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Kees Cook <keescook@chromium.org>
      Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time")
      Cc: <stable@vger.kernel.org> # 2.6.31+
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ad654bc
    • R
      Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()" · 5c4dd348
      Rafael J. Wysocki 提交于
      Revert commit 6efde38f (PM / Hibernate: Iterate over set bits
      instead of PFNs in swsusp_free()) that introduced a NULL pointer
      dereference during system resume from hibernation:
      
      BUG: unable to handle kernel NULL pointer dereference at (null)
      IP: [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
      PGD b39c2067 PUD b39c1067 PMD 0
      Oops: 0000 [#1] SMP
      Modules linked in: <irrelevant list of modules>
      CPU: 1 PID: 4898 Comm: s2disk Tainted: G         C     3.17-rc5-amd64 #1 Debian 3.17~rc5-1~exp1
      Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011
      task: ffff88023155ea40 ti: ffff8800b3b14000 task.ti: ffff8800b3b14000
      RIP: 0010:[<ffffffff810a8cc1>]  [<ffffffff810a8cc1>]
      swsusp_free+0x21/0x190
      RSP: 0018:ffff8800b3b17ea8  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8800b39bab00 RCX: 0000000000000001
      RDX: ffff8800b39bab10 RSI: ffff8800b39bab00 RDI: 0000000000000000
      RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8800b39bab10 R11: 0000000000000246 R12: ffffea0000000000
      R13: ffff880232f485a0 R14: ffff88023ac27cd8 R15: ffff880232927590
      FS:  00007f406d83b700(0000) GS:ffff88023bc80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000000 CR3: 00000000b3a62000 CR4: 00000000000007e0
      Stack:
       ffff8800b39bab00 0000000000000010 ffff880232927590 ffffffff810acb4a
       ffff8800b39bab00 ffffffff811a955a ffff8800b39bab10 0000000000000000
       ffff88023155f098 ffffffff81a6b8c0 ffff88023155ea40 0000000000000007
      Call Trace:
       [<ffffffff810acb4a>] ? snapshot_release+0x2a/0xb0
       [<ffffffff811a955a>] ? __fput+0xca/0x1d0
       [<ffffffff81080627>] ? task_work_run+0x97/0xd0
       [<ffffffff81012d89>] ? do_notify_resume+0x69/0xa0
       [<ffffffff8151452a>] ? int_signal+0x12/0x17
      Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 48 8b 05 ba 62 9c 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d a1 62 9c 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40
      RIP  [<ffffffff810a8cc1>] swsusp_free+0x21/0x190
       RSP <ffff8800b3b17ea8>
      CR2: 0000000000000000
      ---[ end trace f02be86a1ec0cccb ]---
      
      due to forbidden_pages_map being NULL in swsusp_free().
      
      Fixes: 6efde38f "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"
      Reported-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5c4dd348
  7. 19 9月, 2014 1 次提交
  8. 14 9月, 2014 4 次提交
    • F
      nohz: nohz full depends on irq work self IPI support · 9b01f5bf
      Frederic Weisbecker 提交于
      The nohz full functionality depends on IRQ work to trigger its own
      interrupts. As it's used to restart the tick, we can't rely on the tick
      fallback for irq work callbacks, ie: we can't use the tick to restart
      the tick itself.
      
      Lets reject the full dynticks initialization if that arch support isn't
      available.
      
      As a side effect, this makes sure that nohz kick is never called from
      the tick. That otherwise would result in illegal hrtimer self-cancellation
      and lockup.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      9b01f5bf
    • F
      nohz: Consolidate nohz full init code · 4327b15f
      Frederic Weisbecker 提交于
      The supports for CONFIG_NO_HZ_FULL_ALL=y and the nohz_full= kernel
      parameter both have their own way to do the same thing: allocate
      full dynticks cpumasks, fill them and initialize some state variables.
      
      Lets consolidate that all in the same place.
      
      While at it, convert some regular printk message to warnings when
      fundamental allocations fail.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      4327b15f
    • F
      irq_work: Force raised irq work to run on irq work interrupt · 76a33061
      Frederic Weisbecker 提交于
      The nohz full kick, which restarts the tick when any resource depend
      on it, can't be executed anywhere given the operation it does on timers.
      If it is called from the scheduler or timers code, chances are that
      we run into a deadlock.
      
      This is why we run the nohz full kick from an irq work. That way we make
      sure that the kick runs on a virgin context.
      
      However if that's the case when irq work runs in its own dedicated
      self-ipi, things are different for the big bunch of archs that don't
      support the self triggered way. In order to support them, irq works are
      also handled by the timer interrupt as fallback.
      
      Now when irq works run on the timer interrupt, the context isn't blank.
      More precisely, they can run in the context of the hrtimer that runs the
      tick. But the nohz kick cancels and restarts this hrtimer and cancelling
      an hrtimer from itself isn't allowed. This is why we run in an endless
      loop:
      
      	Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
      	CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
      	Workqueue: btrfs-endio-write normal_work_helper [btrfs]
      	 ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
      	 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
      	 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
      	Call Trace:
      	 <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
      	 [<ffffffff8a7ef928>] panic+0xd4/0x207
      	 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
      	 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
      	 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
      	 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
      	 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
      	 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
      	 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
      	 [<ffffffff8a008268>] do_nmi+0xb8/0x100
      	 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
      	 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
      	 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
      	 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
      	 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
      	 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
      	 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
      	 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
      	 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
      	 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
      	 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
      	 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
      	 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
      	 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
      	 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
      	 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
      	 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
      	 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
      	 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80
      
      To fix this we force non-lazy irq works to run on irq work self-IPIs
      when available. That ability of the arch to trigger irq work self IPIs
      is available with arch_irq_work_has_interrupt().
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76a33061
    • F
      nohz: Move nohz full init call to tick init · a80e49e2
      Frederic Weisbecker 提交于
      This way we unbloat a bit main.c and more importantly we initialize
      nohz full after init_IRQ(). This dependency will be needed in further
      patches because nohz full needs irq work to raise its own IRQ.
      Information about the support for this ability on ARM64 is obtained on
      init_IRQ() which initialize the pointer to __smp_call_function.
      
      Since tick_init() is called right after init_IRQ(), this is a good place
      to call tick_nohz_init() and prepare for that dependency.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      a80e49e2
  9. 13 9月, 2014 5 次提交
    • R
      alarmtimer: Lock k_itimer during timer callback · 474e941b
      Richard Larocque 提交于
      Locks the k_itimer's it_lock member when handling the alarm timer's
      expiry callback.
      
      The regular posix timers defined in posix-timers.c have this lock held
      during timout processing because their callbacks are routed through
      posix_timer_fn().  The alarm timers follow a different path, so they
      ought to grab the lock somewhere else.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      474e941b
    • R
      alarmtimer: Do not signal SIGEV_NONE timers · 265b81d2
      Richard Larocque 提交于
      Avoids sending a signal to alarm timers created with sigev_notify set to
      SIGEV_NONE by checking for that special case in the timeout callback.
      
      The regular posix timers avoid sending signals to SIGEV_NONE timers by
      not scheduling any callbacks for them in the first place.  Although it
      would be possible to do something similar for alarm timers, it's simpler
      to handle this as a special case in the timeout.
      
      Prior to this patch, the alarm timer would ignore the sigev_notify value
      and try to deliver signals to the process anyway.  Even worse, the
      sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
      specified, so the signal number could be bogus.  If sigev_signo was an
      unitialized value (as it often would be if SIGEV_NONE is used), then
      it's hard to predict which signal will be sent.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      265b81d2
    • R
      alarmtimer: Return relative times in timer_gettime · e86fea76
      Richard Larocque 提交于
      Returns the time remaining for an alarm timer, rather than the time at
      which it is scheduled to expire.  If the timer has already expired or it
      is not currently scheduled, the it_value's members are set to zero.
      
      This new behavior matches that of the other posix-timers and the POSIX
      specifications.
      
      This is a change in user-visible behavior, and may break existing
      applications.  Hopefully, few users rely on the old incorrect behavior.
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sharvil Nanavati <sharvil@google.com>
      Signed-off-by: NRichard Larocque <rlarocque@google.com>
      [jstultz: minor style tweak]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e86fea76
    • A
      jiffies: Fix timeval conversion to jiffies · d78c9300
      Andrew Hunter 提交于
      timeval_to_jiffies tried to round a timeval up to an integral number
      of jiffies, but the logic for doing so was incorrect: intervals
      corresponding to exactly N jiffies would become N+1. This manifested
      itself particularly repeatedly stopping/starting an itimer:
      
      setitimer(ITIMER_PROF, &val, NULL);
      setitimer(ITIMER_PROF, NULL, &val);
      
      would add a full tick to val, _even if it was exactly representable in
      terms of jiffies_ (say, the result of a previous rounding.)  Doing
      this repeatedly would cause unbounded growth in val.  So fix the math.
      
      Here's what was wrong with the conversion: we essentially computed
      (eliding seconds)
      
      jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)
      
      by using scaling arithmetic, which took the best approximation of
      NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
      x/(2^USEC_JIFFIE_SC), and computed:
      
      jiffies = (usec * x) >> USEC_JIFFIE_SC
      
      and rounded this calculation up in the intermediate form (since we
      can't necessarily exactly represent TICK_NSEC in usec.) But the
      scaling arithmetic is a (very slight) *over*approximation of the true
      value; that is, instead of dividing by (1 usec/ 1 jiffie), we
      effectively divided by (1 usec/1 jiffie)-epsilon (rounding
      down). This would normally be fine, but we want to round timeouts up,
      and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
      would be fine if our division was exact, but dividing this by the
      slightly smaller factor was equivalent to adding just _over_ 1 to the
      final result (instead of just _under_ 1, as desired.)
      
      In particular, with HZ=1000, we consistently computed that 10000 usec
      was 11 jiffies; the same was true for any exact multiple of
      TICK_NSEC.
      
      We could possibly still round in the intermediate form, adding
      something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
      convert usec->nsec, round in nanoseconds, and then convert using
      time*spec*_to_jiffies.  This adds one constant multiplication, and is
      not observably slower in microbenchmarks on recent x86 hardware.
      
      Tested: the following program:
      
      int main() {
        struct itimerval zero = {{0, 0}, {0, 0}};
        /* Initially set to 10 ms. */
        struct itimerval initial = zero;
        initial.it_interval.tv_usec = 10000;
        setitimer(ITIMER_PROF, &initial, NULL);
        /* Save and restore several times. */
        for (size_t i = 0; i < 10; ++i) {
          struct itimerval prev;
          setitimer(ITIMER_PROF, &zero, &prev);
          /* on old kernels, this goes up by TICK_USEC every iteration */
          printf("previous value: %ld %ld %ld %ld\n",
                 prev.it_interval.tv_sec, prev.it_interval.tv_usec,
                 prev.it_value.tv_sec, prev.it_value.tv_usec);
          setitimer(ITIMER_PROF, &prev, NULL);
        }
          return 0;
      }
      
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Reported-by: NAaron Jacobs <jacobsa@google.com>
      Signed-off-by: NAndrew Hunter <ahh@google.com>
      [jstultz: Tweaked to apply to 3.17-rc]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      d78c9300
    • T
      futex: Unlock hb->lock in futex_wait_requeue_pi() error path · 13c42c2f
      Thomas Gleixner 提交于
      futex_wait_requeue_pi() calls futex_wait_setup(). If
      futex_wait_setup() succeeds it returns with hb->lock held and
      preemption disabled. Now the sanity check after this does:
      
              if (match_futex(&q.key, &key2)) {
      	   	ret = -EINVAL;
      		goto out_put_keys;
      	}
      
      which releases the keys but does not release hb->lock.
      
      So we happily return to user space with hb->lock held and therefor
      preemption disabled.
      
      Unlock hb->lock before taking the exit route.
      Reported-by: NDave "Trinity" Jones <davej@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      13c42c2f
  10. 11 9月, 2014 3 次提交
    • R
      kcmp: fix standard comparison bug · acbbe6fb
      Rasmus Villemoes 提交于
      The C operator <= defines a perfectly fine total ordering on the set of
      values representable in a long.  However, unlike its namesake in the
      integers, it is not translation invariant, meaning that we do not have
      "b <= c" iff "a+b <= a+c" for all a,b,c.
      
      This means that it is always wrong to try to boil down the relationship
      between two longs to a question about the sign of their difference,
      because the resulting relation [a LEQ b iff a-b <= 0] is neither
      anti-symmetric or transitive.  The former is due to -LONG_MIN==LONG_MIN
      (take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a !=
      b).  The latter can either be seen observing that x LEQ x+1 for all x,
      implying x LEQ x+1 LEQ x+2 ...  LEQ x-1 LEQ x; or more directly with the
      simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c >
      0.
      
      Note that it makes absolutely no difference that a transmogrying bijection
      has been applied before the comparison is done.  In fact, had the
      obfuscation not been done, one could probably not observe the bug
      (assuming all values being compared always lie in one half of the address
      space, the mathematical value of a-b is always representable in a long).
      As it stands, one can easily obtain three file descriptors exhibiting the
      non-transitivity of kcmp().
      
      Side note 1: I can't see that ensuring the MSB of the multiplier is
      set serves any purpose other than obfuscating the obfuscating code.
      
      Side note 2:
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      #include <fcntl.h>
      #include <unistd.h>
      #include <assert.h>
      #include <sys/syscall.h>
      
      enum kcmp_type {
              KCMP_FILE,
              KCMP_VM,
              KCMP_FILES,
              KCMP_FS,
              KCMP_SIGHAND,
              KCMP_IO,
              KCMP_SYSVSEM,
              KCMP_TYPES,
      };
      pid_t pid;
      
      int kcmp(pid_t pid1, pid_t pid2, int type,
      	 unsigned long idx1, unsigned long idx2)
      {
      	return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
      }
      int cmp_fd(int fd1, int fd2)
      {
      	int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2);
      	if (c < 0) {
      		perror("kcmp");
      		exit(1);
      	}
      	assert(0 <= c && c < 3);
      	return c;
      }
      int cmp_fdp(const void *a, const void *b)
      {
      	static const int normalize[] = {0, -1, 1};
      	return normalize[cmp_fd(*(int*)a, *(int*)b)];
      }
      #define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */
      int main(int argc, char *argv[])
      {
      	int r, s, count = 0;
      	int REL[3] = {0,0,0};
      	int fd[MAX];
      	pid = getpid();
      	while (count < MAX) {
      		r = open("/dev/null", O_RDONLY);
      		if (r < 0)
      			break;
      		fd[count++] = r;
      	}
      	printf("opened %d file descriptors\n", count);
      	for (r = 0; r < count; ++r) {
      		for (s = r+1; s < count; ++s) {
      			REL[cmp_fd(fd[r], fd[s])]++;
      		}
      	}
      	printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
      	qsort(fd, count, sizeof(fd[0]), cmp_fdp);
      	memset(REL, 0, sizeof(REL));
      
      	for (r = 0; r < count; ++r) {
      		for (s = r+1; s < count; ++s) {
      			REL[cmp_fd(fd[r], fd[s])]++;
      		}
      	}
      	printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
      	return (REL[0] + REL[2] != 0);
      }
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acbbe6fb
    • P
      kernel/printk/printk.c: fix faulty logic in the case of recursive printk · 000a7d66
      Patrick Palka 提交于
      We shouldn't set text_len in the code path that detects printk recursion
      because text_len corresponds to the length of the string inside textbuf.
      A few lines down from the line
      
          text_len = strlen(recursion_msg);
      
      is the line
      
          text_len += vscnprintf(text + text_len, ...);
      
      So if printk detects recursion, it sets text_len to 29 (the length of
      recursion_msg) and logs an error.  Then the message supplied by the
      caller of printk is stored inside textbuf but offset by 29 bytes.  This
      means that the output of the recursive call to printk will contain 29
      bytes of garbage in front of it.
      
      This defect is caused by commit 458df9fd ("printk: remove separate
      printk_sched buffers and use printk buf instead") which turned the line
      
          text_len = vscnprintf(text, ...);
      
      into
      
          text_len += vscnprintf(text + text_len, ...);
      
      To fix this, this patch avoids setting text_len when logging the printk
      recursion error.  This patch also marks unlikely() the branch leading up
      to this code.
      
      Fixes: 458df9fd ("printk: remove separate printk_sched buffers and use printk buf instead")
      Signed-off-by: NPatrick Palka <patrick@parcs.ath.cx>
      Reviewed-by: NPetr Mladek <pmladek@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      000a7d66
    • D
      net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selected · b954d834
      Daniel Borkmann 提交于
      Since BPF JIT depends on the availability of module_alloc() and
      module_free() helpers (HAVE_BPF_JIT and MODULES), we better build
      that code only in case we have BPF_JIT in our config enabled, just
      like with other JIT code. Fixes builds for arm/marzen_defconfig
      and sh/rsk7269_defconfig.
      
      ====================
      kernel/built-in.o: In function `bpf_jit_binary_alloc':
      /home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc'
      kernel/built-in.o: In function `bpf_jit_binary_free':
      /home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free'
      make: *** [vmlinux] Error 1
      ====================
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Fixes: 738cbe72 ("net: bpf: consolidate JIT binary allocator")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b954d834
  11. 10 9月, 2014 2 次提交
    • D
      net: bpf: consolidate JIT binary allocator · 738cbe72
      Daniel Borkmann 提交于
      Introduced in commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") and later on replicated in aa2d2c73
      ("s390/bpf,jit: address randomize and write protect jit code") for
      s390 architecture, write protection for BPF JIT images got added and
      a random start address of the JIT code, so that it's not on a page
      boundary anymore.
      
      Since both use a very similar allocator for the BPF binary header,
      we can consolidate this code into the BPF core as it's mostly JIT
      independant anyway.
      
      This will also allow for future archs that support DEBUG_SET_MODULE_RONX
      to just reuse instead of reimplementing it.
      
      JIT tested on x86_64 and s390x with BPF test suite.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      738cbe72
    • A
      net: filter: add "load 64-bit immediate" eBPF instruction · 02ab695b
      Alexei Starovoitov 提交于
      add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
      All previous instructions were 8-byte. This is first 16-byte instruction.
      Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
      insn[0].code = BPF_LD | BPF_DW | BPF_IMM
      insn[0].dst_reg = destination register
      insn[0].imm = lower 32-bit
      insn[1].code = 0
      insn[1].imm = upper 32-bit
      All unused fields must be zero.
      
      Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
      which loads 32-bit immediate value into a register.
      
      x64 JITs it as single 'movabsq %rax, imm64'
      arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
      
      Note that old eBPF programs are binary compatible with new interpreter.
      
      It helps eBPF programs load 64-bit constant into a register with one
      instruction instead of using two registers and 4 instructions:
      BPF_MOV32_IMM(R1, imm32)
      BPF_ALU64_IMM(BPF_LSH, R1, 32)
      BPF_MOV32_IMM(R2, imm32)
      BPF_ALU64_REG(BPF_OR, R1, R2)
      
      User space generated programs will use this instruction to load constants only.
      
      To tell kernel that user space needs a pointer the _pseudo_ variant of
      this instruction may be added later, which will use extra bits of encoding
      to indicate what type of pointer user space is asking kernel to provide.
      For example 'off' or 'src_reg' fields can be used for such purpose.
      src_reg = 1 could mean that user space is asking kernel to validate and
      load in-kernel map pointer.
      src_reg = 2 could mean that user space needs readonly data section pointer
      src_reg = 3 could mean that user space needs a pointer to per-cpu local data
      All such future pseudo instructions will not be carrying the actual pointer
      as part of the instruction, but rather will be treated as a request to kernel
      to provide one. The kernel will verify the request_for_a_pointer, then
      will drop _pseudo_ marking and will store actual internal pointer inside
      the instruction, so the end result is the interpreter and JITs never
      see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
      loads 64-bit immediate into a register. User space never operates on direct
      pointers and verifier can easily recognize request_for_pointer vs other
      instructions.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ab695b
  12. 09 9月, 2014 2 次提交
  13. 06 9月, 2014 2 次提交