1. 27 6月, 2017 2 次提交
    • E
      Encryption at rest support · 51778612
      Ewout Prangsma 提交于
      Summary:
      This PR adds support for encrypting data stored by RocksDB when written to disk.
      
      It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
      The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
      Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
      
      The Counter operation mode uses an initial counter & random initialization vector (IV).
      Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
      The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
      
      To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
      Typically you would run it like this:
      
      ```
      ENCRYPTED_ENV=1 make check_some
      ```
      
      There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
      Closes https://github.com/facebook/rocksdb/pull/2424
      
      Differential Revision: D5322178
      
      Pulled By: sdwilsh
      
      fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
      51778612
    • S
      Unit Tests for sync, range sync and file close failures · 5c97a7c0
      Siying Dong 提交于
      Summary: Closes https://github.com/facebook/rocksdb/pull/2454
      
      Differential Revision: D5255320
      
      Pulled By: siying
      
      fbshipit-source-id: 0080830fa8eb5da6de25e17ba68aee91018c7913
      5c97a7c0
  2. 25 6月, 2017 1 次提交
    • M
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh 提交于
      Summary:
      Throughput: 46k tps in our sysbench settings (filling the details later)
      
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      
      Differential Revision: D5210732
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
      499ebb3a
  3. 03 6月, 2017 2 次提交
    • S
      Improve write buffer manager (and allow the size to be tracked in block cache) · 95b0e89b
      Siying Dong 提交于
      Summary:
      Improve write buffer manager in several ways:
      1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
      2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
      3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
      Closes https://github.com/facebook/rocksdb/pull/2350
      
      Differential Revision: D5110648
      
      Pulled By: siying
      
      fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
      95b0e89b
    • A
      Pass CF ID to MemTableRepFactory · a4d9c025
      Andrew Kryczka 提交于
      Summary:
      Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:
      
      - adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
      - updates MemTable to create MemTableRep's using the new overload.
      Closes https://github.com/facebook/rocksdb/pull/2346
      
      Differential Revision: D5108061
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
      a4d9c025
  4. 01 6月, 2017 1 次提交
    • Y
      Fixing blob db sequence number handling · ad19eb86
      Yi Wu 提交于
      Summary:
      Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.
      
      Stacking on #2375.
      Closes https://github.com/facebook/rocksdb/pull/2385
      
      Differential Revision: D5148358
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
      ad19eb86
  5. 31 5月, 2017 1 次提交
  6. 27 5月, 2017 1 次提交
  7. 23 5月, 2017 1 次提交
  8. 05 5月, 2017 1 次提交
    • A
      Avoid calling fallocate with UINT64_MAX · 7c1c8ce5
      Andrew Kryczka 提交于
      Summary:
      When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
      
      Test setup:
      - command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
      - filesystem: xfs
      
      before this diff:
      `fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
      
      after this diff:
      `fallocate(10, 01, 0, 1977)              = 0`
      Closes https://github.com/facebook/rocksdb/pull/2252
      
      Differential Revision: D5007275
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
      7c1c8ce5
  9. 06 4月, 2017 1 次提交
  10. 04 4月, 2017 1 次提交
  11. 29 3月, 2017 1 次提交
    • M
      Configure index partition size · e7731d11
      Maysam Yabandeh 提交于
      Summary:
      Allow the users to specify the target index partition size.
      
      With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut.
      Closes https://github.com/facebook/rocksdb/pull/2041
      
      Differential Revision: D4780216
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 95a0831
      e7731d11
  12. 23 3月, 2017 1 次提交
  13. 14 3月, 2017 1 次提交
  14. 01 3月, 2017 1 次提交
  15. 24 2月, 2017 1 次提交
  16. 23 2月, 2017 2 次提交
  17. 07 2月, 2017 1 次提交
    • M
      Two-level Indexes · 69d5262c
      Maysam Yabandeh 提交于
      Summary:
      Partition Index blocks and use a Partition-index as a 2nd level index.
      
      The two-level index can be used by setting
      BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
      configuring BlockBasedTableOptions::index_per_partition
      
      t15539501
      Closes https://github.com/facebook/rocksdb/pull/1814
      
      Differential Revision: D4473535
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bffb87e
      69d5262c
  18. 04 1月, 2017 1 次提交
  19. 23 12月, 2016 1 次提交
    • A
      direct io write support · 972f96b3
      Aaron Gao 提交于
      Summary:
      rocksdb direct io support
      
      ```
      [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      RocksDB:    version 5.0
      Date:       Wed Nov 23 13:17:43 2016
      CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
      CPUCache:   25600 KB
      Keys:       16 bytes each
      Values:     100 bytes each (50 bytes after compression)
      Entries:    1000000
      Prefix:    0 bytes
      Keys per prefix:    0
      RawSize:    110.6 MB (estimated)
      FileSize:   62.9 MB (estimated)
      Write rate: 0 bytes/second
      Compression: Snappy
      Memtablerep: skip_list
      Perf Level: 1
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      ------------------------------------------------
      Initializing RocksDB Options from the specified file
      Initializing RocksDB Options from command-line flags
      DB path: [/tmp/rocksdbtest-112628/dbbench]
      fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s
      
      [gzh@dev11575.prn2 ~/roc
      Closes https://github.com/facebook/rocksdb/pull/1564
      
      Differential Revision: D4241093
      
      Pulled By: lightmark
      
      fbshipit-source-id: 98c29e3
      972f96b3
  20. 17 12月, 2016 1 次提交
  21. 22 11月, 2016 1 次提交
  22. 17 11月, 2016 1 次提交
  23. 21 10月, 2016 1 次提交
    • I
      Support IngestExternalFile (remove AddFile restrictions) · 869ae5d7
      Islam AbdelRahman 提交于
      Summary:
      Changes in the diff
      
      API changes:
      - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
      - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
      - Deprecate AddFile() API
      
      Logic changes:
      - If our file overlap with the memtable we will flush the memtable
      - We will find the first level in the LSM tree that our file key range overlap with the keys in it
      - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
      - We will assign a global sequence number to our new file
      - Remove AddFile restrictions by using global sequence numbers
      
      Other changes:
      - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
      
      Test Plan:
      unit tests (still need to add more)
      addfile_stress (https://reviews.facebook.net/D65037)
      
      Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: jkedgar, hcz, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D65061
      869ae5d7
  24. 19 10月, 2016 1 次提交
    • I
      Support SST files with Global sequence numbers [reland] · b88f8e87
      Islam AbdelRahman 提交于
      Summary:
      reland https://reviews.facebook.net/D62523
      
      - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
      - Update TableProperties to be aware of the offset of each property in the file
      - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
      
      Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
      
      Test Plan: unit tests
      
      Reviewers: sdong, yhchiang
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D65211
      b88f8e87
  25. 20 9月, 2016 1 次提交
  26. 08 9月, 2016 1 次提交
  27. 12 8月, 2016 1 次提交
    • I
      Eliminate memcpy from ForwardIterator · d11c09d9
      Islam AbdelRahman 提交于
      Summary:
      This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
      This diff is stacked on D61305
      
      Test Plan:
      existing tests (updated them to test tailing iterator)
      new test
      
      Reviewers: andrewkr, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D60009
      d11c09d9
  28. 11 8月, 2016 1 次提交
    • S
      read_options.background_purge_on_iterator_cleanup to cover forward iterator... · 56dd0341
      sdong 提交于
      read_options.background_purge_on_iterator_cleanup to cover forward iterator and log file closing too.
      
      Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
      
      Test Plan: I am adding unit tests.
      
      Reviewers: andrewkr, IslamAbdelRahman, yiwu
      
      Reviewed By: yiwu
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D61503
      56dd0341
  29. 21 7月, 2016 1 次提交
    • I
      Introduce FullMergeV2 (eliminate memcpy from merge operators) · 68a8e6b8
      Islam AbdelRahman 提交于
      Summary:
      This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
      
      This diff is stacked on top of D56493 and D56511
      
      In this diff we
      - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
      - Replace std::deque<std::string> with std::vector<Slice> to pass operands
      - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
      - Allow FullMergeV2 output to be an existing operand
      
      ```
      [Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
      
      [FullMergeV2]
      readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
      readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
      readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
      readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
      readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
      
      [master]
      readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
      readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
      readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
      readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
      readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
      ```
      
      ```
      [Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
      
      [FullMergeV2]
      readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
      readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
      readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
      readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
      readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s
      
      [master]
      readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
      readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
      readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
      readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
      readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s
      
      ```
      
      ```
      [Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
      
      [FullMergeV2]
      $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
      readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
      readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
      readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
      readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
      readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
      
      [master]
      readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
      readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
      readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
      readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
      readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
      ```
      
      ```
      [Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
      
      [FullMergeV2]
      readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
      readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
      readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
      readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
      readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s
      
      [master]
      readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
      readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
      readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
      readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
      readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
      ```
      
      Test Plan: COMPILE_WITH_ASAN=1 make check -j64
      
      Reviewers: yhchiang, andrewkr, sdong
      
      Reviewed By: sdong
      
      Subscribers: lovro, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D57075
      68a8e6b8
  30. 22 6月, 2016 1 次提交
  31. 18 6月, 2016 1 次提交
    • S
      Deprectate filter_deletes · 7b79238b
      sdong 提交于
      Summary: filter_deltes is not a frequently used feature. Remove it.
      
      Test Plan: Run all test suites.
      
      Reviewers: igor, yhchiang, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: leveldb, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59427
      7b79238b
  32. 10 5月, 2016 1 次提交
    • Y
      Fix win build · 730f7e2e
      Yi Wu 提交于
      Summary: Fixing error with win build where we compare int64_t with size_t.
      
      Test Plan: make check
      
      Reviewers: andrewkr
      
      Reviewed By: andrewkr
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D57885
      730f7e2e
  33. 05 5月, 2016 1 次提交
    • Y
      Enable configurable readahead for iterators · 24a24f01
      Yi Wu 提交于
      Summary:
      Add an option `iterator_readahead_size` to `ReadOptions` to enable
      configurable readahead for iterators similar to the corresponding
      option for compaction.
      
      Test Plan:
      ```
      make commit_prereq
      ```
      
      Reviewers: kumar.rangarajan, ott, igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: yiwu, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D55419
      24a24f01
  34. 21 4月, 2016 1 次提交
    • A
      Add per-level compression ratio property · 73a847ef
      Andrew Kryczka 提交于
      Summary:
      This is needed so we can measure compression ratio improvements
      achieved by D52287.
      
      The property compares raw data size against the total file size for a given
      level. If the level is empty it should return 0.0.
      
      Test Plan: new unit test
      
      Reviewers: IslamAbdelRahman, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D56967
      73a847ef
  35. 19 4月, 2016 2 次提交
    • Y
      Fix lite build · 290883d9
      Yi Wu 提交于
      Summary: Fix rocksdb lite build after D56715.
      
      Test Plan:
        make -j40 'OPT=-g -DROCKSDB_LITE'
      
      Reviewers: sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D56895
      290883d9
    • Y
      Split db_test.cc · 792762c4
      Yi Wu 提交于
      Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase.
      
      Test Plan: make check
      
      Reviewers: sdong, yhchiang, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong
      
      Differential Revision: https://reviews.facebook.net/D56715
      792762c4
  36. 18 2月, 2016 1 次提交