1. 30 4月, 2019 1 次提交
  2. 27 3月, 2019 1 次提交
    • Y
      Support for single-primary, multi-secondary instances (#4899) · 9358178e
      Yanqin Jin 提交于
      Summary:
      This PR allows RocksDB to run in single-primary, multi-secondary process mode.
      The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
      Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
      
      This PR has several components:
      1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
      
      2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
      
      3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
      3.1 Tail the primary's MANIFEST during recovery.
      3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
      3.3 Tailing WAL will be in a future PR.
      
      4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
      
      Differential Revision: D14510945
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
      9358178e
  3. 05 3月, 2019 1 次提交
    • A
      Use `fallocate` even if hole-punching unsupported (#5023) · 186b3afa
      Andrew Kryczka 提交于
      Summary:
      The compiler flag `-DROCKSDB_FALLOCATE_PRESENT` was only set when
      `fallocate`, `FALLOC_FL_KEEP_SIZE`, and `FALLOC_FL_PUNCH_HOLE` were all
      present. However, the last of the three is not really necessary for the
      primary `fallocate` use case; furthermore, it was introduced only in later
      Linux kernel versions (2.6.38+).
      
      This PR changes the flag `-DROCKSDB_FALLOCATE_PRESENT` to only require
      `fallocate` and `FALLOC_FL_KEEP_SIZE` to be present. There is a separate
      check for `FALLOC_FL_PUNCH_HOLE` only in the place where it is used.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5023
      
      Differential Revision: D14248487
      
      Pulled By: siying
      
      fbshipit-source-id: a10ed0b902fa755988e957bd2dcec9081ec0502e
      186b3afa
  4. 21 2月, 2019 1 次提交
    • Z
      add GetStatsHistory to retrieve stats snapshots (#4748) · c4f5d0aa
      Zhongyi Xie 提交于
      Summary:
      This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
      
      (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
      
      Differential Revision: D13961471
      
      Pulled By: miasantreble
      
      fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
      c4f5d0aa
  5. 14 2月, 2019 1 次提交
  6. 08 2月, 2019 1 次提交
  7. 11 1月, 2019 1 次提交
  8. 03 1月, 2019 2 次提交
  9. 18 12月, 2018 1 次提交
  10. 14 12月, 2018 1 次提交
    • B
      Concurrent task limiter for compaction thread control (#4332) · a8b9891f
      Burton Li 提交于
      Summary:
      The PR is targeting to resolve the issue of:
      https://github.com/facebook/rocksdb/issues/3972#issue-330771918
      
      We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
      When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
      Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
      
      With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
      
      ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
      
      The usage is straight forward:
      e.g.:
      
      //
      // Enable compaction thread limiter thru ColumnFamilyOptions
      //
      std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
      Options options;
      ColumnFamilyOptions cf_opt(options);
      cf_opt.compaction_thread_limiter = ctl;
      ...
      
      //
      // Compaction thread limiter can be tuned or disabled on-the-fly
      //
      ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
      ...
      ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
      ctl->SetMaxOutstandingTask(-1); // Same as above
      ...
      ctl->SetMaxOutstandingTask(0);  // full throttle (0 task)
      
      //
      // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
      //
      std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
      std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
      Options options;
      ColumnFamilyOptions cf_opt_ssd1(options);
      ColumnFamilyOptions cf_opt_ssd2(options);
      ColumnFamilyOptions cf_opt_hdd1(options);
      ColumnFamilyOptions cf_opt_hdd2(options);
      ColumnFamilyOptions cf_opt_hdd3(options);
      
      // SSD CFs
      cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
      cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
      
      // HDD CFs
      cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
      cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
      cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
      
      ...
      
      //
      // The limiter is disabled by default (or set to nullptr explicitly)
      //
      Options options;
      ColumnFamilyOptions cf_opt(options);
      cf_opt.compaction_thread_limiter = nullptr;
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
      
      Differential Revision: D13226590
      
      Pulled By: siying
      
      fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
      a8b9891f
  11. 30 11月, 2018 1 次提交
    • S
      Move FIFOCompactionPicker to a separate file (#4724) · 70645355
      Sagar Vemuri 提交于
      Summary:
      **Summary:**
      Simplified the code layout by moving FIFOCompactionPicker to a separate file.
      **Why?:**
      While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724
      
      Differential Revision: D13227914
      
      Pulled By: sagar0
      
      fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
      70645355
  12. 28 11月, 2018 1 次提交
    • H
      Add SstFileReader to read sst files (#4717) · 5e72bc11
      Huachao Huang 提交于
      Summary:
      A user friendly sst file reader is useful when we want to access sst
      files outside of RocksDB. For example, we can generate an sst file
      with SstFileWriter and send it to other places, then use SstFileReader
      to read the file and process the entries in other ways.
      
      Also rename the original SstFileReader to SstFileDumper because of
      name conflict, and seems SstFileDumper is more appropriate for tools.
      
      TODO: there is only a very simple test now, because I want to get some feedback first.
      If the changes look good, I will add more tests soon.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717
      
      Differential Revision: D13212686
      
      Pulled By: ajkr
      
      fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
      5e72bc11
  13. 22 11月, 2018 1 次提交
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  14. 08 11月, 2018 1 次提交
  15. 27 10月, 2018 1 次提交
    • Y
      port folly::JemallocNodumpAllocator (#4534) · 5f5fddab
      Yi Wu 提交于
      Summary:
      Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
      1. It only support jemalloc >= 5.0
      2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.
      
      Depending on #4502.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534
      
      Differential Revision: D10435474
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
      5f5fddab
  16. 25 10月, 2018 1 次提交
    • A
      Use only "local" range tombstones during Get (#4449) · 8c78348c
      Abhishek Madan 提交于
      Summary:
      Previously, range tombstones were accumulated from every level, which
      was necessary if a range tombstone in a higher level covered a key in a lower
      level. However, RangeDelAggregator::AddTombstones's complexity is based on
      the number of tombstones that are currently stored in it, which is wasteful in
      the Get case, where we only need to know the highest sequence number of range
      tombstones that cover the key from higher levels, and compute the highest covering
      sequence number at the current level. This change introduces this optimization, and
      removes the use of RangeDelAggregator from the Get path.
      
      In the benchmark results, the following command was used to initialize the database:
      ```
      ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
      ```
      
      ...and the following command was used to measure read throughput:
      ```
      ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
      ```
      
      The filluniquerandom command was only run once, and the resulting database was used
      to measure read performance before and after the PR. Both binaries were compiled with
      `DEBUG_LEVEL=0`.
      
      Readrandom results before PR:
      ```
      readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
      ```
      
      Readrandom results after PR:
      ```
      readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
      ```
      
      So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
      
      ----
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
      
      Differential Revision: D10370575
      
      Pulled By: abhimadan
      
      fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
      8c78348c
  17. 19 10月, 2018 1 次提交
  18. 12 10月, 2018 1 次提交
    • W
      Add compile time option to work with utf8 filename strings (#4469) · 5d809ece
      Wilfried Goesgens 提交于
      Summary:
      The default behaviour of rocksdb is to use the `*A(` windows API functions.
      These accept filenames in the currently configured system encoding,
      be it Latin 1, utf8 or whatever.
      If the Application intends to completely work with utf8 strings internally,
      converting these to that codepage properly isn't even always possible.
      Thus this patch adds a switch to use the `*W(` functions, which accept
      UTF-16 filenames, and uses C++11 features to translate the
      UTF8 containing std::string to an UTF16 containing std::wstring.
      
      This feature is a compile time options, that can be enabled by setting `WITH_WINDOWS_UTF8_FILENAMES` to true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4469
      
      Differential Revision: D10356011
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 27b6ae9171f209085894cdf80069e8a896642044
      5d809ece
  19. 28 9月, 2018 1 次提交
    • Y
      Utility to run task periodically in a thread (#4423) · d6f2ecf4
      Yi Wu 提交于
      Summary:
      Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time.
      
      We can use this class to replace `TimerQueue` in #4382 and `BlobDB`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423
      
      Differential Revision: D10020932
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087
      d6f2ecf4
  20. 18 9月, 2018 1 次提交
    • A
      Add RangeDelAggregator microbenchmarks (#4363) · 1626f6ab
      Abhishek Madan 提交于
      Summary:
      To measure the results of upcoming DeleteRange v2 work, this commit adds
      simple benchmarks for RangeDelAggregator. It measures the average time
      for AddTombstones and ShouldDelete calls.
      
      Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:
      
      Before #4014:
      ```
      =======================
      Results:
      =======================
      AddTombstones:          1356.28 us
      ShouldDelete:           0.401732 us
      ```
      
      Latest master:
      ```
      =======================
      Results:
      =======================
      AddTombstones:          740.82 us
      ShouldDelete:           0.383271 us
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363
      
      Differential Revision: D9881676
      
      Pulled By: abhimadan
      
      fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
      1626f6ab
  21. 28 8月, 2018 2 次提交
    • J
      cmake: allow opting out debug runtime (#4317) · 6ed7f146
      Jay Lee 提交于
      Summary:
      Projects built in debug profile don't always link to debug runtime.
      Allowing opting out the debug runtime to make rocksdb get along well
      with other projects.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4317
      
      Differential Revision: D9518038
      
      Pulled By: sagar0
      
      fbshipit-source-id: 384901a0d12b8de20759756e8a19b4888a27c399
      6ed7f146
    • Y
      BlobDB: Implement DisableFileDeletions (#4314) · a6d3de4e
      Yi Wu 提交于
      Summary:
      `DB::DiableFileDeletions` and `DB::EnableFileDeletions` are used for applications to stop RocksDB background jobs to delete files while they are doing replication. Implement these methods for BlobDB. `DeleteObsolteFiles` now needs to check `disable_file_deletions_` before starting, and will hold `delete_file_mutex_` the whole time while it is running. `DisableFileDeletions` needs to wait on `delete_file_mutex_` for running `DeleteObsolteFiles` job and set `disable_file_deletions_` flag.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4314
      
      Differential Revision: D9501373
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 81064c1228f1724eff46da22b50ff765b16292cd
      a6d3de4e
  22. 16 8月, 2018 1 次提交
    • F
      Improve point-lookup performance using a data block hash index (#4174) · 19ec44fd
      Fenggang Wu 提交于
      Summary:
      Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
      
      The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
      
      Differential Revision: D8965914
      
      Pulled By: fgwu
      
      fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
      19ec44fd
  23. 14 8月, 2018 1 次提交
    • Z
      RocksDB Trace Analyzer (#4091) · 999d955e
      Zhichao Cao 提交于
      Summary:
      A framework of trace analyzing for RocksDB
      
      After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
      **Input:**
      1. trace file
      2. Whole keys space file
      
      **Statistics:**
      1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
      2. Key hotness (access count) of each one
      3. Key space separation based on given prefix
      4. Key size distribution
      5. Value size distribution if appliable
      6. Top K accessed keys
      7. QPS statistics including the average QPS and peak QPS
      8. Top K accessed prefix
      9. The query correlation analyzing, output the number of X after Y and the corresponding average time
          intervals
      
      **Output:**
      1. key access heat map (either in the accessed key space or whole key space)
      2. trace sequence file (interpret the raw trace file to line base text file for future use)
      3. Time serial (The key space ID and its access time)
      4. Key access count distritbution
      5. Key size distribution
      6. Value size distribution (in each intervals)
      7. whole key space separation by the prefix
      8. Accessed key space separation by the prefix
      9. QPS of each operation and each column family
      10. Top K QPS and their accessed prefix range
      
      **Test:**
      1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
      2. Generated the trace and analyze the trace
      
      **Implemented but not tested (due to the limitation of trace_replay):**
      1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
      2. Analyzing the number of Key found by Get
      
      **Future Work:**
      1.  Support execution time analyzing of each requests
      2.  Support cache hit situation and block read situation of Get
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
      
      Differential Revision: D9256157
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
      999d955e
  24. 07 8月, 2018 1 次提交
    • Y
      BlobDB: Cleanup TTLExtractor interface (#4229) · 140f256d
      Yi Wu 提交于
      Summary:
      Cleanup TTLExtractor interface. The original purpose of it is to allow our users keep using existing `Write()` interface but allow it to accept TTL via `TTLExtractor`. However the interface is confusing. Will replace it with something like `WriteWithTTL(batch, ttl)` in the future.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4229
      
      Differential Revision: D9174390
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 68201703d784408b851336ab4dd9b84188245b2d
      140f256d
  25. 01 8月, 2018 1 次提交
    • S
      Trace and Replay for RocksDB (#3837) · 12b6cdee
      Sagar Vemuri 提交于
      Summary:
      A framework for tracing and replaying RocksDB operations.
      
      A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.
      
      - Column-families are supported
      - Multi-threaded tracing is supported.
      - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
      - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.
      
      Currently supported DB operations:
      - Writes:
      -- Put
      -- Merge
      -- Delete
      -- SingleDelete
      -- DeleteRange
      -- Write
      - Reads:
      -- Get (point lookups)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837
      
      Differential Revision: D7974837
      
      Pulled By: sagar0
      
      fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
      12b6cdee
  26. 25 7月, 2018 1 次提交
    • F
      DataBlockHashIndex: Standalone Implementation with Unit Test (#4139) · 8805ec2f
      Fenggang Wu 提交于
      Summary:
      The first step of the `DataBlockHashIndex` implementation. A string based hash table is implemented and unit-tested.
      
      `DataBlockHashIndexBuilder`: `Add()` takes pairs of `<key, restart_index>`, and formats it into a string when `Finish()` is called.
      `DataBlockHashIndex`: initialized by the formatted string, and can interpret it as a hash table. Lookup for a key is supported by iterator operation.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4139
      
      Reviewed By: sagar0
      
      Differential Revision: D8866764
      
      Pulled By: fgwu
      
      fbshipit-source-id: 7f015f0098632c65979a22898a50424384730b10
      8805ec2f
  27. 24 7月, 2018 1 次提交
    • C
      move static msgs out of Status class (#4144) · 374c37da
      Chang Su 提交于
      Summary:
      The member msgs of class Status contains all types of status messages.
      When users dump a Status object, msgs will confuse users. So move it out
      of class Status by making it as file-local static variable.
      
      Closes #3831 .
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144
      
      Differential Revision: D8941419
      
      Pulled By: sagar0
      
      fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d
      374c37da
  28. 20 7月, 2018 1 次提交
  29. 18 7月, 2018 1 次提交
  30. 14 7月, 2018 1 次提交
  31. 29 6月, 2018 1 次提交
    • A
      Allow DB resume after background errors (#3997) · 52d4c9b7
      Anand Ananthabhotla 提交于
      Summary:
      Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
      1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
      2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
      3. Provide an API for the user to clear the error and resume the DB instance
      
      This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
      Closes https://github.com/facebook/rocksdb/pull/3997
      
      Differential Revision: D8653831
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
      52d4c9b7
  32. 05 6月, 2018 1 次提交
    • D
      Provide a way to override windows memory allocator with jemalloc for ZSTD · f4b72d70
      Dmitri Smirnov 提交于
      Summary:
      Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
      
      For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
      
      The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
      Closes https://github.com/facebook/rocksdb/pull/3838
      
      Differential Revision: D8229794
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
      f4b72d70
  33. 02 6月, 2018 1 次提交
  34. 01 6月, 2018 1 次提交
  35. 17 5月, 2018 1 次提交
    • M
      Change and clarify the relationship between Valid(), status() and Seek*() for... · 8bf555f4
      Mike Kolupaev 提交于
      Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
      
      Summary:
      Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
       * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
       * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
      
      However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
      
      This PR changes the convention to:
       * If status() is not ok, Valid() always returns false.
       * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
      
      This does sacrifice the two use cases listed above, but siying said it's ok.
      
      Overview of the changes:
       * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
       * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
       * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
      
      To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
      
      Iterators that didn't need changes:
       * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
       * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
      
      Iterators with changes (see inline comments for details):
       * DBIter - an overhaul:
          - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
          - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
          - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
          - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
          - It used to not reset status on seek for some types of errors.
          - Some simplifications and better comments.
          - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
       * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
       * ForwardIterator - changed to the new convention, also slightly simplified.
       * ForwardLevelIterator - fixed some bugs and simplified.
       * LevelIterator - simplified.
       * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
       * BlockBasedTableIterator - minor changes.
       * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
       * PlainTableIterator - some seeks used to not reset status.
       * CuckooTableIterator - tiny code cleanup.
       * ManagedIterator - fixed some bugs.
       * BaseDeltaIterator - changed to the new convention and fixed a bug.
       * BlobDBIterator - seeks used to not reset status.
       * KeyConvertingIterator - some small change.
      Closes https://github.com/facebook/rocksdb/pull/3810
      
      Differential Revision: D7888019
      
      Pulled By: al13n321
      
      fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
      8bf555f4
  36. 10 5月, 2018 1 次提交
  37. 08 5月, 2018 2 次提交