1. 11 6月, 2019 2 次提交
    • M
      Avoid deadlock between mutex_ and log_write_mutex_ (#5437) · c8c1a549
      Maysam Yabandeh 提交于
      Summary:
      To avoid deadlock mutex_ should never be acquired before log_write_mutex_. The patch documents that and also fixes one case in ::FlushWAL that acquires mutex_ through ::WriteStatusCheck when it already holds lock on log_write_mutex_.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5437
      
      Differential Revision: D15749722
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f57b69c44b4b80cc6d7ddf3d3fdf4a9eb5a5a45a
      c8c1a549
    • H
      Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421) · 5efa0d6b
      haoyuhuang 提交于
      Summary:
      BlockCacheLookupContext only contains the caller for now.
      We will trace block accesses at five places:
      1. BlockBasedTable::GetFilter.
      2. BlockBasedTable::GetUncompressedDict.
      3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
      4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      
      We create the context at:
      1. BlockBasedTable::Get. (kUserGet)
      2. BlockBasedTable::MultiGet. (kUserMGet)
      3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
      4. BlockBasedTable::Open. (kPrefetch)
      5. Index/Filter::CacheDependencies. (kPrefetch)
      6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).
      
      I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
      Throughput of this PR: 231334 ops/s.
      Throughput of the master branch: 238428 ops/s.
      
      Experiment setup:
      RocksDB:    version 6.2
      Date:       Mon Jun 10 10:42:51 2019
      CPU:        24 * Intel Core Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       20 bytes each
      Values:     100 bytes each (100 bytes after compression)
      Entries:    1000000
      Prefix:    20 bytes
      Keys per prefix:    0
      RawSize:    114.4 MB (estimated)
      FileSize:   114.4 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: NoCompression
      Compression sampling rate: 0
      Memtablerep: skip_list
      Perf Level: 1
      
      Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
      
      Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
      
      TODOs:
      1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
      2. Integrate tracer to trace block cache accesses.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421
      
      Differential Revision: D15704258
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
      5efa0d6b
  2. 07 6月, 2019 1 次提交
  3. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  4. 04 6月, 2019 1 次提交
  5. 01 6月, 2019 2 次提交
  6. 31 5月, 2019 3 次提交
  7. 30 5月, 2019 1 次提交
  8. 02 5月, 2019 1 次提交
  9. 26 4月, 2019 1 次提交
    • Y
      Close WAL files before deletion (#5233) · da96f2fe
      Yanqin Jin 提交于
      Summary:
      Currently one thread in RocksDB keeps a WAL file open while another thread
      deletes it. Although the first thread never writes to the WAL again, it still
      tries to close it in the end. This is fine on POSIX, but can be problematic on
      other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
      throw exceptions. The solution is to let the second thread close the WAL before deleting it.
      
      RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
      
      Differential Revision: D15032670
      
      Pulled By: riversand963
      
      fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
      da96f2fe
  10. 25 4月, 2019 1 次提交
    • Z
      secondary instance: add support for WAL tailing on `OpenAsSecondary` · aa56b7e7
      Zhongyi Xie 提交于
      Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts.
      
      Differential Revision: D15059905
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10
      aa56b7e7
  11. 16 4月, 2019 1 次提交
    • Y
      Fix MultiGet ASSERT bug when passing unsorted result (#5195) · 3e63e553
      Yi Zhang 提交于
      Summary:
      Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.
      
      I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.
      
      anand1976
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195
      
      Differential Revision: D14935475
      
      Pulled By: yizhang82
      
      fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
      3e63e553
  12. 13 4月, 2019 1 次提交
    • M
      WritePrepared: fix race condition in reading batch with duplicate keys (#5147) · fe642cbe
      Maysam Yabandeh 提交于
      Summary:
      When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
      The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
      A unit  test is added to reproduce the race condition with duplicate keys.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
      
      Differential Revision: D14758815
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
      fe642cbe
  13. 12 4月, 2019 1 次提交
    • A
      Introduce a new MultiGet batching implementation (#5011) · fefd4b98
      anand76 提交于
      Summary:
      This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
      
      Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
      1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
      2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
      
      The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
      
      Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
      
      Batch   Sizes
      
      1        | 2        | 4         | 8      | 16  | 32
      
      Random pattern (Stride length 0)
      4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
      4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
      4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)
      
      Good locality (Stride length 16)
      4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
      4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
      4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
      
      Good locality (Stride length 256)
      4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
      4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
      4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
      
      Medium locality (Stride length 4096)
      4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
      4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
      4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
      
      dbbench command used (on a DB with 4 levels, 12 million keys)-
      TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
      
      Differential Revision: D14348703
      
      Pulled By: anand1976
      
      fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
      fefd4b98
  14. 06 4月, 2019 1 次提交
    • S
      Expose DB methods to lock and unlock the WAL (#5146) · 39c6c5fc
      Sergei Glushchenko 提交于
      Summary:
      Expose DB methods to lock and unlock the WAL.
      
      These methods are intended to use by MyRocks in order to obtain WAL
      coordinates in consistent way.
      
      Usage scenario is following:
      
      MySQL has performance_schema.log_status which provides information that
      enables a backup tool to copy the required log files without locking for
      the duration of copy. To populate this table MySQL does following:
      
      1. Lock the binary log. Transactions are not allowed to commit now
      2. Save the binary log coordinates
      3. Walk through the storage engines and lock writes on each engine. For
         InnoDB, redo log is locked. For MyRocks, WAL should be locked.
      4. Ask storage engines for their coordinates. InnoDB reports its current
         LSN and checkpoint LSN. MyRocks should report active WAL files names
         and sizes.
      5. Release storage engine's locks
      6. Unlock binary log
      
      Backup tool will then use this information to copy InnoDB, RocksDB and
      MySQL binary logs up to specified positions to end up with consistent DB
      state after restore.
      
      Currently, RocksDB allows to obtain the list of WAL files. Only missing
      bit is the method to lock the writes to WAL files.
      
      LockWAL method must flush the WAL in order for the reported size to be
      accurate (GetSortedWALFiles is using file system stat call to return the
      file size), also, since backup tool is going to copy the WAL, it is
      better to be flushed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146
      
      Differential Revision: D14815447
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
      39c6c5fc
  15. 04 4月, 2019 1 次提交
  16. 03 4月, 2019 1 次提交
    • M
      WriteUnPrepared: less virtual in iterator callback (#5049) · 14b3f683
      Maysam Yabandeh 提交于
      Summary:
      WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
      The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
      Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
      
      The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
      
      Benchmark:
      ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
      
      ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
      seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec
      
      ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
      seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
      
      Differential Revision: D14366459
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
      14b3f683
  17. 02 4月, 2019 1 次提交
    • M
      Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043) · 120bc471
      Mike Kolupaev 提交于
      Summary:
      Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
      
      In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
      
      (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
      I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
      
      Differential Revision: D14339233
      
      Pulled By: al13n321
      
      fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
      120bc471
  18. 29 3月, 2019 1 次提交
    • A
      Smooth the deletion of WAL files (#5116) · dae3b554
      anand76 提交于
      Summary:
      WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116
      
      Differential Revision: D14669437
      
      Pulled By: anand1976
      
      fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0
      dae3b554
  19. 28 3月, 2019 1 次提交
    • S
      Apply automatic formatting to some files (#5114) · 89ab1381
      Siying Dong 提交于
      Summary:
      Following files were run through automatic formatter:
      db/db_impl.cc
      db/db_impl.h
      db/db_impl_compaction_flush.cc
      db/db_impl_debug.cc
      db/db_impl_files.cc
      db/db_impl_readonly.h
      db/db_impl_write.cc
      db/dbformat.cc
      db/dbformat.h
      table/block.cc
      table/block.h
      table/block_based_filter_block.cc
      table/block_based_filter_block.h
      table/block_based_filter_block_test.cc
      table/block_based_table_builder.cc
      table/block_based_table_reader.cc
      table/block_based_table_reader.h
      table/block_builder.cc
      table/block_builder.h
      table/block_fetcher.cc
      table/block_prefix_index.cc
      table/block_prefix_index.h
      table/block_test.cc
      table/format.cc
      table/format.h
      
      I could easily run all the files, but I don't want people to feel that
      I'm doing it for lines of code changes :)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114
      
      Differential Revision: D14633040
      
      Pulled By: siying
      
      fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
      89ab1381
  20. 27 3月, 2019 1 次提交
    • Y
      Support for single-primary, multi-secondary instances (#4899) · 9358178e
      Yanqin Jin 提交于
      Summary:
      This PR allows RocksDB to run in single-primary, multi-secondary process mode.
      The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
      Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
      
      This PR has several components:
      1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
      
      2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
      
      3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
      3.1 Tail the primary's MANIFEST during recovery.
      3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
      3.3 Tailing WAL will be in a future PR.
      
      4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
      
      Differential Revision: D14510945
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
      9358178e
  21. 26 3月, 2019 1 次提交
  22. 01 3月, 2019 1 次提交
    • S
      Add two more StatsLevel (#5027) · 5e298f86
      Siying Dong 提交于
      Summary:
      Statistics cost too much CPU for some use cases. Add two stats levels
      so that people can choose to skip two types of expensive stats, timers and
      histograms.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
      
      Differential Revision: D14252765
      
      Pulled By: siying
      
      fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
      5e298f86
  23. 21 2月, 2019 1 次提交
    • Z
      add GetStatsHistory to retrieve stats snapshots (#4748) · c4f5d0aa
      Zhongyi Xie 提交于
      Summary:
      This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
      
      (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
      
      Differential Revision: D13961471
      
      Pulled By: miasantreble
      
      fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
      c4f5d0aa
  24. 14 2月, 2019 1 次提交
  25. 13 2月, 2019 2 次提交
    • Y
      Atomic ingest (#4895) · a69d4dee
      Yanqin Jin 提交于
      Summary:
      Make file ingestion atomic.
      
       as title.
      Ingesting external SST files into multiple column families should be atomic. If
      a crash occurs and db reopens, either all column families have successfully
      ingested the files before the crash, or non of the ingestions have any effect
      on the state of the db.
      
      Also add unit tests for atomic ingestion.
      
      Note that the unit test here does not cover the case of incomplete atomic group
      in the MANIFEST, which is covered in VersionSetTest already.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895
      
      Differential Revision: D13718245
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
      a69d4dee
    • S
      Stats should be logged in INFO level (#4977) · 49ddd7ec
      Siying Dong 提交于
      Summary:
      Previously, stats were logged in warning level. This was done in that way because
      people reported that it wasn't logged in MyRocks. However, later we learned that it turns
      out to be due to a bug in MyRocks, which is fixed in
      https://github.com/facebook/mysql-5.6/commit/79bb705e74b239d7030b724ea6bbd635eceec531
      
      Now we revert the stats logging to INFO level, so that it doesn't pollute the warning
      level logging.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977
      
      Differential Revision: D14058485
      
      Pulled By: siying
      
      fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6
      49ddd7ec
  26. 06 2月, 2019 1 次提交
    • S
      BYTES_READ stats miscount for NotFound cases (#4938) · 8fe07332
      Siying Dong 提交于
      Summary:
      In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be
      whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not
      increasing these two counters.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938
      
      Differential Revision: D13908963
      
      Pulled By: siying
      
      fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162
      8fe07332
  27. 16 1月, 2019 1 次提交
    • Y
      WritePrepared: Fix visible key compacted out by compaction (#4883) · 5d4fddfa
      Yi Wu 提交于
      Summary:
      With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case:
      ```
      seq = 1: put "foo" = "bar"
      seq = 2: transaction T: delete "foo", prepare
      seq = 3: compaction start
      seq = 4: take snapshot S
      seq = 5: transaction T: commit.
      ...
      seq = N: compaction iterator reached key "foo".
      ```
      When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out.
      
      The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883
      
      Differential Revision: D13668775
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4
      5d4fddfa
  28. 04 1月, 2019 1 次提交
    • Y
      Refactor atomic flush result installation to MANIFEST (#4791) · a07175af
      Yanqin Jin 提交于
      Summary:
      as titled.
      Since different bg flush threads can flush different sets of column families
      (due to column family creation and drop), we decide not to let one thread
      perform atomic flush result installation for other threads. Bg flush threads
      will install their atomic flush results sequentially to MANIFEST, using
      a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791
      
      Differential Revision: D13498930
      
      Pulled By: riversand963
      
      fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
      a07175af
  29. 03 1月, 2019 3 次提交
    • A
      Lock free MultiGet (#4754) · b9d6ecca
      Anand Ananthabhotla 提交于
      Summary:
      Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
      
      Tests:
      1. Unit tests
      2. make check
      3. db_bench -
      
      TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
      readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)
      
      Multireadrandom with batch size 1:
      multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
      
      Differential Revision: D13363550
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
      b9d6ecca
    • F
      Fix spelling errors (#4827) · 7d65bd5c
      Faustin Lammler 提交于
      Summary:
      Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary).
      
      See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827
      
      Differential Revision: D13566362
      
      Pulled By: riversand963
      
      fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d
      7d65bd5c
    • Y
      Remove an unused parameter (#4816) · ec68091d
      Yanqin Jin 提交于
      Summary:
      The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is
      not used. Remove it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816
      
      Differential Revision: D13543218
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153
      ec68091d
  30. 21 12月, 2018 1 次提交
    • S
      Introduce a CPU time counter in perf_context (#4741) · da1c64b6
      Siying Dong 提交于
      Summary:
      Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
      Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
      Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
      
      Differential Revision: D13287798
      
      Pulled By: siying
      
      fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
      da1c64b6
  31. 20 12月, 2018 1 次提交
  32. 18 12月, 2018 1 次提交
  33. 14 12月, 2018 1 次提交
    • M
      Fix race condition on options_file_number_ (#4780) · 34954233
      Maysam Yabandeh 提交于
      Summary:
      options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780
      
      Differential Revision: D13461411
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b
      34954233