- 16 11月, 2019 1 次提交
-
-
由 Little-Wallace 提交于
Summary: When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock. The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable. Fixes https://github.com/facebook/rocksdb/issues/5974 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976 Differential Revision: D18535943 Pulled By: maysamyabandeh fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc
-
- 30 10月, 2019 1 次提交
-
-
由 sdong 提交于
Summary: In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush(). This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716 Test Plan: Run all tests with ASAN and TSAN. Differential Revision: D18217658 fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
-
- 13 9月, 2019 1 次提交
-
-
由 Lingjing You 提交于
Summary: Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it. Bench result (qps): `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4` master: | batch size \ thread num | 1 | 2 | 4 | 8 | | ----------------------- | ------- | ------- | ------- | ------- | | 1 | 387883 | 220790 | 308294 | 490998 | | 10 | 1397208 | 978911 | 1275684 | 1733395 | | 100 | 2045414 | 1589927 | 1798782 | 2681039 | | 1000 | 2228038 | 1698252 | 1839877 | 2863490 | fillseq with writebatch hint: | batch size \ thread num | 1 | 2 | 4 | 8 | | ----------------------- | ------- | ------- | ------- | ------- | | 1 | 286005 | 223570 | 300024 | 466981 | | 10 | 970374 | 813308 | 1399299 | 1753588 | | 100 | 1962768 | 1983023 | 2676577 | 3086426 | | 1000 | 2195853 | 2676782 | 3231048 | 3638143 | Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728 Differential Revision: D17297240 fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
-
- 07 9月, 2019 1 次提交
-
-
由 sdong 提交于
Summary: When building with clang 9, warning is reported for InternalDBStatsType type names shadowed the one for statistics. Rename them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5779 Test Plan: Build with clang 9 and see it passes. Differential Revision: D17239378 fbshipit-source-id: af28fb42066c738cd1b841f9fe21ab4671dafd18
-
- 24 8月, 2019 1 次提交
-
-
由 Zhongyi Xie 提交于
Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
-
- 26 7月, 2019 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`. We address these issues in this PR by doing the following. 1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc. 2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`. 3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` If the API extension looks good, I will add more unit tests. Some simple benchmark using db_bench. ``` $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true ``` Master is at a78503bd. ``` | | readrandom | fillrandom | | master | 15.53 MB/s | 25.97 MB/s | | PR5502 | 16.70 MB/s | 25.80 MB/s | ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502 Differential Revision: D16340894 Pulled By: riversand963 fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
-
- 23 7月, 2019 1 次提交
-
-
由 Manuel Ung 提交于
Summary: There are a number of fixes in this PR (with most bugs found via the added stress tests): 1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`. 2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration. 3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead. 4. Add stress test to test reading own writes. 5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_. 6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value. 7. Add some more comments everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573 Differential Revision: D16276089 Pulled By: lth fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
-
- 02 7月, 2019 1 次提交
-
-
由 Zhongyi Xie 提交于
Summary: WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation) Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it. This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will keep advancing its log number when necessary, allowing RocksDB to delete old WAL files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509 Differential Revision: D16045896 Pulled By: miasantreble fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
-
- 11 6月, 2019 1 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: The patch reduces the contention over prepared_mutex_ using these techniques: 1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues. 2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop() 3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420 Differential Revision: D15741985 Pulled By: maysamyabandeh fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
-
- 07 6月, 2019 1 次提交
-
-
由 Zhongyi Xie 提交于
Summary: When using `PRIu64` type of printf specifier, current code base does the following: ``` #ifndef __STDC_FORMAT_MACROS #define __STDC_FORMAT_MACROS #endif #include <inttypes.h> ``` However, this can be simplified to ``` #include <cinttypes> ``` as long as flag `-std=c++11` is used. This should solve issues like https://github.com/facebook/rocksdb/issues/5159 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402 Differential Revision: D15701195 Pulled By: miasantreble fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
-
- 06 6月, 2019 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now). Test plan (on devserver): ``` $COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` All tests must pass. We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled. ``` $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 ``` Repeat for 6 times for both versions. Results are as follows: ``` | | readrandom | fillrandom | | master | 16.77 MB/s | 47.05 MB/s | | PR5079 | 16.44 MB/s | 47.03 MB/s | ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079 Differential Revision: D15132946 Pulled By: riversand963 fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
-
- 01 6月, 2019 1 次提交
-
-
由 Vijay Nadimpalli 提交于
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
-
- 31 5月, 2019 1 次提交
-
-
由 Siying Dong 提交于
Summary: There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo ve them to a new directory test_util/, so that util/ is cleaner. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377 Differential Revision: D15551366 Pulled By: siying fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
-
- 20 5月, 2019 1 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes. The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313 Differential Revision: D15379977 Pulled By: maysamyabandeh fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
-
- 16 5月, 2019 1 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
-
- 14 5月, 2019 1 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
-
- 16 4月, 2019 1 次提交
-
-
由 Vijay Nadimpalli 提交于
Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188) Summary: Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188 Differential Revision: D14942832 Pulled By: vjnadimpalli fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
-
- 05 4月, 2019 1 次提交
-
-
由 Adam Simpkins 提交于
Summary: Annotate all of the logging functions to inform the compiler that these use printf-style formatting arguments. This allows the compiler to emit warnings if the format arguments are incorrect. This also fixes many problems reported now that format string checking is enabled. Many of these are simply mix-ups in the argument type (e.g, int vs uint64_t), but in several cases the wrong number of arguments were being passed in which can cause the code to crash. The primary motivation for this was to fix the log message in `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s format parameter with no argument supplied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089 Differential Revision: D14574795 Pulled By: simpkins fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
-
- 03 4月, 2019 1 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121 Differential Revision: D14665210 Pulled By: maysamyabandeh fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
-
- 28 3月, 2019 1 次提交
-
-
由 Siying Dong 提交于
Summary: Following files were run through automatic formatter: db/db_impl.cc db/db_impl.h db/db_impl_compaction_flush.cc db/db_impl_debug.cc db/db_impl_files.cc db/db_impl_readonly.h db/db_impl_write.cc db/dbformat.cc db/dbformat.h table/block.cc table/block.h table/block_based_filter_block.cc table/block_based_filter_block.h table/block_based_filter_block_test.cc table/block_based_table_builder.cc table/block_based_table_reader.cc table/block_based_table_reader.h table/block_builder.cc table/block_builder.h table/block_fetcher.cc table/block_prefix_index.cc table/block_prefix_index.h table/block_test.cc table/format.cc table/format.h I could easily run all the files, but I don't want people to feel that I'm doing it for lines of code changes :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114 Differential Revision: D14633040 Pulled By: siying fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
-
- 16 3月, 2019 1 次提交
-
-
由 anand76 提交于
Summary: There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss. Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072 Differential Revision: D14461384 Pulled By: anand1976 fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
-
- 01 3月, 2019 2 次提交
-
-
由 Maysam Yabandeh 提交于
Summary: PreReleaseCallback meant to be called before the writes are visible to the readers. Since the sequence number is known after the WAL write, there is no reason to delay calling PreReleaseCallback to after the memtable write, which would complicates the reader's logic in presence of our memtable writes that are made visible by the other write thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5015 Differential Revision: D14221670 Pulled By: maysamyabandeh fbshipit-source-id: a504dd665cf923226d7af09cc8e9c7739a25edc6
-
由 Siying Dong 提交于
Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
-
- 30 1月, 2019 1 次提交
-
-
由 Alexander Zinoviev 提交于
Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
-
- 04 1月, 2019 1 次提交
-
-
由 DorianZheng 提交于
Summary: The original implementation has two problems: 1. https://github.com/facebook/rocksdb/blob/f0dda35d7de1fd56e0b7c96376ca8aff2a6364fd/db/db_impl_write.cc#L478 https://github.com/facebook/rocksdb/blob/f0dda35d7de1fd56e0b7c96376ca8aff2a6364fd/db/write_thread.h#L231 If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss. 2. https://github.com/facebook/rocksdb/blob/f0dda35d7de1fd56e0b7c96376ca8aff2a6364fd/db/write_thread.h#L130 The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838 Differential Revision: D13574070 Pulled By: yiwu-arbug fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b
-
- 03 1月, 2019 1 次提交
-
-
由 Faustin Lammler 提交于
Summary: Hi, Lintian, the Debian package checker complains about spelling error (spelling-error-in-binary). See https://salsa.debian.org/mariadb-team/mariadb-10.3/-/jobs/98380 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4827 Differential Revision: D13566362 Pulled By: riversand963 fbshipit-source-id: cd4e9212133c73b0591030de6cdedaa47575968d
-
- 13 11月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: In the past, both `DBImpl::atomic_flush_` and `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is set correctly. This does not lead to incorrect behavior, but is a duplicate of information. Since `immutable_db_options_` is always there and has `atomic_flush`, we should use it as source of truth and remove `DBImpl::atomic_flush_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631 Differential Revision: D12928371 Pulled By: riversand963 fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
-
- 10 11月, 2018 1 次提交
-
-
由 Sagar Vemuri 提交于
Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
-
- 30 10月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: For flush triggered by RocksDB due to memory usage approaching certain threshold (WriteBufferManager or Memtable full), we should cut the memtable only when the current active memtable is not empty, i.e. contains data. This is what we do for non-atomic flush. If we always cut memtable even when the active memtable is empty, we will generate extra, empty immutable memtable. This is not ideal since it may cause write stall. It also causes some DBAtomicFlushTest to fail because cfd->imm()->NumNotFlushed() is different from expectation. Test plan ``` $make clean && make J=1 -j32 all check $make clean && OPT="-DROCKSDB_LITE -g" make J=1 -j32 all check $make clean && TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make J=1 -j32 valgrind_test ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4595 Differential Revision: D12818520 Pulled By: riversand963 fbshipit-source-id: d867bdbeacf4199fdd642debb085f94703c41a18
-
- 27 10月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f
-
- 13 10月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
-
- 11 10月, 2018 1 次提交
-
-
由 UncP 提交于
Summary: in `WriteThread::LaunchParallelMemTableWriters`, there is ` write_group->running.store(write_group->size); ` https://github.com/facebook/rocksdb/blob/master/db/write_thread.cc#L510 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4450 Differential Revision: D10201900 Pulled By: yiwu-arbug fbshipit-source-id: 96c8fbbba5aff7ba8a6ceb3117a2bd7cc9b2f34b
-
- 10 10月, 2018 1 次提交
-
-
由 Anand Ananthabhotla 提交于
Summary: There is a bug when the write queue leader is blocked on a write delay/stop, and the queue has writers with WriteOptions::no_slowdown set to true. They are not woken up until the write stall is cleared. The fix introduces a dummy writer inserted at the tail to indicate a write stall and prevent further inserts into the queue, and a condition variable that writers who can tolerate slowdown wait on before adding themselves to the queue. The leader calls WriteThread::BeginWriteStall() to add the dummy writer and then walk the queue to fail any writers with no_slowdown set. Once the stall clears, the leader calls WriteThread::EndWriteStall() to remove the dummy writer and signal the condition variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475 Differential Revision: D10285827 Pulled By: anand1976 fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
-
- 16 9月, 2018 1 次提交
-
-
由 Anand Ananthabhotla 提交于
Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
-
- 25 8月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB. This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it. This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952 Differential Revision: D8529933 Pulled By: riversand963 fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
-
- 24 8月, 2018 1 次提交
-
-
由 Yanqin Jin 提交于
Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
-
- 07 8月, 2018 1 次提交
-
-
由 Jingguo Yao 提交于
Summary: HashMayMatch is related to AddKey() instead of CreateFilter(). Also applies some minor Fixes #4191 #4200 #3910 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202 Differential Revision: D9180945 Pulled By: maysamyabandeh fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
-
- 01 8月, 2018 1 次提交
-
-
由 Sagar Vemuri 提交于
Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
-
- 24 7月, 2018 1 次提交
-
-
由 Manuel Ung 提交于
Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
-
- 29 6月, 2018 1 次提交
-
-
由 Anand Ananthabhotla 提交于
Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
-