提交 5a92c415 编写于 作者: S Steven Li

Enhanced crash_gen tool to run clusters, with a new README file

上级 f7a0b6b8
<center><h1>User's Guide to the Crash_Gen Tool</h1></center>
# Introduction
To effectively test and debug our TDengine product, we have developed a simple tool to
exercise various functions of the system in a randomized fashion, hoping to expose
maximum number of problems, hopefully without a pre-determined scenario.
# Preparation
To run this tool, please ensure the followed preparation work is done first.
1. Fetch a copy of the TDengine source code, and build it successfully in the `build/`
directory
1. Ensure that the system has Python3.8 or above properly installed. We use
Ubuntu 20.04LTS as our own development environment, and suggest you also use such
an environment if possible.
# Simple Execution
To run the tool with the simplest method, follow the steps below:
1. Open a terminal window, start the `taosd` service in the `build/` directory
(or however you prefer to start the `taosd` service)
1. Open another terminal window, go into the `tests/pytest/` directory, and
run `./crash_gen.sh -p -t 3 -s 10` (change the two parameters here as you wish)
1. Watch the output to the end and see if you get a `SUCCESS` or `FAILURE`
That's it!
# Running Clusters
This tool also makes it easy to test/verify the clustering capabilities of TDengine. You
can start a cluster quite easily with the following command:
```
$ cd tests/pytest/
$ ./crash_gen.sh -e -o 3
```
The `-e` option above tells the tool to start the service, and do not run any tests, while
the `-o 3` option tells the tool to start 3 DNodes and join them together in a cluster.
Obviously you can adjust the the number here.
## Behind the Scenes
When the tool runs a cluster, it users a number of directories, each holding the information
for a single DNode, see:
```
$ ls build/cluster*
build/cluster_dnode_0:
cfg data log
build/cluster_dnode_1:
cfg data log
build/cluster_dnode_2:
cfg data log
```
Therefore, when something goes wrong and you want to reset everything with the cluster, simple
erase all the files:
```
$ rm -rf build/cluster_dnode_*
```
## Addresses and Ports
The DNodes in the cluster all binds the the `127.0.0.1` IP address (for now anyway), and
uses port 6030 for the first DNode, and 6130 for the 2nd one, and so on.
## Testing Against a Cluster
In a separate terminal window, you can invoke the tool in client mode and test against
a cluster, such as:
```
$ ./crash_gen.sh -p -t 10 -s 100 -i 3
```
Here the `-i` option tells the tool to always create tables with 3 replicas, and run
all tests against such tables.
# Additional Features
The exhaustive features of the tool is available through the `-h` option:
```
$ ./crash_gen.sh -h
usage: crash_gen_bootstrap.py [-h] [-a] [-b MAX_DBS] [-c CONNECTOR_TYPE] [-d] [-e] [-g IGNORE_ERRORS] [-i MAX_REPLICAS] [-l] [-n] [-o NUM_DNODES] [-p] [-r]
[-s MAX_STEPS] [-t NUM_THREADS] [-v] [-x]
TDengine Auto Crash Generator (PLEASE NOTICE the Prerequisites Below)
---------------------------------------------------------------------
1. You build TDengine in the top level ./build directory, as described in offical docs
2. You run the server there before this script: ./build/bin/taosd -c test/cfg
optional arguments:
-h, --help show this help message and exit
-a, --auto-start-service
Automatically start/stop the TDengine service (default: false)
-b MAX_DBS, --max-dbs MAX_DBS
Maximum number of DBs to keep, set to disable dropping DB. (default: 0)
-c CONNECTOR_TYPE, --connector-type CONNECTOR_TYPE
Connector type to use: native, rest, or mixed (default: 10)
-d, --debug Turn on DEBUG mode for more logging (default: false)
-e, --run-tdengine Run TDengine service in foreground (default: false)
-g IGNORE_ERRORS, --ignore-errors IGNORE_ERRORS
Ignore error codes, comma separated, 0x supported (default: None)
-i MAX_REPLICAS, --max-replicas MAX_REPLICAS
Maximum number of replicas to use, when testing against clusters. (default: 1)
-l, --larger-data Write larger amount of data during write operations (default: false)
-n, --dynamic-db-table-names
Use non-fixed names for dbs/tables, useful for multi-instance executions (default: false)
-o NUM_DNODES, --num-dnodes NUM_DNODES
Number of Dnodes to initialize, used with -e option. (default: 1)
-p, --per-thread-db-connection
Use a single shared db connection (default: false)
-r, --record-ops Use a pair of always-fsynced fils to record operations performing + performed, for power-off tests (default: false)
-s MAX_STEPS, --max-steps MAX_STEPS
Maximum number of steps to run (default: 100)
-t NUM_THREADS, --num-threads NUM_THREADS
Number of threads to run (default: 10)
-v, --verify-data Verify data written in a number of places by reading back (default: false)
-x, --continue-on-exception
Continue execution after encountering unexpected/disallowed errors/exceptions (default: false)
```
......@@ -18,6 +18,7 @@ from __future__ import annotations
from typing import Set
from typing import Dict
from typing import List
from typing import Optional # Type hinting, ref: https://stackoverflow.com/questions/19202633/python-3-type-hinting-for-none
import textwrap
import time
......@@ -62,9 +63,10 @@ gContainer: Container
class WorkerThread:
def __init__(self, pool: ThreadPool, tid, tc: ThreadCoordinator,
# te: TaskExecutor,
): # note: main thread context!
def __init__(self, pool: ThreadPool, tid, tc: ThreadCoordinator):
"""
Note: this runs in the main thread context
"""
# self._curStep = -1
self._pool = pool
self._tid = tid
......@@ -1007,6 +1009,8 @@ class Database:
possibly in a cluster environment.
For now we use it to manage state transitions in that database
TODO: consider moving, but keep in mind it contains "StateMachine"
'''
_clsLock = threading.Lock() # class wide lock
_lastInt = 101 # next one is initial integer
......@@ -1182,7 +1186,7 @@ class Task():
def __init__(self, execStats: ExecutionStats, db: Database):
self._workerThread = None
self._err = None # type: Exception
self._err: Optional[Exception] = None
self._aborted = False
self._curStep = None
self._numRows = None # Number of rows affected
......@@ -1318,10 +1322,11 @@ class Task():
self._aborted = True
traceback.print_exc()
except BaseException: # TODO: what is this again??!!
self.logDebug(
"[=] Unexpected exception, SQL: {}".format(
wt.getDbConn().getLastSql()))
raise
raise RuntimeError("Punt")
# self.logDebug(
# "[=] Unexpected exception, SQL: {}".format(
# wt.getDbConn().getLastSql()))
# raise
self._execStats.endTaskType(self.__class__.__name__, self.isSuccess())
self.logDebug("[X] task execution completed, {}, status: {}".format(
......@@ -1498,7 +1503,8 @@ class TaskCreateDb(StateTransitionTask):
# was: self.execWtSql(wt, "create database db")
repStr = ""
if gConfig.max_replicas != 1:
numReplica = Dice.throw(gConfig.max_replicas) + 1 # 1,2 ... N
# numReplica = Dice.throw(gConfig.max_replicas) + 1 # 1,2 ... N
numReplica = gConfig.max_replicas # fixed, always
repStr = "replica {}".format(numReplica)
self.execWtSql(wt, "create database {} {}"
.format(self._db.getName(), repStr) )
......@@ -2050,7 +2056,7 @@ class ClientManager:
class MainExec:
def __init__(self):
self._clientMgr = None
self._svcMgr = None
self._svcMgr = None # type: ServiceManager
signal.signal(signal.SIGTERM, self.sigIntHandler)
signal.signal(signal.SIGINT, self.sigIntHandler)
......@@ -2063,17 +2069,16 @@ class MainExec:
self._svcMgr.sigUsrHandler(signalNumber, frame)
def sigIntHandler(self, signalNumber, frame):
if self._svcMgr:
if self._svcMgr:
self._svcMgr.sigIntHandler(signalNumber, frame)
if self._clientMgr:
if self._clientMgr:
self._clientMgr.sigIntHandler(signalNumber, frame)
def runClient(self):
global gSvcMgr
if gConfig.auto_start_service:
self._svcMgr = ServiceManager()
gSvcMgr = self._svcMgr # hack alert
self._svcMgr.startTaosService() # we start, don't run
gSvcMgr = self._svcMgr = ServiceManager() # hack alert
gSvcMgr.startTaosService() # we start, don't run
self._clientMgr = ClientManager()
ret = None
......@@ -2086,12 +2091,10 @@ class MainExec:
def runService(self):
global gSvcMgr
self._svcMgr = ServiceManager()
gSvcMgr = self._svcMgr # save it in a global variable TODO: hack alert
gSvcMgr = self._svcMgr = ServiceManager(gConfig.num_dnodes) # save it in a global variable TODO: hack alert
self._svcMgr.run() # run to some end state
self._svcMgr = None
gSvcMgr = None
gSvcMgr.run() # run to some end state
gSvcMgr = self._svcMgr = None
def init(self): # TODO: refactor
global gContainer
......@@ -2165,6 +2168,13 @@ class MainExec:
'--dynamic-db-table-names',
action='store_true',
help='Use non-fixed names for dbs/tables, useful for multi-instance executions (default: false)')
parser.add_argument(
'-o',
'--num-dnodes',
action='store',
default=1,
type=int,
help='Number of Dnodes to initialize, used with -e option. (default: 1)')
parser.add_argument(
'-p',
'--per-thread-db-connection',
......@@ -2209,7 +2219,12 @@ class MainExec:
def run(self):
if gConfig.run_tdengine: # run server
self.runService()
try:
self.runService()
return 0 # success
except ConnectionError as err:
Logging.error("Failed to make DB connection, please check DB instance manually")
return -1 # failure
else:
return self.runClient()
......
......@@ -12,7 +12,9 @@ from util.cases import *
from util.dnodes import *
from util.log import *
from .misc import Logging, CrashGenError, Helper
from .misc import Logging, CrashGenError, Helper, Dice
import os
import datetime
# from .service_manager import TdeInstance
class DbConn:
......@@ -44,6 +46,9 @@ class DbConn:
self._lastSql = None
self._dbTarget = dbTarget
def __repr__(self):
return "[DbConn: type={}, target={}]".format(self._type, self._dbTarget)
def getLastSql(self):
return self._lastSql
......@@ -54,7 +59,7 @@ class DbConn:
# below implemented by child classes
self.openByType()
Logging.debug("[DB] data connection opened, type = {}".format(self._type))
Logging.debug("[DB] data connection opened: {}".format(self))
self.isOpen = True
def close(self):
......@@ -277,15 +282,18 @@ class DbTarget:
self.cfgPath = cfgPath
self.hostAddr = hostAddr
self.port = port
def __repr__(self):
return "[DbTarget: cfgPath={}, host={}:{}]".format(
self.cfgPath, self.hostAddr, self.port)
Helper.getFriendlyPath(self.cfgPath), self.hostAddr, self.port)
def getEp(self):
return "{}:{}".format(self.hostAddr, self.port)
class DbConnNative(DbConn):
# Class variables
_lock = threading.Lock()
_connInfoDisplayed = False
# _connInfoDisplayed = False # TODO: find another way to display this
totalConnections = 0 # Not private
def __init__(self, dbTarget):
......@@ -304,9 +312,9 @@ class DbConnNative(DbConn):
cls = self.__class__ # Get the class, to access class variables
with cls._lock: # force single threading for opening DB connections. # TODO: whaaat??!!!
dbTarget = self._dbTarget
if not cls._connInfoDisplayed:
cls._connInfoDisplayed = True # updating CLASS variable
Logging.info("Initiating TAOS native connection to {}".format(dbTarget))
# if not cls._connInfoDisplayed:
# cls._connInfoDisplayed = True # updating CLASS variable
Logging.debug("Initiating TAOS native connection to {}".format(dbTarget))
# Make the connection
# self._conn = taos.connect(host=hostAddr, config=cfgPath) # TODO: make configurable
# self._cursor = self._conn.cursor()
......@@ -424,3 +432,4 @@ class DbManager():
def cleanUp(self):
self._dbConn.close()
import threading
import random
import logging
import os
class CrashGenError(Exception):
......@@ -26,7 +27,7 @@ class LoggingFilter(logging.Filter):
class MyLoggingAdapter(logging.LoggerAdapter):
def process(self, msg, kwargs):
return "[{}]{}".format(threading.get_ident() % 10000, msg), kwargs
return "[{}] {}".format(threading.get_ident() % 10000, msg), kwargs
# return '[%s] %s' % (self.extra['connid'], msg), kwargs
......@@ -71,12 +72,44 @@ class Logging:
def warning(cls, msg):
cls.logger.warning(msg)
@classmethod
def error(cls, msg):
cls.logger.error(msg)
class Status:
STATUS_STARTING = 1
STATUS_RUNNING = 2
STATUS_STOPPING = 3
STATUS_STOPPED = 4
def __init__(self, status):
self.set(status)
def __repr__(self):
return "[Status: v={}]".format(self._status)
def set(self, status):
self._status = status
def get(self):
return self._status
def isStarting(self):
return self._status == Status.STATUS_STARTING
def isRunning(self):
# return self._thread and self._thread.is_alive()
return self._status == Status.STATUS_RUNNING
def isStopping(self):
return self._status == Status.STATUS_STOPPING
def isStopped(self):
return self._status == Status.STATUS_STOPPED
def isStable(self):
return self.isRunning() or self.isStopped()
# Deterministic random number generator
class Dice():
seeded = False # static, uninitialized
......@@ -118,14 +151,23 @@ class Helper:
def convertErrno(cls, errno):
return errno if (errno > 0) else 0x80000000 + errno
@classmethod
def getFriendlyPath(cls, path): # returns .../xxx/yyy
ht1 = os.path.split(path)
ht2 = os.path.split(ht1[0])
return ".../" + ht2[1] + '/' + ht1[1]
class Progress:
STEP_BOUNDARY = 0
BEGIN_THREAD_STEP = 1
END_THREAD_STEP = 2
SERVICE_HEART_BEAT= 3
tokens = {
STEP_BOUNDARY: '.',
BEGIN_THREAD_STEP: '[',
END_THREAD_STEP: '] '
END_THREAD_STEP: '] ',
SERVICE_HEART_BEAT: '.Y.'
}
@classmethod
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册