Symptoms and tests to confirm
The entire cluster becomes unstable with the Cluster Master showing flapping of indexers from up to down. With farm of two layer proxy servers.
You will see intermittent HTTP errors uploading to smart store.
10-07-2019 15:13:42.821 +0100 ERROR RetryableClientTransaction - transactionDone(): groupId=(nil) rTxnId=… transactionId=…. success=N HTTP-statusCode=502 HTTP-statusDescription="network error" retries=0 retry=N no_retry_reason="no retry policy" remainingTxns=0
10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~….|", status=failed, unable to check if receipt exists at path=_internal/db/…/receipt.json(0,-1,), error="network error"
10-07-2019 15:13:42.821 +0100 ERROR CacheManager - action=upload, cache_id="bid|_internal~…|", status=failed, elapsed_ms=15016
Crashlogs with:
[build 7651b7244cf2] 2019-10-07 11:17:36
Received fatal signal 6 (Aborted).
Cause:
Signal sent by PID 2599 running under UID 0.
Crashing thread: cachemanagerUploadExecutorWorker-180
Testing: ./splunk cmd splunkd rfs – ls --starts-with volume:XXXXXXX Returns no results because of Connection Timeout with Bad Gateway 502
Testing: wget on aws s3 instance returns bad gateway.
To confirm the issue with a repro
Step 1. change below parameters values in sever.conf to 200
[cachemanager]
max_concurrent_downloads = 200
max_concurrent_uploads = 200
Step 2. Block the connection from peers to S3 using
echo "127.0.0.1 s3-us-west-2.amazonaws.com" >> /etc/hosts
What was observed -
1. Peers unable to upload the buckets to remote storage(which is obvious)
2. Peers constantly retrying to upload the buckets
3. Peers were marked Down by CM since peers could not heartbeat to the CM as they were constantly busy retrying the upload of buckets with so many threads in parallel, which is causing extra pressure on CMSLave lock.
Below is the pstack i collected from one of the indexer -
Thread which is holding the CMSlave lock while making a S3 Head request to check if file is present or not on S3:
Thread 14 (Thread 0x7f8b04dff700 (LWP 8834)):
0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
1 0x00005639b0d24e27 in EventLoop::run() ()
2 0x00005639b0dece00 in TcpOutboundLoop::run() ()
3 0x00005639b08928e9 in RetryableClientTransaction::_run_sync(bool) ()
4 0x00005639b0930c44 in S3StorageInterface::fileExists(StorageObject const&, Str*, RemoteRetryPolicy*) ()
5 0x00005639b04eb4b0 in cachemanager::CacheManagerBackEnd::isRemoteBucketPresent(cachemanager::CacheId const&, Pathname const&, bool, ScopedPointer*) const ()
6 0x00005639b04f2bc1 in cachemanager::CacheManagerBackEnd::isBucketStable(cachemanager::CacheId const&, cachemanager::CacheManagerBackEnd::CheckScope, bool, ScopedPointer*) ()
7 0x00005639b03435c7 in DatabaseDirectoryManager::isBucketStable(cachemanager::CacheId const&, cachemanager::CacheManagerBackEnd::CheckScope, bool, bool, ScopedPointer*) ()
8 0x00005639b0f92f64 in CMSlave::manageReplicatedBucketsTimeoutS2_locked() ()
9 0x00005639b0f93c9d in CMSlave::service(bool) ()
10 0x00005639b00e09f3 in CallbackRunnerThread::main() ()
11 0x00005639b0dedfa9 in Thread::callMain(void*) ()
12 0x00007f8b0d9614a4 in start_thread (arg=0x7f8b04dff700) at pthread_create.c:456
13 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
While other threads such heartbeat thread and other operations are waiting for this lock to be released -
Heartbeat thread waiting for the lock-
Thread 60 (Thread 0x7f8afa7ff700 (LWP 9053)):
0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
1 0x00007f8b0d963bb5 in GI_pthread_mutex_lock (mutex=0x7f8b0d0818f8) at ../nptl/pthread_mutex_lock.c:80
2 0x00005639b0dedcd9 in PthreadMutexImpl::lock() ()
3 0x00005639b0f71f55 in CMSlave::getHbInfo(Str&, Str&, unsigned int&, CMPeerStatus::ManualDetention&, bool&, long&, unsigned long&) ()
4 0x00005639b1005b8c in CMHeartbeatThread::when_expired(Interval*) ()
5 0x00005639b0df634c in TimeoutHeap::runExpiredTimeouts(MonotonicTime&) ()
6 0x00005639b0d24d86 in EventLoop::run() ()
7 0x00005639b01225da in CMServiceThread::main() ()
8 0x00005639b0dedfa9 in Thread::callMain(void*) ()
9 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afa7ff700) at pthread_create.c:456
10 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Even searches might be blocked on this lock -
Thread 81 (Thread 0x7f8afe1ff700 (LWP 10428)):
0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
1 0x00007f8b0d963bb5 in GI_pthread_mutex_lock (mutex=0x7f8b0d0818f8) at ../nptl/pthread_mutex_lock.c:80
2 0x00005639b0dedcd9 in PthreadMutexImpl::lock() ()
3 0x00005639b0f948ec in CMSlave::writeBucketsToSearch(unsigned long, Clustering::SiteId const&, Clustering::SummaryAction, Str&) ()
4 0x00005639b13a0822 in DispatchCommand::dumpClusterSlaveBuckets(SearchResultsInfo&) ()
5 0x00005639b1429152 in StreamedSearchDataProvider::handleStreamConnectionImpl(HttpCompressingServerTransaction&, SearchResultsInfo*, Str*) ()
6 0x00005639b142bbb5 in StreamedSearchDataProvider::handleStreamConnection(HttpCompressingServerTransaction&) ()
7 0x00005639b0c38d4d in MHTTPStreamDataProvider::streamBody() ()
8 0x00005639b07db115 in ServicesEndpointReplyDataProvider::produceBody() ()
9 0x00005639b07d28ff in RawRestHttpHandler::getBody(HttpServerTransaction*) ()
10 0x00005639b0d558fb in HttpThreadedCommunicationHandler::communicate(TcpSyncDataBuffer&) ()
11 0x00005639b0119e42 in TcpChannelThread::main() ()
12 0x00005639b0dedfa9 in Thread::callMain(void*) ()
13 0x00007f8b0d9614a4 in start_thread (arg=0x7f8afe1ff700) at pthread_create.c:456
14 0x00007f8b0d6a3d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
This explains why the cluster was super unstable when we had issues uploading the bucket and explains #1 and #3.
This dependency on CMSlave lock has already been fixed in 8.0.1
About #2 since customer set max_concurrent_downloads/uploads = 200, there were so many concurrent uploads to S3, via proxy that it locked out and started backing up. At one time, it closed the connection on indexers and upload retries started and timeouts appeared.
... View more