Delays in logs ingestion in GCP us-central-1

Incident Report for ESS (Public)

Resolved

As of 13:10 UTC, logstash queues have been drained which means there should be no further delays or impact to customers. At this point, we consider this issue resolved. If necessary, please reach out to support if experiencing any further issues https://www.elastic.co/support/welcome.

Posted Aug 06, 2019 - 16:26 UTC

Monitoring

Update: The logging cluster upgrade has completed on all the Elasticsearch instances. We are observing a sustained indexing rate at 2x the pre-incident rate. Our Logstash queues are being processed and we're slowly catching up with ingesting logs.

Posted Aug 06, 2019 - 13:17 UTC

Update

Update: At around 6:30am UTC we have started an upgrade of Elasticsearch logging cluster which is home to several indices containing customers' clusters and proxy logs. Unfortunately the upgrade process has negatively impacted the performance of that cluster which currently cannot cope with the ingest load and the rolling upgrade process at the same time. Around 80% of instances in the affected cluster have been upgraded successfully. However the remaining 20% are still in the process of relocating large shards over to new instances which consumes significant amount of resources. We are monitoring the situation and will provide an update within the next couple of hours.

Posted Aug 06, 2019 - 11:05 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Aug 06, 2019 - 09:54 UTC

Investigating

We have identified an issue with logging cluster in GCP us-central-1 that is causing delays in logs ingestion pipelines. Customers may have delayed visibility into the logs displayed in the UI for their cluster. We are currently working on fixing the issue. Update will be provided within one hour.

Posted Aug 06, 2019 - 09:54 UTC