Cluster Logs and Metrics for customer clusters is currently delayed

Incident Report for ESS (Public)

Resolved

This incident has been resolved.

Posted Jun 20, 2019 - 03:54 UTC

Update

Cluster logging and metric data should now be up to date for all deployments. We'll continue to monitor our ingestion pipelines and monitoring clusters as per our normal monitoring.

There should be no ongoing customer impact as part of this incident.

Posted Jun 20, 2019 - 03:28 UTC

Monitoring

Cluster logging and metric data for most clusters should now be up to date. Deployments in AWS eu-central-1, GCP us-central1 and GCP europe-west1 regions will still be delayed, however those regions are processing queues and will be up to date within the hour.

We will update this issue within the next 30 minutes.

Posted Jun 20, 2019 - 03:01 UTC

Identified

Customer logs and metrics appear to be flowing again, with queues trending downwards. We expect cluster logging and metric data to be up to date within the next hour.

One of our backend monitoring clusters had entered a degraded state which was impacting indexing rates within our Logstash consumers. We're still investigating what caused this cluster to get into this state, but believe two nodes were impacted by our recent hot-warm incident resulting in those nodes not able to keep up with cluster state changes. We've restored this cluster to an operational state, which in turn restored our log ingestion rates.

We'll post another update in 30 minutes, or as new information comes to hand.

Posted Jun 20, 2019 - 02:43 UTC

Investigating

Currently cluster logs and metrics across all regions are delayed. At this stage you won't be able to see up to date logs or metrics for your clusters. You can still access historical logs and metrics. We're investigating the root cause and will post an update shortly.

Posted Jun 20, 2019 - 02:03 UTC