Increased error rate and logging delays in us-east-1

Incident Report for ESS (Public)

Resolved

We have been monitoring the cluster throughout the day and the queues have remained at normal levels.

Posted Jun 24, 2019 - 22:22 UTC

Update

We have caught up processing our logging and metrics queues. All cluster logs should now be processed as expected.

Unfortunately, we were unable to save all the backlog queue files and have therefore lost some data for the period of June 23rd 2019 15:00pm UTC and June 24th 2019 10:00AM UTC. At this stage we are assessing the impact of data ingestion issue and we will provide an update in a few hours.

Posted Jun 24, 2019 - 13:39 UTC

Update

Queues for logging and metrics data in us-east-1 are now draining at a more acceptable rate. We believe we have identified all the logging and metrics clusters suffering contention, and increased their number of shards to improve ingest.

We will update this incident when we can confirm improvements, but we expect customer-facing logs will still be delayed for several hours.

Posted Jun 24, 2019 - 10:25 UTC

Update

Queues for logging and metrics data in us-east-1 remain high, and cluster logs and metrics will be delayed as a result. We have increased the number of shards on the active indices in the delayed clusters, and we are monitoring for improvements in our ingestion rates.

We'll update this incident when we can confirm improvements, but we expect logs will still be delayed for several hours.

Posted Jun 24, 2019 - 07:43 UTC

Update

Queue's for logging and metrics data in us-east-1 remain high and cluster logs and metrics will be delayed as a result. We're working on increasing the number of shards on the active indices within the delayed logging cluster which should improve our ingestion rates.

We'll update this incident when these changes have been made, however we expect logs will be delayed for several hours still.

Posted Jun 24, 2019 - 06:19 UTC

Update

Queue's for logging and metrics data in us-east-1 remain high and cluster logs and metrics will be delayed as a result. We're continuing remediation work to speed up ingestion, however it will still take some time to clear these queues.

We'll continue to monitor the log and metrics pipelines and will update this incident with any new information as it comes to light.

Posted Jun 24, 2019 - 03:32 UTC

Update

The logging and metrics queues in us-east-1 are still high, we're continuing remediation work to speed up the ingestion but it will take some time to reduce the queues. We are continuing to monitor the situation.

Posted Jun 24, 2019 - 00:01 UTC

Monitoring

The logging delay in us-east-1 has improved, although the queues have not fully drained. This may cause logs and metrics from earlier today to not appear yet. We are continuing to monitor the situation.

Posted Jun 23, 2019 - 20:27 UTC

Update

The proxy layer in this region has been stable for the past 30 minutes. We are still working on the logging delay, which is currently two hours behind.

Posted Jun 23, 2019 - 17:06 UTC

Update

The rate of increased 5xx errors on our proxy rates in AWS us-east-1 has returned to normal levels and the proxy layer is more stable. We are still seeing a delay in logging in the region.

The backend zookeeper ensemble is still under increased load and we are continuing to investigate. We'll have another update for you in 1 hour.

Posted Jun 23, 2019 - 16:47 UTC

Identified

A failure in a backend Zookeeper node at 14:44 UTC has caused increased proxy 5xx error rates starting at 15:11 UTC which also caused disruption to customer intra-cluster connectivity and a logging delay in the us-east-1 region. The initial Zookeeper failure has been corrected and engineers are currently working to correct the impact to our regional logging and metrics clusters.

Posted Jun 23, 2019 - 16:08 UTC

This incident affected: AWS N. Virginia (us-east-1) (Elasticsearch connectivity: AWS us-east-1, Kibana connectivity: AWS us-east-1, APM connectivity: AWS us-east-1).