Keboola - Failing jobs on all stacks due to an outage in AWS US east region – Incident details

Failing jobs on all stacks due to an outage in AWS US east region

Resolved
Operational
Started 7 days agoLasted about 13 hours

Affected

AWS EU (eu-central-1)

Operational from 7:10 AM to 7:38 AM, Major outage from 7:38 AM to 9:17 AM, Partial outage from 9:17 AM to 9:34 AM, Degraded performance from 9:34 AM to 10:26 AM, Operational from 10:26 AM to 8:13 PM

GCP EU (europe-west3)

Operational from 7:10 AM to 7:38 AM, Major outage from 7:38 AM to 9:17 AM, Partial outage from 9:17 AM to 9:34 AM, Degraded performance from 9:34 AM to 10:26 AM, Operational from 10:26 AM to 8:13 PM

AWS US (us-east-1)

Major outage from 7:10 AM to 7:14 AM, Partial outage from 7:14 AM to 7:38 AM, Major outage from 7:38 AM to 9:34 AM, Degraded performance from 9:34 AM to 2:00 PM, Major outage from 2:00 PM to 6:39 PM, Degraded performance from 6:39 PM to 8:13 PM

GCP US (us-east4)

Operational from 7:10 AM to 7:38 AM, Major outage from 7:38 AM to 9:17 AM, Partial outage from 9:17 AM to 9:34 AM, Degraded performance from 9:34 AM to 10:26 AM, Operational from 10:26 AM to 8:13 PM

Azure NE (north-europe)

Operational from 7:10 AM to 7:38 AM, Major outage from 7:38 AM to 9:17 AM, Partial outage from 9:17 AM to 9:34 AM, Degraded performance from 9:34 AM to 10:26 AM, Operational from 10:26 AM to 8:13 PM

Updates
  • Resolved
    Resolved

    AWS reports continued improvement in recovery. All previously affected stacks are now operational. This incident has been resolved.

  • Update
    Update

    AWS reports progress with the mitigation and we are seeing signs of recovery. Jobs are successfully starting and finishing. We continue monitoring the issue.

  • Update
    Update

    The AWS outage is still ongoing and continues to affect our AWS US stack. We’re actively monitoring the situation and will share updates as soon as we have more details.

  • Monitoring
    Monitoring

    AWS reports another "significant API errors and connectivity issues across multiple services in the US-EAST-1 Region" which has impact on AWS US stack. Jobs are delayed.

  • Investigating
    Investigating

    We still see degraded performance in AWS US stack, the jobs are currently failing start.

  • Update
    Update

    The AWS incident in the US region is still not fully resolved. AWS is reporting errors leading to limited capacity to schedule new workloads, which may continue to affect the AWS US stack performance.

  • Update
    Update

    All stacks are operational except AWS US. The AWS US stack shows degraded performance, primarily affecting job listing visibility in the UI. Job execution is not impacted and continues to run as expected.

  • Update
    Update

    We’re still seeing issues on the AWS US stack following the AWS incident. Some jobs are stuck in processing and may not complete. We’re working on mitigation. Other stacks are running normally, and we’re continuing to monitor their performance.

  • Update
    Update

    Jobs are now running successfully on all stacks. AWS has reported significant signs of recovery. Orchestration has been unpaused on the AWS US stack, and all orchestrations scheduled between 10:30 and 11:30 (CET) will be gradually triggered until synchronization is restored.

  • Update
    Update

    We’re seeing jobs complete successfully on all stacks except the AWS US stack, so we’re resuming orchestrations for those unaffected stacks and continue monitoring the issue.

  • Monitoring
    Monitoring

    We’ve paused job scheduling to prevent further impact while the AWS incident is being resolved.

  • Update
    Update

    We identified multiple stacks are experiencing degraded job execution performance. The degradation is linked to platform images hosted in the affected AWS US region.

  • Identified
    Identified

    Based on report from AWS https://health.aws.amazon.com/health/status that this issue is affecting cloud on AWS us east 1 region only.

    We will report back if we have more information

  • Investigating
    Investigating

    We are currently investigating issue with jobs not starting on AWS US east stack.

    Our team has been notified and is actively investigating.

    Next update in 20 minutes or when new information is available.