Keboola - Jobs not starting on all stacks – Incident details

Jobs not starting on all stacks

Resolved
Major outage
Started 2 months agoLasted about 5 hours

Affected

AWS EU (eu-central-1)

Major outage from 2:24 PM to 3:29 PM, Operational from 3:29 PM to 7:02 PM

GCP EU (europe-west3)

Major outage from 2:24 PM to 3:35 PM, Operational from 3:35 PM to 7:02 PM

AWS US (us-east-1)

Major outage from 2:24 PM to 3:29 PM, Operational from 3:29 PM to 7:02 PM

GCP US (us-east4)

Major outage from 2:24 PM to 3:35 PM, Operational from 3:35 PM to 7:02 PM

Azure NE (north-europe)

Major outage from 2:24 PM to 3:45 PM, Operational from 3:45 PM to 7:02 PM

Updates
  • Resolved
    Resolved

    This incident has been complete resolved.
    We apologize for any inconvenience caused.

  • Update
    Update

    We have identified approximately ten stuck flows across multiple stacks and are processing them gradually. If you encounter a flow that is in a terminating state and does not complete, please contact our support team.

  • Update
    Update

    We are monitoring all stacks and can see that the backlog of accumulated jobs is gradually clearing, but you may still experience minor delays. Everything should be back to normal soon.

  • Monitoring
    Monitoring

    The Azure NE (Northern Europe) stack has now also been fixed. We will now monitor the situation for at least another hour.

  • Update
    Update

    GCP EU europe-west3 and US us-east4 has now been fixed, and all jobs will gradually start automatically.

  • Identified
    Identified

    We have identified the cause of the problem and are gradually repairing the stacks. AWS EU and US was repaired first, and tasks are now running automatically without user intervention. All stacks should be repaired within an hour at the latest. We will keep you informed.

  • Investigating
    Investigating

    We are aware of performance degradation affecting our starting jobs.

    Symptoms: Slow task execution or tasks stuck in the create state.

    Our team is monitoring the situation and working to resolve this issue.

    We apologize for the disruption and will provide an update within 1 hour or when new information is available.