Keboola - Microsoft Azure Issue – Incident details

Microsoft Azure Issue

Resolved
Partial outage 30 %
Started about 1 month agoLasted about 5 hours

Affected

AWS EU (eu-central-1)

Operational from 5:02 PM to 5:19 PM, Degraded performance from 5:19 PM to 10:10 PM

GCP EU (europe-west3)

Operational from 5:02 PM to 5:19 PM, Degraded performance from 5:19 PM to 10:10 PM

AWS US (us-east-1)

Operational from 5:02 PM to 5:19 PM, Degraded performance from 5:19 PM to 10:10 PM

GCP US (us-east4)

Operational from 5:02 PM to 5:19 PM, Degraded performance from 5:19 PM to 10:10 PM

Azure NE (north-europe)

Partial outage from 5:02 PM to 9:30 PM, Degraded performance from 9:30 PM to 10:10 PM

Updates
  • Resolved
    Resolved

    Automatic scaling is turned back on and all platforms are fully operational and we'll continue monitoring all stacks. We're sorry for this inconvenience.

  • Update
    Update

    We're seeing full operations on all Azure stacks. We'll be enabling automatic scaling shortly and all jobs are currently processing as usual.

  • Monitoring
    Monitoring

    Azure is promising full mitigation within next four hours. We are closely monitoring the situation and once the issue is fixed we will resume all operations and turn autoscaling back on.

    At this stage, we anticipate full mitigation within the next four hours as we continue to recover nodes. This means we expect recovery to happen by 23:20 UTC on 29 October 2025. We will provide another update on our progress within two hours, or sooner if warranted.

  • Update
    Update

    We have limited autoscaling of all Azure stacks to limit cascading the issue to new nodes and thus limiting the number of waiting jobs. We have seen partial improvement in certain regions so our hopes are up.

  • Update
    Update

    Microsoft has updated the impact and states that network infrastructure in all Azure regions is affected.

  • Identified
    Identified

    These issues can propagate downstream to AWS and GCP stacks, where the symptoms can include:

    • Limited operations of some AI features (error explains, config description generator)

  • Investigating
    Investigating

    We are aware of performance degradation affecting our Azure infrastructure. This appears to be related to an upstream issue with Azure's announced Azure Portal Access Issues (https://azure.status.microsoft/en-us/status).

    Symptoms include:

    • Failures to prepare new nodes to accept workload, causing some jobs to be in the waiting state.

    Our team is monitoring the situation.

    We apologize for the disruption and will provide an update within 30 minutes or when new information is available.