Keboola - Azure Scaling Issue in All Regions – Incident details

Azure Scaling Issue in All Regions

Resolved
Partial outage
Started 3 months agoLasted about 5 hours

Affected

Azure NE (north-europe)

Operational from 7:00 PM to 11:31 PM

Updates
  • Resolved
    Resolved

    The infrastructure issues affecting our Azure infrastructure have been resolved, all our Azure stacks are fully operational now.

    We apologize for the inconvenience and thank you for your patience.

  • Monitoring
    Monitoring

    We're seeing improvements in performance on all stacks, everything seems to be running fine and we're closely monitoring the situation.

  • Update
    Update

    From Azure Status:

    Current status: We have determined that these issues were caused by a recent configuration change that affected public access to certain Microsoft‑managed storage accounts, used to host extension packages. We are actively working to mitigate impact, by updating our configuration to restore relevant access permissions. After applying this update in one region, we validated that it mitigates the issues customers were experiencing - we are making good progress applying this same mitigation across all impacted regions, in parallel where possible. We still expect that this will be completed across all impacted regions by approximately 00:00 UTC, approximately one hour from now. Our next update will be provided by then, or sooner if we have progress to share.

  • Identified
    Identified

    Azure have acknowledged the underlying issue and are working on a fix. See https://azure.status.microsoft/en-gb/status for more details.

    Active - Virtual Machine service management issues - Multiple regions

    Impact statement: We are aware of an ongoing issue causing customers to receive error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for Virtual Machines (VMs).

    We're monitoring the issue closely and provide an update within 1 hour or when new information is available.

  • Investigating
    Investigating

    We are aware of performance degradation affecting our Azure infrastructure in all regions. This appears to be related to an upstream issue with global Azure's cluster autoscaling services.

    We're able to process jobs with the currently provisioned capacity but we're unable to scale out new nodes for additional jobs.

    Symptoms:

    • Extended delays in jobs processing (jobs staying longer in waiting state) or jobs processing delayed indefinitely

    • Inability to start Data Apps or Workspaces

    Our team is monitoring the situation and waiting for Azure response to our incident reports.

    We apologize for the disruption and will provide an update within 1 hour or when new information is available.