DEV Community

Haripriya Veluchamy
Haripriya Veluchamy

Posted on

Canary Deployments in Azure Container Apps: A Complete Guide

Why Do We Need Canary Deployments?

Imagine you push a new update to production. Everything looked fine in testing, but suddenly your users start experiencing errors. By the time you notice, thousands of users are affected. You scramble to rollback, but the damage is done.

This is exactly what canary deployments prevent.

The Problem with Traditional Deployments

In a traditional deployment, when you push new code, 100% of your traffic immediately goes to the new version. If something breaks, all your users are affected.

Traditional Deployment:

Before:  [Old Version] ████████████ 100% traffic
After:   [New Version] ████████████ 100% traffic  ← If broken, everyone is affected!
Enter fullscreen mode Exit fullscreen mode

The Canary Solution

Canary deployment gets its name from the old mining practice of bringing canaries into coal mines. If dangerous gases were present, the canary would die first, warning miners to evacuate.

Similarly, in canary deployments, we send a small portion of traffic to the new version first. If something goes wrong, only a small percentage of users are affected.

Canary Deployment:

Step 1:  [Old Version] ██████████ 90%
         [New Version] ██ 10%        ← Test with small traffic

Step 2:  [Old Version] ██████ 50%
         [New Version] ██████ 50%    ← Gradually increase

Step 3:  [New Version] ████████████ 100%  ← Full rollout after validation
Enter fullscreen mode Exit fullscreen mode

How Azure Container Apps Supports Canary Deployments

Azure Container Apps has built-in support for traffic splitting through its revision system. Every time you deploy, a new revision is created. You can then control how much traffic goes to each revision.

Key Concepts

Revision: A snapshot of your container app at a specific point in time. Each deployment creates a new revision.

Traffic Weight: The percentage of traffic each revision receives. All weights must add up to 100%.

Active Revision: A revision that is running and can receive traffic.

Inactive Revision: A revision that exists but receives no traffic and consumes no resources.

Implementation: Single Region Canary Deployment

Let's build a complete canary deployment pipeline for a single-region setup.

The Deployment Flow

Push Code
    │
    ▼
Build Image (tagged with git SHA)
    │
    ▼
Save Current Stable Revision
    │
    ▼
Deploy New Revision (Canary)
    │
    ▼
Health Check (5 attempts)
    │
    ├── PASS ──▶ Split Traffic 50/50
    │
    └── FAIL ──▶ Auto Rollback + Alert
Enter fullscreen mode Exit fullscreen mode

Step 1: Build and Deploy with Unique Tags

Never use the latest tag for production deployments. It's mutable and causes inconsistencies.

- name: Deploy Application
  run: |
    # Use git SHA for unique, immutable image tag
    IMAGE_TAG="${{ github.sha }}"
    REVISION_SUFFIX=$(echo "$IMAGE_TAG" | cut -c1-8)

    # Build and push with unique tag
    az acr build \
      --registry $REGISTRY_NAME \
      --image $APP_NAME:$IMAGE_TAG \
      --file Dockerfile .

    # Deploy with revision suffix for easy identification
    az containerapp update \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --image "$ACR_SERVER/$APP_NAME:$IMAGE_TAG" \
      --revision-suffix "$REVISION_SUFFIX"
Enter fullscreen mode Exit fullscreen mode

Step 2: Health Check

Before routing traffic, verify the new deployment is healthy.

- name: Health Check
  run: |
    FQDN=$(az containerapp show \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --query properties.configuration.ingress.fqdn \
      --output tsv)

    HEALTH_PASSED=false
    for i in {1..5}; do
      if curl -sf --max-time 5 "https://${FQDN}/health" > /dev/null 2>&1; then
        HEALTH_PASSED=true
        break
      fi
      sleep 10
    done

    echo "HEALTH_PASSED=${HEALTH_PASSED}" >> $GITHUB_ENV
Enter fullscreen mode Exit fullscreen mode

This tries 5 times with 10-second intervals. Why 5 attempts? Because containers need time to start, connect to databases, and warm up.

Step 3: Traffic Splitting

If health check passes, split traffic between stable and canary.

- name: Route 50/50 Traffic
  if: env.HEALTH_PASSED == 'true'
  run: |
    STABLE="${{ env.STABLE_REVISION }}"
    CANARY="${{ env.CANARY_REVISION }}"

    if [[ -n "${STABLE}" && "${STABLE}" != "${CANARY}" ]]; then
      az containerapp ingress traffic set \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --traffic-weight ${STABLE}=50 ${CANARY}=50
    else
      # First deployment - no stable exists
      az containerapp ingress traffic set \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --traffic-weight ${CANARY}=100
    fi
Enter fullscreen mode Exit fullscreen mode

Step 4: Auto Rollback on Failure

If health check fails, immediately rollback to protect users.

- name: Auto Rollback
  if: env.HEALTH_PASSED == 'false'
  run: |
    # Route all traffic back to stable
    az containerapp ingress traffic set \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --traffic-weight ${{ env.STABLE_REVISION }}=100

    # Deactivate broken revision to free resources
    az containerapp revision deactivate \
      --name $APP_NAME \
      --resource-group $RESOURCE_GROUP \
      --revision "${{ env.CANARY_REVISION }}" || true

    # Send alert to team
    curl -X POST "$WEBHOOK_URL" \
      -H "Content-Type: application/json" \
      -d '{"text":"Auto rollback triggered for '$APP_NAME'"}'

    exit 1
Enter fullscreen mode Exit fullscreen mode

Implementation: Multi-Region Canary Deployment

For applications deployed across multiple regions, we need a sequential approach to prevent global outages.

Why Sequential Deployment?

If you deploy to all regions simultaneously and there's a bug, all regions go down together. Sequential deployment means:

  1. Deploy to Region 1
  2. Health check Region 1
  3. If healthy, deploy to Region 2
  4. Health check Region 2
  5. Continue...

If any region fails, stop the rollout immediately.

Multi-Region Deployment Flow

Deploy to US Region
    │
    ▼
Health Check US
    │
    ├── FAIL ──▶ Stop! Don't deploy to other regions
    │
    └── PASS ──▶ Deploy to EU Region
                    │
                    ▼
                Health Check EU
                    │
                    ├── FAIL ──▶ Rollback EU, keep US on canary
                    │
                    └── PASS ──▶ All regions on 50/50 split
Enter fullscreen mode Exit fullscreen mode

Multi-Region Implementation

- name: Deploy to All Regions
  run: |
    REGIONS="us eu uk"
    REVISION_SUFFIX=$(echo "${{ github.sha }}" | cut -c1-8)

    for REGION in $REGIONS; do
      APP_NAME="myapp-${REGION}-prod"

      echo "Deploying to $REGION..."

      # Deploy
      az containerapp update \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --image $IMAGE_NAME \
        --revision-suffix "$REVISION_SUFFIX"

      # Health check this region before proceeding
      FQDN=$(az containerapp show \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --query "properties.configuration.ingress.fqdn" \
        --output tsv)

      HEALTHY=false
      for i in {1..5}; do
        if curl -sf "https://$FQDN/health" > /dev/null 2>&1; then
          HEALTHY=true
          break
        fi
        sleep 10
      done

      if [[ "$HEALTHY" != "true" ]]; then
        echo "Region $REGION failed health check. Stopping deployment."
        exit 1
      fi

      echo "$REGION deployed and healthy"
    done

- name: Route Traffic All Regions
  run: |
    REGIONS="us eu uk"

    for REGION in $REGIONS; do
      APP_NAME="myapp-${REGION}-prod"
      CANARY="${APP_NAME}--${REVISION_SUFFIX}"

      # Get stable revision for this region
      STABLE=$(az containerapp revision list \
        --name $APP_NAME \
        --resource-group $RESOURCE_GROUP \
        --query "[?properties.trafficWeight>\`0\` && name!='${CANARY}'] | [0].name" \
        -o tsv)

      if [[ -n "$STABLE" ]]; then
        az containerapp ingress traffic set \
          --name $APP_NAME \
          --resource-group $RESOURCE_GROUP \
          --traffic-weight ${STABLE}=50 ${CANARY}=50
      fi
    done
Enter fullscreen mode Exit fullscreen mode

Manual Promotion and Rollback

After the canary has been running for a while and you've validated it's working correctly, you need to promote it to 100% or rollback if issues are found.

Promotion Workflow

name: 'Canary: Promote or Rollback'

on:
  workflow_dispatch:
    inputs:
      action:
        description: 'Action to perform'
        required: true
        type: choice
        options:
          - promote-100
          - rollback

jobs:
  canary-action:
    runs-on: ubuntu-latest
    steps:
      - name: Azure Login
        uses: azure/login@v2
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Get Current Revisions
        id: revisions
        run: |
          REVISIONS=$(az containerapp revision list \
            --name $APP_NAME \
            --resource-group $RESOURCE_GROUP \
            --query "[?properties.trafficWeight>\`0\`] | sort_by(@, &properties.trafficWeight)" \
            -o json)

          # Highest traffic = stable, lowest = canary
          STABLE=$(echo "$REVISIONS" | jq -r 'last | .name')
          CANARY=$(echo "$REVISIONS" | jq -r 'first | .name')

          echo "STABLE=${STABLE}" >> $GITHUB_OUTPUT
          echo "CANARY=${CANARY}" >> $GITHUB_OUTPUT

      - name: Execute Action
        run: |
          case "${{ github.event.inputs.action }}" in
            promote-100)
              # Send 100% to canary
              az containerapp ingress traffic set \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --traffic-weight ${{ steps.revisions.outputs.CANARY }}=100

              # Deactivate old stable
              az containerapp revision deactivate \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --revision "${{ steps.revisions.outputs.STABLE }}" || true
              ;;

            rollback)
              # Send 100% back to stable
              az containerapp ingress traffic set \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --traffic-weight ${{ steps.revisions.outputs.STABLE }}=100

              # Deactivate broken canary
              az containerapp revision deactivate \
                --name $APP_NAME \
                --resource-group $RESOURCE_GROUP \
                --revision "${{ steps.revisions.outputs.CANARY }}" || true
              ;;
          esac
Enter fullscreen mode Exit fullscreen mode

Cleaning Up Inactive Revisions

Over time, you'll accumulate many inactive revisions. While they don't consume compute resources, they clutter your revision list. Deactivating revisions after promotion or rollback keeps things clean.

# Deactivate a specific revision
az containerapp revision deactivate \
  --name $APP_NAME \
  --resource-group $RESOURCE_GROUP \
  --revision "myapp--abc12345"
Enter fullscreen mode Exit fullscreen mode

The || true after deactivation commands ensures the pipeline doesn't fail if the revision is already inactive.

Best Practices

1. Use Immutable Image Tags

Never use latest. Always tag images with git SHA or build number.

❌ myapp:latest
✅ myapp:a1b2c3d4e5f6
Enter fullscreen mode Exit fullscreen mode

2. Start with Higher Canary Percentage for Low Traffic

If you have few users, start with 50/50. You need enough traffic on the canary to detect issues.

Low traffic:  50/50 split (need volume to detect issues)
High traffic: 10/90 split (even 10% is thousands of users)
Enter fullscreen mode Exit fullscreen mode

3. Implement Proper Health Checks

Your /health endpoint should verify:

  • Application is running
  • Database connections work
  • Critical dependencies are reachable

4. Set Up Alerts

Always send alerts on rollback. Your team needs to know when deployments fail.

5. Use Revision Suffixes

Name revisions with git SHA prefix for easy identification.

myapp--a1b2c3d4  ← Easy to trace back to commit
Enter fullscreen mode Exit fullscreen mode

Conclusion

Canary deployments are essential for safe production releases. With Azure Container Apps, you get native support for traffic splitting that makes implementation straightforward.

Key takeaways:

  1. Deploy new code as a separate revision
  2. Run health checks before routing traffic
  3. Split traffic gradually (50/50 for low traffic, 10/90 for high traffic)
  4. Auto-rollback on health check failure
  5. Use sequential deployment for multi-region setups
  6. Clean up inactive revisions after promotion/rollback

The initial setup takes effort, but the peace of mind knowing your deployments are safe is worth it. No more 2 AM panic calls because a bad deployment took down production.

Top comments (0)