🦊 GitLab CI: Deploy a Majestic Single Server Runner on AWS

#gitlab #devops #pipeline #tutorial

Initial thoughts
1. The right EC2 instance at the right price
2. Scripting the GitLab runner installation and configuration
3. Deploying the auto-stopping architecture with Terraform
Further reading

Initial thoughts

In GitLab CI: The Majestic Single Server Runner, we found that a single server runner outperforms a Kubernetes cluster with equivalent node specifications until approximately 200 jobs requested simultaneously! This is typically beyond the average daily usage for most software teams. Equally important, when there are 40 queued jobs to process or below, the single server runner is twice as fast. This scenario is quite common, even during the busiest days, for most teams.

As demonstrated in GitLab Runners: Which Topology for Fastest Job Execution?, single-server executors deliver the fastest job execution times.

This article will help you deploy this no-compromise runner on AWS, at a reasonable price, thanks to multiple optimizations. Part of it applies to any Cloud, public or private.

The deployment is automated and optimized as much as possible:

Infrastructure is provisioned with Terraform
A spot instance is used
EC2 is stopped at night and on week-end
EC2 boot script (re)installs everything and registers to GitLab
The runner is tagged with a few interesting ec2 characteristics

1. The right EC2 instance at the right price

An AWS spot instance is a cost-effective option that allows you to leverage spare EC2 capacity at a discounted price. By choosing spot instances, you can significantly reduce your Amazon EC2 costs. Since our deployment is automated and downtime is not critical, opting for spot instances is an optimal choice for cost optimization.

To fully utilize the capabilities of a single server runner while keeping costs reasonable, it is essential to select an EC2 instance with a local NVMe SSD disk. These instances are identified by the 'd' in their name, indicating that they are disk-optimized.

When choosing an EC2 instance, the following conditions should be considered:

The instance should have the 'd' letter to indicate NVMe local disk support.
It should be available in our usual region.
The CPU specifications should match our usage requirements. For Java/Javascript applications CICD, about 1 core per parallel job is good. We choose here 16 CPU for 20 parallel jobs.
The spot price should be reasonable.

For the purpose of this article, we have selected the r5d.4xlarge instance type. At the time of writing, the spot price for this instance in us-east-1 is approximately $370/month. It might seems high to you.

But when compared to the monthly cost of our development team, this price is relatively low. However, we can further optimize costs by automatically stopping the EC2 instance outside of working hours using daily CloudWatch executions. Since it is a local disk instance, the state will be lost every day, but we have nothing to loose except some cache, that can be warmed up with a scheduled pipeline every morning.

Let's calculate the cost: $0.5045/hour x 12 open daily hours x 21 open days per month = $127/month. This brings the cost even lower than the already acceptable price. To put it into perspective, this represents an 85% discount compared to running the same instance full-time on-demand ($841/month).

2. Scripting the GitLab runner installation and configuration

To streamline the process of deploying the EC2 instance, we will create a script that can be used as the user_data to bootstrap the server anytime it (re)boots. This script will handle the installation of Docker, the GitLab Runner, and the configuration required to connect to the GitLab instance.

The script is designed to handle reboots and stop/start actions, which may result in the deletion of local disk data on the NVMe EC2 instance.

Key features of this updated script:

AWS CloudWatch Integration: Automatic installation and configuration of the CloudWatch agent to send logs (/var/log/user-data.log, /var/log/syslog) and system metrics (CPU, memory, disk, network) to AWS CloudWatch for centralized monitoring
Enhanced Logging: All script operations are logged to /var/log/user-data.log with timestamps and detailed execution traces (set -x)
Containerd Support: In addition to Docker, the script now manages containerd with proper bind mounts, ensuring better container runtime isolation
GitLab Runner Optimizations: Includes performance feature flags (FF_TIMESTAMPS, FF_USE_FASTZIP, ARTIFACT_COMPRESSION_LEVEL=fastest, CACHE_COMPRESSION_LEVEL=fastest) to speed up job execution
Improved Error Handling: Validates NVME disk presence and verifies GitLab Runner registration success with explicit error messages
Fast Node Manager (fnm): Pre-installs fnm for efficient Node.js version management in CI/CD pipelines
Robust Mount Architecture: Uses bind mounts from /mnt/nvme-* to standard locations (/var/lib/docker, /var/lib/containerd, /gitlab), providing better disk organization and maintainability

Make sure to modify the following variables at the start of the script according to your specific requirements:

aws-ec2-init-nvme-and-gitlab-runner.sh

#!/bin/bash
#
### Script to initialize a GitLab runner on an existing AWS EC2 instance with NVME disk(s)
#
# - script is not interactive (can be run as user_data)
# - will reboot at the end to perform NVME mounting
# - first NVME disk will be used for GitLab cache
# - last NVME disk will be used for Docker and containerd (if only one NVME, the same will be used without problem)
# - robust : on each reboot and stop/start, disks are mounted again (but data may be lost if stop and then start after a few minutes)
# - runner is tagged with multiple instance data (public dns, IP, instance type...)
# - works with a single spot instance
# - should work even with multiple ones in a fleet, with same user_data (not tested for now)
#
# /!\ There is no prerequisite, except these needed variables :
MAINTAINER=zenika
GITLAB_URL=https://gitlab.com/
GITLAB_TOKEN=XXXX # https://gitlab.com/groups/ZenikaIT/-/runners
RUNNER_NAME=majestic-runner-v2026

# Enable verbose logging (set -x shows all executed commands)
set -x
exec > >(tee -a /var/log/user-data.log)
exec 2>&1

echo "========================================"
echo "GitLab Runner EC2 Initialization - $(date)"
echo "Instance: $(ec2-metadata --instance-type | cut -d ' ' -f 2) / $(ec2-metadata --instance-id | cut -d ' ' -f 2)"
echo "========================================"

echo "\n=== Installing CloudWatch Agent ==="
wget -q https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb
rm amazon-cloudwatch-agent.deb

# Configure CloudWatch Agent to send logs and metrics (compact JSON)
sudo tee /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json >/dev/null <<'CWCONFIG'
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {"file_path": "/var/log/user-data.log", "log_group_name": "/aws/ec2/gitlab-runner", "log_stream_name": "{instance_id}/user-data", "timestamp_format": "%Y-%m-%d %H:%M:%S"},
          {"file_path": "/var/log/syslog", "log_group_name": "/aws/ec2/gitlab-runner", "log_stream_name": "{instance_id}/syslog", "timestamp_format": "%b %d %H:%M:%S"}
        ]
      }
    }
  },
  "metrics": {
    "metrics_collected": {
      "cpu": {"measurement": ["cpu_usage_idle","cpu_usage_iowait","cpu_usage_user","cpu_usage_system","cpu_usage_steal"], "metrics_collection_interval": 60, "resources": ["*"]},
      "mem": {"measurement": ["mem_used_percent","mem_available","mem_total","mem_used"], "metrics_collection_interval": 60},
      "swap": {"measurement": ["swap_used_percent","swap_used","swap_free"], "metrics_collection_interval": 60},
      "disk": {"measurement": ["disk_used_percent","disk_free","disk_total","disk_used"], "resources": ["*"], "metrics_collection_interval": 60},
      "diskio": {"measurement": ["diskio_read_bytes","diskio_write_bytes","diskio_reads","diskio_writes"], "resources": ["*"], "metrics_collection_interval": 60},
      "net": {"measurement": ["bytes_sent","bytes_recv","packets_sent","packets_recv"], "resources": ["*"], "metrics_collection_interval": 60}
    }
  }
}
CWCONFIG

# Start CloudWatch Agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a fetch-config \
  -m ec2 \
  -s \
  -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
echo "✅ CloudWatch logs: /aws/ec2/gitlab-runner"

# prepare docker (re)install
echo "\n=== Installing Docker & GitLab Runner ==="
sudo apt-get update -qq
sudo apt-get -y install apt-transport-https ca-certificates curl gnupg lsb-release sysstat
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt-get update -qq

# Install Fast Node Manager (fnm)
curl -fsSL https://fnm.vercel.app/install | bash

# install gitlab runner
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
sudo apt-get -y install gitlab-runner
echo "✅ GitLab Runner: $(gitlab-runner --version | head -n1)"

# create NVME initializer script
echo "\n=== Creating NVME Initializer ==="
cat <<EOF >/home/ubuntu/nvme-initializer.sh
#!/bin/bash
#
# To be run on each fresh start, since NVME disks are ephemeral
# so first start, start after stop, but not on reboot
# inspired by https://stackoverflow.com/questions/45167717/mounting-a-nvme-disk-on-aws-ec2
#
set -x
exec >> /var/log/user-data.log 2>&1

echo "=== NVME Initializer - \$(date) ==="
lsblk -b --output=NAME,SIZE,TYPE,MOUNTPOINT

# get NVME disks bigger than 100Go (some small size disk may be there for root, depending on server type)
NVME_DISK_LIST=\$(lsblk -b --output=NAME,SIZE | grep "^nvme" | awk '{if(\$2>100000000000)print\$1}' | sort)
echo "Found NVME disks: \$NVME_DISK_LIST"

# there may be 1 or 2 NVME disks, then we split (or not) the mounts between GitLab cache and Docker/containerd runtime
export NVME_GITLAB=\$(echo "\$NVME_DISK_LIST" | head -n 1)
export NVME_CONTAINER=\$(echo "\$NVME_DISK_LIST" | tail -n 1)
echo "NVME_GITLAB=/dev/\$NVME_GITLAB NVME_CONTAINER=/dev/\$NVME_CONTAINER"

if [ -z "\$NVME_GITLAB" ]; then
  echo "❌ ERROR: No NVME disk found!"
  exit 1
fi

# format disks if not
sudo mkfs -t xfs /dev/\$NVME_GITLAB || echo "Already formatted"
if [ "\$NVME_GITLAB" != "\$NVME_CONTAINER" ]; then
  sudo mkfs -t xfs /dev/\$NVME_CONTAINER || echo "Already formatted"
fi

# Mount NVME disks on /mnt/nvme-*
# - If 1 disk: everything goes on the same disk (NVME_GITLAB == NVME_CONTAINER)
# - If 2 disks: gitlab cache on first, docker+containerd runtime on second
sudo mkdir -p /mnt/nvme-gitlab /mnt/nvme-runtime
sudo mount /dev/\$NVME_GITLAB /mnt/nvme-gitlab
sudo mount /dev/\$NVME_CONTAINER /mnt/nvme-runtime

# Create service directories and bind mount to standard locations
sudo mkdir -p /mnt/nvme-gitlab/gitlab-cache /gitlab
sudo mount --bind /mnt/nvme-gitlab/gitlab-cache /gitlab

sudo mkdir -p /mnt/nvme-runtime/docker /var/lib/docker
sudo mount --bind /mnt/nvme-runtime/docker /var/lib/docker

sudo mkdir -p /mnt/nvme-runtime/containerd /var/lib/containerd
sudo mount --bind /mnt/nvme-runtime/containerd /var/lib/containerd

# reinstall Docker and containerd (which data may have been wiped out)
sudo apt-get -y reinstall docker-ce docker-ce-cli containerd.io docker-compose-plugin

echo "\n=== Mounted volumes ==="
df -h | grep -E '(Filesystem|nvme|gitlab|docker|containerd)'
echo "✅ NVME initialization successful"

EOF

# set NVME initializer script as startup script
sudo tee /etc/systemd/system/nvme-initializer.service >/dev/null <<EOS

[Unit]
Description=NVME Initializer
After=network.target

[Service]
ExecStart=/home/ubuntu/nvme-initializer.sh
Type=oneshot
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

EOS

sudo chmod 744 /home/ubuntu/nvme-initializer.sh
sudo chmod 664 /etc/systemd/system/nvme-initializer.service
sudo systemctl daemon-reload
sudo systemctl enable nvme-initializer.service

sudo systemctl start nvme-initializer.service
sudo systemctl status nvme-initializer.service

# tail -f /var/log/syslog

### Runner creation at the end to have a feedback on Gitlab side of the whole process done

echo "\n=== Registering GitLab Runner ==="
echo "gitlab-runner ALL=(ALL) NOPASSWD:ALL" | sudo tee -a /etc/sudoers
echo "Runner: $RUNNER_NAME"

# FF_NETWORK_PER_BUILD to fix a DinD error, from https://forum.gitlab.com/t/since-docker-update-docker-ce-5-29-0-0-1-debian-12-bookworm-cicd-dind-errors-fatal-no-host-or-port-found/131377/4
sudo gitlab-runner register --name "$RUNNER_NAME" --url "$GITLAB_URL" --token "$GITLAB_TOKEN" --executor "docker" --docker-image "ubuntu:22.04" --docker-volumes "/gitlab/:/host/" --custom_build_dir-enabled=true --docker-privileged --docker-pull-policy "if-not-present" --env "FF_NETWORK_PER_BUILD=true" --non-interactive --env "FF_TIMESTAMPS=true" --env "FF_USE_FASTZIP=true" --env "ARTIFACT_COMPRESSION_LEVEL=fastest" --env "CACHE_COMPRESSION_LEVEL=fastest"

if [ $? -eq 0 ]; then
  echo "✅ GitLab Runner registered successfully!"
else
  echo "❌ GitLab Runner registration FAILED!"
  exit 1
fi

# bind docker socket (to avoid docker-in-docker service)
# sudo gitlab-runner register --name "$RUNNER_NAME" --url "$GITLAB_URL" --token "$GITLAB_TOKEN" --executor "docker" --docker-image "ubuntu:22.04" --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" --docker-volumes "/gitlab/custom-cache/:/host/" --custom_build_dir-enabled=true --docker-privileged --docker-pull-policy "if-not-present" --non-interactive

# to unregister :
# sudo gitlab-runner unregister --name "$(curl --silent http://169.254.169.254/latest/meta-data/public-hostname)"

# replace "concurrent = 1" with "concurrent = 20"
sudo sed -i '/^concurrent /s/=.*$/= 20/' /etc/gitlab-runner/config.toml
# replace "check_interval = 0" with "check_interval = 2"
sudo sed -i '/^check_interval /s/=.*$/= 2/' /etc/gitlab-runner/config.toml
### from https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4036#note_1083142570
# replace "/cache" technical volume with one mounted on disk to avoid cache failure when several jobs in parallel
# this could have also have been a docker volume mounted : https://gitlab.com/gitlab-org/gitlab-runner/-/issues/1151#note_1019634818 but this does not make it faster if 2 different MVNE disks (gitlab + docker)
sudo sed -i 's#"/cache"#"/gitlab/cache:/cache"#' /etc/gitlab-runner/config.toml

sudo systemctl restart gitlab-runner
sudo systemctl status gitlab-runner --no-pager

echo "\n========================================"
echo "🎉 GitLab Runner initialization COMPLETED!"
echo "========================================"

3. Deploying the auto-stopping architecture with Terraform

To quickly deploy the architecture, we will be using Terraform. With Terraform, we can automate the deployment process and have our infrastructure up and running in minutes.

Before we proceed, please ensure that you have an existing VPC created as a prerequisite. You can refer to the examples provided in the official GitHub repo for guidance on creating the VPC.

Key improvements in this updated Terraform configuration:

EC2 Fleet instead of single spot instance: Uses aws_ec2_fleet with multiple instance types and availability zones for maximum availability and cost optimization
CloudWatch integration: Creates a dedicated log group (/aws/ec2/gitlab-runner) that works with the CloudWatch agent installed by the bootstrap script
IAM instance profile: Allows the EC2 instance to send logs and metrics to CloudWatch without hardcoded credentials
Launch template architecture: Separates instance configuration from fleet management, making updates easier
Multi-instance-type strategy: Tries multiple compute-optimized instance types (c5ad.4xlarge, c6id.4xlarge, g4ad.4xlarge, c5d.4xlarge) across 3 AZs for better spot availability
Pure Terraform scheduler: Replaces external module with inline Python Lambda function for stop/start scheduling, reducing dependencies and improving maintainability
Cost optimization: Uses lowestPrice allocation strategy to always select the cheapest available spot instance

Here is the gitlab-runner.tf file that contains the Terraform configuration:

################################################################################
# Gitlab Runner EC2 Fleet (multi-AZ, multi-instance-type for better availability)
################################################################################

# CloudWatch Log Group for runner logs
resource "aws_cloudwatch_log_group" "runner_logs" {
  name              = "/aws/ec2/gitlab-runner"
  retention_in_days = 7
}

resource "aws_security_group" "in-ssh-out-all" {
  name   = "in-ssh-out-all"
  vpc_id = module.vpc.vpc_id
  ingress {
    cidr_blocks = [
      "0.0.0.0/0"
    ]
    from_port = 22
    to_port   = 22
    protocol  = "tcp"
  } // Terraform removes the default rule
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Launch template with common configuration for all instances
resource "aws_launch_template" "gitlab-runner" {
  name_prefix = "gitlab-runner-"
  image_id    = "ami-03446a3af42c5e74e" # Ubuntu 24.04 LTS amd64, build on 2025-12-12. From https://cloud-images.ubuntu.com/locator/ec2/
  key_name    = "my-key"
  user_data   = filebase64("aws-ec2-init-nvme-and-gitlab-runner.sh")

  iam_instance_profile {
    name = aws_iam_instance_profile.runner_instance.name
  }

  network_interfaces {
    associate_public_ip_address = true
    security_groups             = [aws_security_group.in-ssh-out-all.id]
    delete_on_termination       = true
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(
      local.tags,
      {
        Name      = "steroid-runner"
        Scheduled = "working-hours"
      }
    )
  }

  tag_specifications {
    resource_type = "volume"
    tags = merge(
      local.tags,
      {
        Name = "steroid-runner-volume"
      }
    )
  }
}

# EC2 Fleet with multiple instance types and AZs for maximum availability
resource "aws_ec2_fleet" "gitlab-runner" {

  target_capacity_specification {
    default_target_capacity_type = "spot"
    total_target_capacity        = 1 # Only 1 instance needed
    spot_target_capacity         = 1
  }

  # Try all instance types across all AZs (cheapest first)
  # Priority 01/2026 x86_64 (spot average €/month, % discount vs on-demand):
  # c5ad.4xlarge: 249€/mo avg (-48%) < c6id.4xlarge: 266€/mo avg (-52%) < g4ad.4xlarge: 269€/mo avg (-54%) < c5d.4xlarge: 288€/mo avg (-46%)
  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.gitlab-runner.id
      version            = "$Latest"
    }

    # c5ad.4xlarge: 249€/mo avg spot (32GB RAM, 600GB NVMe SSD, -48% vs on-demand 478€)
    override {
      instance_type = "c5ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[0]
      priority      = 1.0
    }
    override {
      instance_type = "c5ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[1]
      priority      = 1.1
    }
    override {
      instance_type = "c5ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[2]
      priority      = 1.2
    }

    # c6id.4xlarge: 266€/mo avg spot (32GB RAM, 950GB NVMe SSD, -52% vs on-demand 558€)
    override {
      instance_type = "c6id.4xlarge"
      subnet_id     = module.vpc.public_subnets[0]
      priority      = 1.3
    }
    override {
      instance_type = "c6id.4xlarge"
      subnet_id     = module.vpc.public_subnets[1]
      priority      = 1.4
    }
    override {
      instance_type = "c6id.4xlarge"
      subnet_id     = module.vpc.public_subnets[2]
      priority      = 1.5
    }

    # g4ad.4xlarge: 269€/mo avg spot (64GB RAM, 600GB NVMe SSD, -54% vs on-demand 590€)
    override {
      instance_type = "g4ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[0]
      priority      = 1.6
    }
    override {
      instance_type = "g4ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[1]
      priority      = 1.7
    }
    override {
      instance_type = "g4ad.4xlarge"
      subnet_id     = module.vpc.public_subnets[2]
      priority      = 1.8
    }

    # c5d.4xlarge: 288€/mo avg spot (32GB RAM, 400GB NVMe SSD, -46% vs on-demand 532€)
    override {
      instance_type = "c5d.4xlarge"
      subnet_id     = module.vpc.public_subnets[0]
      priority      = 1.9
    }
    override {
      instance_type = "c5d.4xlarge"
      subnet_id     = module.vpc.public_subnets[1]
      priority      = 2.0
    }
    override {
      instance_type = "c5d.4xlarge"
      subnet_id     = module.vpc.public_subnets[2]
      priority      = 2.1
    }
  }

  spot_options {
    allocation_strategy            = "lowestPrice" # Strict price priority (respects override order)
    instance_interruption_behavior = "terminate"
    instance_pools_to_use_count    = 1 # Only use the cheapest pool at a time
  }

  terminate_instances                 = true
  terminate_instances_with_expiration = false
  valid_until                         = "2030-01-01T00:00:00Z"
  replace_unhealthy_instances         = true
  type                                = "maintain" # Maintain target capacity

  tags = merge(
    local.tags,
    {
      Name      = "steroid-runner-fleet"
      Scheduled = "working-hours"
    }
  )
}

################################################################################
# Stop/Start scheduler with pure Terraform (EventBridge + Lambda)
################################################################################

# Lambda function to stop/start the fleet by modifying its capacity
resource "aws_lambda_function" "scheduler" {
  function_name = "runner-scheduler"
  role          = aws_iam_role.scheduler_lambda.arn
  handler       = "index.handler"
  runtime       = "python3.12"
  timeout       = 60

  filename         = data.archive_file.scheduler_lambda.output_path
  source_code_hash = data.archive_file.scheduler_lambda.output_base64sha256

  environment {
    variables = {
      FLEET_ID = aws_ec2_fleet.gitlab-runner.id
    }
  }
}

# Create Lambda deployment package
data "archive_file" "scheduler_lambda" {
  type        = "zip"
  output_path = "${path.module}/.terraform/scheduler-lambda.zip"

  source {
    content  = <<-EOF
import boto3
import os

ec2 = boto3.client('ec2')

def handler(event, context):
    action = event.get('action', 'stop')
    fleet_id = os.environ.get('FLEET_ID')

    print(f"Fleet ID: {fleet_id}")
    print(f"Action: {action}")

    if not fleet_id:
        return {'statusCode': 400, 'body': 'FLEET_ID environment variable not set'}

    # Get current fleet status
    try:
        fleet_response = ec2.describe_fleets(FleetIds=[fleet_id])
        if not fleet_response['Fleets']:
            return {'statusCode': 404, 'body': f'Fleet {fleet_id} not found'}

        fleet = fleet_response['Fleets'][0]
        current_target = fleet['TargetCapacitySpecification']['TotalTargetCapacity']
        print(f"Current target capacity: {current_target}")
    except Exception as e:
        print(f"Error describing fleet: {e}")
        return {'statusCode': 500, 'body': f'Error: {str(e)}'}

    # Modify fleet capacity based on action
    try:
        if action == 'stop':
            # Set capacity to 0 to terminate instances
            print(f"Setting fleet capacity to 0")
            ec2.modify_fleet(
                FleetId=fleet_id,
                TargetCapacitySpecification={'TotalTargetCapacity': 0}
            )
            return {'statusCode': 200, 'body': f'Fleet capacity set to 0 (instances will terminate)'}
        elif action == 'start':
            # Set capacity to 1 to launch instance
            print(f"Setting fleet capacity to 1")
            ec2.modify_fleet(
                FleetId=fleet_id,
                TargetCapacitySpecification={'TotalTargetCapacity': 1}
            )
            return {'statusCode': 200, 'body': f'Fleet capacity set to 1 (instance will launch)'}
        else:
            return {'statusCode': 400, 'body': f'Unknown action: {action}'}
    except Exception as e:
        print(f"Error modifying fleet: {e}")
        return {'statusCode': 500, 'body': f'Error: {str(e)}'}
EOF
    filename = "index.py"
  }
}

# CloudWatch Log Group for Lambda
resource "aws_cloudwatch_log_group" "scheduler_lambda" {
  name              = "/aws/lambda/${aws_lambda_function.scheduler.function_name}"
  retention_in_days = 7
}

# EventBridge rule to stop runner nightly
resource "aws_cloudwatch_event_rule" "stop_runner" {
  name                = "stop-runner-nightly"
  description         = "Stop runner at 18:00 UTC every day"
  schedule_expression = "cron(0 18 ? * * *)"
}

resource "aws_cloudwatch_event_target" "stop_runner" {
  rule      = aws_cloudwatch_event_rule.stop_runner.name
  target_id = "StopRunnerLambda"
  arn       = aws_lambda_function.scheduler.arn

  input = jsonencode({
    action = "stop"
  })
}

resource "aws_lambda_permission" "allow_eventbridge_stop" {
  statement_id  = "AllowExecutionFromEventBridgeStop"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.scheduler.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.stop_runner.arn
}

# EventBridge rule to start runner daily on working days
resource "aws_cloudwatch_event_rule" "start_runner" {
  name                = "start-runner-daily"
  description         = "Start runner at 06:00 UTC Monday-Friday"
  schedule_expression = "cron(0 6 ? * MON-FRI *)"
}

resource "aws_cloudwatch_event_target" "start_runner" {
  rule      = aws_cloudwatch_event_rule.start_runner.name
  target_id = "StartRunnerLambda"
  arn       = aws_lambda_function.scheduler.arn

  input = jsonencode({
    action = "start"
  })
}

resource "aws_lambda_permission" "allow_eventbridge_start" {
  statement_id  = "AllowExecutionFromEventBridgeStart"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.scheduler.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.start_runner.arn
}

The runner starts at 06:00 UTC and stops at 18:00 UTC every day (Monday-Friday for starts, every day for stops). Feel free to adjust the cron expressions according to your requirements.

Here is the iam.tf file that defines the IAM roles and policies:

################################################################################
# IAM Roles and Policies for GitLab Runner Infrastructure
################################################################################

# Attach AWS managed policy for CloudWatch Agent metrics
resource "aws_iam_role_policy_attachment" "runner_cloudwatch_agent" {
  role       = aws_iam_role.runner_instance.name
  policy_arn = "arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"
}

# IAM role for EC2 instances to send logs to CloudWatch
resource "aws_iam_role" "runner_instance" {
  name = "gitlab-runner-instance-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

# IAM policy for CloudWatch Logs
resource "aws_iam_role_policy" "runner_cloudwatch_logs" {
  name = "gitlab-runner-cloudwatch-logs"
  role = aws_iam_role.runner_instance.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogStreams"
        ]
        Resource = "arn:aws:logs:*:*:log-group:/aws/ec2/gitlab-runner:*"
      },
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeTags"
        ]
        Resource = "*"
      }
    ]
  })
}

# Instance profile to attach IAM role to EC2
resource "aws_iam_instance_profile" "runner_instance" {
  name = "gitlab-runner-instance-profile"
  role = aws_iam_role.runner_instance.name
}

################################################################################
# IAM for Lambda Scheduler
################################################################################

# IAM role for Lambda execution
resource "aws_iam_role" "scheduler_lambda" {
  name = "runner-scheduler-lambda-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "lambda.amazonaws.com"
      }
    }]
  })
}

# IAM policy for Lambda to manage EC2 instances
resource "aws_iam_role_policy" "scheduler_lambda" {
  name = "runner-scheduler-lambda-policy"
  role = aws_iam_role.scheduler_lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "arn:aws:logs:*:*:*"
      },
      {
        Effect = "Allow"
        Action = [
          "ec2:DescribeFleets",
          "ec2:ModifyFleet"
        ]
        Resource = "*"
      }
    ]
  })
}

And the variables.tf file to customize your deployment:

variable "region" {
  description = "Cluster region"
  default     = "eu-west-1" # Ireland - adjust to your preferred region
}

variable "local_aws_profile" {
  description = "local AWS profile used for provisioning"
  default     = "zenika"
}

variable "client" {
  description = "Client"
  default     = "gitlab-runner"
}

variable "stop_schedule" {
  description = "Cron expression to stop the runner (UTC timezone)"
  type        = string
  default     = "cron(0 18 ? * MON-SUN *)" # 19h-20h Paris time
}

variable "start_schedule" {
  description = "Cron expression to start the runner (UTC timezone)"
  type        = string
  default     = "cron(0 06 ? * MON-FRI *)" # 7h-8h Paris time, weekdays only
}