Teldar

Scale-to-Zero AI/ML Infrastructure on Google Cloud

Teldar

Industry:

AI & Machine Learning Consulting

Core Technologies:

Google Cloud Platform
NVIDIA H100 GPUs
Terraform IaC
Google Cloud Functions
Spot Virtual Machines

Background

Teldar is a Swiss AI consultancy that handles demanding machine learning workloads. Because their projects require high-end hardware like NVIDIA H100 GPUs, keeping infrastructure running 24/7 is financially impractical. They needed a way to access peak computing power exactly when a model needs to run, without paying for that hardware while it sits idle.

Challenges

Teldar faced several challenges managing GPU-intensive workloads in the cloud:

Idle GPU Costs: Premium A3 GPU instances incurred significant costs during downtime between workloads.

Manual Provisioning: Spinning up specialized GPU VMs manually slowed down project execution.

Operational Overhead: Developers were spending time managing infrastructure instead of building models.

Deployment Consistency: Ensuring identical environments across runs was difficult without automation.

Solutions

Zazmic designed and implemented an automated, event-driven scale-to-zero architecture on Google Cloud.

Automated GPU Lifecycle Management: Google Cloud Functions automatically provision and terminate GPU-enabled VMs based on workload demand, ensuring resources exist only when needed.

Spot GPU Optimization: The platform leverages GCP Spot VMs, enabling access to NVIDIA H100 performance at a significantly reduced cost.

Infrastructure as Code: Terraform defines all cloud resources, providing repeatable, reliable deployments with minimal manual effort.

Production Handover: Zazmic delivered detailed runbooks and documentation, enabling Teldar to manage the platform independently.

Outcomes

The new platform transformed how Teldar runs high-demand AI workloads:

Significant Cost Reduction

GPU compute costs are incurred only while models are actively running.

Faster Time to Execution

Automated provisioning removes delays associated with manual setup.

Operational Autonomy

Teldar’s team manages the infrastructure without ongoing external support.

Production-Ready Stability

Consistent, predictable environments reduce operational risk and technical debt.

Conclusion

By shifting to an automated scale-to-zero architecture, Teldar can run demanding AI workloads without the burden of fixed GPU costs. The solution turns high-end infrastructure into an on-demand utility—allowing the team to scale efficiently, control spend, and focus fully on delivering AI innovation.

Ready to Transform Your Business?

Let's discuss how Zazmic can help you achieve similar results with AI and cloud solutions.