Cilium at KubeCon + CloudNativeCon and CiliumCon

Artificial Intelligence

The AI revolution is here, and with it comes unprecedented demands on network infrastructure. From training massive language models to serving real-time inference, AI workloads require high-performance networking, robust security, and deep observability. Kubernetes has evolved from just a platform for running workloads like web services and microservices to the ideal platform for supporting the end-to-end lifecycle of large artificial intelligence (AI) and machine learning (ML) workloads.

Cilium is the cloud native solution for providing, securing, and observing network connectivity between workloads, empowering AI workloads to thrive in cloud native environments. Organizations ranging from artificial intelligence research institutions that build sophisticated training models to financial institutions and startups rely on Cilium to support their distributed AI/ML workloads on Kubernetes.

router bee

Fulfilling the Networking Demands of AI/ML Workloads

AI workloads typically require massive data transfers, ultra-low latency, high throughput, and high bandwidth networking. Traditional networking solutions struggle to keep up, leading to bottlenecks and inefficiencies.

Cilium leverages eBPF to deliver kernel-level networking performance, eliminating the overhead of traditional Linux networking. With features like eXpress Data Path (XDP) and BIG TCP, Cilium ensures high-throughput, low-latency networking ideal for needs of AI/ML workloads. Cilium provides a comprehensive networking toolset for deploying AI/ML models with load balancers, ingress controllers, network policies, egress gateway, service mesh, and more. These features facilitate the seamless deployment of AI/ML workloads and their integration into services and applications.

cilium big tcp stats
Managing connections across different cloud storage providers was cumbersome. We needed a solution that could simplify this complexity and reduce costs. By leveraging Cilium’s eBPF capabilities, we maintained high-performance networking while completely removing unpredictable traffic costs. Speed is everything in our business; if we can move goods efficiently and forecast demand accurately, we win. Cilium has become a cornerstone of our strategy. It’s not just a tool; it’s a game-changer.
George Zubrienko Data & AI Lead Platform Engineer, Ecco

Ecco is a global leader in shoe production and retail. The company’s IT infrastructure is almost entirely Kubernetes-based and designed to facilitate machine learning (ML) workflows that enable intelligent decision-making and supply chain management.

Watch The Talk

Robust Security for AI/ML Models and Data

AI models are the result of significant investment in research and infrastructure. Protecting these models and the sensitive data they process is non-negotiable for enterprises. Traditional security solutions often lack the granularity and scalability for dynamic, cloud native environments.

Cilium provides robust security features that enhance Kubernetes security. These features include zero-trust security with identity-aware security policies, mutual authentication, and advanced network policies. Cilium supports native HTTP and DNS protocol enforcement, ensuring only authorized services can access endpoints. Cilium’s Transparent Encryption (using IPsec or WireGuard) effortlessly encrypts data in transit, safeguarding intellectual property and easing compliance.

hetzner cilium test illustration

Deep Observability for AI/ML Infrastructure and Workloads

AI/ML workflows can be incredibly complex, with data flowing across multiple services and clusters. Monitoring performance, debugging issues, or optimizing resource usage without deep observability becomes a hassle. Cilium’s Hubble observability platform provides granular insights into network traffic, API calls, and service dependencies. You can monitor DNS performance, HTTP latency, and error rates with real-time metrics, ensuring AI/ML workloads run smoothly.

hetzner cilium test illustration
Spark was the other reason that Cilium became a killer feature that we needed to roll out across every cloud. Spark is a great tool, but sometimes their built in encryption will fail at random. Statistically, at some point it will crash so if you’re dealing with a 12 hour job, it’s gonna fail on hour 11 and that is a terrible thing to try to explain to the customer. Cilium with IPsec doesn’t have that problem. Why have Spark be doing encryption when what we really want Spark to be doing is processing data. We chose to have a reasonable isolation of priorities and responsibilities and have Spark be focused on data processing and have the network layer that is responsible for encryption.
Joe Stevens Member of the Technical Staff, Ascend.

Ascend.io uses Cilium Transparent Encryption to support their use of Apache Spark, the analytics engine for large-scale data processing.

Read The Case Study

Scaling AI/ML Infrastructure Efficiently

AI workloads are resource intensive and often experience fluctuating demands. Scaling infrastructure to meet these demands without overspending or inefficiently allocating resources is a significant challenge for operators. The ability to scale up and down depending on the resource demand is one of the most significant advantages Kubernetes brings to AI/ML. Cilium further enhances this advantage, empowering you to scale AI workloads efficiently. Cilium’s advanced load balancing and traffic management capabilities ensure your AI applications can scale dynamically without disruption. By optimizing resource allocation and reducing overhead, Cilium helps you maximize the ROI of your AI/ML investments.

hetzner cilium test illustration

Seamless Management of Multi-Cluster and Hybrid Cloud Environments

AI workloads often span multiple clusters, clouds, and on-premises environments. Managing networking, observability, and security across these heterogeneous environments can be an operational nightmare. Cilium integrates seamlessly across environments, providing a unified networking, observability, and security layer. Abstracting away the underlying infrastructure provides a consistent and reliable experience for your entire AI infrastructure. Cilium Cluster Mesh effectively allows multiple clusters to be joined into a large unified network, regardless of the Kubernetes distribution or location where each is running. Cilium host firewall extends Kubernetes declarative, policy-driven security model to the nodes hosting your workloads, delivering seamless, consistent protection across your entire environment.

hetzner cilium test illustration

See Real World Stories of Companies using Cilium for AI/ML

backend ai office building

Building the core fabric of accelerated hybrid AI clusters using Cilium

Backend.ai switched from the Docker's own overlay network driver to Cilium. As a result, they observed significant throughput and latency improvements in specific inter-container networking scenarios, including application proxy to auto-scale and load balance the ML inference traffic at a large scale.

Learn More
meltwater office

Meltwater's Live Migration to Cilium for Richer Features

Meltwater is a global leader in media, social and consumer intelligence. They have been building machine learning models for nearly 20 years and use AI at the heart of their operations for use cases such as natural language processing, speech processing, clustering and summarization, and more.

Learn More