Dyna Robotics

Dyna Robotics Turbocharges Foundation Model Training with Alluxio Across Multi-Cloud Infrastructure

Dyna Robotics, a cutting-edge robotics company, improved its foundation model training performance by deploying Alluxio as a distributed caching and data access layer across GCP and Together AI. Facing I/O bottlenecks caused by high-volume video ingestion, limited NFS throughput, and the need to scale GPU infrastructure beyond a single cloud provider, DYNA adopted Alluxio to eliminate PyTorch DataLoader starvation and unlock compute-bound, fault-tolerant model training across their distributed GPU clusters. This significantly accelerated model iteration, a core business driver for DYNA, directly enabling faster commercial deployment and greater market agility.

About Dyna Robotics

Dyna Robotics is pioneering the next generation of embodied artificial intelligence, dedicated to making AI-powered robotic automation accessible to businesses of all sizes. The company’s mission is to empower businesses by automating repetitive tasks with intelligent robotic arms. With their breakthrough robotics foundation model, Dynamism v1 (DYNA-1), they enable robots to perform complex, dexterous tasks with commercial-grade precision and reliability.

(Source: Dynamism v1 (DYNA-1) Model: A Breakthrough in Performance and Production-Ready Embodied AI)

Dyna Robotics' initial product has already achieved commercialization with live customer deployments. Dyna Robotics continues to improve its foundational model by building efficient data collection infrastructure to collect data from both human demonstrations and real-world deployments.

The Challenge

Dyna Robotics' training pipeline is data-intensive, relying heavily on video demonstration clips captured during human teleoperation and robot deployments. Like many robotics companies, DYNA generates tens of terabytes of new video data daily, which is first ingested into Google Cloud Storage (GCS).

To make this data accessible to training jobs, a CRON-based process synchronizes files hourly from GCS to an on-prem NFS server. This architecture introduced significant limitations:

Severe I/O Starvation: The NFS backend was unable to sustain the high-throughput, low-latency demands of PyTorch DataLoader when processing massive volumes of small video and image files. Prefetch queues frequently ran empty, starving the training pipeline and leaving GPUs underutilized.

Wasted GPU Cycles: With training jobs blocked on I/O, workloads became I/O-bound rather than compute-bound. Training slowdowns of 30% or more were common when NFS became overloaded. This led to inefficient GPU utilization and inflated cloud costs, undermining the economics of scaled model development.

Operational Complexity: The hourly GCS-to-NFS synchronization introduced a duplicated data path and created a chokepoint. The NFS layer itself added unnecessary operational overhead, and DYNA was hitting GCP's 64TB persistent disk limit, forcing the team to manually shard data across multiple NFS servers — a time-consuming and error-prone process. Meanwhile, fast SSD resources were underutilized, sitting idle on the GPU servers.

Multi-Cloud Scalability Needs: As DYNA's training demands grew, they needed to expand beyond a single cloud provider to access more GPUs at competitive pricing and avoid vendor lock-in. However, accessing training data from GCS across cloud providers would incur prohibitive egress fees, creating an additional barrier to multi-cloud deployment.

These infrastructure limitations triggered a cascading impact on business velocity. Training bottlenecks delayed model iteration, which slowed product refinement, impeded downstream application development, and ultimately pushed out commercial rollout timelines and revenue realization. In a competitive robotics landscape, this latency in iteration risked eroding DYNA's first-mover advantage.

The Solution: Alluxio as a Multi-Cloud Distributed Caching Layer

To address critical I/O bottlenecks, eliminate operational overhead, and enable cost-effective multi-cloud scaling, Dyna Robotics adopted Alluxio as a distributed caching and unified data access layer across its GPU training infrastructure on both Google Cloud Platform (GCP) and Together AI.

‍

Dyna Robotics AI Model Training Data Infrastructure

‍

High-Throughput Cluster-Wide Caching: DYNA deployed Alluxio directly on their NVIDIA H100 GPU training clusters to form a shared, distributed cache using existing but underutilized local SSDs (~88TB total cache capacity per cluster). This approach avoided additional hardware spend while significantly boosting I/O throughput, resolving DataLoader starvation, and ensuring training pipelines are always compute-bound.

Multi-Cloud Architecture with Zero-Egress Fee Storage: To enable efficient training across multiple cloud GPU providers (GCP and Together AI) without incurring expensive data transfer costs, DYNA implemented a strategic multi-cloud architecture by using Alluxio.

Alluxio is deployed co-located with training clusters in both GCP and Together AI
The GCP training cluster uses Alluxio to cache and read datasets from GCS
Training data is copied once from GCS to Cloudflare R2, which serves as an intermediary storage layer, providing zero-egress-fee storage
The Together AI training cluster uses Alluxio to read and cache datasets from Cloudflare R2
This setup allows DYNA to efficiently share and access the same training dataset across two different cloud environments without incurring high data transfer costs

Optimized for Small File Access: Alluxio's architecture is well-suited for workloads with massive numbers of small files and high metadata pressure, such as DYNA's training datasets composed of short video clips stored in HDF5 format. This delivered much higher and more stable read performance than traditional NFS.

Seamless PyTorch Integration: With a POSIX-compliant FUSE interface, Alluxio dropped in as a replacement for NFS without requiring code changes. DYNA's team saw immediate performance gains without modifying their PyTorch training pipelines.

Operational Simplification: Alluxio's centralized credential management meant DYNA only needed to configure access to object stores on the Alluxio master, not on every individual GPU worker node. This eliminated the complex manual management of multiple NFS servers and data sharding that previously consumed significant engineering time.

Operational Robustness and Fallback Handling: Because Alluxio is co-located with GPU servers, node failures could take cache workers offline. However, Alluxio's graceful fallback to object storage (GCS or Cloudflare R2) ensures uninterrupted training, continuing at baseline throughput even during hardware faults.

Key Metrics and Values

Technical Values

Training Acceleration: Eliminated training slowdowns of 30% or more by resolving I/O bottlenecks and ensuring stable, predictable I/O performance
Operational Simplification: Transparent caching of hot data eliminates the operational complexity of managing multiple NFS servers, manual data sharding across GCP's 64TB persistent disk limits, and per-node credential configuration
Multi-Cloud Architecture: Enabled cost-effective GPU access across multiple cloud providers (GCP and Together AI) without incurring prohibitive data egress fees
Fault Tolerance: Demonstrated resilience when a GPU machine failed unexpectedly, with training continuing without slowdown or data loss

Business Impact

Multi-Cloud Flexibility: Alluxio enabled DYNA to deploy GPU training across GCP and Together AI, avoiding vendor lock-in and maintaining pricing negotiation power for GPU compute, one of their largest operational expenses, without incurring prohibitive data egress costs.
Improved Model Development Speed: Stable, predictable I/O performance enables more frequent model updates and faster quality improvements
Commercial Rollout Acceleration: Enhanced development velocity and infrastructure scalability support faster scaling toward commercial deployment targets
Engineering Productivity: Eliminated time-consuming manual infrastructure management, freeing engineering resources to focus on core AI development

Summary

By implementing Alluxio across their multi-cloud infrastructure, Dyna Robotics has accelerated model iteration cycles while maintaining operational flexibility and cost efficiency. “In the highly competitive landscape of embodied AI, having a high-performance, scalable, and reliable AI infrastructure is absolutely critical,“ explained Lindon Gao, CEO of Dyna Robotics. “Alluxio, as the data acceleration layer for our foundation model training infrastructure, has proven to be an extremely valuable partner in our journey to commercial success.”

Read case study

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Request a demo

Additional Case Studies

Blackout Power Trading Selects Alluxio to Scale from 5,000 to 100,000+ ML Models

View case study

A leading inference cloud provider accelerates inference cold starts by implementing Alluxio's distributed data caching solution. With Alluxio, the inference provider has drastically improved model loading speeds and reduced cloud egress costs across multiple GPU clouds. Their high-speed, globally distributed inference solution amazes customers each and every day.

View case study

RedNote Accelerates Model Training & Distribution with Alluxio

View case study

Alluxio Enterprise AI

Alluxio Enterprise Data

About Dyna Robotics

The Challenge

The Solution: Alluxio as a Multi-Cloud Distributed Caching Layer

Key Metrics and Values

Technical Values

Business Impact

Summary

Sign-up for a Live Demo or Book a Meeting with a Solutions Engineer

Additional Case Studies