- Python 3.10 or higher
- pip
For Tier 2/3 (edge devices) or basic usage:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .DeepSpeed requires PyTorch to be installed first. Install in this order:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install PyTorch first (required for DeepSpeed build)
pip install torch>=2.0
# Then install the package with Tier 1 dependencies
pip install -e ".[tier1]"Alternatively, install everything in one step (PyTorch will be installed first automatically):
pip install torch>=2.0
pip install -e ".[tier1]"For development with testing and linting tools:
pip install -e ".[tier1,dev]"This document describes the design of a federated, nested-learning framework that:
- Trains Mixture-of-Experts (MoE) models across heterogeneous compute tiers:
- Tier 1: HPC clusters / data centers.
- Tier 2/3: Edge devices (IoT, vehicles, mobiles).
- Uses Nested Learning concepts to enable continual learning without catastrophic forgetting.
- Ensures strict data privacy and governance:
- Global governance in data centers (DataHub, Apache Ranger, Flyte, MLflow).
- Local, airgapped governance on edge devices (sidecar).
- Is orchestrated with Flyte and coordinated via Flower (or similar FL framework).
- Uses Kafka for event streaming and HAProxy/VPN for secure connectivity.
- Exposes a CLI-first, modular Python framework that can later be wrapped as a service/app.
This document is intended for project planning, implementation, and infra design.
-
Federated Training Across Multiple Clusters
- Training must occur in a federated fashion across multiple clusters (Tier 1) and devices (Tier 2/3).
- No raw data leaves any cluster or device by default.
-
Heterogeneous Node Types
- Nodes can be:
- Servers / HPC clusters (Tier 1).
- IoT devices, vehicles, mobile devices (Tier 2/3).
- Mixed hardware and network conditions.
- Nodes can be:
-
Tier-Specific Training Capabilities
- Tier 1:
- Train core foundational model.
- Train MoE experts (feed-forward networks).
- Can also train LoRA/QLoRA adapters if desired.
- Tier 2/3:
- Train only local adapters (LoRA/QLoRA-like for experts).
- Core weights stay frozen.
- Tier 1:
-
Hierarchical Asynchronous Co-Training
- Multiple nested loops:
- Fast, local training at edge.
- Slower, heavy training at Tier 1.
- Event-driven aggregation and model versioning.
- No strict synchronous rounds required; system must tolerate stale updates.
- Multiple nested loops:
-
Nested Learning Principle
- Treat global training as an outer optimization.
- Treat per-node training (especially adapters) as nested inner optimizations.
- Design to mitigate catastrophic forgetting.
-
Data & Privacy
- Data should not leave its cluster or device unless explicitly allowed.
- Common format: JSONL for LLM fine-tuning.
- PII detection / anonymization where any text leaves local boundaries.
-
Governance & Compliance
- Global governance: DataHub, Apache Ranger, MLflow, Flyte.
- Local governance: Sidecar per edge node, airgapped, no central metadata leakage.
-
Model Versioning & Observability
- Automated model versioning (MLflow).
- Monitoring & observability for training, infra, and telemetry.
-
Scalable Eventing & Networking
- Kafka for control and update streams.
- HAProxy and VPN for secure connectivity from edge to central infra.
-
Developer Experience
- Primary interface: CLI + Python library.
- Later: optional API service / UI.
-
Tier 1 – HPC / Data Center
- Runs full MoE model: core + experts.
- Heavy distributed training (DeepSpeed + PyTorch).
- Full integration with global control plane:
- Flyte (orchestration).
- DataHub (metadata).
- Apache Ranger (data access control).
- MLflow (runs & model registry).
- Observability stack (Prometheus, Grafana, logs, tracing).
-
Tier 2/3 – Edge Nodes (IoT / Vehicles / Mobiles)
- Receive frozen global model snapshot (quantized if necessary).
- Train only expert adapters (LoRA/QLoRA).
- Use local governance sidecar:
- Enforce “model-delta-only” export.
- Maintain local audit logs, never shipped centrally.
Three nested loops:
- Inner Loop (Edge / Local Adapters)
- Each device trains adapters on its own data.
- Produces
ΔW_experts_local(adapters-only deltas).
- Middle Loop (Tier 1 / Core & Experts)
- Clusters train:
- Core
W_coreon global data. - Experts
W_experts_globalon large-scale datasets.
- Core
- Integrate
ΔW_experts_localfrom edge nodes.
- Clusters train:
- Outer Loop (Global Control Plane)
- Flyte orchestrates:
- When to do core updates.
- When to aggregate expert updates.
- When to evaluate & promote new model versions.
- Data Engine-like feedback to generate new tasks.
- Flyte orchestrates:
We explicitly partition model parameters:
- Core (
W_core)- Token & positional embeddings.
- Self-attention weights (Q, K, V, O).
- Normalization layers (LayerNorm, RMSNorm).
- Optionally gating network weights (if we don’t want devices to alter routing).
- Global Experts (
W_experts_global)- Set of feed-forward experts:
{W_E1, W_E2, ..., W_En}. - Hosted and trained primarily at Tier 1.
- Sharded and parallelized using DeepSpeed MoE.
- Set of feed-forward experts:
- Adapters / LoRA Layers
- Tier 1 Adapters (optional): Domain-specific adapters for specific clusters or datasets.
- Tier 2/3 Adapters (mandatory): Low-rank matrices
(A_i, B_i)on top of expert weights:W_Ei_eff = W_Ei + A_i · B_iᵀ- Only
(A_i, B_i)trainable on edge nodes.
- Global gating network
g(x; W_gate):- Trained at Tier 1 on large-scale data.
- On Tier 2/3:
- Option 1: Gating fully frozen, only expert functions adapted via adapters.
- Option 2: Minimal gating adapters (e.g. per-expert bias) that stay local only and are not aggregated centrally.
Steps on each edge node:
- Receive Model Snapshot
- Pull
W_core,W_experts_global, and model versionvvia FedServer or Kafka control events. - Optionally receive quantized weights (QLoRA / 4-bit for memory efficiency).
- Pull
- Initialize Local Model
- Load frozen
W_core,W_experts_global. - Attach local adapters
(A_i, B_i)for selected experts.
- Load frozen
- Local Training
- Train with local private dataset (JSONL or equivalent).
- Objective: minimize local loss
L_local, updating only adapters. - Use resource-aware training: mixed precision, tiny batch sizes, small learning rates.
- Governance Filtering
- Sidecar validates outbound payload:
- Must be strictly
ΔW_experts_local(or new adapter weights). - No raw data, no detailed metadata, no PII.
- Must be strictly
- Sidecar writes local audit logs.
- Sidecar validates outbound payload:
- Send Update
ΔW_experts_localpublished to Kafka topicupdates.experts.localvia VPN + HAProxy.
- Telemetry
- Minimal training metrics & statuses (loss, success/failure) published to
telemetry.edgetopic (optionally DP-noised).
- Minimal training metrics & statuses (loss, success/failure) published to
This loop is asynchronous: Node trains when resources are available; central system ingests deltas continuously.
Inputs: Global datasets, streams of ΔW_experts_local, configs/tasks.
Steps:
- Core & Expert Training (CoreUpdate Workflow)
- Flyte triggers a “core update” workflow.
- Ranger authorizes training services to access cluster data.
- DeepSpeed + PyTorch load
W_core,W_experts_globaland train on large-scale data. - Compute
ΔW_core_T1,ΔW_experts_T1.
- Integrate Edge Expert Updates
- Aggregation worker (FedServer or dedicated process) consumes
updates.experts.local. - Apply aggregation strategy (weight by recency, data size, reliability; downweight stale updates).
- Aggregation worker (FedServer or dedicated process) consumes
- Global Parameter Update
- Combine updates:
W_core ← W_core + η_core * ΔW_core_T1 W_experts ← W_experts + η_T1 * ΔW_experts_T1 + η_T2 * ΔW_experts_agg_from_edges
- Combine updates:
- Evaluation & Nested Learning Safeguards
- Evaluate updated model on global validation sets and regressions.
- Apply anti-forgetting strategies: Regularization (EWC), selective freezing, or distillation.
- Versioning & Promotion
- Log run to MLflow; register new model version
v+1. - Update DataHub with lineage.
- Governance/SecOps approve or reject new version.
- Log run to MLflow; register new model version
- Broadcast New Version
- FedServer / Kafka emits
ModelAvailable/RoundStartfor versionv+1.
- FedServer / Kafka emits
- Apache Ranger: Defines and enforces data access policies. Audits data access.
- DataHub: Central catalog of datasets, models, features. Maintains lineage (Dataset → Run → Model).
- MLflow: Tracks experiments, metrics, artifacts. Model registry with stages.
- Flyte: Orchestrates workflows (training, aggregation, eval). Encodes governance in pipeline definitions.
- Presidio: PII detection/anonymization for exceptional cross-silo sharing.
Sidecar responsibilities:
- Policy Enforcement: Local static config (YAML/JSON) or OPA-like rules defining allowed endpoints and payload fields.
- Outbound Filter: Validates payloads (no raw text, no identifiers).
- Local Audit Logs: Tracks local training/export events; stored on device only.
- Optional Local Presidio: Lightweight PII detection for logs.
This implements dual governance: Full-featured governance in Tier 1; Minimal, strictly enforced governance in Tier 2/3 with airgapped metadata.
- VPN (WireGuard / IPSec): Secure overlay for edge nodes to reach central cluster.
- HAProxy / Ingress: Public ingress/load-balancer for FedServer/Kafka. Terminates TLS.
- mTLS & Auth: Mutual TLS certificates/tokens binding identity (
client_id) to connections.
- Edge Node:
Edge Runtime → Sidecar → VPN Tunnel → VPN Gateway → HAProxy → FedServer / Kafka - Tier 1: Internal communication within cluster (FedServer, Kafka, Flyte) behind firewall.
control.federation_rounds: Events:RoundStart,ModelAvailable, config updates.updates.experts.local: Payload:ΔW_experts_local+ minimal metadata.telemetry.edge: Payload: Aggregated loss metrics, training outcome (DP/noise optional).tasks.training: Payload: StructuredTrainingTaskobjects (core_update, expert_aggregation).
- Loose coupling: Decouples training, aggregation, and orchestration.
- Replayability: Allows reconstruction of training history/debugging.
- Data Engine: Telemetry feeds analytics which drive new tasks.
| Category | Tools |
|---|---|
| Training & Modeling | PyTorch (Core DL), DeepSpeed (Distributed MoE), PEFT/LoRA/QLoRA (Adapters), ONNX (Export). |
| Federation | Flower (FL Server/Client abstractions), Flyte (K8s-native orchestration). |
| Governance | Apache Ranger (Access control), DataHub (Catalog/Lineage), MLflow (Registry), Presidio (PII). |
| Observability | Prometheus (Metrics), Grafana (Dashboards), Loki (Logs), OpenTelemetry (Tracing). |
| Networking | Kafka (Streaming), HAProxy (Ingress), WireGuard/IPSec (VPN). |
Primary project: CLI-first Python package (can later host a service).
federated-nested-moe/
├── pyproject.toml # packaging, dependencies, entrypoints
├── README.md # high-level docs
├── docs/ # detailed doc for architecture, governance, etc.
├── examples/ # sample configs, Flyte workflows, Dockerfiles
├── src/
│ └── fednested/
│ ├── cli.py # main CLI (typer/click)
│ ├── config/ # Pydantic models and config loader
│ ├── model/ # MoE model, adapters, checkpoints, quantization
│ ├── training/ # tier1_trainer, tier2_trainer, aggregation
│ ├── federation/ # Flower-based server/client wrappers
│ ├── governance/ # ranger, datahub, local_sidecar, presidio integration
│ ├── orchestration/ # Flyte helpers, reference workflows
│ ├── observability/ # logging, metrics, tracing
│ ├── messaging/ # kafka_client, topic management, events
│ ├── networking/ # vpn tools, haproxy templates, security helpers
│ └── utils/ # generic utilities
└── tests/
└── ... # unit/integration tests
fednested.modelmoe_model.py: Definition of MoE architecture and parameter partitioning.adapters.py: LoRA/QLoRA layers and APIs.checkpointing.py: Load/save state_dicts, split/merge core/experts/adapters.
fednested.trainingtier1_trainer.py: Orchestrates deep MoE training on HPC (DeepSpeed).tier2_trainer.py: Edge training loops (adapters only).aggregation.py: Logic for combiningΔW_experts.
fednested.federationserver.py: Custom FedServer hooks into Kafka & aggregation.client.py: Edge client runtime.
fednested.governancelocal_sidecar.py: Local enforcement engine.global_metadata.py: DataHub + MLflow hooks.
fednested.orchestrationflyte_integration.py: Tasks wrapping CLI calls.
We use Typer or Click for a clean CLI.
fednested init-config tier1- Generate sample Tier 1 config (model, data, Ranger, MLflow settings).
fednested tier1 core-update --config config/tier1.yaml- Run a Tier 1 core + experts training cycle.
fednested tier1 aggregate-experts --config config/tier1.yaml- Run aggregator: consume
updates.experts.local.
- Run aggregator: consume
fednested tier2 run-client --config config/tier2.yaml- Start edge client: join federation, train local adapters, send deltas.
fednested messaging bootstrap-topics- Create required Kafka topics.
fednested infra generate-vpn-config --profile edge- Generate VPN client config for edge node enrollment.
- Environment: Kubernetes or HPC cluster with GPU nodes.
- Components: Flyte, Kafka, FedServer, HAProxy, DataHub, MLflow, Ranger.
- Execution: Run as K8s Jobs using
tier1_trainer.pywith DeepSpeed.
- Environment: IoT devices, vehicles, mobiles (Container or native).
- Components:
fednestedclient library, Local Sidecar, VPN client. - Constraints: Model quantization (QLoRA), energy-aware scheduling.
-
Phase 1 – Core Framework & Tier 1 Only
- Implement MoE model + adapters.
- Implement Tier 1 training with DeepSpeed.
- Integrate MLflow + DataHub + Ranger.
-
Phase 2 – Federated Edge Training + Dual Governance
- Implement edge client + adapters-only training.
- Implement sidecar governance.
- Configure VPN + HAProxy + Kafka.
-
Phase 3 – Nested Learning Optimizations & Data Engine
- Add anti-forgetting strategies.
- Implement smarter aggregation schemes.
- Implement telemetry consumer & Data Engine.
-
Phase 4 – Optional Control Plane Service / UI
- FastAPI service for listing models/tasks.
- Web UI for monitoring and governance approvals.
- Complexity: Managing Flyte, Flower, Kafka, Ranger, DataHub, etc. requires strong automation/IaC.
- Edge Resources: Aggressive quantization is likely required.
- Security: Adversarial clients could poison training; robust mTLS and sidecars are essential.
- Governance Consistency: Dual pipelines must encode consistent policies despite using different tools.