Reinforcement learning deployment optimizer and 100K+ unit scalability
Background
The RQ-1-43 ML trajectory deployment optimizer uses a trained MLP for delta-V estimation combined with greedy/NN-guided heuristics to sequence swarm unit deployment. The NN has been retrained on the deployment regime (0.9-1.1 AU, val MSE 0.0005) and produces accurate transfer cost estimates, but the NN-guided strategy matches sequential performance because all swarm slots share the same orbital radius -- the NN receives identical (r1, r2) for every candidate and cannot differentiate them. This structural limitation, plus scalability challenges, motivate exploration of RL-based approaches.
Why This Matters
The current simulator achieves ~95% accuracy for deployment cost estimation at Phase 1 scale but cannot address:
Reinforcement learning gaps:
- The current NN-guided strategy uses a trained estimator with greedy optimization, not true RL policy learning. Even with a retrained deployment-regime NN, the strategy cannot outperform sequential because it evaluates individual hops, not multi-hop chains.
- An RL agent (e.g., PPO or SAC) could learn deployment policies that discover batch-like clustering automatically by optimizing over sequences of transfers
- Policy transfer from small (1K unit) training environments to large (100K+) deployment scenarios requires curriculum learning or hierarchical decomposition
Scalability challenges:
- Combinatorial explosion: 100K units with 50 tugs creates an action space that greedy methods cannot effectively search
- The current O(N^2) nearest-neighbor search in the greedy strategy becomes prohibitive above ~10K units
- Hierarchical decomposition (cluster-level planning + intra-cluster sequencing) is needed but not implemented
Real-time replanning:
- Anomaly response: tug failures, missed insertion windows, collision avoidance maneuvers require online replanning
- The current strategies assume deterministic execution without contingency handling
- Real-time NN inference for replanning requires model distillation to run on flight-grade processors
N-body trajectory propagation:
- Current Hohmann/NN approximation ignores gravitational perturbations from planets during multi-month transfers
- High-fidelity N-body propagation for 100K+ concurrent transfers requires GPU-accelerated integration
- Low-thrust trajectory optimization (for electric propulsion tugs) differs fundamentally from impulsive Hohmann assumptions
Simulation Approach
This question requires GPU-accelerated RL training infrastructure (PPO/SAC with vectorized environments) and high-fidelity orbital mechanics propagation. The recommended approach is offline training with policy export for browser-based visualization of learned deployment strategies.
Question Details
- Source Phase
- Phase 1 - Initial Swarm Deployment
- Source BOM Item
- Swarm Control System
- Question ID
- rq-1-48
- Created
- 2026-02-10
- Related BOM Items
- bom-1-7bom-1-1bom-1-6