Reinforcement learning deployment optimizer and 100K+ unit scalability

Background

The RQ-1-43 ML trajectory deployment optimizer uses a trained MLP for delta-V estimation combined with greedy/NN-guided heuristics to sequence swarm unit deployment. The NN has been retrained on the deployment regime (0.9-1.1 AU, val MSE 0.0005) and produces accurate transfer cost estimates, but the NN-guided strategy matches sequential performance because all swarm slots share the same orbital radius -- the NN receives identical (r1, r2) for every candidate and cannot differentiate them. This structural limitation, plus scalability challenges, motivate exploration of RL-based approaches.

Why This Matters

The current simulator achieves ~95% accuracy for deployment cost estimation at Phase 1 scale but cannot address:

Reinforcement learning gaps:

The current NN-guided strategy uses a trained estimator with greedy optimization, not true RL policy learning. Even with a retrained deployment-regime NN, the strategy cannot outperform sequential because it evaluates individual hops, not multi-hop chains.
An RL agent (e.g., PPO or SAC) could learn deployment policies that discover batch-like clustering automatically by optimizing over sequences of transfers
Policy transfer from small (1K unit) training environments to large (100K+) deployment scenarios requires curriculum learning or hierarchical decomposition

Scalability challenges:

Combinatorial explosion: 100K units with 50 tugs creates an action space that greedy methods cannot effectively search
The current O(N^2) nearest-neighbor search in the greedy strategy becomes prohibitive above ~10K units
Hierarchical decomposition (cluster-level planning + intra-cluster sequencing) is needed but not implemented

Real-time replanning:

Anomaly response: tug failures, missed insertion windows, collision avoidance maneuvers require online replanning
The current strategies assume deterministic execution without contingency handling
Real-time NN inference for replanning requires model distillation to run on flight-grade processors

N-body trajectory propagation:

Current Hohmann/NN approximation ignores gravitational perturbations from planets during multi-month transfers
High-fidelity N-body propagation for 100K+ concurrent transfers requires GPU-accelerated integration
Low-thrust trajectory optimization (for electric propulsion tugs) differs fundamentally from impulsive Hohmann assumptions

Simulation Approach

This question requires GPU-accelerated RL training infrastructure (PPO/SAC with vectorized environments) and high-fidelity orbital mechanics propagation. The recommended approach is offline training with policy export for browser-based visualization of learned deployment strategies.

Explore

Community

Contribute

Background

Why This Matters

Simulation Approach

Question Details

Related Questions

Propulsion/actuation authority for station-keeping

Optical surface degradation from micrometeoroids

Cluster coordinator rotation duty cycle

Slot reallocation governance protocol

Software update strategy at scale

Navigation