✓ Method
1) Skill distillation from trajectories. A base policy collects trajectories for each task. A strong teacher model then summarizes them into skills. Successes become concise decision principles and applicability conditions. Failures are turned into counterfactual lessons that explicitly name the failure point, flawed reasoning, corrective action, and a preventive principle, converting noisy negative traces into actionable guidance.
2) SKILLBANK: a hierarchical skill library. Skills are stored as triples (name, principle, applicability). The library is split into:
– General skills that apply across tasks (e.g., systematic exploration, pre-action sanity checks).
– Task-specific skills that apply to task families (e.g., ALFWorld or WebShop sub-tasks).
At inference time the agent always conditions on general skills and retrieves up to K=6 task-specific skills whose embedding similarity to the task description exceeds a threshold (δ=0.4). This yields a compact context (10–20× shorter than raw trajectories) while preserving reasoning-relevant guidance.
3) Cold-start and RL training. Before RL, the policy is supervised-fine-tuned on teacher-generated, skill-augmented trajectories so it learns to retrieve and apply skills. This SFT model becomes a reference policy for KL regularization during RL, stabilizing training. RL uses GRPO with clipped objectives and normalized advantages; the key idea is to optimize actions while keeping the policy close to skill-aware behavior learned during SFT.
4) Recursive evolution. Every few epochs (e.g., every 5), validation failures are sampled and analyzed by the teacher to add or refine skills. Categories with low accuracy get more attention, but growth is capped per checkpoint to avoid explosion. The skill bank grows from 55 initial skills (12 general + 43 task-specific) to ~100 skills by later training, and the updated library feeds back into subsequent RL updates.