SKILLRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning 2/4
Method
1) Skill distillation from trajectories. A base policy collects trajectories for each task. A strong teacher model then summarizes them into skills. Successes become concise decision principles and applicability conditions. Failures are turned into counterfactual lessons that explicitly name the failure point, flawed reasoning, corrective action, and a preventive principle, converting noisy negative traces into actionable guidance.
2) SKILLBANK: a hierarchical skill library. Skills are stored as triples (name, principle, applicability). The library is split into:
General skills that apply across tasks (e.g., systematic exploration, pre-action sanity checks).
Task-specific skills that apply to task families (e.g., ALFWorld or WebShop sub-tasks).
At inference time the agent always conditions on general skills and retrieves up to K=6 task-specific skills whose embedding similarity to the task description exceeds a threshold (δ=0.4). This yields a compact context (10–20× shorter than raw trajectories) while preserving reasoning-relevant guidance.
3) Cold-start and RL training. Before RL, the policy is supervised-fine-tuned on teacher-generated, skill-augmented trajectories so it learns to retrieve and apply skills. This SFT model becomes a reference policy for KL regularization during RL, stabilizing training. RL uses GRPO with clipped objectives and normalized advantages; the key idea is to optimize actions while keeping the policy close to skill-aware behavior learned during SFT.
4) Recursive evolution. Every few epochs (e.g., every 5), validation failures are sampled and analyzed by the teacher to add or refine skills. Categories with low accuracy get more attention, but growth is capped per checkpoint to avoid explosion. The skill bank grows from 55 initial skills (12 general + 43 task-specific) to ~100 skills by later training, and the updated library feeds back into subsequent RL updates.
[−] Concrete example (simple chain): ALFWorld "heat the apple".
4a) Skill distillation: The base agent runs a few trajectories. A success trace shows the command sequence "Find apple > open microwave > put apple in it > close it." The teacher extracts a concise principle: "For heating tasks, ensure the target object is in hand before interacting with the appliance." A failure trace shows the agent trying to heat the microwave only to discover a counterfactual skill: "Before action 'use an unknown' ensure object isn't already contained in target g..."—the teacher distills this into a preventive principle.
4b) SKILLBANK: retrieval. The general skills include things like "Do a pre-action sanity check." The task-specific item was a "Is Appliance Before Move?" skill for heating/cooling tasks. When the task description is "heat the apple," retrieval computes embedding similarity and pulls the heating-relevant task-specific skills (above δ=0.4) along with all general skills.
4c) Cold-start + RL. The model is SFT-trained on skill-augmented traces so it learns to invoke these skills in context. During RL, it keeps the skill-aware behavior anchored via KL regularization while improving service quality. The clipped objective prevents the policy from drifting too far from the SFT reference, so learned skills remain active during training.
4d) Recursive evolution. Suppose validation shows repeated failures at tasks combining container interactions (e.g., open+insert+close): a teacher model creates/updates a new task-specific skill like "For heating tasks, verify the object isn't already inside the target appliance before attempting to move it in." This skill enters the SKILLBANK for subsequent RL rounds.