Reading View — Expanded Detail

SKILLRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning 2/4

✓ Method

1) Skill distillation from trajectories. A base policy collects trajectories for each task. A strong teacher model then summarizes them into skills. Successes become concise decision principles and applicability conditions. Failures are turned into counterfactual lessons that explicitly name the failure point, flawed reasoning, corrective action, and a preventive principle, converting noisy negative traces into actionable guidance.

2) SKILLBANK: a hierarchical skill library. Skills are stored as triples (name, principle, applicability). The library is split into:

– General skills that apply across tasks (e.g., systematic exploration, pre-action sanity checks).

– Task-specific skills that apply to task families (e.g., ALFWorld or WebShop sub-tasks).

At inference time the agent always conditions on general skills and retrieves up to K=6 task-specific skills whose embedding similarity to the task description exceeds a threshold (δ=0.4). This yields a compact context (10–20× shorter than raw trajectories) while preserving reasoning-relevant guidance.

3) Cold-start and RL training. Before RL, the policy is supervised-fine-tuned on teacher-generated, skill-augmented trajectories so it learns to retrieve and apply skills. This SFT model becomes a reference policy for KL regularization during RL, stabilizing training. RL uses GRPO with clipped objectives and normalized advantages; the key idea is to optimize actions while keeping the policy close to skill-aware behavior learned during SFT.

[⏐] why are we using GRPO and not other RL algorithms

GRPO is a practical fit here because the objective is sequence-level, sparse-reward RL with a need for stable updates around a strong reference policy. GRPO (like PPO-style methods) gives a clipped objective plus explicit KL control, which makes it easier to preserve the skill-aware behavior learned in SFT while still improving task success. That stability matters when the policy input includes retrieved skills and the reward signal is binary.

The paper does not claim GRPO is uniquely required; it is a reasonable, well-understood choice for LLM-policy RL and provides a clean baseline to isolate the impact of skill distillation and recursive evolution. Other algorithms (e.g., PPO variants, DPO-style offline methods, or actor-critic approaches) could likely be swapped in, but would introduce extra variables or require different reward formulations. The core contribution is the skill-augmented loop, not a new RL objective.

4) Recursive evolution. Every few epochs (e.g., every 5), validation failures are sampled and analyzed by the teacher to add or refine skills. Categories with low accuracy get more attention, but growth is capped per checkpoint to avoid explosion. The skill bank grows from 55 initial skills (12 general + 43 task-specific) to ~100 skills by later training, and the updated library feeds back into subsequent RL updates.

[+] Concrete example for all four steps