Andrew Thomas -

Reinforcement learning (RL) solutions have shown considerable promise for automating the defence of networks to cyber attacks. However, a limitation to their real world deployment is the sample efficiency and inflexibility of RL agents. This means that even small changes to attack types requires a new agent to be trained from scratch. Meta-learning for RL aims to improve the sample efficiency of training agents by encoding pre-training information that assists fast adaptation. This work focusses on two key meta-learning approaches, MAML and ML3, representing differing approaches to encoding meta learning knowledge. Both approaches are limited to sets of environments that use the same action and observation space. To overcome this we also present an extension to ML3, Gen ML3 that removes this requirement by training the learned loss on the reward information only. Experiments have been conducted on a distribution of network setups based on the PrimAITE environment. All approaches demonstrated improvements in sample efficiency against a PPO baseline for a range of automated cyber defence (ACD) tasks. We also show effective meta-learning across network topologies with Gen ML3.