Whilst the prey and predator have opposing objectives, they share similar characteristics. Both agents operate
within a continuous action space with two degrees of freedom: rotation about the \(y\)-axis and forward/backward movement.
Each degree of freedom allows the agent to select a value between \([-1, 1]\), determining the direction and intensity of
movement. The agents' behavioural parameters consist of maximum movement and rotation speeds, which scale this intensity
value. Notably, the predator possesses an additional box collider located at its face, resulting in the prey's death
upon collision. Conversely, the prey lacks this collider as its primary goal involves evading the predator.
Box colliders for the predator and prey agents
In order to simulate various vision fields, we employ a modified version of the ML-Agents Ray Perception sensor. This
sensor uses a collection of ray casts originating from a central point. Several parameters are available to manipulate
the vision, including ray angles (to establish the field of view), ray length (to determine the depth of field), number
of rays (to regulate ray density/resolution), and a list of tags for filtering detected objects. Each ray conveys the following
information: a flag indicating whether it collided with a tagged object, a one-hot encoded vector identifying the object
type, and the normalized distance to the hit object relative to the ray length. Consequently, the sensor's output
comprises a flattened vector encapsulating the information encoded within each ray.
To control the region of binocular vision, we have extended this sensor to incorporate a parameter representing
the number of depth rays. We make the simple assumption that agents can see the presence and type
of an object across their entire field of view, but only have depth perception in their binocular region.
Consequently, we exclude depth information for any rays falling outside this region by
assigning a distance of \(-1\). Using this modified Ray Perception sensor, we define two distinct sensor types:
predator-style and prey-style, which draw inspiration from the characteristic vision traits observed in real-world
predators and prey [16], respectively.
Predator- and prey-style ray perception sensors
Utilizing the modified Ray Perception sensor offers distinct advantages for this paper by providing precise
control over the simulated vision fields. Customizable parameters such as field of view and depth of field enable
alignment with the specific requirements of the prey and predator agents in the simulated environment. This level of
control ensures that the sensory input accurately reflects the desired characteristics of the vision of the agent.
Furthermore, using the modified Ray Perception sensor instead of raw camera inputs enhances training efficiency by
reducing the computational burden of processing high-dimensional image data.
The setting of predator-prey dynamics can be framed as a two-player dynamic game [6],
[7]. This formulation captures the interactions and influence of the agents on the state of the game, providing a framework
for studying predator-prey dynamics based on a common utility function. The game is only played for a set number of
steps, until it is terminated.
In our formulation, the state and action spaces align with the descriptions provided in the previous sections,
allowing for a consistent representation of the game dynamics. The transitions between states are determined by
the underlying physics engine, ensuring a realistic simulation of the predator-prey environment.
$$
G_\text{pred} = \sum_{i=1}^{T}\left[\frac{𝟙(\text{prey caught at } t)}{N_\text{prey}} - \frac{1}{T}\right]
$$
$$
G_\text{prey} = - G_\text{pred}
$$
The reward structure employed in our proposed framework, as illustrated in the figure above, incorporates a constant time
penalty applied to the predator at each time step, and a positive reward if a prey is caught. This deliberate design
choice serves to incentivize the predator to capture prey at the earliest opportunity. Conversely, the reward for the
prey is formulated in the opposite manner, establishing a zero-sum game. The range of rewards is bounded within the
interval \([-1, 1]\). This reward specification enables a natural distinction between favorable and unfavorable
outcomes of both agents.
Furthermore, since the predator-prey game exhibits an inherent asymmetry, it can be expressed as an asymmetric
zero-sum game. Unlike games such as soccer, where teams have shared objectives, the predator and prey agents
pursue conflicting goals. The asymmetry arises from the distinct policies used by the different agents, resulting
in strategic dynamics that differ from those observed in symmetric games.
Overall, the nature of the game poses additional challenges, as it suffers from many of the problems in
competitive multi-agent learning [8]. Multi-agent reinforcement learning in competitive scenarios
poses several challenges [9]. In this paper, we employ a prominent technique called "self-play"
[10], [11]
to address these difficulties. The diagram below provides a high-level overview of this approach.
Diagram showing the self-play mechanism
Initially, either the predator or prey agent is trained, while the model of the other agent remains frozen. After
a predetermined number of iterations, the frozen model is trained, and the other agent's model is frozen.
However, a potential issue arises from the bias introduced by repeatedly playing against the most recent model.
This bias can lead to overfitting and poor generalization. To mitigate this problem, we incorporate an ELO ranking
system as done in MuZero [12], enabling models to compete against earlier versions of their opponents.
By doing so, we reduce the impact of bias, leading to improved generalization and robustness in the learned policies.