The proliferation of edge devices, data continuously generated at the network edge has given risen to the generation of potential privacy disclosure concerns. Federated edge learning (FEL) in the mobile edge computing (MEC) systems, as a distributed machine learning architecture, can be customized efficiently in response to these challenges by sharing only model parameters instead of raw data. In the realistic scenarios, however, bias to the global model is an issue largely due to the nodes with different data distributions and resource constraints. By leveraging the possibility of loitering behavior of nodes on the training data and the integration of learning algorithm performance as intermediaries to well cope with data imbalance in a hyperopic manner, it enables great potentials in low-latency and energy-efficient FEL model performance. Toward this end, this paper is dedicated to addressing issues related to participant laziness and effective incentives from the perspective of optimizing the local training process and global aggregation simultaneously. Specially, the mixed model of two-stage leader-follower Stackelberg game and incentive mechanism is embraced to address the relations between an aggregator and nodes for energy-efficient resource management. Then we systematically analyze the existence of Nash Equilibrium. An efficient off-policy-based group relative policy optimization with hierarchical mean-field theory(OGRPO-HMF) has been proposed to optimize the local training process and global aggregation simultaneously. To evaluate the effectiveness of our proposed OGRPO-HMF algorithm, we compare its overall performance with the state-of-the-art counterparts. Our experimental results over different datasets demonstrate the superiority of OGRPO-HMF in reducing the average loss on all training samples and total training time by ensuring the model accuracy.