Decentralization and high penetration of smart devices in IoE-enabled smart grids face the power system with complex scheduling problems. Engaging with big data produced by the interconnected infrastructures, besides the high dimensional and uncertain environment, make traditional methods incapable of addressing these problems since exact modeling of the environment under uncertainties is impracticable. Also, learning-based methods suffer from excessive complexity and the curse of dimensionality. This research proposes a Probabilistic Delayed Double Deep Q-Learning (P3DQL) which is a combination of the tuned version of Double Deep Q-Learning (DDQL) and Delayed Q-Learning (DQL). The planned algorithm makes a trade-off between overestimation and underestimation biases guaranteeing efficiency regarding sample complexity and learning proficiency by applying a delay in updating the rule. Finally, the proposed algorithm is tested on real-world data from Pecan Street Inc., assessing the performance of the P3DQL regarding peak clipping, decreasing Peak to Average Power Ratio (PAPR), and cost reduction. The results indicate the superiority of the developed algorithm over other utilized methods by 28.2% peak clipping, 12.9% PAPR decrease, and 29.4% cost reduction.