Driving while drowsy may lead to car accidents and other dangerous situations. Since yawning is an obvious sign of drowsiness, it is important to construct a high-accuracy real-time approach to yawning detection. However, existing research on facial keypoint-based segment-level models is still relatively scarce, and not yet comprehensive. Therefore, this paper proposes an approach where the facial keypoints in video clips are first recognized by OpenPose and standardized, and then yawning and other mouth behaviors are detected by our graphtemporal convolutional network (GTCN) model. Extensive experimental results on the public yawning detection dataset YawDD not only reflect the superiority of OpenPose as the facial keypoint extractor and graph convolutional network (GCN) as the spatial feature extractor, but also indicate that the GTCN model achieved a state-of-the-art performance, with 91.73% accuracy on three classifications of normal, talking, and yawning, and 99.25% accuracy on the binary classification problem of yawning detection on the testing set. Experiments also reveal that the GTCN model has good real-time performance in practice.