While performing an ultrasound (US) scan, sonographers direct their gaze at regions of interest to verify that the correct plane is acquired and to interpret the acquisition frame. Predicting sonographer gaze on US videos is useful for identification of spatio-temporal patterns that are important for US scanning. This paper investigates utilizing sonographer gaze, in the form of gaze-tracking data, in a multi-modal imaging deep learning framework to assist the analysis of the first trimester fetal ultrasound scan. Specifically, we propose an encoder-decoder convolutional neural network with skip connections to predict the visual gaze for each frame using 115 first trimester ultrasound videos; 29,250 video frames for training, 7,290 for validation and 9,126 for testing. We find that the dataset of our size benefits from automated data augmentation, which in turn, alleviates model overfitting and reduces structural variation imbalance of US anatomical views between the training and test datasets. Specifically, we employ a stochastic augmentation policy search method to improve segmentation performance. Using the learnt policies, our models outperform the baseline: KLD, SIM, NSS and CC (2.16, 0.27, 4.34 and 0.39 versus 3.17, 0.21, 2.92 and 0.28).