When situational demands exceed available cognitive resources, people experience cognitive overload which often leads to stress, exhaustion, fatigue, and consequently erroneous behavior. This is particularly problematic in safety-critical contexts, where people are confronted with various, potentially distracting, demands that may deteriorate goal-directed behavior. Therefore, robust measures of the experienced cognitive load are needed that not only account for task-induced demands but also consider situational-environmental influences. For this aim, we need to be able to correctly classify (high) cognitive load using a variety of continuous and unobtrusively measured variables. We here present a multimodal study with 18 participants (nine female, mean age=25.9±3.8 years) of whom ocular, cardiac, respiratory, and brain activity (using fNIRS) were recorded during the execution of an adapted warship commander task with concurrent emotional speech distraction. These emotional speech stimuli have a high salience and are, thus, perceived as especially distractive. Our cross-subject multilevel classification approach comprises feature engineering, model optimization and selection as well as sensor fusion methods with the goal of reliably identifying the currently experienced cognitive load. We used a leave-one-out strategy to test the generalizability of the final proposed classifier. Because the architecture combines information from different modalities, the final cognitive load prediction can be considered robust against noise, artifacts, and temporal sensor dropouts. Our approach contributes to the ecologically valid identification of cognitive overload and paves the way towards state monitoring in realistic applications and to systems that can adapt flexibly to the current cognitive resources of their users.