In conversation, one sees as much as one hears the interlocutor. Compelling demonstrations of auditory-visual (AV) integration in speech perception are the classic McGurk effects: in McGurk fusion, an auditory [p] dubbed onto a face articulating [k] is perceived as a single fused percept [t], but in McGurk combination, an auditory [k] dubbed onto a visual [p] is heard as combinations of [k] and [p]. The spatiotemporal co-occurrence of AV speech signals is likely used by the brain. AV integration offers interesting challenges for neuroscience and speech science alike. How, when, where, and in what format do auditory and visual speech signals integrate? Several studies are described, suggesting that multisensory speech integration relies on a dynamic set of predictive computations involving large-scale cortical, sensorimotor networks. Within an analysis-by-synthesis framework, it is suggested that speech perception entails a predictive brain network operating on abstract speech units.