Accelerando: once you teach a computer to see, it can teach itself to hear
In SoundNet: Learning Sound Representations from Unlabeled Video, researchers from MIT's computer science department describe their success in using software image-recognition to automate sound recognition: once software can use video analysis to decide what's going on in a clip, it can then use that understanding to label the sounds in the clip, and thus accumulate a model for understanding sound, without a human having to label videos first for training purposes.
Source: Yusuf Aytar, Carl Vondrick, Antonio Torr
We propose to train deep sound networks (SoundNet) by transferring knowledge from established vision networks and large amounts of unlabeled video. The synchronous nature of videos (sound + vision) allow us to perform such a transfer which resulted in semantically rich audio representations for natural sounds. Our results show that transfer with unlabeled video is a powerful paradigm for learning sound representations. All of our experiments suggest that one may obtain better performance simply by downloading more videos, creating deeper networks, and leveraging richer vision models.