Vision has become the major sensory input of many systems and robots in the past few years. Efforts in camera development, dataset annotation, and model design have greatly advanced the frontiers of computer vision. However, many other sensory data are still under-explored, e.g. sounds, thermal, RF, point clouds. In this talk, I will introduce a cross modal self-supervised learning paradigm, to show how we can use our achievements in computer vision to assist the development of other sensory modalities. Such learning paradigm could be solutions for problems suffering from scarcity of annotations. And we envision it to become the major learning scheme of future robots that are equipped with increasing number of sensors.