Attention Is All You Need.
With these simple words, the Deep Learning industry was forever changed. Transformers were initially introduced in the field of Natural Language Processing to enhance language translation, but they demonstrated astonishing results even outside language processing. In particular, they recently spread in the Computer Vision community, advancing the state-of-the-art on many vision tasks. But what are Transformers? What is the mechanism of self-attention, and do we really need it? How did they revolutionize Computer Vision? Will they ever replace convolutional neural networks?
These and many other questions will be answered during the talk.
In this tech talk, we will discuss:
- A piece of history: Why did we need a new architecture?
- What is self-attention, and where does this concept come from?
- The Transformer architecture and its mechanisms
- Vision Transformers: An Image is worth 16x16 words
- Video Understanding using Transformers: the space + time approach
- The scale and data problem: Is Attention what we really need?
- The future of Computer Vision through Transformers
Davide Coccomini is a young Computer Vision enthusiast. He graduated in Computer Engineering and recently in Artificial Intelligence and Data Engineering at the University of Pisa and he is about to start a Ph.D. path collaborating with the Institute of Information Science and Technologies (ISTI) at the Italian National Research Council (CNR). He worked on topics like video deepfake detection, anomaly detection for sensors and videos and climate change fighting.
Nicola Messina is a Ph.D. student in Information Engineering at the University of Pisa. He completed the M.Sc. in Computer Engineering in 2018, and he is actually collaborating with the Institute of Information Science and Technologies (ISTI) at the National Research Council (CNR) in Pisa. He is actually working on deep learning methods for relational understanding in multimedia data, with particular emphasis on transformer-based architectures for effective and efficient cross-modal retrieval.