Mandela Patrick

Computer Vision | Machine Learning

I am a Machine Learning Scientist at Piñata Farms, focused on machine learning for video understanding and recommendations. I completed my PhD with Andrea Vedaldi and João Henriques in the VGG group, at the University of Oxford. I was fortunate enough to be funded by a Rhodes Scholarship and AIMS.

Before that, I did my undergrad at Harvard College, where I obtained a B.A with honors in computer science.

Email / Google Scholar / Github / Twitter / LinkedIn / Thesis

News

Started as Machine Learning Scientist at Piñata Farms!
Invited as a speaker at the Share Stories and Lessons Learned (SSLL) ICCV 2021 workshop. See slides here
My PhD thesis, "Learning and interpreting deep representations from multi-modal data" is publicly available on the Oxford Research Archive (ORA)!
Our work Motionformer accepted as an Oral at NeurIPS 2021! Code
Passed my PhD! My examiners were Andrew Zisserman and Andrew Owens.
Two papers (GDT and STiCA) accepted to ICCV'21! Code
Started at internship at Facebook AI, mentored by Florian Metze, Christoph Feichtenhofer, and Ishan Misra.
Our paper on multilingual multimodal video-text pretraining (MMP) got accepted at NAACL 2021! Code
Our paper on video-text representation learning (SSB) got accepted as a Spotlight into ICLR 2021!
Our paper on Self-Labelling Videos (SeLaVi) got accepted at NeurIPS 2020! Code

Research

I'm interested in computer vision, self-supervised learning and multi-modal learning.

	Keeping Your Eye On the Ball: Trajectory Attention in Video Transformers Mandela Patrick, Dylan Campbell, Yuki M. Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, João F. Henriques NeurIPS, 2021 (Oral)* code \| slides \| poster \| bibtex \| talk We present trajectory attention, a drop-in self-attention block for video transformers that implicitly tracks space-time patches along motion paths. We set SOTA results on a number of action recognition datasets: Kinetics-400, Something-Something V2, and Epic-Kitchens.
	Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian Metze, João F. Henriques, Andrea Vedaldi ICCV*, 2021 code \| slides \| poster \| bibtex We better leverage latent time and space for video representation learning by computing efficient multi-crops in embedding space and using a shallow transformer to model time. This yields SOTA performance and allows for training with longer videos.
	On Compositions of Transformations in Contrastive Self-Supervised Learning Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi ICCV, 2021 code \| slides \| poster \| bibtex \| blog We give transformations the prominence they deserve by introducing a systematic framework suitable for contrastive learning. SOTA video representation learning by learning (in)variances systematically.
	Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann NAACL, 2021 code \| slides \| poster \| bibtex We develop a transformer model to learn contextualized multilingual multimodal embedddings and also release a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. We apply this model in a zero-shot setting to retrieve videos with non-English queries, and outperform recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K.
	Support-set bottlenecks for video-text representation learning Mandela Patrick, Po-Yao Huang, Yuki M. Asano, Florian Metze, Alexander Hauptmann, João F. Henriques, Andrea Vedaldi ICLR, 2021 (Spotlight)* slides \| poster \| bibtex \| talk We use a generative objective to improve the instance discrimination limitations of contrastive learning to set new state-of-the-art results in text-to-video retrieval.
	Labelling unlabelled videos from scratch with multi-modal self-supervision Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi NeurIPS, 2020 code \| slides \| poster \| bibtex \| talk Unsupervisedly clustering videos via self-supervision. We show clustering videos well does not come for free from good representations. Instead, we learn a multi-modal clustering function that treats the audio and visual-stream as augmentations.
	Understanding Deep Networks via Extremal Perturbations and Smooth Masks Ruth Fong, Mandela Patrick*, Andrea Vedaldi ICCV, 2019 (Oral)* code \| slides \| poster \| bibtex \| talk We introduce extremal perturbations, an novel attribution method that highlights "where" a model is "looking." We improve upon Fong and Vedaldi, 2017 by separating out regularization on the size and smoothness of a perturbation mask from the attribution objective of learning a mask that maximally affects a model's output; we also extend our work to intermediate channel representations.

Great template from Jon Barron