上一条: Human-centric Image Cropping with Partition-aware and Content-preserving Features
下一条: From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering