Iterative Scene Graph Generation with Generative Transformers


Sanjoy Kundu and Sathyanarayanan Aakur


Oklahoma State University



Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.


Video Walkthrough




Qualitative Visualizations


We present additional qualitative results for all three tasks - PredCls, SGCls and SGDet. In PredCls, the goal is to generate the scene graph, given ground truth entities and localization. In SGCls, the goal is to generate the scene graph, given only entity localization. In SGDet, only the input image is provided, and the goal is to generate the scene graph along with the entity localization. Edges (predicates) in red are correctly identified zero-shot relationships i.e., the subject-predicate-object triplet does not exist in the training data set. Note that the SGDet results are made from object detection bounding box proposals while the other two are from groundtruth bounding boxes as outlined above.

Example 1


Input Image with Groundtruth Bounding Boxes

Predicted Scene Graph under the PredCls setting

Predicted Scene Graph under the SGCls setting

Predicted Scene Graph under the SGDet setting

Example 2


Input Image with Groundtruth Bounding Boxes

Predicted Scene Graph under the PredCls setting

Predicted Scene Graph under the SGCls setting

Predicted Scene Graph under the SGDet setting

Example 3


Input Image with Groundtruth Bounding Boxes

Predicted Scene Graph under the PredCls setting

Predicted Scene Graph under the SGCls setting

Predicted Scene Graph under the SGDet setting

Example 4


Input Image with Groundtruth Bounding Boxes

Predicted Scene Graph under the PredCls setting

Predicted Scene Graph under the SGCls setting

Predicted Scene Graph under the SGDet setting

Example 5


Input Image with Groundtruth Bounding Boxes

Predicted Scene Graph under the PredCls setting

Predicted Scene Graph under the SGCls setting

Predicted Scene Graph under the SGDet setting



Code, Paper and Extras

  • Preprint can be found here
  • Supplementary material with additional information can be found here
  • Find training/evaluation code here [Coming Soon!]

Bibtex

@inproceedings{aakur2019perceptual,
  title={Iterative Scene Graph Generation with Generative Transformers},
  author={Kundu, Sanjoy and Aakur, Sathyanarayanan N},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}