GrokNet: Unified Computer Vision Model Trunk and Embeddings For Commerce

Sean Bell, Yiqun Liu, Sami Alsheikh, Yina Tang, Ed Pizzi, M. Henning, Karun Singh, Omkar Parkhi, Fedor Borisyuk, KDD 2020


The paper presents GrokNet, a unified computer vision model, which incorporates a diverse set of loss functions, optimizing jointly for exact product recognition accuracy and various classification tasks. GrokNet predicts a wide variety of properties for an image, such as its category, attributes, and likely search queries. It also predicts an embedding (like a “fingerprint”) that can be used to perform tasks like product recognition, visual search, visually similar product recommendations, ranking, personalization, price suggestions, and canonicalization.

GrokNet is a single unified model with full coverage across all products.

Main contributions


The model is trained with SGD for 100,000 iterations on 96 NVIDIA V100 GPUs with a batch size of 46 per GPU, learning rate 0.012 × 12 = 0.0144, momentum 0.9, and weight decay 1e-4.


Training data collection

GrokNet requires almost 100 million images as training data, so data collection is a major task.

Trunk Architecture

Training data is given as input and losses are calculated with the Trunk architecture based model.

Loss Functions

GrokNet unifies several distinct tasks into a single architecture combining several loss functions and loss function types in a weighted sum.

Our two cents