FID Assessment of Generative Text-Image Foundation Models

Throughout the last years it has been common practice for generative text-to-image models to assess FID- [12] and CLIP-scores [34, 36] in a zero-shot setting on complex, small-scale text-image datasets of natural images such as COCO [26]. However, with the advent of foundational text-to-image models [40, 37, 38, 1], which are not only targeting visual compositionality, but also at other difficult tasks such as deep text understanding, fine-grained distinction between unique artistic styles and especially a pronounced sense of visual aesthetics, this particular form of model evaluation has become more and more questionable. Kirstain et al. [23] demonstrates that COCO zero-shot FID is negatively correlated with visual aesthetics, and such measuring the generative performance of such models should be rather done by human evaluators. We investigate this for SDXL and visualize FID-vs-CLIP curves in Fig. 12 for 10k text-image pairs from COCO [26]. Despite its drastically improved performance as measured quantitatively by asking human assessors (see Fig. 1) as well as qualitatively (see Fig. 4 and Fig. 14), SDXL does not achieve better FID scores than the previous SD versions. Contrarily, FID for SDXL is the worst of all three compared models while only showing slightly improved CLIP-scores (measured with OpenClip ViT g-14). Thus, our results back the findings of Kirstain et al. [23] and further emphasize the need for additional quantitative performance scores, specifically for text-to-image foundation models. All scores have been evaluated based on 10k generated examples.

文章来源: https://hackernoon.com/fid-assessment-of-generative-text-image-foundation-models?source=rss
如有侵权请联系:admin#unsafe.sh