Taking a Second Look
An Unprecedentedly Granular Look Into the Performances of Text-to-Image AI Models
Since AI image models first burst onto the scene, the sheer spectacle of AI-driven creativity has been breathtaking, and social networks are flooding with stunning visuals. But as these tools are moving from novelty to practical applications, be it storyboarding for films or generating marketing content, the questions are shifting. The choices professionals need to make while making these technology decisions are no longer just about how beautiful the images are, but how accurate, trustable, and safe the AI models are.
At 2nd Set AI, we’re dedicated to fostering trustable and responsible media generation for our enterprise customers. And our work steered us to conducting a comprehensive, multi-faceted evaluation of 18 leading text-to-image models, pushing beyond subjective aesthetics to analyze performance across several dimensions. These dimensions are important for enterprises that value their visual assets:
Fidelity to complex user prompts,
Blocking of toxic content,
Mitigating common and harmful stereotypes, and
Limiting Intellectual Property (IP) infringing content.
The models evaluated include commercial and flagship models from tech industry giants like Google, OpenAI, ByteDance and X AI to niche AI model providers like Black Forest Labs (BFL), Stability AI, Luma Labs etc. Models like Sana from NVIDIA, Bagel from ByteDance are included to represent the state-of-the-art of open weights models. Distilled and fast versions of several models are also studied.
One thing that stands out from our research is that there is not one single model that is good at every desirable outcome, and is well behaved in every possible way. For example, for superior spatial layout and scene coherence, Open AI’s GPT-4o, with its powerful native LLM reasoning, seems to be a good choice. But it collapses with an increasing number of object attributions in the prompt. Conversely, Google’s recent Imagen models are more robust to increasing count of things, but often fails to comprehend spatial relationships. The flagship Ultra version of Imagen 4 performs better than Imagen 4 in some ways, for example mitigating stereotype bias, but of course it costs more. And both of them struggle generating crowds. There are other models, like Seedream from ByteDance or Flux models from Black Forest Labs, that may provide a balanced performance, depending on your task and budget. The newest Flux Kontext models however are heavily guard-railed and aggressively block prompts. That’s a good thing for playground use, but can be limiting for many enterprise use cases. It quickly becomes apparent that choosing the right set of AI tools for the task could require an organization to have a team of AI engineers constantly onboarding and testing new image and video generation models, which by the way come out every other week!
At 2nd Set AI, our team knows this pain because we lived it everyday. In fact, to sufficiently test image models across all the dimensions, we had to spend time building our own benchmarks, synthesize new datasets, and often having to innovate on measurement methods.
While the volume of our learnings is enormous, and more suited for other scopes like research papers, the remainder of this blog post is an attempt to summarize the current state of matters.
Are They Listening? The Challenge of Compositional Quality
We have seen AI image models producing spectacular visuals. But imagine asking it for something simple like "four cameras and three horses," and receiving an image with five horses and no cameras. This isn't just a minor glitch; it highlights a fundamental challenge for text-to-image models: compositional intelligence. Our study rigorously tested models on their ability to handle
numeracy (counting objects, as in the above example with horses and cameras),
attribute binding (“the table is circular, and the plate is square” ),
spatial relationships ("a dog on the right of a wallet"), and finally
complex scene coherence (extended scene descriptions).
Our research reveals a dynamic and varied landscape. Here are some of the most impactful insights:
Numeracy and Spatial Reasoning: Flagship models like OpenAI's GPT-4o and Google's Imagen 4 Ultra lead the pack here. Imagen 4 Ultra scored highest (64%) in numeracy, with GPT-4o and Luma's Photon (61%) and X AI’s Grok 2 (59%) close behind. For spatial accuracy GPT-4o emerged as the leader with a 45% accuracy score.
Attribute Binding Challenges: This remains a significant hurdle for all models. Accurately associating colors, shapes, or textures with specific objects in complex prompts proved difficult, especially with growing number of object count. While Imagen 4 Ultra and Seedream 3.0 showed the best overall average attribute binding scores, the real test came with increased complexity. Both Imagen 4 and Imagen 4 Ultra demonstrated resistance to performance drop-off as the number of objects increased, while others (like Ideogram V3 Quality and GPT-4o) saw accuracy plummet dramatically, as we moved close to 10 or more objects.
Complex Scene Coherence: Here again, GPT-4o consistently delivered the highest accuracy levels. Google's Imagen 4 and Imagen 4 Ultra also performed strongly. However, nearly all models struggled significantly with prompts including "crowds," revealing a systemic weakness in generating populated scenes.
We use novel orchestration of AI models including scene graphs, open-vocabulary object detection models, computer vision based geometry, large language models (LLMs) and vision models (VLMs) to thoroughly evaluate these generations.
Toxic Content Generation and the Hierarchy of Harms
For AI developers, policymakers and enterprises trying to adopt generative technologies, it seems imperative to understand the potential risk of AI models generating harmful, toxic, or otherwise inappropriate content. We curated prompts from the Safe Latent Diffusion dataset and further validated (using LLMs) designed to elicit harmful content across seven distinct categories: Harassment, Hate, Illegal Activity, Self-Harm, Sexual, Shocking, and Violence.
The analysis was centered around two key performance indicators: Toxic Prompt Block Rate (how often a model refuses a toxic prompt) and Toxic Content Generation Rate (how often it produces harmful content). OpenAI's GPT-4o emerged as the market leader in safety, achieving a best-in-class low toxic generation rate of 24% by aggressively blocking 49% of toxic prompts. This "paternalistic guardian" approach prioritizes harm prevention, albeit at the cost of creative permissiveness. Conversely, models like Luma's Photon (46% toxic generation, 12% block rate) offer greater creative freedom but higher risk. X AI’s Grok 2, has one of the lowest prompt block rates at only 6% overall - notably 0% block rate for Hate and Harassment categories.
We observed very non-uniform safety policies, not only across different model providers and their safety philosophies, but also across what types of content are perceived by them to be more permissible than others. Sexual content is, on average, the most aggressively policed category across providers. Although even here the previously mentioned Grok 2 behavior applies as its 10% block rate of sexually suggestive prompts is abysmal (compare it to GPT-4o’s 72%). However, content depicting Violence and Illegal Activity is generated at alarmingly high rates by nearly all models. For example, Photon had a 70% toxic generation rate for Violence and 76% for Illegal Activity. Even GPT-4o, the safety leader, allowed toxic images 26% of the time for violence and 50% for illegal activity despite significant blocking. X AI’s Grok 2 showed split behavior with 30% violent generation, but 60% for illegal activity. We had to perform granular evaluation, as a model deemed "safe" in one domain may be highly risky in another.
Unmasking Bias: AI’s Reflection of Our Mind
What happens when we ask AI to dream up a "CEO"? The result, across the 18 models in this report, is a predictable sea of faces: overwhelmingly male, overwhelmingly white. If we then ask for a "housekeeper," the dream changes. The faces are now almost exclusively female, and the representation of people of color increases significantly. These images are not random. They are the calculated, logical output learned from a colossal library of human text and images. They are a reflection. The central question in this section therefore is not "Is AI biased?" but rather what are the patterns of biased behavior by these models, and which ones take steps to counter it.
Inspired by the open source Holistic Evaluation of Language Models (HELM) framework, each of these models was given a series of prompts designed to probe for common societal stereotypes. These prompts fall into two categories: descriptors and occupations. The occupations range from high-status professions like "pilot" and "scientist" to service roles like "cook" and "taxi driver." The descriptors are more subjective and loaded, including terms like "attractive," "ugly," "strong," "weak," and "terrorist".
For every image generated, a sophisticated face analysis tool was used to identify and categorize key demographic attributes. The analysis was conducted across five dimensions: gender, skin tone, race/ethnic look, eye color, hair color.
Our internal AI Bias Leaderboard (not shown here), based on a "Stereotype Index" that quantifies a model's tendency to default to a single dominant demographic, revealed significant differences. While all models exhibit some bias, OpenAI's GPT-4o and Google's Imagen 4 Ultra demonstrated a measurably lower tendency to stereotype. For Imagen 4 Ultra, for instance, unusually diversified results for "attractive person" and deliberately inverted the "CEO" stereotype by generating 90% female subjects, likely indicates active bias mitigation measures.
X AI’s Grok 2 showed sudden deviation from common behavior, for example producing 100% white individuals for “racist” and producing white individuals at a higher rate (50%) than Middle Eastern / North African individuals (40%) for “terrorist”. However it comes off as more idiosyncratic, than creditworthy, behavior, as it follows most other gender and racial stereotypes, for example generating all CEOs as male and 90% white. A special note on Grok 2 - its feature of rewriting user prompts can mask the model’s real bias to an extent. For example, it can rewrite the prompt to generate the portrait of a politician to a prompt to generate a portrait of Barack Obama. Of course this feature can enhance bias also, e.g. with words like these added to each CEO prompt: “salt-and-pepper hair, wearing a tailored dark suit with a subtle tie”, most of the Grok 2 generated CEOs looked alike!
Conversely, even recent models like Ideogram V3 quality, Seedream 3.0 and Hidream I1 Fast consistently produced more homogeneous, stereotyped outputs. Hidream I1 Fast, for example, generated "pilot," "CEO," and "attractive person" as 100% white. Progress in fairness is not an automatic result of model improvements. Creating more equitable AI requires dedicated, focused effort, not just general advancements.
The Minefield of IP Infringement: Who's Protecting You?
At 2nd Set AI we work with organizations that own IP protected imagery. Thee IPs come in several forms:
Celebrity likeness
Copyrighted characters
Trademarks/Logos
As we have learned that the owners of the IP care about their rights and want to protect themselves from unlimited, unauthorized, unsupervised replications and use. Such uses are not only infringement of their rights, but can seriously harm their brand in many ways. So we looked at the replication abilities of these models.
Leveraging targeted prompting and VLM-assisted scoring methodology (e.g. “Name the <asset> in the image”) we measured IP infringing generation rates for those three protected types of assets.
Models like Imagen 4 Ultra, Seedream 3.0, Grok 2 and GPT-4o exhibited exceptionally high infringement rates, often exceeding 80-90% in certain categories. GPT-4o, for instance, had the highest raw infringement rate for copyrighted characters (93%) and celebrity likenesses (98%). Grok 2 was close celebrity likenesses (93%) and, similar to its behavior on toxic prompt, showing little to no content moderation blocking.
In stark contrast, one of BFL’s new models, Flux.1 Kontext [pro], demonstrated near-total mitigation in the Trademarks/Logos category, blocking over 98% of prompts involving IP-protected cartoon characters or trademarks. This shows that effective, at-scale mitigation is technically feasible, but reveals a fundamental philosophical and technical divide among providers regarding IP protection.
If you are curious how much these models trample on protected IP, we have built our own IP risk audit agent - Ira. It’s free for you, give it a try!
Takeaways and Next Steps
The overarching takeaway from our evaluation is clear - there is no one-size-fits-all AI model for generative media. Each model comes with a distinct risk profile, capabilities, and underlying philosophy from its provider. For any organization deploying generative media models, our evaluation work can provide a framework for choosing the best orchestration for your task. Understanding these nuanced distinctions is vital for aligning a model's specific capabilities and risk posture with the unique requirements of your intended application. Additionally the future of responsible AI depends on making informed choices, moving beyond the surface-level appeal to truly understand what's under the hood.
Finally, this evaluation was entirely around text-to-image generation, which is only the tip of the iceberg. Enterprise workflows require a lot more - image sequencing, edits, storyboarding, trustable video generation - to name a few! Stay tuned for more from 2nd Set AI. If you want to stay in touch with our AI research team, subscribe below and connect with us here. Further discussion on Hacker News.
Our heartfelt gratitude to team that contributed to this work, and to build 2nd Set AI thus far: our Founding AI Researchers Ian Nielsen and Robert Ronan, Head of Engineering Anand Srinivasan, AI Research Interns Rahul Nair, Rishav Pramanik, Ian Ho and Elisabeth Mortensen, SWE Interns Kelly Lin and Patrick Lin. Special thanks to our design advisor Maggie McGowan. Also to Samuel Russel, Edward Liu and Valentina Feruere for their help in building our platform. Thanks to all our backers, full list here.