It’s incredibly hard to keep up with the pace of progress in generative media these days. Perhaps the biggest news recently has been OpenAI’s mind blowing work on 4o Native Image Generation. But it can be hard to truly understand what this progress really means and some of the broader implications. So, if you too want to understand more about the latest in generative imagery, enjoy this multipart experiment bringing the latest in genAI images together with some fascinating (and occasionally whimsical) tests cases drawn from the history of English dress.
Historical Fashion
Our jumping off point for this exercise comes from Bernadette Banner, a UK-based historical dress expert who focuses on “primarily on English dress predating the widespread use of the electric sewing machine.” And she has an amazing and tremendously popular YouTube channel, documenting her various reconstructions of historical garments, not to mention an excellent book, various courses, and more.
In April of last year, she put up a video where she attempted to generate historically accurate oil portraits and evaluate the accuracy of the fashion in them. In the main part of the video, she tested Tudor, 1560s, 1630s, 1760s, and Edwardian historical dress, across men and women.
There are a number of reasons why she got such bad results, including not using the best models at the time. Some of her bad generations are just older stable diffusion issues. But others are about grounding, which is her real concern.
In this context, what she (and her substantial audience) were interested in was whether the AI could accurately understand fashion concepts like Tudor, Regency, and Edwardian. And last spring, the most commonly used models couldn’t. They were insufficiently grounded. But are they still so bad?
The Banner Test
With the latest advances in image generation, including OpenAI’s mind blowing work on 4o Native Image Generation, my suspicion was that generative imagery could now satisfy this expectation of grounding.
So, I chose to frame this intuition in a Turing test-style formulation:
Provided a period of historical dress, can a generative model produce an image so accurately grounded, that an expert will be able to derive the accurate period from the image alone?
I call this formulation The Banner Test. If Bernadette can tell when the picture is supposed to be from, then we have evidence of a human-level of grounding in the generation.
Experimental Results
Much to my amusement, the generative model actually started out with something of a cheat: it put the date literally in the painting. Note that you may need to tap or click on the images to see the full detail of the image, including the dates.


But this was easily fixed with some manual masking of those generated dates. And 3/5 didn’t show that cheating-type behavior.



To me, these look pretty good, but I’m no Bernadette Banner, so what do I know? The whole point of this exercise is about domain specialized knowledge being required to evaluate the quality of the grounding.
Instead, I tapped a real fashion expert, NYC-based fashion designer Maggie McGowan. Maggie is a womenswear designer who has led such very modern brands as Michael Kors’ Runway team and thus not specifically a historical dress expert. But her fashion expertise is a good proxy for what a fashion professional is likely to know.
When she evaluated the set, like any expert, she leaned on knowledge of specific details.
For the 1760s woman:
Based on shape of headpiece and the jewelry details I'm thinking... 1750-1775.
And for the 1810s man:
[I]t's giving me a Mr. Darcy vibe from Pride and Prejudice so I'm going with early 1800's, 1810-1820?
Interestingly, when I surveyed individuals without fashion expertise, many folks explicitly named Mr. Darcy, which speaks to the clear understanding of this fashion concept throughout broader society.
In her overall evaluation, Ms. McGowan accurately identified all but one of these images’ time periods within a few decades. For the outlier image, we concluded that this was more due to the difference in expertise between a present day, working fashion designer and a fashion historian. Presumably Amber Butchart would’ve gotten a perfect score.
Based on this, I’m going to conclude that the Banner Test is now passed.
Round Trip Testing
Of course, you may not have a Maggie McGowan, Amber Butchart, or Bernadette Banner on call. Expert human evaluation is a rare and valuable commodity. So, one way that we generally back progress in models is through what’s called round trip testing (or sometimes back translation).
The core idea is that if you provide an input to a model, then the model should be able to perform the inverse operation and recover the original input. So, in translation, if I provide the word, “Hi,” to a model to translate into French, then it should produce, “Salut.” And then I should be able to provide that to the English translation model, to recover the input, “Hi.”
So, to do the round trip testing version of the Banner Test, all you need is a vision model. For testing the new 4o Native Image generation capabilities, I decide to use a completely independent model as a judge, Google’s Gemini 2.0 Flash. It’s been my experience and that of many of many colleagues that AI models from different organizations have different strengths and weaknesses, so it seemed more informative to evaluate using a non-OpenAI vision model.
Results
So, does the round trip testing work? Yes!
First, we do have to accommodate for the generative model’s tendency to literally paint the date into the painting. If you don’t the vision models will just read them.
The year this man was painted is clearly visible in the top right corner of the image: 1630.


With the date masking fixes in place, you can see that the vision model is more than capable of recovering the original dates and explaining its reasoning, just as a human expert would.
Without any identifying information on the painting itself, such as a visible date or artist signature, I can only estimate the year based on the style of the portrait:
Clothing: The man's dark, relatively plain doublet with a simple white collar (though with some lace detailing) is typical of the early to mid-17th century. The less elaborate style compared to later periods of the century is a key indicator.
Hair and Facial Hair: The longer hair, mustache, and short beard were fashionable during the 1620s and 1630s.
Painting Style: The realistic depiction and the use of light and shadow are consistent with portraiture of that era.
Based on these stylistic clues, this portrait was most likely painted in the 1620s or 1630s.
Like Ms. Banner, the model focuses on the structure of the collar and the specific details of the hair and facial hair to definitively identify the period. The same level of reasoning and accuracy continued throughout all of the experiments.
Tudor
Based on the style of dress, particularly the gable hood, the high neckline with a ruff-like element, and the overall composition, this portrait most likely depicts a woman from the late 1530s or 1540s. This places it firmly within the Tudor period.
As with Ms. Banner, the model focuses on the gable hood and ruff to establish Tudor dress.
1760s
Based on the style of her clothing, particularly the wide panniers under the skirt, the fitted bodice with a relatively low neckline, the lace trim, and the overall silhouette, this portrait is typical of the 1760s or 1770s.
Note that it was specifically the absence of panniers that Ms. Banner called out in the original video. That issue is now addressed by the generative model and understood accurately by the vision model.
Regency
Based on the style of the man's clothing, particularly the high-collared cravat, the tailored coat with a nipped waist and slightly flared skirt, the fitted trousers, and the overall formal but relatively understated appearance, this portrait is typical of the Regency era, which spans roughly from 1811 to 1820.
Therefore, it is highly likely that this man was painted sometime within that decade, around 1810-1820.
Like Ms. Banner, the model focuses in on several clear tightly locatable period details, like the fitted trousers.
Edwardian
Based on the style of the woman's clothing, particularly:
The "S" bend silhouette: The corseted waist, the pigeon breast, and the skirt that flares out at the back are characteristic of the Edwardian era.
The high neckline with lace detailing: This was a common feature of Edwardian blouses and dresses.
The puffed "leg-of-mutton" or gigot sleeves (though more subdued): While the most extreme versions were earlier, a softer, more rounded fullness at the shoulder was still present in the Edwardian period.
The large, wide-brimmed hat adorned with feathers and flowers: This is a quintessential element of Edwardian fashion.
The long, flowing skirt: While not excessively wide, it has the characteristic shape of the time.
The pearl necklace: A popular accessory.
Considering these elements, this portrait was most likely painted during the Edwardian era, which spanned from roughly 1901 to 1910.
And as in Ms. Banner’s previous test, the Edwardian output is the most successful. It’s really fascinating to see so many period-specific details be understood by both the generative model and the vision model.
There are literally no errors in the above, as far as I’m able to establish. So, the round trip version of the Banner test is also passed!
Imaginative Generative Fashion
Of course, replication of imagery is just one possible use of generative images. Another really popular use case is to imagine images that haven’t existed before. So, further building on the work of Ms. Banner, I decided to include the other key element of channel’s content: her guinea pigs, Lyra and Minerva. Redoing the experiment with guinea pigs actually provides an interesting testbed for further modeling questions.





Since this is primarily an entertainment type use case, it’s best to start with a subjective evaluation. To me, these are excellent interpretations of the creative brief. I would not at all be surprised to see these for sale on Etsy as hipster home decor.
Historical guinea pig fashion, ‘tis peak!
Of course, it’s more interesting as an AI researcher to evaluate their accuracy. Since we’ve already validated that vision models can understand historical human fashion. Can it understand the historical fashion of guinea pigs?
Again, very much yes. They even seem to understand the context.
This is a whimsical portrait of a guinea pig dressed in the style of a Tudor-era woman.
At worst, they simply expand their range of dates for the estimated “painting.”
This style is most strongly associated with the Georgian and Regency eras, roughly spanning from the late 1700s to the early 1830s.
However, as the subject is a guinea pig, the "painting date" is fictional and used for comedic effect to place the animal in that historical context.
But, again, there were no true errors. So, the guinea pig variant of the round trip version of the Banner test is also passed! I’ll clear a space on my shelf for the now obligatory Test of Time Award for this research breakthrough.
Reasoning
But we’re not done yet! No discussion of generative models is complete these days without a discussion of reasoning. This topic is all the range in the language domain, but I’d argue that it’s much more interesting in the visual domains of images and videos. Reasoning is one of the best ways that a model can address truly hard problems.
One of the classic hard problems in generative imagery is hands. This proved true in this experiment.
Hands are hard. But paws are harder.
In the guinea pig versions of the historical dress generation, the 4o image generation model chose to largely put guinea pig heads on human bodies, which means that the figures have human hands. But guinea pigs have entirely different appendages, with four toes on their front paws and three toes on their back paws.
While this choice is not wrong, in an artistic sense, it does diverge from accurate guinea pig anatomy, so a possible goal could be to ensure that furry generations are more respectful of species-specific anatomy.
To test this, I again reran the experiment, this time deploying an advanced multimodal reasoning model developed by me and
. One of the key abilities of this model is to introduce additional context-specific grounding into visual reasoning during generation. In this case, it was enriched with a (generative) understanding of guinea pig anatomy.Again note that you may need to tap or click on the images to see such details as whiskers and number of fingers and toes.





Again, we have accurate historical dress, but now we also have more obviously guinea pig anatomical details. Guinea pig fashion, ‘til still peak!
The anatomical accuracy of guinea pigs in historical fashion may not seem like a tremendously important problem, but coherence in imagined visuals can be a huge business concern. The Godzilla vs. Kong MonsterVerse franchise successfully solved such speculative visual reasoning problems and is on track to make $1 trillion! It seems unlikely that would’ve happened if Godzilla and Kong didn’t have consistent and coherent anatomy, relative size, and other specific details conveying the verisimilitude of an obviously fantastical world.
The Times, They Are a-Changing
Generative media is moving incredibly quickly. The original Bernadette Banner video was made less than 1 year ago! And as described in exhaustive detail above, it’s now been passed, validates in round tripping, and can even be extended to imaginative domains like furries (with or without anatomical accuracy). GenAI’s pace of progress is similarly peak.
What should anyone using or building take away from this experiment? Here are some suggestions:
Generative media is advancing incredibly quickly. Tests of these examples done the month before 4o Native Image Generation’s release produced worse results. This line was just crossed.
Models will cheat if you let them. It’s a common experience to find models literally passing through elements of the text of the prompt into the resulting generation. Supervision of generative media is necessary for most real world use cases.
Advances in foundation model capabilities are symbiotic with advances in context-specific reasoning. Do your furry Edwardians have five-fingered hands or four-fingered paws? Where did you record that decision and does the model have an understanding of that constraint?
Grounding is rapidly improving, but it is still highly context dependent. These tests work incredibly well, because they’re pulling from a rich archive of historical dress imagery. Most enterprises don’t have the same volume of image data about their image assets. And in some cases, the grounding may be coming from unlicensed data, creating issues like the Studio Ghibli controversy, which I discussed more here.
Finally, if you’re interested in future experimentations on the state of the art in generative media, model evaluations, and reasoning, please subscribe to this Substack. There are many more interesting questions to be asked about the state of the art in generative media.
Post written by Jeff Smith, based on work done with Saurav Pandit and Maggie McGowan.