NIST's previous age estimation report was produced 10 years ago. As the report concludes, progress has been substantial in those 10 years. In 2014 testing was performed using a dataset built from visa application photographs collected in consular offices in Mexico. The dataset comprises 5,738,091 subjects, or people, with a total of 6,249,294 images of those subjects captured at an image resolution of 252x300 pixels.
Using that exact same dataset in 2024 NIST found that five of the six algorithms under test outperformed the most accurate algorithm tested in 2014, and that the best mean absolute error (MAE) calculated for that dataset has reduced from 4.3 to 3.1 years.
This represents real progress and we can expect these technologies to continue to improve as algorithms and data are continuously enhanced.
NIST used four additional datasets in the 2024 test:
1. FBI mugshots - 1,482,667 subjects captured using a standardised photographic setup with the majority if images being 480x600 pixels
2. Border crossings - 632,520 subjects captured on webcams oriented by immigration officers
3. Immigration application photos - 802,332 subjects captured using a standardised photographic setup at an attended interview at immigration offices in the US. The majority of photos have a uniform, white background, eyeglasses are absent, subjects are in a frontal pose and images are 300x300 pixels.
4. Kalina Everyday - 1,991 self portrait photos taken daily for longitudinal study purposes
An important finding is that age estimation accuracy was often lowest using the border crossings dataset and highest using the immigration application photos. Speculating, this is likely to be correlated with the respective image quality in those datasets, with border crossing images being captured in a non-standardised way on cheap webcams with cluttered backgrounds and variable lighting, whereas the immigration application photos are higher quality due to the standardised capture process.
This finding tells us that deployers of such technologies should consider how their deployment context might impact operational performance. Retailers, for example, could find that a checkout equipped with a higher quality camera, positioned for optimal lighting and less background clutter is able to perform more accurately than an inferior deployment and therefore inconvenience fewer shoppers with ID checks that are not necessary.
To get the most out of this technology, deployers, whether they are offline retailers or online services, should test that their performance in the field matches that of lab based testing.
Another finding is that the wearing of eyeglasses impacts estimation error, with four of the six algorithms under test showing higher estimation errors in both males and females when wearing glasses. Over time we should expect such performance differentials to reduce as tech providers gain access to more training and testing data covering such appearances.
However this finding is indicative that testing should also consider other presentation factors, such as the wearing of cosmetics, piercings, fake eyelashes and even tattoos.
It is very welcome to see NIST move beyond the Fitzpatrick skin type in their treatment of demographic bias. Instead they use country of birth as a proxy for ethnicity, since this data point is available on the immigration applications they have access to, but they are very transparent about the imperfections of this proxy:
1. it ignores local ethnic variations
2. some part of the population will have trans-national ancestry
Across age groups and algorithms, the report tends to find that false positive rates, where the true age is less than the legal age limit, are highest in West African females and lowest in East European males. But what we cannot say, without further work, is how this performance differential compares with the equivalent judgement of a human, or even at what threshold we should deem performance differentials to be discriminatory.
NIST is a globally recognised science lab tasked with evaluating and mitigating the risks of AI systems by President Biden's exec order. Their report will play a vital role in the industry building credibility and trust by providing independent verification of system performance.
Serve Legal can help age estimation tech providers and deployers go a step further by curating purpose built datasets that facilitate testing using either pre-collected images or live presentations with ground truth across multiple demographic characteristics, decorative presentational differences and environmental factors.