There is little doubt that human observers can readily extract visual information from faces to estimate a person’s age1. Cues from wrinkling, pigmentation, hair color and facial structure all contribute to these estimates2. Yet, human accuracy in age evaluations is imprecise, limited by the observer’s perceptual resolution, top-down influences3, and by genetic and environmental factors that cause people’s faces to age differently4.

In the current study, we compared the performance of human observers and several popular Artificial Intelligence (AI) programs in estimating people’s ages from photos of their faces. As it turns out, AI platforms make many of the same errors as humans, although to a larger extent. Before exploring what this might mean, we briefly review the kinds of errors that humans typically exhibit when estimating someone’s age and the potential sources of those errors.

Previous research has identified several well-known biases and inaccuracies in our ability to estimate age from facial appearance2. For example, there is a significant decrease in accuracy when estimating age from the faces of middle-aged adults (40–60) compared to young adults (20–40) and of older adults (ages 60–80) compared to middle-aged adults2,5. This decrease in accuracy could be accounted for by the fact that genetic and environmental factors have a larger impact on the appearance of the face as people grow older, and that there is considerable variance in these effects on the apparent age of the face2. The inaccuracies could also stem, at least in part, from biases in evaluating the age of people of different ages: the faces of young adults (ages 20–40) are typically perceived as older than their actual age, whereas the faces of older adults are typically perceived as younger2,5.

These biases are likely due to people’s tendency to use the estimated mean of a given dimensional distribution as reference point for their judgments6, a regression to the mean effect. In the case of age evaluation, assuming that the perceived mean age of the population is around 40–45 years, it has been suggested that age evaluations of younger and older adults tend to drift to the direction of this perceived mean, creating a bias in which the age of younger adults is overestimated while age evaluations of older adults are biased in the opposite direction2. We note, however, that while there is partial correlation between bias and accuracy measures in age perception, as we discuss below, the regression to the mean effect could not fully account for the general decrease in accuracy performance with the age of the presented faces. In particular, the bias in age estimations for faces of younger (and older) adults is significantly larger than the bias for faces of middle-aged adults2,5, but at the same time, the accuracy of age estimations for faces of middle-aged adults is significantly improved compared to the accuracy for faces of young adults2,5.

Perhaps one of the most intriguing biases in age evaluation can be seen in the way smiling influences perceived age. Recent research from our lab and others strongly suggests that, contrary to common belief7, smiling faces are perceived as older than the faces of the same people when they have a neutral expression5,7,10. This “ageing effect of smiling” (AES) is assumed to be driven by the formation of smile-related wrinkles in the region of the eyes. Remarkably, we observed a sharp dissociation between what the participants believed they had reported and how they actually performed. Thus, in the same experimental session, participants estimated smiling faces as looking older than neutral faces, but at the same time, erroneously assumed that they had rated the smiling faces as younger7,9.

More recent data shows that the AES is reduced with age of the faces presented. The AES is most prominent for faces of young adults of either gender, is evident only in male faces of middle-aged adults, and completely vanishes for the faces of older adults5. This modulation of the AES by the age of the presented faces does not depend on the age of the observer, and is attributed to the relative contribution of information from smile-related wrinkles compared to that provided by other cues to aging. Neutral faces of young adults, which contain few age-related signs, are easily offset by the addition of smile-related wrinkling, while the same is not true for faces of older adults where there are a number of other prominent cues to aging5.

The AES represents a basic bias in age evaluations for smiling faces compared to neutral ones. But beyond this bias, age evaluations of smiling faces (as well as faces with any expression) are generally less accurate than age evaluations of neutral faces. In developing this idea, it is important to distinguish between two classic aspects of age evaluation. The first is a bias in perceived age; the second is the absolute accuracy of age evaluation. The bias, by definition, is directional (signed), and is measured by subtracting the actual age from the estimated age of a face. In contrast, the accuracy of age evaluations is measured by computing the absolute differences between the actual and the perceived age. In psychophysical terms, the accuracy in age evaluations is related (but not identical) to the concept of the JND (just noticeable difference) measured by the variability of the response11,12, which, in this case, represents the resolution of perceived estimates of age. Bias, however, is related to the psychophysical concept of Constant Error (CE)13,14, and represents the directional difference between the perceived and actual age of a given face or subset of faces. It should be noted that while the two aspects of age evaluations are sometimes related, they can also be partly independent from one another. For example, a bias in age evaluations would definitely result in inaccurate performance, but at the same time, a zero bias in age evaluation does not mean that age estimates are accurate. Consider the case in which there is symmetrical, yet substantial variability in age evaluations of a particular face (e.g., the person’s actual age is 40 years old, but she/he is perceived to be 20 by some people and 60 by others). In this case, the bias in age estimation is zero, yet accuracy is extremely poor, with an absolute error of 20 years!

It is important to note that the decrease in accuracy for smiling faces is not solely the result of the AES, but rather represents a more general effect of facial expression on accuracy2,5. This is illustrated by the fact that the decrease in accuracy for smiling compared to neutral faces is found even for faces of older adults, for which there is no AES2,5. The decrease in performance for smiling faces likely represents difficulty in estimating the age of expressive faces due to temporary variations in the appearance of the face that are part of emotional expression2. Previous research has also shown that accuracy in age estimations is higher for male than for female faces2,5,15. Yet, unlike the case of expression, the gender-based accuracy effect can be accounted for, at least in part, by biases in age estimation. In particular, the age of female faces is underestimated to a larger degree than the age of male faces, an effect that is more prominent with faces of older adults2,5. Given that accuracy differences in gender are also more prominent with faces of older adults5, it could be argued that the poorer age-estimation performance for female faces results from this bias.

Recently, there has been a growing interest in automated age estimation using artificial intelligence (AI) technology16. The current platforms use machine-learning algorithms based on training with a large set of photos to achieve the most accurate performance in age estimations. The current interest in age estimation by AI is part of an overall attempt to extract various visual features automatically from faces, features that include identity, expression, and gender as well as other information that can be gleaned from the face. The specific interest in age estimation is also boosted by recent commercial incorporation of automatic age estimation technology for different uses, including age verification in retail outlets that is now being implemented in different countries17. Beyond dozens of commercial companies that offer age estimation technologies, there are also many non-commercial apps and webpages that offer age estimation technology based on photos uploaded by the users. Despite this growth in the technology, it is presently unclear how AI compares to human performance and whether it suffers from the same biases and errors. Here, we provide the most comprehensive attempt to date comparing age estimation performance between humans and the most popular AI technology available today.

Comparing human performance with the performance of AI could provide a better understanding of the processes that underlie age perception in humans. In particular, it allows one to tease apart possible top-down cognitively driven effects from those driven by the image itself. For example, although the ageing effect of smiling (AES) has been attributed to wrinkling around the eyes, it is also possible that this effect is modulated by people’s general belief that smiling makes one look younger, and without this modulation, the AES would be even larger7,8. In a related manner, although the decrease in the accuracy of age estimation for older adults has been traditionally attributed to the visual properties of faces, it could also be due to the common belief that accuracy in age estimations for faces of older adults is a less critical issue than accuracy for young adults, and therefore should be taken more lightly2. In sum, to the extent that AI suffers from the well-established biases and inaccuracies that are part of human performance, it could be argued that the biases commonly exhibited by human observers are largely driven by the visual properties of faces.

Beyond the focus on the factors driving human perception of age, the comparison between human and AI abilities could provide a better understanding of current AI technology and suggest ways to improve this technology. There is strong evidence, for example, that, for other aspects of face recognition such as face identity and gender classifications, current AI technology shows discrimination biases based on ethnicity, gender, and age18,19. It is unclear whether and to what extent, compared to humans, automatic age estimation technology suffers from similar biases. Here, we used a comprehensive sample of 21 different AIs, including commercial programs such as Microsoft Face API, Amazon Rekognition, Everypixel API, as well as leading non-commercial websites and applications (for a full list, see Table 1). We compared AI performance to the performance of human observers using a large set of neutral and smiling female and male faces from different age groups that we have previously reported5.

Table 1 Description of the different AIs tested for age evaluations.

Methods

Participants

The data of AI performance was collected over the years 2020–2022, providing a representative set of 21 current commercial and non-commercial AI age estimation technology (see Table 1). We did our best to include the most prominent players in the field as well as the most popular available websites and apps. Although it is possible that we missed some, we believe that our survey includes a comprehensive sample of current AI technology. AI performance was compared with the performance of 30 undergraduate students from Ben Gurion University of the Negev (12 males, mean age = 23.4 years, SD = 1.63 years), originally reported in Ganel and Goodale (2021), and reanalyzed for the purpose of the current study. The experimental protocol was approved by the ethics committee of the Department of Psychology in Ben-Gurion University of the Negev. The study adhered to the ethical standards of the Declaration of Helsinki. All participants signed an informed consent form prior to their participation in the experiment. The manuscript contains no information or images that could lead to identification of a study participant.

Stimulus set and design

The stimulus set was identical to the one recently used in our lab5. The set comprised 480 photos of women and men, each photographed with neutral and smiling expressions. The set was based on three databases that included the real ages of the photographed people: The FACES database20, the PAL face database21, and a set of faces photographed by members of the Ganel lab. The photos were equally divided into 3 age groups: young adults (20–40 years), middle-aged adults (40–60), and old adults (60–80 years). Photos were in .bmp format (240–357 × 300 pixels, depending on the source set, 24-bit depth). Figure 1 shows examples of photos used in the set. For full description of the set, see Ref.5.

Figure 1
figure 1

Examples of stimuli used in the study. The images were of neutral and smiling faces of young adults, middle-aged adults, and old adults. Adapted from Ebner et al.20, all right reserved.

Each photo was uploaded separately for age evaluations by the different AIs and the output was recorded in years. For two of the AIs (AI11, AI16) that provided output age range instead of a 2-digit output, the average of the range was used. Photos were uploaded in a .bmp format. In cases in which .bmp was not supported by the AI, high quality .jpeg versions of the photos with similar dimensions were uploaded. Note that 5 of the AIs (AI3, AI9, AI14, AI18, AI19) did not provide age outputs larger than 69 years of age. While this limitation could be seen as the first indication of “ageism” in AI, a trend which also appeared in the main analysis and will be discussed in the following sections, the general pattern of results and biases was maintained even when the data of these specific AIs were excluded in a separate analysis.

The full experimental procedure for the human participants is described in Ref.5. As in the case of the AIs, each photo was presented separately, and participants were asked to type their two-digit response on the keyboard. The only substantial difference in the design was that in order to prevent top-down memory influences for human participants, the stimulus set was counterbalanced so that each participant was presented with only one photo (smiling or neutral) of the same person5,7,8,9,10. This concern was irrelevant in the case of AIs, because none of them stored previously presented data or relied on previous responses for age estimations, which were based entirely on the presented image. Therefore, the entire set of 480 faces was presented to each of the AIs. To assure reliability, we also randomly uploaded some of the photos twice and on different occasions for a sample of the AIs. In all these cases, the outputs were identical.

Analysis

For each human participant/AI, accuracy scores were computed for each photo using the average absolute difference between the estimated and real age. Bias scores were computed by subtracting the real age from the estimated age. For each participant and for each AI, we computed the average estimated age, the mean accuracy score, and the mean bias score in each combination of age group X gender X expression. The descriptive data in all tables and figures were based on this analysis. In order to compare human and AI performance statistically, we reanalyzed the data in an item-based manner for which each presented person (photo) was treated as separate item. 240 items were therefore included in this analysis. Expression (neutral, smiling) for each item was treated as a within-item variable. Data was then averaged separately across all human participants and across all AIs for each expression and for each photo. A mixed ANOVA design was used to analyze the data, in which the expression of the photographed person (smiling, neutral with 240 photos in each group) and the experimental group (humans, AIs) were the within-subject (item) variables, and the gender of the photographed person and his or her age group were between-subjects (items) variables. During the item analysis, we found that the identity of one of the 480 photos belonged to a different person than intended. This particular photo was removed from all subsequent analyses.

Results

Table 2 presents the accuracy scores for each combination of gender, expression, and age group for each of the AIs. Average human performance in each combination is presented for reference. Table 3 presents average estimated ages in each combination for each of the AIs, the average perceived age for humans, and the average real age.

Table 2 Absolute age estimation accuracy (in years) of the different AIs compared to average human performance.
Table 3 Average estimated age (in years) for the different AIs compared to average human performance.

As can be seen in the tables, errors and biases across the AIs were similar to the well-established errors and biases found for human perception. First, average AI performance sharply decreased for faces of older adults compared to faces of young and middle-aged adults. As illustrated in Fig. 2a (average accuracy scores in each age group), this decrease with age was larger than the decrease observed in human performance. Second, and perhaps more interesting, is the fact that just as in human perception, AI estimated the ages of smiling faces as older than the neutral faces of the same people (Fig. 2b). As was the case for human observers, this effect was most pronounced for faces of younger and middle-aged adults, and decreased for faces of older adults. However, the size of the AES for AI was larger for faces of young adults compared to the AES seen in human observers with the faces of young adults, and the decrease in the size of the AES for older faces was also larger and went in the opposite direction for AI. It is important to note, however, that there was considerable variability in the performance of the 21 AI programs we tested. Nevertheless, as the inset in Fig. 2 shows, variability was also evident in our human participants, although to a lesser degree.

Figure 2
figure 2

(a) Age estimation accuracy for the different AIs compared to average human performance. As was the case for human observers, AI performance showed a large decrease in accuracy for faces of older adults compared to faces of younger adults. This effect was larger for AI compared to human performance. (b) The ageing effect of smiling for the different age groups—smiling faces were estimated as older than neutral faces of the same people. Overall, this effect was larger for AI (younger adults), and showed a sharper decrease with age. The inset shows the mean accuracy data (across age group, gender and facial expression) for each AI and each of the human observers. Group averages are indicated by the horizontal lines. Although there was variability in age estimation accuracy in both groups, humans were less variable and more accurate.

A similar trend of human-like biases in AI age estimation (although to a higher degree) was found for the effect of expression and gender on accuracy (Fig. 3a,b). As was the case for human observers, AI performance decreased for smiling compared to neutral faces, an effect that was evident across all age groups. Again, this decrease in accuracy was larger for AI than it was for human observers. A similar trend was found for the effect of gender. Performance accuracy for male faces was higher than for female faces, an effect found for all age groups in AI but only for old adults in the human observers (Table 3). Not surprisingly, therefore, the overall decrease in age accuracy for female compared to male faces was larger in AI (Fig. 3b). Finally, as in the case of human performance, AI showed age-group dependent biases. In particular, faces of young adults were overestimated compared to their real age while faces of older adults were underestimated compared to their real age. Again, this modulation of the bias with age group was larger for AIs compared to human age estimation (Fig. 3c).

Figure 3
figure 3

(a) Average age estimation accuracy for neutral and smiling faces across age group and gender. As in human performance, AI accuracy decreased for smiling compared to neutral faces. This decrease was larger for AI compared to humans (b) Average age estimation accuracy for male and female faces across age group and expression. As for humans, AI age estimation accuracy decreased for female compared to male faces. This decrease was larger in AI. (c) Age estimation bias for the different AIs across expression and gender. As in human age estimations, AI showed overestimation of age for faces of younger adults and underestimation of age for faces of older adults. Again, this effect was more pronounced in AI compared to humans.

Statistical comparison between AI and human age estimations

To compare AI and human age estimations, we used an item-based analysis in a mixed ANOVA design with expression (smiling, neutral), and experimental group (humans, AIs) as the within-subject (item) variables and with gender and age group as between-subjects (items) variables. The dependent variables were the absolute accuracy score, the mean estimated age, and the directional bias in age estimations.

For accuracy, there was a main effect of group [F(1,233) = 8.27, p < 0.001, ηp2 = 0.07], indicating that the performance of the human observers was more accurate overall than the performance of AI (see Fig. 2). We will discuss this effect in the next section. A main effect of age group [F(1,233) = 18.27, p < 0.001, ηp2 = 0.07] indicated an overall drop in performance with the faces of older compared to young and middle-aged adults [Fig. 2a; F(2,233) = 68.02, p < 0.001, ηp2 = 0.37]. This main effect was qualified by a significant interaction between group (AI and human observers) and age group [F(1,233) = 50.78, p < 0.001, ηp2 = 0.3], which indicates that the decrease in accuracy with age group was larger for AIs than it was for human observers. A main effect was also found for the gender of the photo [F(1,233) = 13.28, p < 0.001, ηp2 = 0.05], indicating a decrease in accuracy for female faces. Again, the main effect was qualified by significant gender X group (AI and human observers) interaction [F(1,233) = 16.57, p < 0.001, ηp2 = 0.06], showing that the decrease in performance for female faces was larger in AI (see Fig. 3b). A main effect was also found for expression [F(1,233) = 122.32, p < 0.001, ηp2 = 0.34], indicating reduced accuracy for smiling compared to neutral faces. Again, this main effect was qualified by expression X group (AI and human observer) interaction [F(1,233) = 60.59, p < 0.001, ηp2 = 0.16], again showing that the effect of expression on accuracy was larger in AI (Fig. 3a).

An interaction between expression and age group [F(2,233) = 5.59, p < 0.01, ηp2 = 0.05] indicated that the effect of expression was more pronounced in the faces of young adults compared to the other age groups. A gender X age-group interaction [F(2,233) = 4, p < 0.05, ηp2 = 0.03] indicated that the effect of gender was larger in the old adults group compared to the other groups. All other interactions were not significant (p > 0.05).

For the dependent variable of average estimated age, there was a main effect of expression [F(1,233) = 13.96, p < 0.001, ηp2 = 0.06], indicating that overall, smiling faces were estimated as older than neutral faces (the AES). This main effect was qualified by an expression X age group interaction [F(2,233) = 20.13, p < 0.001, ηp2 = 0.15], indicating a decrease in the AES with age group. This decrease was larger for AIs compared to human observers, as indicated by a significant group (AI and human observers) X expression X age-group interaction [F(1,233) = 12.74, p < 0.001, ηp2 = 0.1].

A main effect of group (AI and human observer) [F(1,233) = 16.78, p < 0.001, ηp2 = 0.07] indicated that, overall, age was underestimated by the AIs compared to human observers. A main effect of age group [F(2,233) = 1552.46, p < 0.001, ηp2 = 0.93] indicated older age estimations for older adults. This effect was qualified by group X age group interaction [F(2,233) = 47, p < 0.001, ηp2 = 0.29], indicating that faces of older adults were more strongly underestimated by AIs compared to humans than in the other age groups. A marginally significant 3-way interaction between group (AI and human observer), expression, and gender [F(1,233) = 4.28, p = 0.04, ηp2 = 0.02], probably resulted from the larger decrease in performance for male smiling faces compared to female smiling faces in human, but not AI, estimations of age. All other main effects and interactions were not significant (p > 0.05).

Finally, we analyzed the dependent variable of bias in age estimation using a similar design. We note that there is an overlap between this analysis and the analysis of average perceived age. However, because the differences between the estimated and the real ages within each combination of age group and gender were not identical, we decided to include this analysis as well. There was a main effect of group [F(1,233) = 18.94, p < 0.001, ηp2 = 0.08], indicating that, overall, AI underestimated the age of faces more than human observers did. A main effect of age group [F(2,233) = 101.2, p < 0.001, ηp2 = 0.47] indicated that the bias in age estimations was modulated by age. A group (AI vs. human observer) X age-group interaction [F(2,233) = 47.73, p < 0.001, ηp2 = 0.29], showed that age overestimation of faces of younger adults and underestimation of faces of older adults were larger in AI (Fig. 3c).

A main effect of expression [F(1,233) = 13.96, p < 0.001, ηp2 = 0.06] indicated that smiling faces were estimated to be older than neutral faces (the AES). This effect was qualified by an expression X age-group interaction [F(2,233) = 20.1, p < 0.001, ηp2 = 0.15], suggesting that there was a decrease in AES as people got older. A 3-way group X expression X age-group interaction [F(2,233) = 12.7, p < 0.001, ηp2 = 0.09] indicated that the decrease in AES with age group was larger in AI. A 3-way group (AI vs. human observer) X expression X gender interaction [F(2,233) = 4.3, p < 0.05, ηp2 = 0.02] probably resulted from the fact that the interaction between gender and expression was larger for human observers than it was for AIs. All other main effects and interactions were not significant (p > 0.05).

Discussion

The general pattern of results was robust: AI showed human-like age estimation biases and inaccuracies across all aspects of performance tested in the current study. Moreover, all biases and inaccuracies were significantly larger in AI than in human observers.

Because AI programs, unlike human observers, have no top-down preconceptions or opinions about the effects of age range, gender, or facial expression on apparent age, the results suggest that the well-established biases in human age perception are heavily driven by the visual properties of the faces. In the case of the ageing effect of smiling (AES), the results reinforce the notion that smiling faces are estimated as looking older than neutral faces of the same individuals because of the formation of smile-related wrinkles around the eyes8. As in human perception, the AES in AI was demonstrated for faces of young and middle-aged adults. Still, for faces of young adults, the AES was on average more than two-fold larger in AI compared to human observers (2.4 vs. 1.1 years) and was also moderately larger for faces of middle-aged adults (0.7 vs. 0.6 years).

Interestingly, for faces of older adults, for which AES is absent in human perception5, AI showed an opposite pattern of results: smiling faces were perceived as younger than neutral faces. We can only speculate as to the source of this effect. Recently, we observed a similar pattern of results in human observers who were asked to estimate the age of photos of faces of old adults wearing face masks22. The similarity between patterns of results may imply that machine-learning technology relies more heavily on partial facial information in the upper region of the face that is unobscured by masks23. The use of such information has also been associated with reduced holistic processing of faces23,24,25.

The accuracy of age estimation by AIs showed a significant decrease for faces of older adults compared to faces of young and middle-aged adults. Such a decrease in performance has been previously observed in human age perception, and results, at least in part, from an objective difficulty in estimating the age of older adults2. Yet, this decrease is larger in AIs. The large difference in accuracy between human observers and AI for faces of older adults compared to faces of other age groups cannot be easily accounted for by the objective difficulty of extracting age from face photos. After all, a few of the AIs we tested performed particularly well, even for faces of older adults (Fig. 2a), which suggests that potentially, visual information can be used effectively to achieve reasonable performance. In short, there is an indication of “ageism” in some AIs, which could be due to inequalities along the age range of the sets of faces used during training. In particular, it could have resulted from an under-representation of faces of old adults in the training sets used in machine learning19. This decline in performance for faces of older adults in AI could also have been boosted by the fact the possible age estimates that were produced by some of the AIs in our sample could not exceed 70 years of age. Finally, the inaccuracy in age estimation for older adults in AI compared to humans could also have been due to a larger bias in age estimation for the faces in this age group (Fig. 3c). As we noted in the Introduction, however, biases by themselves, and in particular biases due to a regression to the mean effect, cannot fully account for the pattern of errors we observed.

Still, at least in the case of human observers, it is likely that the overestimations of ages of faces of younger adults and the underestimations of faces of older adults are due to a regression to the mean effect. In particular, as in many other cases of uncertainty within a given modality6,26, people’s average quantitative judgments tend to be biased in the direction of the perceived, or assumed mean. The regression to the mean effect can also account for the fact that little or no bias effects were found in age estimations of faces of middle-aged adults, for which the actual age is closer to the perceived mean of the population2,5. Interestingly, the reduced accuracy in age estimations for female compared to male faces could also be accounted, at least in part, to effects of regression to the mean. In particular, as shown in Table 3, female faces’ ages were underestimated to a larger degree than male faces. Therefore, it is possible that decrease in accuracy for female faces resulted from larger regression to the mean effect for female compared to male faces. As for AI performance, it could be the case that at least part of the larger biases found in age estimation in AI compared to human observers could be attributed to larger effects of “regression to the mean” in AI technology. Still, to best of our knowledge, previous literature has not discussed regression to the mean in the domain of machine learning performance, although this basic statistical phenomenon would almost certainly be operating. Nevertheless, the exact nature whereby regression to the mean operates in AI needs to be explored.

The human observers in our study were young adults, which might have led to reduced accuracy in estimating the ages of older adults because of an “own-age bias”27,28. Previous studies, however, have shown no effects of own-age biases in the accuracy of age estimation2,5. For example, in a recent study, we found no advantage in age-estimation accuracy in middle-aged adult observers over young observers for faces of middle-aged adults5. In addition, in a comprehensive study that looked at the accuracy of age estimation in observers of different ages to faces of different ages, older adult participants did not show any advantage for faces of older adults. Instead, there was an overall decrease in their performance compared to participants from other age groups, and they were equally inaccurate in age estimations of faces from all age groups2. We can therefore conclude that the larger decrease in performance with faces of older adults in AI compared to humans in the current study did not benefit from the fact that our human participants were young adults. Indeed, it is likely that the differences between AI and human performance would be exaggerated even more for older human observers.

Another possible cause of inaccuracies in age estimation is the congruency between the ethnicity of the observer and that of the faces presented for evaluation15. Our human participants and the facial database we used were both Caucasian. In the case of AI, however, it is entirely possible that some AIs were trained on faces drawn from a variety of different ethnic groups, whereas the training sets of others might have been less diverse. This could account for some of the differences we observed within the 21 different AIs we tested as well as for overall difference between the performance of AI and the performance of our human observers. One finding, however, that cannot be easily explained by possible differences in the training sets is that AIs show a larger AES than human observers do. Recent evidence shows that the AES in humans is not affected by the ethnicity of the observers or the ethnicity of the faces presented9. Still, the so-called ‘own-race’ effect on other biases and inaccuracies in age estimation needs to be seriously addressed—but is well beyond the scope of the present study.

Two other facial attributes that affected the accuracy of age estimation were the expression and the gender of the face. Both humans and AIs showed an average decrease in performance for smiling compared to neutral faces, and for female compared to male faces. This decrease in performance was again larger for the AIs we tested. It is entirely possible that the larger decrease in AI performance again was a consequence of the training regimes, and in particular, from underrepresentation of smiling faces and female faces in the sets used in machine-learning training. We note that both effects cannot be accounted for by differences with respect to biases in age perception. This is indicated by the fact that differences in AI accuracy with smiling and neutral faces were observed even in situations in which there were no difference in the bias between the two expressions (i.e., for faces of old adults). In a similar vein, differences along AI accuracy between female and male faces were observed in situations in which there were no bias differences with respect to gender (i.e., for faces of young and middle-aged adults). As in other domains of automatic face recognition19, these results reinforce the need to adjust AI training protocols to avoid potential age estimation inequities based on age, gender, and facial expression.

Overall, across age group, gender, and facial expression in the set of photos we presented, the average accuracy in age estimation was significantly higher in human observers than in AI. The average difference in performance was a consequence of the larger inaccuracies in AI for faces of older adults, smiling faces, and female faces. It is important to note that our human observers were undergraduate students and had no explicit training (or expertise) in age estimation, whereas the AIs were a representative sample of the most prominent players in the field that have been trained to estimate age based on huge sets of face photos. Taking this into account, the average performance of the undergraduates we tested is impressive. This overall advantage for human observers is also evident in the individual scores of the human and AI participants. For all human observers and AIs taken together, the most accurate average performance was recorded for one of the human participants (see inset in Fig. 2). The larger biases and inaccuracies of AI compared to human observers suggest that current AI age-estimation technology still has a way to go before it will equal human performance.

Finally, it is important to note that in this study, we were not focused on the different architectures or training sets that particular AIs might use when tasked with estimating age. We acknowledge that the architecture and training sets will almost certainly have an effect on the biases and inaccuracies in age estimation. The question we were addressing here is whether or not the output of these AI platforms would be susceptible to the same biases and show the same inaccuracies as humans when confronted with faces that differ in age, facial expression, and gender. Our hope for this approach was two-fold. First, by documenting the performance of the range of AIs currently available, we would gain some insights into how humans perceive age. Second, that this exercise would provide some new directions for the development of more accurate and less biased AI technology. We believe that we have made some headway on both these fronts.