Reviewing Google, Amazon, Microsoft, Affectiva, Kairos, and Clarifai
AI expert Andrew Ng has a rule of thumb for what you can train AI to do:
If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.
Humans can generally judge what emotion someone’s feeling from their facial expression with a quick glance (although there are questions about whether across cultures we detect universal emotions). So seems like it might be a good idea to teach computers to do emotion detection via facial expressions. You can do it with tools from the big guys — Amazon, Google, Microsoft. There are also startups working on the same thing (Affectiva, Kairos, and Clarifai are covered here).
One obvious way to go would be to try to detect Ekman’s six universal emotions (Joy a.k.a. Happiness, Sadness, Surprise, Fear, Disgust, Anger), but this isn’t the only approach. Let’s survey the landscape.
Where I could, I tried a demo of each service with this photo of tennis player Adam Davidson from Wikimedia Commons. It’s categorized as “tennis victory pose” so presumably he is joyful not angry! I picked this because it’s a good example of why emotion detection using facial expressions can fail sometimes. I used both the full version and the cropped version to see if the results differed.
This photo is an edge example, because seeing his face alone a human would probably say he’s angry, not joyful. But interesting for checking out how the tools behave.
Google’s Cloud Vision API detects joy, sorrow, anger, and surprise and estimates their presence with the qualitative values Very Unlikely, Unlikely, Possible, Likely, and Very Likely.
The Cloud Vision API estimated Anger as Very Likely and every other emotion as Very Unlikely. It gave the same result for the cropped photo.
I suspect Google designed their approach the way they did (with four emotions rather than Ekman’s six and with a qualitative, ordinal scale rather than numeric probabilities) because they had trouble trying to distinguish between all six.
Amazon’s Rekognition service rates the following options on a 0 to 100 scale:
These are close to Ekman’s but not the same. I didn’t find a free version of Amazon’s service available but can probably make an Amazon AWS account and try it, which I will do at some point.
Microsoft Azure’s Emotion API detects anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise, saying “these emotions are understood to be cross-culturally and universally communicated with particular facial expressions.” The scores look to be probabilities, but the documentation doesn’t say. The documentation says that “the emotions contempt and disgust are experimental.” They apparently feel confident in their ability to distinguish anger, fear, happiness, sadness, and surprise. This is close to the list that Google classifies (joy, sorrow, anger, surprise). Contempt is not one of Ekman’s basic emotions but was included on an expanded list he created in the 1990s.
Anger was rated as the most likely alternative, with a score of .85. Surprise came in second, with a score of .13. Using the cropped photo the result was similar, anger .81 and surprise .17.
Affectiva’s emotion detection – face (Affdex?) measures seven emotion metrics (anger, contempt, disgust, fear, joy, sadness, and surprise) – Ekman’s basic six plus contempt. They also provide 20 facial expression metrics. Alternatively you can estimate valence (-100 to 100) and engagement (subject’s expressiveness in a range 0 to 100).
Affectiva on mapping expressions to emotions. They base it on EMFACS, developed by Ekman with Friesen. A better more pure “machine learning” way to go would be to have humans label photos and then train a model on it, not start from some coding system like this that has been routinely disproven.
One criticism of Ekman’s research is that he used posed expressions. Check out some of these posed expressions from the Affectiva website.
Kairos seems to have evolved to using Ekman’s basic emotions. In a blog post from 2015, they say that their internal research discovered that “not all of Ekman’s universal emotions provide consistently distinctive facial expressions.” At that point their emotional analysis API apparently detected negative valence, positive valence, surprise, and attentiveness.
They are also estimating valence (positive or negative emotionality).
Clarifai says “our core model identifies 11,000+ general concepts like objects, ideas, and emotions” but they do not say what emotions they detect. Sounds like their approach is not specific to emotions but detects emotion concepts in the same way that other intangible concepts are detected. So I imagine they have humans label photos with concepts and then they train machine learning models to detect those concepts too. This seems a better approach than trying to detect emotions via Ekman’s facial expression coding system. Why go through that when the computer might do a better job figuring it out directly?
When I used Clarifa’s demo on the full picture, it didn’t show any concepts in its list of return values, but this may be because I’m not seeing the full set of what it’s detected.
Using the cropped face-only picture, Clarifai produced the following:
It identified that there was a facial expression in the image, and detected “furious” and “smile.” So apparently they’re doing more than just detecting anger. Clarifai ranks highly on emotional granularity even if they got the actual emotion wrong!
There are at least three ways to attack the issue of emotion detection using facial expression recognition:
- Use Ekman’s Facial Action Coding System with his basic emotions (Affectiva)
- Identify basic emotions directly via emotion-specific machine learning models (Google)
- Identify emotions just like you identify other intangible concepts in images using some general-purpose concept identification model (Clarifai – although they are detecting “facial expression” and faces so maybe there is some emotion-detection specific model in there).
Not clear whether Microsoft, Amazon, and Kairos are using a direct machine learning model or going via Ekman’s FACS, but I’d guess the former.
I bet Clarifai’s model has the best chance of getting photos like the one I analyzed right, because if they see enough tennis victory poses they will be able to detect that the emotion shown is actually joy not fury. This is closest to what humans do when detecting emotions. We use context along with facial expression to figure out what someone is thinking.