Responsibility means knowing the limitations of your dataset and algorithm

I woke up this morning to heavy rain, so no luck leaving the Dutch weather behind. The journey to the artificial intelligence for Good Global Summit was quick - the Swiss tram ran every 3 minutes, and I sailed through registration. The conference has all the digital extras you’d imagine from a gathering of artificial intelligence engineers: a dedicated app for scheduling, registering, viewing live streams of talks, and connecting with presenters, and instant AI-generated captions (they work surprisingly well – it even got Saskatchewan right!).

The day was packed with talks, but I wanted to share about my two favorites. They broadly focused on Responsible AI, which encompasses subjects like bias, fairness, and interpretability.

In the first talk, Google’s Timnit Gebru argued that we need to have more transparent descriptions of what particular models and datasets can and cannot do, especially if those models are used in high stakes scenarios such as crime prediction. Gebru caused a stir last year when she published an article with Joy Buolamwini showing that gender classification algorithms deployed by Microsoft, IBM, and Face++ all perform significantly worse when classifying dark-skinned people and women. As a remedy, she discussed two possible solutions she’s published articles about: datasheets for datasets, and model cards for model reporting. Datasheets are traditionally used in the electric components industry to describe how a chip or other part will perform. Gebru proposes that we need to develop a similar norm for public datasets, commercial APIs, and pretrained models describing when, where, and how the training data was gathered, its recommended use cases, and in the case of human-centric datasets, the subjects' demographics and consent as applicable. Model cards are a counterpart to the datasheets, describing the circumstances in which a model was trained, and what it should be used for.

The second great talk was by Klaus-Robert Müller on Spectral Relevance Analysis, a tool for understanding what is actually going on inside the black box of an algorithm. Often, Machine Learning and Deep Learning models can be hard to interpret. The models yield impressive results by conventional performance metrics, but Müller argues that we need to dig deeper for a complete view. Recent inventions such as SHAP are very helpful in showing which factors play a role in a model’s decision. Müller et al. take this a step further, and provide a method to evaluate which models learn the right concept, and which models simply cheat. They do this by constructing SHAP-like heatmaps of which pixels the model finds important, then cluster them. If this results in multiple clusters, it indicates that the model uses different learning strategies. And some of these strategies may not be so legit. For instance, one model learned a class by just looking for a watermark! These talks were incredibly insightful. They are also deeply relevant to the work we do – many of our clients for Big Data analysis are working with huge amounts of human data. It’s our duty to incorporate these responsible themes into our work practices.

I’m looking forward to sharing the details in the Responsible AI training I give at Xomnia. Tomorrow, I’ll be attending the Scaling AI For Good breakthrough session. With some ambitious goals, it’s sure to be interesting. More updates will follow! Cheers, Jorren