Mon Feb 21 2022

Machine learning with limited labels: How to get the most out of your domain expertise?

By Samantha Biegel

The success of a machine learning project often relies on the availability of good-quality labeled data. Machine learning models learn by seeing examples of the data, and both the number of examples and their quality make a difference in how well the model learns. Thus, putting effort into obtaining a large number of examples with corresponding labels can be of big help in training your machine learning model. This, however, can be quite difficult in practice.

For instance, have a look at the following example of a satellite image:

Let’s assume that our goal is to identify individual tree species. Then the labeling task for this problem would involve the identification of all the different tree species by a domain expert. These are all marked in this image by different colors.

This can be a quite tedious task, and we often have a similar situation in different domains. This brings us to the common problem: How do we turn an unsupervised problem into a supervised one when labeling is difficult? The key here is that we have to find a way to use our available business knowledge in the most efficient way.

How to use business knowledge when labeling data for supervised machine learning?

The most straightforward data labeling approach is to use your business experts to hand-label individual data points. However, as we saw in the example above, this can take a significant amount of time. Moreover, it might be too expensive to label the whole dataset this way, and sometimes privacy constraints might limit the possibilities for outsourcing the labelling task.

There are also alternatives to this more traditional, fully-supervised learning. We’ll highlight two options:

  • Learning with less labels: For example, using an approach called active learning, you use a certain strategy to pick the most useful data points. Overall, this allows you to learn with less labels.
  • Learning with imprecise labels: Using weakly supervised learning (or weak supervision), you can create labels much faster than with individual labeling. These techniques help you learn from noisy label sources.

What is weak supervision?

Weak supervision is a technique that allows you to quickly generate labels for a large number of data points at the same time. Instead of looking at individual points, you write a set of functions, which we call Labeling Functions. These can be applied to the unlabeled data and their outputs are weak labels, i.e. less accurate or noisier labels. This programmatic approach to labeling makes it much easier and cheaper to obtain large datasets.

Once a set of labels has been defined, the next step is to learn how to combine them into a single label. This is done by learning a Label Model: a generative model that learns how to estimate the true label (which we don’t know, as we’re not hand-labeling) by making use of the correlation structure between the labeling functions.

The outcome is a set of Probabilistic Labels: estimates of the conditional distributions of the true label, given values of the weak labels.

The labeling functions can be based on a variety of different sources. This makes it a flexible method to combine all sources of information that are relevant to your problem into one label. A common source is a piece of business knowledge that is captured into a heuristic defined by the expert. Other sources include outcomes of models trained for similar problems (like a transfer learning setting) or external knowledge bases that can be used to create imprecise or partial labels.

To explain how the weak supervision pipeline works, we’ll use a simple artificial data example where the points belong to one out of two classes:

The pipeline then looks like this:

A) We start out with an entirely unlabeled dataset. The first step is then to let the domain expert define a set of labeling functions, which are functions that divide up the problem space in our simple example.

B) Then, the resulting weak labels are combined by learning the (generative) label model. This model gives us probabilistic labels as output.

C) Finally, the probabilistic labels are used as supervision to a discriminative model. This can be your typical classification model. However, the loss function may need to be adapted to be able to deal with confidence levels of the labels as output.

The idea of programmatic weak supervision was introduced by the Snorkel project (see the paper and library).

Examples of weak supervision applications

There have been several examples of the successful application of weak supervision in industry since it was introduced. For example, in natural language processing, it has been used to help learn structures in dialogues by capturing knowledge in weak labels that is not available to the discriminative model.

In medical machine learning applications, such as automatic seizure detection, weak supervision has proven to be useful in a cross-modal setup: the label model is then defined over a different data type than the discriminative model.

A third type of application is in the vision domain. For these types of problems, it can be more difficult to define high-level rules. Other sources, such as making use of other models for transfer learning or object detection, can help extract more information to use for weak labels.

What is the difference between weakly supervised learning and semi supervised learning?

Semi-supervised learning lies between unsupervised learning and supervised learning. The general idea is that you have a small labeled dataset to learn from, as well as a larger unlabeled one that you want to extract additional information from. Generally, this method is based on the idea that points that are closer together in some representation space are more likely to belong to the same class. This sounds similar to weakly supervised learning: You make use of some structure in the data to group points together.

In weak supervision, however, we do not use individually labeled points to create labeling functions and separate the data points. The method also doesn’t necessarily make use of the data density to assign the same weak labels to similar points.

What is active learning?

Active learning is an approach that helps you find the most informative points in your dataset in an iterative manner. It makes use of a query strategy: a function that computes the potential value of labeling for each point in a pool of unlabeled data points.

This strategy is often based on uncertainty of the model. For points that are closer to the decision boundary of a classification model, the model is less certain about the class. You get more information by labeling these points than you would for labeling points for which you’re already quite certain about the class.

Overall, this approach helps limit the total number of data points needed to label. However, active learning might still require many labeled points to get sufficient performance. Therefore, this approach on its own might not be a suitable way to get a good dataset for your problem.

Challenge: You need good labeling functions to apply weak supervision to your problem

The success of label creation with weak supervision relies on the ability of domain experts to define good heuristics. Since the starting point is usually an unsupervised problem, it might be difficult to assess the accuracy and quality of the set of weak labels that are being created.

As a result of this challenge, the probabilistic labels resulting from the weak supervision process may be inaccurate. This limits the performance of the final classifier. To see how, let’s revisit the final step of the weak supervision pipeline on the left in the picture below. Because of the suboptimal labels from the generative model, the classification boundary is slightly rotated when compared to the optimal one on the right.

Active WeaSuL: Overcoming the potential challenges of applying weak supervision with active learning

To overcome the challenges of applying weak supervision, it can help to add in an additional type of domain knowledge: a few labeled data points that push the label model to create labels that better represent the data distribution. This idea is implemented in Active WeaSuL through incorporating active learning into the weak supervision framework. The labels that are acquired through active learning provide feedback on how the probabilistic labels from weak supervision perform.

The method works as a loop where you iteratively pick a data point and use it in the label model. This is shown in the figure below. Starting from the initial probabilistic labels, you first choose a point for which it would be most useful to get the true label. This is done by using a sampling strategy (1. MaxKL sampling). Then, the domain expert provides the label, and this new label is used to adjust the generative model (2. Modified loss function).

Within several iterations, the generative model adjusts, leading to a better classification boundary. See the following image:

To adjust the generative model in step 2, the weak supervision loss function is modified. The new objective looks like this:

The first part comes from the objective of weak supervision itself, which is defined as a matrix completion problem. You can read more about this by clicking here. The second part adds a penalty to the objective for any points that turn out to be mislabeled when comparing to the label obtained through active learning. This is implemented as the quadratic difference between the probabilistic label (a function of the label model parameters) and the expert label, as can be seen below:

This modification allows us to improve the probabilistic labels, but it does not tell us which points to label. For this, the MaxKL query strategy was proposed. To see what this strategy does, first note that the data points in our set are divided into buckets, based on the different configurations of weak labels that are assigned to the data points (see part A in the image below). We take the probabilistic labels within each bucket as an estimate of the true class distribution. From the sampled labels from active learning, we get another estimate of the class distribution.

To decide which area to sample from, the Kullback-Leibler divergence between these two estimates is used (see part B below). If this value is relatively large for a particular bucket, it’s worth sampling more points. This will provide more feedback to adjust the generative model and get a better estimate of the class distribution.

Now that we’ve seen how the entire cycle of Active WeaSuL works, let’s look at some results on the simple artificial dataset first. The accuracy of Active WeaSuL is given by the dark blue lines in the figures below. It outperforms baseline approaches such as weak supervision by itself and active learning by itself. The improvement is also much faster than for a competing method in light blue.

Similar insights were found on real-world problems, such as visual relation detection and spam detection. The first problem is about identifying the relationship between two objects in an image. For instance, in the image on the left below, the task is to label whether the relationship between ‘giraffe’ and ‘grass’ is ‘sitting on’ or not. In the results on this task, we see that with a bit more labeling resources, it might be better to use active learning by itself, as this can eventually lead to a better performance.

The results on different problems show that, overall, the combination of active learning and weak supervision is helpful in getting a better performance with a small number of labeled points. That makes this approach ideal to use in situations where you have a very small labeling budget.