Deep Metric Learning: a (Long) Survey
11 Dec 2020
Note: This post is still a Draft! If you’re seeing this, then there’s probably a bug in my site (since I’ve been changing a lot of stuffs lately).
One of the most amazing aspects of the human’s visual system is the ability to recognize similar objects and scenes. We don’t need hundreds of photos of the same face to be able to differentiate it among thousands of other faces that we’ve seen. We don’t need thousands of images of the Eiffel Tower to recognize that unique architecture landmark when we visit Paris. Is it possible to design a Deep Neural Network with the similar ability to tell which objects are visually similar and which ones are not? That’s essentially what Deep Metric Learning attempts to solve.
Although this blog post is mainly about Supervised Deep Metric Learning and is selfsufficient by its own, it would be benefitial for you to consider getting familiar with traditional Metric Learning methods (i.e. without Neural Networks) to develop a broader understanding on this topic. I highly recommend the introductory guides on Metric Learning as a starter. If you want to get into the formal mathematical side of things, I recommend the tutorial by Diaz et al. (2020). More advanced Metric Learning methods includes the popular tSNE (van der Maaten & Hinton, 2008) and the new shiny UMAP (McInnes et al., 2018) that everybody uses nowadays for data clustering and visualization.
This article is organized as follows. In the “Direct Approaches” section, I will quickly glance through the methods which are commonly used for Deep Metric Learning, before the rise of angular margin methods in 2017. Then, in Moving Away from Direct Approaches, I will describe the transitioning to current angular margin SOTA models and the reasons why we ditch the direct approaches. Then, in the “StateoftheArt Approaches” section, I will describe in more detail the advances in Metric Learning in recent years.
The most useful section for both beginners and more experienced readers will be the “Getting Practical” section, in which I will do a case study of how Deep Metric Learning is used to achieve StateoftheArt results in various practical problems (mostly from Kaggle and largescale benchmarks), as well as the tricks that were used to make things work.
 Problem Setting of Supervised Metric Learning
 Direct Approaches
 Moving Away from Direct Approaches
 StateoftheArt Approaches
 Getting Practical
 Conclusion
 References
Problem Setting of Supervised Metric Learning
Generally speaking, Deep Metric Learning is a group of techniques that aims to measure the similarity between data samples. More specifically, for a set of data points \(\mathcal{X}\) and their corresponding labels \(\mathcal{Y}\), the goal is to train an embedding neural model (also referred to as feature extractor) \(f_{\theta}(\cdot)\, \colon \mathcal{X} \to \mathbb{R}^n\) (where \(\theta\) are learned weights) together with a distance \(\mathcal{D}\, \colon \mathbb{R}^n \to \mathbb{R}\) (which is usually fixed beforehand), so that the combination \(\mathcal{D}\left(f_{\theta}(x_1), f_{\theta}(x_2)\right)\) produces small values if the labels \(y_1, y_2 \in \mathcal{Y}\) of the samples \(x_1, x_2 \in \mathcal{X}\) are equal, and larger values if they aren’t.
Thus, the Deep Metric Learning problem boils down to just choosing the architecture for \(f_{\boldsymbol{\theta}}\) and choosing the loss function \(\mathcal{L}(\theta)\) to train it with. One might wonder why we cannot just use the classification objective for the metric learning problem? In fact, the Softmax loss is also a valid objective for metric learning, albeit inferior to other objectives as we will see later in this article.
Direct Approaches
I will glance throught the most common approaches in this section very quickly without getting too much into details for two reasons:
 The methods described here are already covered in other tutorials, videos, and blog posts online in great detail. I highly recommend the great survey by Kaya & Bilge (2019).
 The methods that I will describe in the next section outperforms these approaches in most cases, so I have no motivation to delve too deep into the details in this section.
The distance function for these approaches is usually fixed as \(l_2\) metric:
\[\begin{equation*} \mathcal{D}\left(p, q\right) = \p  q\_2 = \left(\sum_{i=1}^n \left(p_i  q_i\right)^2\right)^{1/2} \end{equation*}\]For the ease of notation, let’s denote \(\mathcal{D}_{f_\theta}(x_1, x_2)\) as a shortcut for \(\mathcal{D} \left( f_\theta(x_1), f_\theta(x_2) \right)\), where \(x_1, x_2 \in \mathcal{X}\) are samples from the dataset. Also, for some condition \(A\), let’s denote \(\unicode{x1D7D9}_A\) as the identity function that is equal to \(1\) if \(A\) is true, and \(0\) otherwise.
Contrastive Loss
This is a classic loss function for metric learning. Contrastive Loss is one of the simplest and most intuitive training objectives. Let \(x_1, x_2\) be some samples in the dataset, and \(y_1, y_2\) are their corresponding labels. The loss function is then defined as follows:
\[\begin{equation*} \mathcal{L}_\text{contrast} = \unicode{x1D7D9}_{y_1 = y_2} \mathcal{D}^2_{f_\theta}(x_1, x_2) + \unicode{x1D7D9}_{y_1 \ne y_2} \max\left(0, \alpha  \mathcal{D}^2_{f_\theta}(x_1, x_2)\right) \end{equation*}\]where \(\alpha\) is the margin. The reason we need a margin value is because otherwise, our network \(f_\theta\) will learn to “cheat” by mapping all \(\mathcal{X}\) to the same point, making distances between any samples to be equal to zero. Here and here are very great indepth explanation for this loss function.
Triplet Loss
Triplet Loss (Schroff et al. 2015) is by far the most popular and widely used loss function for metric learning. It is also featured in Andrew Ng’s deep learning course.
Let \(x_a, x_p, x_n\) be some samples from the dataset and \(y_a, y_p, y_n\) be their corresponding labels, so that \(y_a = y_p\) and \(y_a \ne y_n\). Usually, \(x_a\) is called anchor sample, \(x_p\) is called positive sample because it has the same label as \(x_a\), and \(x_n\) is called negative sample because it has a different label. It is defined as:
\[\begin{equation*} \mathcal{L}_\text{triplet} = \max\left(0, \mathcal{D}^2_{f_\theta}(x_a, x_p)  \mathcal{D}^2_{f_\theta}(x_a, x_n) + \alpha\right) \end{equation*}\]where \(\alpha\) is the margin to discourage our network \(f_\theta\) to map the whole dataset \(\mathcal{X}\) to the same point. The key ingredient to make Triplet Loss work in practice is Negative Samples Mining — on each training step, we sample such triplets that such triplets \(x_a, x_p, x_n\) that satisfies \(\mathcal{D}_{f_\theta}(x_a, x_n) < \mathcal{D}_{f_\theta}(x_a, x_p) + \alpha\), i.e. the samples that our network \(f_\theta\) fails to discriminate or is not able to discriminate with high confidence. You can find indepth description and analysis of Triplet Loss in this awesome blog post.
Triplet Loss is still being widely used despite being inferior to the recent advances in Metric Learning (which we will learn about in the next section) due to its relative effectiveness, simplicity, and the wide availability of code samples online for all deep learning frameworks.
Improving the Triplet Loss
Despite its popularity, Triplet Loss has a lot of limitations. Over the past years, there have been a lot of efforts to improve the Triplet Loss objective, building on the same idea of sampling a bunch of data points, then pulling together similar samples and pushing away dissimilar ones in \(l_2\) metric space.
Quadruplet Loss (Chen et al. 2017) is an attempt to make interclass variation of the features \(f_\theta(x)\) larger and intraclass variation smaller, contrary to the Triplet Loss that doesn’t care about class variation of the features. For samples \(x_a, x_p, x_n, x_s\) and their corresponding labels \(y_a = y_p = y_s\), \(y_a \ne y_n\), the Quadruplet Loss is defined as:
\[\begin{eqnarray*} \mathcal{L}_\text{quadruplet} = & \max\left(0, \mathcal{D}^2_{f_\theta}(x_a, x_p)  \mathcal{D}^2_{f_\theta}(x_a, x_s) + \alpha_1\right) \\ + & \max\left(0, \mathcal{D}^2_{f_\theta}(x_a, x_s)  \mathcal{D}^2_{f_\theta}(x_a, x_n) + \alpha_2\right) \end{eqnarray*}\]Structured Loss (Song et al. 2016) was proposed to improve the sample effectiveness of Triplet Loss and make full use of the samples in each batch of training data. Here, I will describe the generalized version of it by Hermans et al. (2017).
Let \(\mathcal{B} = (x_1, \ldots, x_b)\) be one batch of data, \(\mathcal{P}\) be the set of all positive pairs in the batch (\(x_i, x_j \in \mathcal{P}\) if their corresponding labels satisfies \(y_i = y_j\)) and \(\mathcal{N}\) is the set of all negative pairs (\(x_i, x_j \in \mathcal{N}\) if corresponding labels satisfies \(y_i \ne y_j\)). The Structured Loss is then defined as:
\[\begin{eqnarray*} \widehat{\mathcal{J}}_{i,j} =&& \max\left( \max_{(i,k) \in \mathcal{N}} \left\{\alpha  \mathcal{D}_{f_\theta}(x_i, x_k)\right\}, \max_{(l,j) \in \mathcal{N}} \left\{\alpha  \mathcal{D}_{f_\theta}(x_l, x_j)\right\} \right) + \mathcal{D}_{f_\theta}(x_i, x_j) \\ \widehat{\mathcal{L}}_\text{structured} =&& \frac{1}{2\mathcal{P}} \sum_{(i,j) \in \mathcal{P}} \max\left( 0, \widehat{\mathcal{J}}_{i,j} \right)^2 \end{eqnarray*}\]Intuitively, the formula above means that for each pair of positive samples, we compute the distance to the closes negative sample to that pair, and we try to maximize it for every positive pair in the batch. To make it differentiable, the authords proposed to optimize an upper bound instead:
\[\begin{eqnarray*} \mathcal{J}_{i,j} =&& \log\left( \sum_{(i,k) \in \mathcal{N}} \exp\left\{\alpha  \mathcal{D}_{f_\theta}(x_i, x_k)\right\}, \sum_{(l,j) \in \mathcal{N}} \exp\left\{\alpha  \mathcal{D}_{f_\theta}(x_l, x_j)\right\} \right) + \mathcal{D}_{f_\theta}(x_i, x_j) \\ \mathcal{L}_\text{structured} =&& \frac{1}{2\mathcal{P}} \sum_{(i,j) \in \mathcal{P}} \max\left( 0, \mathcal{J}_{i,j} \right)^2 \end{eqnarray*}\]The NPair Loss (Sohn, 2016) paper discusses in great detail one of the main limitations of the Triplet Loss, while proposing a similar idea to Structured Loss of using positive and negative pairs:
During one update, the triplet loss only compares an example with one negative example while ignoring negative examples from the rest of the classes. As consequence, the embedding vector for an example is only guaranteed to be far from the selected negative class but not necessarily the others. Thus we can end up only differentiating an example from a limited selection of negative classes yet still maintain a small distance from many other classes.
In practice, the hope is that, after looping over sufficiently many randomly sampled triplets, the final distance metric can be balanced correctly; but individual update can still be unstable and the convergence would be slow. Specifically, towards the end of training, most randomly selected negative examples can no longer yield nonzero triplet loss error.
Other attemts to design a better metric learning objective based on the core idea of the Triplet Loss objective includes Magnet Loss (Rippel et al. 2015) and Clustering Loss (Song et al. 2017). Both objectives are defined on the dataset distribution as a whole, not only on single elements. However, they didn’t received much traction due to the scaling difficulties, and simply because of their complexity. There has been some attempt to compare these approaches, notably by Horiguchi et al. (2017), but they performed experiments on very small datasets and were unable to achieve meaningful results.
Moving Away from Direct Approaches
After countless of research papers attempting to solve the problems and limitations of Triplet Loss, it became clear that learning to directly minimize/maximize euclidean (\(l_2\)) distance between samples with the same/different labels may not be the way to go. There are two main issues of such approaches:

Expansion Issue — it is very hard to ensure that samples with similar label will be pulled together to a common region in space as noted by Sohn (2016) (mentioned in the previous section). Quadruplet Loss only improves the variability, and Structured Loss can only enforce the structure locally for the samples in the batch, not globally. Attempts to solve this problem directly with a global objective (Magnet Loss, Rippel et al. 2015 and Clustering Loss, Song et al. 2017) were not successful in gaining much traction due to scalability issues.

Sampling Issue — all of the Deep Metric Learning approaches that tries to directly minimize/maximize \(l_2\) distance between samples relies heavily on sophisticated sample mining techniques that chooses the “most useful” samples for learning for each training batch. This is inconvenient enough in the local setting (think about GPU utilization), and can become quite problematic in a distributed training setting (e.g. when you train on 10s of cloud TPUs and pull the samples from a remote GCS bucket).
Center Loss
Center Loss (Wen et al. 2016) is one of the first successful attemts to solve both of the above mentioned issues. Before getting into the details of it, let’s talk about the Softmax Loss.
Let \(z = f_\theta(x)\) be the feature vector of the sample \(x\) after propagating through the neural network \(f_\theta\). In the classification setting of \(m\) classes, on top of the backbone neural network \(f_\theta\) we usually have a linear classification layer \(\hat{y} = W^\intercal z + b\), where \(W \in \mathbb{R}^{n \times m}\) and \(b \in \mathbb{R}^m\). The Softmax Loss (that we’re all familiar with and know by heart) for a batch of \(N\) samples is then presented as follows:
\[\begin{equation*} \mathcal{L}_\text{softmax} =  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{W^\intercal_{y_i} z_i + b_{y_i}\right\} }{ \sum_{j=1}^{m} \exp\left\{W^\intercal_{j} z_i + b_{j}\right\} }} \end{equation*}\]Let’s have a look at the training dynamics of the Softmax objective and how the resulting feature vectors are distributed relative to each other:
As illustrated above, the Softmax objective is not discriminative enough, still there’s still a significant intraclass variation even on such a simple dataset as MNIST. So, the idea of Center Loss is to add a new regularization term to the Softmax Loss to pull the features to corresponding class centers:
\[\begin{equation*} \mathcal{L}_\text{center} = \mathcal{L}_\text{softmax} + \frac{\lambda}{2} \sum_{i=1}^N \ z_i  c_{y_i} \_2^2 \end{equation*}\]where \(c_j\) is also updated using gradient descent with \(\mathcal{L}_\text{center}\) and can be thought of as moving mean vector of the set of feature vectors of class \(j\). If we now visualize the training dynamics and resulting distribution of feature vectors of Center Loss on MNIST, we will see that it is much more discriminative comparing to Softmax Loss.
The Center Loss solves the Expansion Issue by providing the class centers \(c_j\), thus forcing the samples to cluster together to the corresponding class center; it also solves the Sampling issue because we don’t need to perform hard sample mining anymore. Despite having its own problems and limitations (which I will describe in the next subsection), Center Loss is still a pioneering work that helped to steer the direction of Deep Metric Learning to its current form.
SphereFace
The obvious problem of the formulation of Center Loss is, ironically, the choice of centers. First, there’s still no guarantee that you will have a large interclass variability, since the clusters closer to zero will benefit less from the regularization term. To make it “fair” for each class, why don’t we just enforce the class centers to be on the same distance from the center? Let’s map it to a hypersphere!
That’s basically the main idea behind SphereFace (Liu et al. 2017). The setting of SphereFace is very simple. We start from the Softmax loss with following modifications:
 Fix the bias vector \(b = 0\) to make the future analysis easier (the whole heavylifting is performed by our neural network anyways).
 Normalize the weights so that \(\smash{\ W_j \ = 1}\). This way, when we rewrite the product \(\smash{W_j^\intercal z}\) as \(\smash{\ W_j \ \ z \ \cos\theta_j}\), where \(\smash{\theta_j}\) is the angle between feature vector \(z\) and the row vector \(\smash{W_j}\), it becomes just \(\smash{\ z \ \cos\theta_j}\). So, the final classification output for class \(j\) can be though about as projecting the feature fector \(z\) onto vector \(\smash{W_j}\), which in this case, geometrically, is the class center.
Let’s denote \(\smash{\theta_{j,i}}\) (\(\smash{0 \le \theta_{j,i} \le \pi}\)) as the angle between the feature vector \(z_i\) and class center vector \(\smash{W_j}\). The Modified Softmax objective is thus:
\[\begin{eqnarray*} \mathcal{L}_\text{mod. softmax} =&&  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{W^\intercal_{y_i} z_i + b_{y_i}\right\} }{ \sum_{j=1}^{m} \exp\left\{W^\intercal_{j} z_i + b_{j}\right\} }} \\ =&&  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{\ W_{y_i} \ \z_i\ \cos (\theta_{y_i, i}) + b_{y_i}\right\} }{ \sum_{j=1}^{m} \exp\left\{\z_i\ \cos (\theta_{j,i}) + b_{j}\right\} }} \\ =&&  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{\z_i\ \cos (\theta_{y_i, i}) \right\} }{ \sum_{j=1}^{m} \exp\left\{\z_i\ \cos (\theta_{j,i})\right\} }} \end{eqnarray*}\]Geometrically, it means that we assign the sample to class \(j\) if the projection of the logits vector \(z\) to the class center vector \(\smash{W_j}\) is the largest, i.e. if the angle between \(\smash{W_j}\) and \(z\) is the smallest among all class center vectors.
It is important to always keep in mind the decision boundary. At which point you will consider a sample as belonging to a certain class?
For Modified Softmax, the dicision boundary between classes \(i\) and \(j\) is actually the bisector between two class center vectors \(\smash{W_i}\) and \(\smash{W_j}\). Having such a thin decision boundary will not make our features discriminative enough — the interclass variation is too small. Hence the second part of SphereFace — introducing the margins.
The idea is, instead of requiring \(\smash{\cos(\theta_i) > \cos(\theta_j)}\) for all \(\smash{j = 1, \ldots, m\, (j \ne i)}\) to classify a sample as belonging to \(i\)th class as in Modified Softmax, we additionally enforce a margin \(\mu\), so that a sample will only be classified as belonging to \(i\)th class if \(\smash{\cos(\mu \theta_i) > \cos(\theta_j)}\) for all \(\smash{j = 1, \ldots, m\, (j \ne i)}\), with the requirement that \(\smash{\theta_i \in [0, \frac{\pi}{\mu}]}\). The SphereFace objective can be then expressed as:
\[\begin{equation*} \mathcal{L}_\text{SphereFace} =  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{\z_i\ \cos (\mu \theta_{y_i, i}) \right\} }{ \exp\left\{\z_i\ \cos (\mu \theta_{y_i,i})\right\} + \sum_{j \ne y_i} \exp\left\{\z_i\ \cos (\theta_{j,i})\right\} }} \end{equation*}\]The limitations on the value of \(\mu\) is really annoying. We can get rid of it by replacing \(\smash{\cos(\theta)}\) with a monotonically decreasing angle function \(\smash{\psi(\theta)}\), which we define as \(\smash{\psi(\theta) = (1)^k \cos(\mu \theta)  2k}\) for \(\smash{\theta \in [k\pi/\mu, (k+1)\pi/\mu]}\) and \(k \in [0, \mu  1]\). Thus the final form of SphereFace is:
\[\begin{equation*} \mathcal{L}_\text{SphereFace} =  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{\z_i\ \psi (\mu \theta_{y_i, i}) \right\} }{ \exp\left\{\z_i\ \psi (\mu \theta_{y_i,i})\right\} + \sum_{j \ne y_i} \exp\left\{\z_i\ \psi (\theta_{j,i})\right\} }} \end{equation*}\]The differences between Softmax, Modified Softmax, and SphereFace is schematically shown below.
StateoftheArt Approaches
The success of SphereFace resulted in an avalanche of new methods that are based on the idea of employing angular distance with angular margin.
For many who just explored a small part of Metric Learning, it might be confusing why angular distance approaches performs so much better than the direct ones. I want to emphasise that there’s nothing special in angular distance itself. The reason why these methods works is because projecting the features onto a hypersphere was an easy way to achieve good intra and inter class variation.
CosFace
Wang et al. (2018) discussed in great details about the limitations of SphereFace:
The decision boundary of the SphereFace is defined over the angular space by \(\,\smash{\cos(\mu \theta_1) = \cos(\theta_2)}\), which has a difficulty in optimization due to the nonmonotonicity of the cosine function. To overcome such a difficulty, one has to employ an extra trick with an adhoc piecewise function for SphereFace. More importantly, the decision margin of SphereFace depends on \(\,\smash{\theta}\), which leads to different margins for different classes. As a result, in the decision space, some interclass features have a larger margin while others have a smaller margin, which reduces the discriminating power.
CosFace (Wang et al. 2018) proposes a simpler yet more effective way to define the margin. The setting is similar to SphereFace with normalizing the rows of weight matrix \(\smash{W}\), i.e. \(\ \smash{W_j} \ = 1\), and zeroing the biases \(b = 0\). Additionally, we normalize the features \(z\) (extracted by a neural network) as well, so \(\ z \ = 1\). The CosFace objective is then defined as:
\[\begin{equation*} \mathcal{L}_\text{CosFace} =  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{s \left(\cos (\theta_{y_i, i})  m\right) \right\} }{ \exp\left\{s \left(\cos (\theta_{y_i, i})  m\right) \right\} + \sum_{j \ne y_i} \exp\left\{s \cos (\theta_{j,i})\right\} }} \end{equation*}\]where \(s\) is referred to as the scaling parameter, and \(m\) is referred to as the margin parameter. As in SphereFace, \(\smash{\theta_{j,i}}\) denotes the angle between \(i\)th feature vector \(z_i\) and \(\smash{W_j}\), and \(\smash{W_j^\intercal z_i = \cos \theta_{j,i}}\), because \(\smash{\ W_j \ = \ z_i \ = 1}\). Visually, it looks like follows:
Choosing the right scale value \(s\) and margin value \(m\) is very important. In the CosFace paper (Wang et al. 2018), it is shown that \(s\) should have a lower bound to at least obtain the expected classification performance. Let \(\smash{C}\) be the number of classes. Suppose that the learned feature vectors separately lie on the surface of the hypersphere and center around the corresponding weight vector. Let \(\smash{P_{W}}\) denote the expected minimum posterior probability of class center (i.e. \(\smash{W}\)). The lower bound of \(s\) is given by:
\[\begin{equation*} s \ge \frac{C1}{C} \log \frac{\left(C1\right) P_W}{1  P_W} \end{equation*}\]Supposing that all features are wellseparated, the theoretical variable scope of \(m\) is supposed to be:
\[\begin{equation*} 0 \le m \le \left( 1  \max\left( W_i^\intercal W_j \right) \right) \end{equation*}\]where \(i, j \le n\) and \(i \ne j\). Assuming that the optimal solution for the Modified Softmax loss should uniformly distribute the weight vectors on a unit hypersphere, the variable scope of margin \(m\) can be inferred as follows:
\[\begin{align*} 0 \le m \le & \,1  \cos\frac{2\pi}{C}\,, & (K=2) \\ 0 \le m \le & \,\frac{C}{C1}\,, & (C \le K + 1) \\ 0 \le m \ll & \,\frac{C}{C1}\,, & (C > K + 1) \end{align*}\]where \(K\) is the dimension of learned features. The inequalities indicate that as the number of classes increases, the upper bound of the cosine margin between classes are decreased correspondingly. Especially, if the number of classes is much larger than the feature dimension, the upper bound of the cosine margin will get even smaller.
ArcFace
ArcFace (Deng et al. 2019) is very similar to CosFace and addresses the same limitations of SphereFace as mentioned in the CosFace description. However, instead of defining the margin in the cosine space, it defines the margin directly in the angle space.
The setting is identical to CosFace, with the requirements that the last layer weights and feature vector should be normalized, i.e. \(\smash{\ W_j \ = 1}\) and \(\smash{\ z \ = 1}\) and last layer biases should be equal to zero (\(b = 0\)). The ArcFace objective is then defined as:
\[\begin{equation*} \mathcal{L}_\text{ArcFace} =  \frac{1}{N} \sum_{i=1}^{N}{ \log \frac{ \exp\left\{s \cos \left(\theta_{y_i, i} + m\right) \right\} }{ \exp\left\{s \cos \left(\theta_{y_i, i} + m\right) \right\} + \sum_{j \ne y_i} \exp\left\{s \cos (\theta_{j,i})\right\} }} \end{equation*}\]where \(s\) is the scaling parameter and \(m\) is referred to as the margin parameter. While the differences with CosFace is very minor, the results on various benchmarks shows that ArcFace is better than CosFace in most of the cases.
AdaCos
For both CosFace and ArcFace, the choice of scaling parameter \(s\) and margin \(m\) is crucial. Both papers did very litle analysis on the effect of these parameters. Luckily, Zhang et al. (2019) performed an awesome analysis on the hyperparameters of cosinebased losses.