## 1 Introduction

Maintaining civil discussions online is a persistent challenge for online platforms. Due to the sheer scale of user-generated text, modern content moderation systems often employ machine learning algorithms to automatically classify user comments based on their toxicity, with the goal of flagging a collection of likely policy-violating content for human experts to review

(etim2017times). However, modern deep learning models have been shown to suffer from reliability and robustness issues, especially in the face of the rich and complex sociolinguistic phenomena in real-world online conversations. Examples include possibly generating confidently wrong predictions based on spurious lexical features

(wang-culotta-2020-identifying), or exhibiting undesired biases toward particular social subgroups (46743). This has raised questions about how current toxicity detection models will perform in realistic online environments, as well as the potential consequences for moderation systems (rainie_future_2017).In this work, we study an approach to address these questions by incorporating model uncertainty into the collaborative model-moderator system’s decision-making process. The intuition is that by using uncertainty as a signal for the likelihood of model error, we can improve the efficiency and performance of the collaborative moderation system by prioritizing the least confident examples from the model for human review. Despite a plethora of uncertainty methods in the literature, there has been limited work studying their effectiveness in improving the performance of human-AI collaborative systems with respect to application-specific metrics and criteria (awaysheh_review_2019; dusenberry_analyzing_2020; jesson_identifying_2020). This is especially important for the content moderation task: real-world practice has unique challenges and constraints, including label imbalance, distributional shift, and limited resources of human experts; how these factors impact the collaborative system’s effectiveness is not well understood.

In this work, we lay the foundation for the study of the uncertainty-aware collaborative content moderation problem. We first (1) propose rigorous metrics Oracle-Model Collaborative Accuracy (OC-Acc) and AUC (OC-AUC) to measure the performance of the overall collaborative system under capacity constraints on a simulated human moderator. We also propose Review Efficiency, a intrinsic metric to measure a model’s ability to improve the collaboration efficiency by selecting examples that need further review. Then, (2) we introduce a challenging data benchmark, Collaborative Toxicity Moderation in the Wild (CoToMoD), for evaluating the effectiveness of a collaborative toxic comment moderation system. CoToMoD emulates the realistic train-deployment environment of a moderation system, in which the deployment environment contains richer linguistic phenomena and a more diverse range of topics than the training data, such that effective collaboration is crucial for good system performance (amodei2016concrete). Finally, (3) we present a large benchmark study to evaluate the performance of five classic and state-of-the-art uncertainty approaches on CoToMoD under two different moderation review approaches (based on the uncertainty score and on the toxicity score, respectively). We find that both the model’s predictive and uncertainty quality contribute to the performance of the final system, and that the uncertainty-based review strategy outperforms the toxicity strategy across a variety of models and range of human review capacities.

## 2 Related Work

Our collaborative metrics draw on the idea of classification with a reject option, or learning with abstention (JMLR:v9:bartlett08a; cortes2016learning; pmlr-v80-cortes18a; kompa2021second). In this classification scenario, the model has the option to reject an example instead of predicting its label. The challenge in connecting learning with abstention to OC-Acc or OC-AUC is to account for how many examples have already been rejected. Specifically, the difficulty is that the metrics we present are all dataset-level metrics, i.e. the “reject” option is not at the level of individual examples, but rather a set capacity over the entire dataset. Moreover, this means OC-Acc and OC-AUC can be compared directly with traditional accuracy or AUC measures. This difference in focus enables us to consider human time as the limiting resource in the overall model-moderator system’s performance.

One key point for our work is that the best model (in isolation) may not yield the best performance in collaboration with a human bansal2021is. Our work demonstrates this for a case where the collaboration procedure is decided over the full dataset rather than per example: because of this, bansal2021is’s expected team utility does not easily generalize to our setting. In particular, the user chooses which classifier predictions to accept after receiving all of them rather than per example.

Robustness to distribution shift has been applied to toxicity classification in other works (adragna2020fairness; koh2020wilds), emphasizing the connection between fairness and robustness. Our work focuses on how these methods connect to the human review process, and how uncertainty can lead to better decision-making for a model collaborating with a human. Along these lines, dusenberry_analyzing_2020 analyzed how uncertainty affects optimal decisions in a medical context, though again at the level of individual examples rather than over the dataset.

## 3 Background: Uncertainty Quantification for Deep Toxicity Classification

Types of Uncertainty Consider modeling a toxicity dataset using a deep classifier . Here the are example comments, are toxicity labels drawn from a data generating process (e.g., the human annotation process), and

are the parameters of the deep neural network. There are two distinct types of uncertainty in this modeling process:

data uncertainty and model uncertainty (sullivan2015introduction; liu_accurate_2019). Data uncertainty arises from the stochastic variability inherent in the data generating process . For example, the toxicity label for a comment can vary between 0 and 1 depending on raters’ different understandings of the comment or of the annotation guidelines. On the other hand, model uncertainty arises from the model’s lack of knowledge about the world, commonly caused by insufficient coverage of the training data. For example, at evaluation time, the toxicity classifier may encounter neologisms or misspellings that did not appear in the training data, making it more likely to make a mistake (van-aken-etal-2018-challenges). While the model uncertainty can be reduced by training on more data, the data uncertainty is inherent to the data generating process and is irreducible.Estimating Uncertainty A model that quantifies its uncertainty well should properly capture both the data and the model uncertainties. To this end, a learned deep classifier describes the data uncertainty

via its predictive probability, e.g.:

which is conditioned on the model parameter , and is commonly learned by minimizing the Kullback-Leibler (KL) divergence between the model distribution and the empirical distribution of the data (e.g. by minimizing the cross-entropy loss (Goodfellow-et-al-2016)). On the other hand, a deep classifier can quantify model uncertainty by using probabilistic methods to learn the posterior distribution of the model parameters:

This distribution over leads to a distribution over the predictive probabilities . As a result, at inference time, the model can sample model weights from the posterior distribution , and then compute the posterior sample of predictive probabilities

. This allows the model to express its model uncertainty through the variance of the posterior distribution

. Section 5 surveys popular probabilistic deep learning methods.In practice, it is convenient to compute a single uncertainty score capturing both types of uncertainty. To this end, we can first compute the marginalized predictive probability:

which captures both types of uncertainty by marginalizing the data uncertainty over the model uncertainty . We can thus quantify the overall uncertainty of the model by computing the predictive variance of this binary distribution:

(1) |

Evaluating Uncertainty Quality A common approach to evaluate a model’s uncertainty quality is to measure its calibration performance, i.e., whether the model’s predictive uncertainty is indicative of the predictive error (pmlr-v70-guo17a). As we shall see in experiments, traditional calibration metrics like the Brier score (NEURIPS2019_8558cb40) do not correlate well with the model performance in collaborative prediction. One notable reason is that the collaborative systems use uncertainty as a ranking score (to identify possibly wrong predictions), while metrics like Brier score only measure the uncertainty’s ranking performance indirectly.

Uncertainty | |||
---|---|---|---|

Uncertain | Certain | ||

Accuracy | Inaccurate | TP | FN |

Accurate | FP | TN |

This motivates us to consider Calibration AUC, a new class of calibration metrics that focus on the uncertainty score ’s ranking performance. This metric evaluates uncertainty estimation by recasting it as a binary prediction problem, where the binary label is the model’s prediction error , and the predictive score is the model uncertainty. This formulation leads to a confusion matrix as shown in Figure 1 (krishnan2020improving). Here, the four confusion matrix variables take on new meanings: (1) True Positive (TP) corresponds to the case where the prediction is inaccurate and the model is uncertain, (2) True Negative (TN) to the accurate and certain case, (3) False Negative (FN) to the inaccurate and certain case (i.e., over-confidence), and finally (4) False Positive (FP) to the accurate and uncertain case (i.e., under-confidence). Now, consider having the model predict its testing error using model uncertainty. The precision (TP/(TP+FP)) measures the fraction of inaccurate examples where the model is uncertain, recall (TP/(TP+FN)) measures the fraction of uncertain examples where the model is inaccurate, and the false positive rate (FP/(FP+TN)) measures the fraction of under-confident examples among the correct predictions. Thus, the model’s calibration performance can be measured by the area under the precision-recall curve (Calibration AUPRC

) and under the receiver operating characteristic curve (

Calibration AUROC) for this problem. It is worth noting that the calibration AUPRC is closely related to the intrinsic metrics for the model’s collaborative effectiveness: we discuss this in greater detail for the Review Efficiency in Section 4.1 and Appendix A.2). This renders it especially suitable for evaluating model uncertainty in the context of collaborative content moderation.## 4 The Collaborative Content Moderation Task

Online content moderation is a *collaborative* process, performed by humans working in conjunction with machine learning models. For example, the model can select a set of likely policy-violating posts for further review by human moderators. In this work, we consider a setting where a neural model interacts with an “oracle” human moderator with limited capacity in moderating online comments.
Given a large number of examples , the model first generates the predictive probability and review score for each example. Then, the model sends a pre-specified number of these examples to human moderators according to the rankings of the review score , and relies on its prediction for the rest of the examples. In this work, we make the simplifying assumption that the human experts act like an oracle, correctly labeling all comments sent by the model.

### 4.1 Measuring the Performance of the Collaborative Moderation System

Machine learning systems for online content moderation are typically evaluated using metrics like accuracy or area under the receiver operating characteristic curve (AUROC). These metrics reflect the origins of these systems in classification problems, such as for detecting / classifying online abuse, harassment, or toxicity yin2009detection; Dinakar_Reichart_Lieberman_2011; Cheng_Danescu-Niculescu-Mizil_Leskovec_2015; 10.1145/3038912.3052591. However, they do not capture the model’s ability to effectively collaborate with human moderators, or the performance of the resultant collaborative system.

New metrics, both extrinsic and intrinsic 10.5555/1641396.1641403, are one of the core contributions of this work. We introduce extrinsic metrics describing the performance of the overall model-moderator collaborative system (Oracle-Model Collaborative Accuracy and AUC, analogous to the classic accuracy and AUC), and an intrinsic metric focusing on the model’s ability to effectively collaborate with human moderators (Review Efficiency), i.e., how well the model selects the examples in need of further review.

#### Extrinsic Metrics: Oracle-model Collaborative Accuracy and AUC

To capture the collaborative interaction between human moderators and machine learning models, we first propose Oracle-Model Collaborative Accuracy (OC-Acc).
OC-Acc measures the combined accuracy of this collaborative process, subject to a limited review capacity for the human oracle (i.e., the oracle can process at most of the total examples). Formally, given a dataset , for a predictive model generating a review score , the Oracle-Model Collaborative Accuracy for example is

Thus, over the whole dataset, . Here is the quantile of the model’s review scores over the entire dataset. OC-Acc thus describes the performance of a collaborative system which defers to a human oracle when the review score is high, and relies on the model prediction otherwise, capturing the real-world usage and performance of the underlying model in a way that traditional metrics fail to.

However, as an accuracy-like metric, OC-Acc relies on a set threshold on the prediction score. This limits the metric’s ability in describing model performance when compared to threshold-agnostic metrics like AUC. Moreover, OC-Acc can be sensitive to the intrinsic class imbalance in the toxicity datasets, appearing overly optimistic for model predictions that are biased toward negative class, similar to traditional accuracy metrics (DBLP:journals/corr/abs-1903-04561). Therefore in practice, we prefer the AUC analogue of Oracle-Model Collaborative Accuracy, which we term the Oracle-Model Collaborative AUC (OC-AUC). OC-AUC measures the same collaborative process as the OC-Acc, where the model sends the predictions with the top of review scores. Then, similar to the standard AUC computation, OC-AUC sets up a collection of classifiers with varying predictive score thresholds, each of which has access to the oracle exactly as for OC-Acc 10.1145/1143844.1143874. Each of these classifiers sends the same set of examples to the oracle (since the review score is threshold-independent), and the oracle corrects model predictions when they are incorrect given the threshold. The OC-AUC—both OC-AUROC and OC-AUPRC—can then be calculated over this set of classifiers following the standard AUC algorithms 10.1145/1143844.1143874.

#### Intrinsic Metric: Review Efficiency

The metrics so far measure the performance of the overall collaborative system, which combines both the model’s predictive accuracy and the model’s effectiveness in collaboration. To understand the source of the improvement, we also introduce Review Efficiency, an intrinsic metric focusing solely on the model’s effectiveness in collaboration. Specifically, Review Efficiency is the proportion of examples sent to the oracle for which the model prediction would otherwise have been incorrect. This can be thought of as the model’s precision in selecting inaccurate examples for further review (TP/(TP+FP) in Figure 1).

Note that the system’s overall performance (measured by the oracle-model collaborative accuracy) can be rewritten as a weighted sum of the model’s original predictive accuracy and the Review Efficiency (RE):

(2) |

where is the model’s review efficiency among all the examples whose review score are greater than (i.e., those sent to human moderators). Thus, a model with better predictive performance and higher review efficiency yields better performance in the overall system. The benefits of review efficiency become more pronounced as the review fraction increases. We derive Eq. (2) in Appendix B.

### 4.2 CoToMoD: An Evaluation Benchmark for Real-world Collaborative Moderation

In a realistic industrial setting, toxicity detection models are often trained on a well-curated dataset with clean annotations, and then deployed to an environment that contains a more diverse range of sociolinguistic phenomena, and additionally exhibits systematic shifts in the lexical and topical distributions when compared to the training corpus.

To this end, we introduce a challenging data benchmark, Collaborative Toxicity Moderation in the Wild (CoToMoD), to evaluate the performance of collaborative moderation systems in a realistic environment. CoToMoD consists of a set of train, test, and deployment environments: the train and test environments consist of 200k comments from Wikipedia discussion comments from 2004–2015 (the Wikipedia Talk Corpus 10.1145/3038912.3052591), and the deployment environment consists of one million public comments appeared on approximately 50 English-language news sites across the world from 2015–2017 (the CivilComments dataset DBLP:journals/corr/abs-1903-04561). This setup mirrors the real-world implementation of these methods, where robust performance under changing data is essential for proper deployment (amodei2016concrete).

Notably, CoToMoD contains two data challenges often encountered in practice: (1) Distributional Shift, i.e. the comments in the training and deployment environments cover different time periods and surround different topics of interest (Wikipedia pages vs. news articles). As the CivilComments corpus is much larger in size, it contains a considerable collection of long-tail phenomena (e.g., neologisms, obfuscation, etc.) that appear less frequently in the training data. (2) Class Imbalance, i.e. the fact that most online content is not toxic (10.1145/2998181.2998213; 10.1145/3038912.3052591). This manifests in the datasets we use: roughly (50,350 / 1,999,514) of the examples in the CivilComments dataset, and (21,384 / 223,549) of the examples in Wikipedia Talk Corpus examples are toxic (10.1145/3038912.3052591; DBLP:journals/corr/abs-1903-04561). As we will show, failing to account for class imbalance can severely bias model predictions toward the majority (non-toxic) class, reducing the effectiveness of the collaborative system.

## 5 Methods

#### Moderation Review Strategy

In measuring model-moderator collaborative performance, we consider two review strategies (i.e. using different review scores ). First, we experiment with a common toxicity-based review strategy latin_america_moderates; NYT-case-study. Specifically, the model sends comments for review in decreasing order of the predicted toxicity score (i.e., the predictive probability ), equivalent to a review score . The second strategy is uncertainty-based: given , we use uncertainty as the review score, (recall Eq. (1)), so that the review score is maximized at , and decreases toward 0 as approaches 0 or 1. Which strategy performs best depends on the toxicity distribution in the dataset and the available review capacity .

#### Uncertainty Models

We evaluate the performance of classic and the latest state-of-the-art probabilistic deep learning methods on the CoToMoD benchmark. We consider BERT as the base model (devlin-etal-2019-bert), and select five methods based on their practical applicability for transformer models. Specifically, we consider (1) Deterministic which computes the sigmoid probability of a vanilla BERT model (DBLP:conf/iclr/HendrycksG17), (2) Monte Carlo Dropout (MC Dropout) which estimates uncertainty using the Monte Carlo average of from 10 dropout samples (pmlr-v48-gal16), (3) Deep Ensemble which estimates uncertainty using the ensemble mean of from 10 BERT models trained in parallel (NIPS2017_9ef2ed4b), (4) Spectral-normalized Neural Gaussian Process (SNGP), a recent state-of-the-art approach which improves a BERT model’s uncertainty quality by transforming it into an approximate Gaussian process model (NEURIPS2020_543e8374), and (5) SNGP Ensemble, which is the Deep Ensemble using SNGP as the base model.

#### Learning Objective

To address class imbalance, we consider combining the uncertainty methods with Focal Loss (Lin_2017_ICCV)

. Focal loss reshapes the loss function to down-weight “easy” negatives (i.e. non-toxic examples), thereby focusing training on a smaller set of more difficult examples, and empirically leading to improved predictive and uncertainty calibration performance on class-imbalanced datasets

(Lin_2017_ICCV; NEURIPS2020_aeb7b30e). We focus our attention on focal loss (rather than other approaches to class imbalance) because of how this impact on calibration interacts with our moderation review strategies.Testing Env (Wikipedia Talk) | Deployment Env (CivilComments) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

Model | AUROC | AUPRC | Acc. | Brier | Calib. AUPRC | AUROC | AUPRC | Acc. | Brier | Calib. AUPRC | |

XENT | Deterministic | ||||||||||

SNGP | |||||||||||

MC Dropout | 0.9274 | 0.0508 | 0.9671 | ||||||||

Deep Ensemble | 0.7849 | 0.6741 | |||||||||

SNGP Ensemble | 0.9741 | 0.8045 | |||||||||

Focal |
Deterministic | ||||||||||

SNGP | 0.0264 | ||||||||||

MC Dropout | 0.3890 | ||||||||||

Deep Ensemble | 0.9479 | ||||||||||

SNGP Ensemble | 0.3212 |

## 6 Benchmark Experiments

We first examine the prediction and calibration performance of the uncertainty models alone (Section 6.1). For prediction, we compute the predictive accuracy (Acc) and the predictive AUC (both AUROC and AUPRC). For uncertainty, we compute the Brier score (i.e., the mean squared error between true labels and predictive probabilities, a standard uncertainty metric), and also the Calibration AUPRC (Section 3).

We then evaluate the models’ collaboration performance under both the uncertainty- and the toxicity-based review strategies (Section 6.2). For each model-strategy combination, we measure the model’s collaboration ability by computing Review Efficiency, and evaluate the performance of the overall collaborative system using Oracle-Model Collaborative AUROC (OC-AUROC). We evaluate all collaborative metrics over a range of human moderator review capacities, with their review fractions (i.e., fraction of total examples the model sends to the moderator for further review) ranging over .

Results on further uncertainty and collaboration metrics (Calibration AUROC, OC-Acc, OC-AUPRC, etc.) are in Appendix D.

### 6.1 Prediction and Calibration

Table 1 shows the performance of all uncertainty methods evaluated on the testing (the Wikipedia Talk corpus) and the deployment environments (the CivilComments corpus).

First, we compare the uncertainty methods based on the predictive and calibration AUC. As shown, for prediction, the ensemble models (both SNGP Ensemble and Deep Ensemble) provide the best performance, while the SNGP Ensemble and MC Dropout perform best for uncertainty calibration. Training with focal loss systematically improves the model prediction under class imbalance (improving the predictive AUC), while incurring a trade-off with the model’s calibration quality (i.e. decreasing the calibration AUC).

Next, we turn to the model performance between the test and deployment environments. Across all methods, we observe a significant drop in predictive performance ( for AUROC and for AUPRC), and a less pronounced, but still noticeable drop in uncertainty calibration ( for Calibration AUPRC). Interestingly, focal loss seems to mitigate the drop in predictive performance, but also slightly exacerbates the drop in uncertainty calibration.

Lastly, we observe a counter-intuitive improvement in the non-AUC metrics (i.e., accuracy and Brier score) in the out-of-domain deployment environment. This is likely due to their sensitivity to class imbalance (recall that toxic examples are slightly less rare in CivilComments). As a result, these classic metrics tend to favor model predictions biased toward the negative class, and therefore are less suitable for evaluating model performance in the context of toxic comment moderation.

### 6.2 Collaboration Performance

Figure 2 and 3 show the Oracle-model Collaborative AUROC (OC-AUROC) of the overall collaborative system, and Figure 4 shows the Review Efficiency of uncertainty models. Both the toxicity-based (dashed line) and uncertainty-based review strategies (solid line) are included.

#### Effect of Review Strategy

For the AUC performance of the collaborative system, the uncertainty-based review strategy consistently outperforms the toxicity-based review strategy. For example, in the in-domain environment (Wikipedia Talk corpus), using the uncertainty- rather than toxicity-based review strategy yields larger OC-AUROC improvements than any modeling change; this holds across all measured review fractions. We see a similar trend for OC-AUPRC (Appendix Figure 7-8).

The trend in Review Efficiency (Figure 4) provides a more nuanced view to this picture. As shown, the efficiency of the toxicity-based strategy starts to improve as the review fraction increases, leading to a cross-over with the uncertainty-based strategy at high fractions. This is likely caused by the fact that in toxicity classification, the false positive rate exceeds the false negative rate. Therefore sending a large number of positive predictions eventually leads the collaborative system to capture more errors, at the cost of a higher review load on human moderators. We notice that this transition occurs much earlier out-of-domain on CivilComments (Figure 4 right). This highlights the impact of the toxicity distribution of the data on the best review strategy: because the proportion of toxic examples is much lower in CivilComments than in the Wikipedia Talk Corpus, the cross-over between the uncertainty and toxicity review strategies correspondingly occurs at lower review fractions. Finally, it is important to note that this advantage in review efficiency does not directly translate to improvements for the overall system. For example, the OC-AUCs using the toxicity strategy are still lower than those with the uncertainty strategy even for high review fractions.

#### Effect of Modeling Approach

Recall that the performance of the overall collaborative system is the result of the model performance in both prediction and calibration, e.g. Eq. (2). As a result, the model performance in Section 6.1 translates to performance on the collaborative metrics. For example, the ensemble methods (SNGP Ensemble and Deep Ensemble) consistently outperform on the OC-AUC metrics due to their high performance in predictive AUC and decent performance in calibration (Table 1). On the other hand, MC Dropout has good calibration performance but sub-optimal predictive AUC. As a result, it sometimes attains the best Review Efficiency (e.g., Figure 4, right), but never achieves the best overall OC-AUC. Finally, comparing between training objectives, the focal-loss-trained models tend to outperform their cross-entropy-trained counterparts in OC-AUC, due to the fact that focal loss tends to bring significant benefits to the predictive AUC (albeit at a small cost to the calibration performance).

## 7 Conclusion

In this work, we presented the problem of collaborative content moderation, and introduced CoToMoD, a challenging benchmark for evaluating the practical effectiveness of collaborative (model-moderator) content moderation systems. We proposed principled metrics to quantify how effectively a machine learning model and human (e.g. a moderator) can collaborate. These include Oracle-Model Collaborative Accuracy (OC-Acc) and AUC (OC-AUC), which measure analogues of the usual accuracy or AUC for interacting human-AI systems subject to limited human review capacity. We also proposed Review Efficiency

, which quantifies how effectively a model utilizes human decisions. These metrics are distinct from classic measures of predictive performance or uncertainty calibration, and enable us to evaluate the performance of the full collaborative system as a function of human attention, as well as to understand how efficiently the collaborative system utilizes human decision-making. Moreover, though we focused here on measuring the combined system’s performance through metrics analogous to accuracy and AUC, it is trivial to extend these to other classic metrics like precision and recall.

Using these new metrics, we evaluated the performance of a variety of models on the collaborative content moderation task. We considered two canonical strategies for collaborative review: one based on the toxicity scores, and a new one using model uncertainty. We found that the uncertainty-based review strategy outperforms the toxicity strategy across a variety of models and range of human review capacities, yielding a absolute increase in how efficiently the model uses human decisions and and absolute increases in the collaborative system’s AUROC and AUPRC, respectively. This merits further study and consideration of this strategy’s use in content moderation. The interaction between the data distribution and best review strategy demonstrated by the crossover between the two strategies’ performance out-of-domain) emphasizes the implicit trade-off between false positives and false negatives in the two review strategies: because toxicity is rare, prioritizing comments for review in order of toxicity reduces the false positive rate while potentially increasing the false negative rate. By comparison, the uncertainty-based review strategy treats false positives and negatives more evenly. Further study is needed to clarify this interaction. Our work shows that the choice of review strategy drastically changes the collaborative system performance: evaluating and striving to optimize only the model yields much smaller improvements than changing the review strategy, and misses major opportunities to improve the overall system.

Though the results presented in the current paper are encouraging, there remain important challenges for uncertainty modeling in the domain of toxic content moderation. In particular, dataset bias remains a significant issue: statistical correlation between the annotated toxicity labels and various surface-level cues may lead models to learn to overly rely on e.g. lexical or dialectal patterns (zhou-etal-2021-challenges). This could cause the model to produce high-confidence mispredictions for comments containing these cues (e.g., reclaimed words or counter-speech), resulting in a degradation in calibration performance in the deployment environment (cf. Table 1). Surprisingly, the standard debiasing techniques we experimented in this work (specifically, focal loss (karimi-mahabadi-etal-2020-end)) only exacerbated this decline in calibration performance. This suggests that naively applying debiasing techniques may incur unexpected negative impacts on other aspects of the moderation system. Further research is needed into modeling approaches that can achieve robust performance both in prediction and in uncertainty calibration under data bias and distributional shift (NEURIPS2020_eddc3427; utama-etal-2020-towards; DBLP:journals/corr/abs-2103-06922; yaghoobzadeh-etal-2021-increasing; bao2021predict; karimi-mahabadi-etal-2020-end).

There exist several important directions for future work. One key direction is to develop better review strategies than the ones discussed here: though the uncertainty-based strategy outperforms the toxicity-based one, there may be room for further improvement. Furthermore, constraints on the moderation process may necessitate different review strategies: for example, if content can only be removed with moderator approval, we could experiment with a hybrid strategy which sends a mixture of high toxicity and high uncertainty content for human review. A second direction is to study how these methods perform with real moderators: the experiments in this work are computational and there may exist further challenges in practice. For example, the difficulty of rating a comment can depend on the text itself in unexpected ways. Finally, a linked question is how to communicate uncertainty and different review strategies to moderators: simpler communicable strategies may be preferable to more complex ones with better theoretical performance.

## Acknowledgements

The authors would like to thank Jeffrey Sorensen for extensive feedback on the manuscript, and Nitesh Goyal, Aditya Gupta, Luheng He, Balaji Lakshminarayanan, Alyssa Lees, and Jie Ren for helpful comments and discussions.

## References

## Appendix A Details on Metrics

### a.1 Expected Calibration Error

For completeness, we include a definition of the expected calibration error (ECE) (10.5555/2888116.2888120) here. We use the ECE as a comparison for the uncertainty calibration performance alongside the Brier score in the tables in Appendix D.

ECE can be computed by discretizes the probability range into a set of bins, and computes the weighted average of the difference between confidence (the mean probability within each bin) and the accuracy (the fraction of predictions within each bin that are correct),

(3) |

where and denote the accuracy and confidence for bin , respectively, is the number of examples in bin , and is the total number of examples.

### a.2 Connection between Calibration AUPRC and Collaboration Metrics

As discussed in Section 3, Calibration AUPRC is an especially suitable metric for measuring model uncertainty in the context of collaborative content moderation, due to its close connection with the intrinsic metrics for the model’s collaboration effectiveness.

Specifically, the Review Efficiency metric (introduced in Section 4.1) can be understood as the analog of precision for the calibration task. To see this, recall the four confusion matrix variables introduced in Figure 1: (1) True Positive (TP) corresponds to the case where the prediction is inaccurate and the model is uncertain, (2) True Negative (TN) to the accurate and certain case, (3) False Negative (FN) to the inaccurate and certain case (i.e., over-confidence), and finally (4) False Positive (FP) to the accurate and uncertain case (i.e., under-confidence).

Then, given a review capacity constraint , we see that

which measures the proportion of examples that were sent to human moderator that would otherwise be classified incorrectly.

Similarly, we can also define the analog of recall for the calibration task, which we term Review Effectiveness:

Review Effectiveness is also a valid intrinsic metric for the model’s collaboration effectivess. It measures the proportion of incorrect model predictions that were successfully corrected using the review strategy. (We visualize model performance in Review Effectiveness in Section D.)

To this end, the calibration AUPRC can be understood as the area under the Review Efficiency v.s. Review Effectiveness curve, with the usual classification threshold replaced by the review capacity . Therefore, calibration AUPRC serves as a threshold-agnostic metric that captures the model’s intrinsic performance in collaboration effectiveness.

### a.3 Further Discussion

For the uncertainty-based review, an important question is whether classic uncertainty metrics like Brier score capture good model-moderator collaborative efficiency. The SNGP Ensemble’s good performance contrasts with its poorer Brier score (Table 1). By comparison, the calibration AUPRC successfully captures this good performance, and is highest for that model. More generally, the low-review fraction review efficiency with cross-entropy is exactly captured by the calibration AUPRC (same ordering for the two measures). This correspondence is not perfect: though the SNGP Ensemble with focal loss has the highest review efficiency overall, its calibration AUPRC is lower than the MC Dropout or SNGP models (models with next highest review efficiencies). This may reflect the reshaping effect of focal loss on SNGP’s calibration (explored in Appendix C). Overall, calibration AUPRC much better captures the relationship between collaborative ability and calibration than do classic calibration metrics like Brier score (or ECE, see Appendix D). This is because classic calibration metrics are population-level averages, whereas calibration AUPRC measures the ranking of the predictions, and is thus more closely linked to the review order problem.

## Appendix B Connecting Review Efficiency and Collaborative Accuracy

In this appendix, we derive Eq. (2) from the main paper, which connects the Review Efficiency and Oracle-Collaborative Accuracy.

Given a trained toxicity model, a review policy and a dataset, let us denote as the event that an example gets reviewed, and as the event that model prediction is correct. Now, assuming the model sends of examples for human review, we have:

Acc |

Also, we can write:

i.e., review efficiency is the percentage of incorrect predictions among reviewed examples. Finally:

i.e., an example is predicted correctly by the collaborative system if either the model prediction itself is accurate (), or it was sent for human review ( or ).

The above expression of OC-Acc leads to two different decompositions of the OC-Acc. First,

where is the accuracy among the examples that are not sent to human for review.

## Appendix C Reliability Diagrams for Deterministic and SNGP models

We study the effect of focal loss on calibration quality for SNGP in further detail. We plot the reliability diagrams for the deterministic and SNGP models trained with cross-entropy and focal cross-entropy. Figure 5 shows the reliability diagrams in-domain and Figure 6 shows them out-of-domain. We see that focal loss fundamentally changes the models’ uncertainty behavior, systematically shifting the uncertainty curves from overconfidence (the lower right, below the diagonal) and toward the calibration line (the diagonal). However, the exact pattern of change is model dependent. We find that the deterministic model with focal loss is over-confident for predictions under , and under-confident above , while the SNGP models are still over-confident, although to a lesser degree compared to using cross-entropy loss.

Model (Test) | AUROC | AUPRC | Acc. | ECE | Brier | Calib. AUROC | Calib. AUPRC | |
---|---|---|---|---|---|---|---|---|

XENT | Deterministic | |||||||

SNGP | ||||||||

MC Dropout | 0.9282 | |||||||

Deep Ensemble | ||||||||

SNGP Ensemble | ||||||||

Focal |
Deterministic | |||||||

SNGP | ||||||||

MC Dropout | 0.3890 | |||||||

Deep Ensemble | 0.9479 | |||||||

SNGP Ensemble |

Model (Deployment) | AUROC | AUPRC | Acc. | ECE | Brier | Calib. AUROC | Calib. AUPRC | |

XENT | Deterministic | |||||||

SNGP | 0.0070 | |||||||

MC Dropout | ||||||||

Deep Ensemble | ||||||||

SNGP Ensemble | ||||||||

Focal |
Deterministic | |||||||

SNGP | ||||||||

MC Dropout | 0.9481 | |||||||

Deep Ensemble | ||||||||

SNGP Ensemble | 0.9481 | 0.3212 |

## Appendix D Complete metric results

We give the results for the remaining collaborative metrics not included in the main paper in this appendix. These give a comprehensive summary of the collaborative performance of the models evaluated in the paper. Table 2 and Table 3 give values for all review fraction-independent metrics, both in- and out-of-domain, respectively. We did not include the ECE and calibration AUROC in the corresponding table in the main paper (Table 1) for simplicity. Similarly, Figures 7 and 9 show the in-domain results (the OC-AUPRC and OC-Acc), and the out-of-domain plots (in the same order, followed by Review Efficiency) are Figure 8, Figure 10, and Figure 12.

The in- and out-of-domain OC-AUROC figures are included in the main paper as Figure 2 and Figure 3, respectively; the in-domain Review Efficiency is Figure 4. Additionally, we also report results on the Review Effectiveness metric (introduced in Section A.2) in Figures 13-14. Similiar to Review Efficiency, we find little difference in performance between different uncertainty models, and that the uncertainty-based policy outperforms toxicity-based policy especially in the low review capacity setting.

Comments

There are no comments yet.