F1 is a well-established
metric to evaluate the performance of a classifier or a predictor. The issue
with F1 is that it can only evaluate the ranking or classification capability
of a predictor, but not the quality of the probability values assigned to
predicted contacts. Intuitively, a probability assignment is good if it has the
following properties: 1) it can help users separate a good prediction from a
bad one. That is, for a hard target, most predicted probability values shall
have a small value; while for an easy target, there shall be many more large
probability values among the top predictions; and 2) it can help users separate
contacts from non-contacts. That is, on average a true contact (especially
those top ranked) shall have a much larger predicted probability value than a
non-contact. The above properties are desirable since they can facilitate users
to guess if a prediction is of low- or high-quality and to select top predicted
contacts for folding simulation (in the absence of native structures).

To evaluate the quality of
the predicted probability values, the CASP12 contact prediction assessor
introduced a new performance metric called probability-weighted F1, abbreviated
as F1(prob). Generalized
from F, F1(prob) is defined as where , and is the sum
of the probability values assigned to all true contacts among the top
predictions. Intuitively, one probability assignment with a larger F1(prob) shall be better than another with a smaller
F1(prob) in separating a good prediction from a bad
one and separating contacts from non-contacts. Although seems to be appealing,
it turns out that F1(prob)
is a wrong metric mainly because it does not penalize a large probability value
assigned to a residue pair not forming a contact. Below please see a more
detailed argument.

Let denote the
length of a protein under prediction and the number
of residue pairs. Let denote the
probability values assigned by one predictor to the residue
pairs where . Let denote the of this
prediction. Supposing that this protein is a hard target with very few sequence
homologs, this prediction shall have a low quality and most of the predicted
probability values shall be smaller than 0.5. That is, many of the top predicted
contacts are actually wrong.

Now we show that there exist
numerous probability assignments, denoted by Q, such that 1) Q has the same
ranking order as P, i.e., P and Q have the same F1 score; 2) Q has a better F, i.e., ; 3) Q
may mislead a user to believe that the underlying prediction is of high quality
although it is indeed of low quality; and 4) Q may be worse than P in separating contacts
from non-contacts.

**A trivial case**. We can obtain Q by assigning the probability values
of the top L residue pairs to a value close to 1 without changing their ranking
order and assigning the remaining N-L probability values to 0 or a small
constant (e.g., 0.01). Since the ranking order of the residue pairs is not
changed, Q has the same F1 as P, but a larger F1(prob) than P. However, since the top L probability values
are close to 1 and much larger than the remaining ones, when presented a
probability assignment Q instead of P, a user may think that the underlying
prediction is really good and the top L predicted contacts are very likely to
be true positives, which contradicts with the fact that the underlying
prediction is actually bad and many of the top L predicted contacts are false
positives.

**A nontrivial case.** We may set Q to . In fact, for any positive , we can set Q to to have
the same effect. In addition, we can just change the top L probability values
and keep the remaining N-L probability values unchanged. We can also do linear
interpolation between P and to
generate Q. It is easy to show that Q has the same F1 as P, but a larger F1(prob) than P. However, since
many small probability values in P become large in Q (e.g., 0.25 becomes 0.5),
when presented a probability assignment Q instead of P, a user may think that
the underlying prediction is of high quality and many top predicted contacts
are correct, which again contradicts with the fact that the underlying
prediction actually has bad quality.

In summary, F1(prob) favors those predictors
which would like to assign a larger probability value to a predicted contact no
matter whether it is correct or not. F1(prob) cannot be used to evaluate the quality of the
predicted probability values since one probability assignment with a larger F1(prob) is not better (sometimes even worse) than another
with a smaller F1(prob) in separating contacts from
non-contacts and in separating good predictions from bad. If F1(prob) is enforced, every server will simply just assign all
the predicted contacts with probability 1.0, which is useless, to maximize
their F1(prob).

P.S. A similar argument can
also be used to show that the current method used by the assessor to filter
predictions by probability>0.5 is also wrong.