Coherence description
It's not recommended to use CV because
there are known issues associated with it.
CV is based
on a sliding window, a one-set segmentation of the top words
and an indirect confirmation measure that uses normalized
pointwise mutual information (NPMI) and the cosinus similarity.
This coherence measure retrieves cooccurrence counts for
the given words using a sliding window and the window size 110.
The counts are used to calculated the NPMI of every top word to
every other top word, thus, resulting in a set of
vectors—one for every top word. The one-set segmentation
of the top words leads to the calculation of the similarity
between every top word vector and the sum of all top word
vectors. As similarity measure the cosinus is used. The
coherence is the arithmetic mean of these similarities. (Note
that this was the best coherence measure in our evalution.)
Proposed in
M. Röder, A. Both, and A. Hinneburg:
Exploring the Space of Topic Coherence Measures. In
Proceedings of the eighth International Conference on Web
Search and Data Mining, 2015.
CP is a
based on a sliding window, a one-preceding segmentation of the
top words and the confirmation measure of Fitelson's coherence.
Word cooccurrence counts for the given top words are
derived using a sliding window and the window size 70. For
every top word, the confirmation to its preceding top word is
calculated using the confirmation measure of Fitelson's
coherence. The coherence is the arithmetic mean of the
confirmation measure results.
Proposed in
M. Röder, A. Both, and A. Hinneburg:
Exploring the Space of Topic Coherence Measures. In
Proceedings of the eighth International Conference on Web
Search and Data Mining, 2015.
CUCI is a
coherence that is based on a sliding window and the pointwise
mutual information (PMI) of all word pairs of the given top
words.
The word cooccurrence counts are derived using a sliding
window with the size 10. For every word pair the PMI is
calculated. The arithmetic mean of the PMI values is the result
of this coherence. (Note that in the original publication only
the sum of these values is calculated)
Proposed in
D. Newman, J. H. Lau, K. Grieser, and T.
Baldwin: Automatic evaluation of topic coherence. In
Human Language Technologies: The 2010 Annual Conferenceof the
North American Chapter of the Association for Computational
Linguistics, pages 100-108. Association for Computational
Linguistics, 2010.
CUMass is
based on document cooccurrence counts, a one-preceding
segmentation and a logarithmic conditional probability as
confirmation measure.
The main idea of this coherence is that the occurrence of
every top word should be supported by every top preceding top
word. Thus, the probability of a top word to occur should be
higher if a document already contains a higher order top word
of the same topic. Therefore, for every word the logarithm of
its conditional probability is calculated using every other top
word that has a higher order in the ranking of top words as
condition. The probabilities are derived using document
cooccurrence counts. The single conditional probabilities are
summarized using the arithmetic mean. (Note that in the
original publication only the sum of these values is
calculated)
Proposed in
D. Mimno, H. M. Wallach, E. Talley, M.
Leenders, and A. McCallum: Optimizing semantic coherence
in topic models. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 262-272.
Association for Computational Linguistics, 2011.
CNPMI is an
enhanced version of the CUCI coherence using the normalized
pointwise mutual information (NPMI) instead of the pointwise
mutual information (PMI).
Proposed in
N. Aletras and M. Stevenson: Evaluating
topic coherence using distributional semantics. In Proceedings
of the 10th International Conference on Computational Semantics
(IWCS'13) Long Papers, pages 13-22, 2013.
CA is based
on a context window, a pairwise comparison of the top words and
an indirect confirmation measure that uses normalized pointwise
mutual information (NPMI) and the cosinus similarity.
This coherence measure retrieves cooccurrence counts for
the given words using a context window with the window size 5.
The counts are used to calculated the NPMI of every top word to
every other top word, thus, resulting in a single vector for
every top word. After that the cosinus similarity between all
word pairs is calculated. The coherence is the arithmetic mean
of these similarities. (Note that in the original publication
several other coherence measures have been described. We have
chosen this one because it was the best of these measures in
our evaluation)
Proposed in
N. Aletras and M. Stevenson: Evaluating
topic coherence using distributional semantics. In Proceedings
of the 10th International Conference on Computational Semantics
(IWCS'13) Long Papers, pages 13-22, 2013.