Package com.optimaize.langdetect
Class LanguageDetectorImpl
java.lang.Object
com.optimaize.langdetect.LanguageDetectorImpl
- All Implemented Interfaces:
LanguageDetector
This class is immutable and thus thread-safe.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final double
private static final double
TODO document what this is for, and why that value is chosen.private static final int
TODO document what this is for, and why that value is chosen.private static final double
TODO document what this is for, and why that value is chosen.private static final long
This is used when no custom seed was passed in.private static final int
TODO document what this is for, and why that value is chosen.private static final org.slf4j.Logger
private final double
private static final int
TODO document what this is for, and why that value is chosen.private final NgramExtractor
private final @NotNull NgramFrequencyData
private final double
private final @org.jetbrains.annotations.Nullable double[]
User-defined language priorities, in the same order aslanglist
.private static final Comparator
<DetectedLanguage> private final double
private final com.google.common.base.Optional
<Long> private final int
private final double
-
Constructor Summary
ConstructorsConstructorDescriptionLanguageDetectorImpl
(@NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable Map<LdLocale, Double> langWeightingMap, @NotNull NgramExtractor ngramExtractor) Use theLanguageDetectorBuilder
. -
Method Summary
Modifier and TypeMethodDescriptioncom.google.common.base.Optional
<LdLocale> detect
(CharSequence text) Returns the best detected language if the algorithm is very confident.private @org.jetbrains.annotations.Nullable double[]
detectBlock
(CharSequence text) private double[]
detectBlockLongText
(List<String> ngrams) This is the original algorithm used for all text length.private double[]
detectBlockShortText
(Map<String, Integer> ngrams) getProbabilities
(CharSequence text) Returns all languages with at least some likeliness.private double[]
Initialize the map of language probabilities.private @NotNull List
<DetectedLanguage> sortProbability
(double[] prob) Returns the detected languages sorted by probabilities descending.private boolean
updateLangProb
(@org.jetbrains.annotations.NotNull double[] prob, @NotNull String ngram, int count, double alpha) update language probabilities with N-gram string(N=1,2,3)
-
Field Details
-
logger
private static final org.slf4j.Logger logger -
ALPHA_WIDTH
private static final double ALPHA_WIDTHTODO document what this is for, and why that value is chosen.- See Also:
-
ITERATION_LIMIT
private static final int ITERATION_LIMITTODO document what this is for, and why that value is chosen.- See Also:
-
CONV_THRESHOLD
private static final double CONV_THRESHOLDTODO document what this is for, and why that value is chosen.- See Also:
-
BASE_FREQ
private static final int BASE_FREQTODO document what this is for, and why that value is chosen.- See Also:
-
N_TRIAL
private static final int N_TRIALTODO document what this is for, and why that value is chosen.- See Also:
-
DEFAULT_SEED
private static final long DEFAULT_SEEDThis is used when no custom seed was passed in. By using the same seed for different calls, the results are consistent also. Changing this number means that users of the library might suddenly see other results after updating. So don't change it hastily. I chose a prime number *clueless*. See https://github.com/optimaize/language-detector/issues/14- See Also:
-
PROBABILITY_SORTING_COMPARATOR
-
ngramFrequencyData
-
priorMap
@Nullable private final @org.jetbrains.annotations.Nullable double[] priorMapUser-defined language priorities, in the same order aslanglist
. -
alpha
private final double alpha -
seed
-
shortTextAlgorithm
private final int shortTextAlgorithm -
prefixFactor
private final double prefixFactor -
suffixFactor
private final double suffixFactor -
probabilityThreshold
private final double probabilityThreshold -
minimalConfidence
private final double minimalConfidence -
ngramExtractor
-
-
Constructor Details
-
LanguageDetectorImpl
LanguageDetectorImpl(@NotNull @NotNull NgramFrequencyData ngramFrequencyData, double alpha, com.google.common.base.Optional<Long> seed, int shortTextAlgorithm, double prefixFactor, double suffixFactor, double probabilityThreshold, double minimalConfidence, @Nullable @Nullable Map<LdLocale, Double> langWeightingMap, @NotNull @NotNull NgramExtractor ngramExtractor) Use theLanguageDetectorBuilder
.
-
-
Method Details
-
detect
Description copied from interface:LanguageDetector
Returns the best detected language if the algorithm is very confident.Note: you may want to use getProbabilities() instead. This here is very strict, and sometimes returns absent even though the first choice in getProbabilities() is correct.
- Specified by:
detect
in interfaceLanguageDetector
- Parameters:
text
- You probably want aTextObject
.- Returns:
- The language if confident, absent if unknown or not confident enough.
-
getProbabilities
Description copied from interface:LanguageDetector
Returns all languages with at least some likeliness.There is a configurable cutoff applied for languages with very low probability.
The way the algorithm currently works, it can be that, for example, this method returns a 0.99 for Danish and less than 0.01 for Norwegian, and still they have almost the same chance. It would be nice if this could be improved in future versions.
- Specified by:
getProbabilities
in interfaceLanguageDetector
- Parameters:
text
- You probably want aTextObject
.- Returns:
- Sorted from better to worse. May be empty. It's empty if the program failed to detect any language, or if the input text did not contain any usable text (just noise).
-
detectBlock
- Returns:
- null if there are no "features" in the text (just noise).
-
detectBlockShortText
-
detectBlockLongText
This is the original algorithm used for all text length. It is inappropriate for short text. -
initProbability
private double[] initProbability()Initialize the map of language probabilities. If there is the specified prior map, use it as initial map.- Returns:
- initialized map of language probabilities
-
updateLangProb
private boolean updateLangProb(@NotNull @org.jetbrains.annotations.NotNull double[] prob, @NotNull @NotNull String ngram, int count, double alpha) update language probabilities with N-gram string(N=1,2,3)- Parameters:
count
- 1-n: how often the gram occurred.
-
sortProbability
Returns the detected languages sorted by probabilities descending. Languages with less probability than PROB_THRESHOLD are ignored.
-