Class LangProfile
java.lang.Object
com.optimaize.langdetect.cybozu.util.LangProfile
- All Implemented Interfaces:
Serializable
Deprecated.
replaced by LanguageProfile
LangProfile
is a Language Profile Class.
Users don't use this class directly.
TODO split into builder and immutable class.
TODO currently this only makes n-grams with the space before a word included. no n-gram with the space after the word.
Example: "foo" creates " fo" as 3gram, but not "oo ". Either this is a bug, or if intended then needs documentation.- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionDeprecated.Key = ngram, value = count.private static final int
Deprecated.Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10.private static final int
Deprecated.n-grams that occur less than this often can be removed using omitLessFreq().private String
Deprecated.The language name (identifier).private int[]
Deprecated.Tells how many occurrences of n-grams exist per gram length.private static final long
Deprecated. -
Constructor Summary
ConstructorsConstructorDescriptionDeprecated.Constructor for JSONICLangProfile
(String name) Deprecated.Normal Constructor -
Method Summary
Modifier and TypeMethodDescriptionvoid
Deprecated.Add n-gram to profilegetFreq()
Deprecated.getName()
Deprecated.int[]
Deprecated.void
Deprecated.Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams.void
Deprecated.void
Deprecated.void
setNWords
(int[] nWords) Deprecated.
-
Field Details
-
serialVersionUID
private static final long serialVersionUIDDeprecated.- See Also:
-
MINIMUM_FREQ
private static final int MINIMUM_FREQDeprecated.n-grams that occur less than this often can be removed using omitLessFreq(). This number can change, see LESS_FREQ_RATIO.- See Also:
-
LESS_FREQ_RATIO
private static final int LESS_FREQ_RATIODeprecated.Explanation by example: If the most frequent n-gram occurs 1 mio times, then 1'000'000 / this (100'000) = 10. 10 is larger than MINIMUM_FREQ (2), thus MINIMUM_FREQ remains at 2. All n-grams that occur less than 2 times can be removed as noise using omitLessFreq(). If the most frequent n-gram occurs 5000 times, then 5'000 / this (100'000) = 0.05. 0.05 is smaller than MINIMUM_FREQ (2), thus MINIMUM_FREQ becomes 0. No n-grams are removed because of insignificance when calling omitLessFreq().- See Also:
-
name
Deprecated.The language name (identifier). -
freq
Deprecated.Key = ngram, value = count. All n-grams are in here (1-gram, 2-gram, 3-gram). -
nWords
private int[] nWordsDeprecated.Tells how many occurrences of n-grams exist per gram length. When making 1grams, 2grams and 3grams (currently) then this contains 3 entries where element 0 = number occurrences of 1-grams element 1 = number occurrences of 2-grams element 2 = number occurrences of 3-grams Example: if there are 57 1-grams (English language has about that many) and the training text is fairly long, then this number is in the millions.
-
-
Constructor Details
-
LangProfile
public LangProfile()Deprecated.Constructor for JSONIC -
LangProfile
Deprecated.Normal Constructor- Parameters:
name
- language name
-
-
Method Details
-
add
Deprecated.Add n-gram to profile- Parameters:
gram
-
-
omitLessFreq
public void omitLessFreq()Deprecated.Removes ngrams that occur fewer times than MINIMUM_FREQ to get rid of rare ngrams. Also removes ascii ngrams if the total number of ascii ngrams is less than one third of the total. This is done because non-latin text (such as Chinese) often has some latin noise in between. TODO split the 2 cleaning to separate methods. TODO distinguish ascii/latin, currently it looks for latin only, should include characters with diacritics, eg Vietnamese. TODO current code counts ascii, but removes any latin. is that desired? if so then this needs documentation. -
getName
Deprecated. -
setName
Deprecated. -
getFreq
Deprecated. -
setFreq
Deprecated. -
getNWords
public int[] getNWords()Deprecated. -
setNWords
public void setNWords(int[] nWords) Deprecated.
-