distances¶
_bag¶
abydos.distance._bag.
Bag similarity & distance
- class distances._bag.Bag(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceBag distance.
Bag distance is proposed in :cite:`Bartolini:2002`. It is defined as
\[dist_{bag}(src, tar) = max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\]Added in version 0.3.6.
Methods
dist(src, tar)Return the normalized bag distance between two strings.
dist_abs(src, tar[, normalized])Return the bag distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the token simularity two strings.
- __init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize Bag instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized bag distance between two strings.
Bag distance is normalized by dividing by \(max( |src|, |tar| )\).
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Normalized bag distance
Examples
>>> cmp = Bag() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.4 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.0
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float¶
Return the bag distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- normalizedbool
Normalizes to [0, 1] if True
- Returns:
- int or float
Bag distance
Examples
>>> cmp = Bag() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 5 >>> cmp.dist_abs('ATCG', 'TAGC') 0 >>> cmp.dist_abs('abcdefg', 'hijklm') 7 >>> cmp.dist_abs('abcdefg', 'hijklmno') 8
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_baulieu_xiii¶
abydos.distance._baulieu_xiii.
Baulieu XIII distance
- class distances._baulieu_xiii.BaulieuXIII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceBaulieu XIII distance.
For two sets X and Y and a population N, Baulieu XIII distance :cite:`Baulieu:1997` is
\[dist_{BaulieuXIII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]This is Baulieu’s 31st dissimilarity coefficient. This coefficient fails Baulieu’s (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXIII} = \frac{b+c}{a+b+c+a \cdot (a-4)^2}\]Added in version 0.4.0.
Methods
dist(src, tar)Return the Baulieu XIII distance of two strings.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the token simularity two strings.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize BaulieuXIII instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the Baulieu XIII distance of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Baulieu XIII distance
Examples
>>> cmp = BaulieuXIII() >>> cmp.dist('cat', 'hat') 0.2857142857142857 >>> cmp.dist('Niall', 'Neil') 0.4117647058823529 >>> cmp.dist('aluminum', 'Catalan') 0.6 >>> cmp.dist('ATCG', 'TAGC') 1.0
Added in version 0.4.0.
_character¶
abydos.tokenizer._character.
Character tokenizer
- class distances._character.CharacterTokenizer(scaler: str | Callable[[float], float] | None = None)¶
Bases:
_TokenizerA character tokenizer.
Added in version 0.4.0.
Methods
count()Return token count.
count_unique()Return the number of unique elements.
get_counter()Return the tokens as a Counter object.
get_list()Return the tokens as an ordered list.
get_set()Return the unique tokens as a set.
tokenize(string)Tokenize the term and store it.
- __init__(scaler: str | Callable[[float], float] | None = None) None¶
Initialize tokenizer.
- Parameters:
- scalerNone, str, or function
A scaling function for the Counter:
None : no scaling
‘set’ : All non-zero values are set to 1.
a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
- .. versionadded:: 0.4.0
- tokenize(string: str) CharacterTokenizer¶
Tokenize the term and store it.
The tokenized term is stored as an ordered list and as a Counter object.
- Parameters:
- stringstr
The string to tokenize
Examples
>>> CharacterTokenizer().tokenize('AACTAGAAC') CharacterTokenizer({'A': 5, 'C': 2, 'T': 1, 'G': 1})
Added in version 0.4.0.
_clement¶
abydos.distance._clement.
Clement similarity
- class distances._clement.Clement(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceClement similarity.
For two sets X and Y and a population N, Clement similarity :cite:`Clement:1976` is defined as
\[sim_{Clement}(X, Y) = \frac{|X \cap Y|}{|X|}\Big(1-\frac{|X|}{|N|}\Big) + \frac{|(N \setminus X) \setminus Y|}{|N \setminus X|} \Big(1-\frac{|N \setminus X|}{|N|}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Clement} = \frac{a}{a+b}\Big(1 - \frac{a+b}{n}\Big) + \frac{d}{c+d}\Big(1 - \frac{c+d}{n}\Big)\]Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Clement similarity of two strings.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize Clement instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the Clement similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Clement similarity
Examples
>>> cmp = Clement() >>> cmp.sim('cat', 'hat') 0.5025379382522239 >>> cmp.sim('Niall', 'Neil') 0.33840586363079933 >>> cmp.sim('aluminum', 'Catalan') 0.12119877280918714 >>> cmp.sim('ATCG', 'TAGC') 0.006336616803332366
Added in version 0.4.0.
_cormode_lz¶
abydos.distance._cormode_lz.
Cormode’s LZ distance
- class distances._cormode_lz.CormodeLZ(**kwargs: Any)¶
Bases:
_DistanceCormode’s LZ distance.
Cormode’s LZ distance :cite:`Cormode:2000,Cormode:2003`
Added in version 0.4.0.
Methods
dist(src, tar)Return the normalized Cormode's LZ distance of two strings.
dist_abs(src, tar)Return the Cormode's LZ distance of two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(**kwargs: Any) None¶
Initialize CormodeLZ instance.
- Parameters:
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized Cormode’s LZ distance of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Cormode’s LZ distance
Examples
>>> cmp = CormodeLZ() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.8 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.75
Added in version 0.4.0.
- dist_abs(src: str, tar: str) float¶
Return the Cormode’s LZ distance of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Cormode’s LZ distance
Examples
>>> cmp = CormodeLZ() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 5 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
Added in version 0.4.0.
_damerau_levenshtein¶
abydos.distance._damerau_levenshtein.
Damerau-Levenshtein distance
- class distances._damerau_levenshtein.DamerauLevenshtein(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶
Bases:
_DistanceDamerau-Levenshtein distance.
This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`. Damerau-Levenshtein code is based on Java code by Kevin L. Stern :cite:`Stern:2014`, under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
Methods
dist(src, tar)Return the Damerau-Levenshtein similarity of two strings.
dist_abs(src, tar)Return the Damerau-Levenshtein distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶
Initialize Levenshtein instance.
- Parameters:
- costtuple
A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- normalizerfunction
A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the Damerau-Levenshtein similarity of two strings.
Damerau-Levenshtein distance normalized to the interval [0, 1].
The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The normalized Damerau-Levenshtein distance
Examples
>>> cmp = DamerauLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.5
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float¶
Return the Damerau-Levenshtein distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int (may return a float if cost has float values)
The Damerau-Levenshtein distance between src & tar
- Raises:
- ValueError
Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.
Examples
>>> cmp = DamerauLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 2
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_dice_asymmetric_i¶
abydos.distance._dice_asymmetric_i.
Dice’s Asymmetric I similarity
- class distances._dice_asymmetric_i.DiceAsymmetricI(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceDice’s Asymmetric I similarity.
For two sets X and Y and a population N, Dice’s Asymmetric I similarity :cite:`Dice:1945` is
\[sim_{DiceAsymmetricI}(X, Y) = \frac{|X \cap Y|}{|X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{DiceAsymmetricI} = \frac{a}{a+b}\]Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Dice's Asymmetric I similarity of two strings.
Notes
In terms of a confusion matrix, this is equivalent to precision or positive predictive value
ConfusionTable.precision().Added in version 0.4.0.
- __init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize DiceAsymmetricI instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the Dice’s Asymmetric I similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Dice’s Asymmetric I similarity
Examples
>>> cmp = DiceAsymmetricI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.1111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.0.
_discounted_levenshtein¶
abydos.distance._discounted_levenshtein.
Discounted Levenshtein edit distance
- class distances._discounted_levenshtein.DiscountedLevenshtein(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any)¶
Bases:
LevenshteinDiscounted Levenshtein distance.
This is a variant of Levenshtein distance for which edits later in a string have discounted cost, on the theory that earlier edits are less likely than later ones.
Added in version 0.4.1.
Methods
alignment(src, tar)Return the Levenshtein alignment of two strings.
dist(src, tar)Return the normalized Levenshtein distance between two strings.
dist_abs(src, tar)Return the Levenshtein distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any) None¶
Initialize DiscountedLevenshtein instance.
- Parameters:
- modestr
Specifies a mode for computing the discounted Levenshtein distance:
lev(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosacomputes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
- normalizerfunction
A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
- discount_fromint or str
If an int is supplied, this is the first character whose edit cost will be discounted. If the str
codais supplied, discounting will start with the first non-vowel after the first vowel (the first syllable coda).- discount_funcstr or function
The two supported str arguments are
log, for a logarithmic discount function, andexpfor a exponential discount function. See notes below for information on how to supply your own discount function.- vowelsstr
These are the letters to consider as vowels when discount_from is set to
coda. It defaults to the English vowels ‘aeiou’, but it would be reasonable to localize this to other languages or to add orthographic semi-vowels like ‘y’, ‘w’, and even ‘h’.- **kwargs
Arbitrary keyword arguments
Notes
This class is highly experimental and will need additional tuning.
The discount function can be passed as a callable function. It should expect an integer as its only argument and return a float, ideally less than or equal to 1.0. The argument represents the degree of discounting to apply.
Added in version 0.4.1.
- dist(src: str, tar: str) float¶
Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The normalized Levenshtein distance between src & tar
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist('cat', 'hat') 0.3513958291799864 >>> cmp.dist('Niall', 'Neil') 0.5909885886270658 >>> cmp.dist('aluminum', 'Catalan') 0.8348163322045603 >>> cmp.dist('ATCG', 'TAGC') 0.7217609721523955
Added in version 0.4.1.
- dist_abs(src: str, tar: str) float¶
Return the Levenshtein distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float (may return a float if cost has float values)
The Levenshtein distance between src & tar
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2.526064024369237 >>> cmp.dist_abs('aluminum', 'Catalan') 5.053867269967515 >>> cmp.dist_abs('ATCG', 'TAGC') 2.594032108779918
>>> cmp = DiscountedLevenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 1.7482385137517997 >>> cmp.dist_abs('ACTG', 'TAGC') 3.342270622531718
Added in version 0.4.1.
_distance¶
abydos.distance._distance.
The distance._distance module implements abstract class _Distance.
_double_metaphone¶
abydos.phonetic._double_metaphone.
Double Metaphone
- class distances._double_metaphone.DoubleMetaphone(max_length: int = -1)¶
Bases:
_PhoneticDouble Metaphone.
Based on Lawrence Philips’ (Visual) C++ code from 1999 :cite:`Philips:2000`.
Added in version 0.3.6.
Methods
encode(word)Return the Double Metaphone code for a word.
encode_alpha(word)Return the alphabetic Double Metaphone code for a word.
- __init__(max_length: int = -1) None¶
Initialize DoubleMetaphone instance.
- Parameters:
- max_lengthint
Maximum length of the returned Dolby code – this also activates the fixed-length code mode if it is greater than 0
- .. versionadded:: 0.4.0
- encode(word: str) str¶
Return the Double Metaphone code for a word.
- Parameters:
- wordstr
The word to transform
- Returns:
- str
The Double Metaphone value(s)
Examples
>>> pe = DoubleMetaphone() >>> pe.encode('Christopher') 'KRSTFR,' >>> pe.encode('Niall') 'NL,' >>> pe.encode('Smith') 'SM0,XMT' >>> pe.encode('Schmidt') 'XMT,SMT'
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Made return a str only (comma-separated)
- encode_alpha(word: str) str¶
Return the alphabetic Double Metaphone code for a word.
- Parameters:
- wordstr
The word to transform
- Returns:
- str
The alphabetic Double Metaphone value(s)
Examples
>>> pe = DoubleMetaphone() >>> pe.encode_alpha('Christopher') 'KRSTFR,' >>> pe.encode_alpha('Niall') 'NL,' >>> pe.encode_alpha('Smith') 'SMÞ,XMT' >>> pe.encode_alpha('Schmidt') 'XMT,SMT'
Added in version 0.4.0.
Changed in version 0.6.0: Made return a str only (comma-separated)
_editex¶
abydos.distance._editex.
editex
- class distances._editex.Editex(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any)¶
Bases:
_DistanceEditex.
As described on pages 3 & 4 of :cite:`Zobel:1996`.
The local variant is based on :cite:`Ring:2009`.
Added in version 0.3.6.
Changed in version 0.4.0: Added taper option
Methods
dist(src, tar)Return the normalized Editex distance between two strings.
dist_abs(src, tar)Return the Editex distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any) None¶
Initialize Editex instance.
- Parameters:
- costtuple
A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
- localbool
If True, the local variant of Editex is used
- taperbool
Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized Editex distance between two strings.
The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int
Normalized Editex distance
Examples
>>> cmp = Editex() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.2 >>> cmp.dist('aluminum', 'Catalan') 0.75 >>> cmp.dist('ATCG', 'TAGC') 0.75
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float¶
Return the Editex distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int
Editex distance
Examples
>>> cmp = Editex() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 12 >>> cmp.dist_abs('ATCG', 'TAGC') 6
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_fuzzywuzzy_partial_string¶
abydos.distance._fuzzywuzzy_partial_string.
FuzzyWuzzy Partial String similarity
- class distances._fuzzywuzzy_partial_string.FuzzyWuzzyPartialString(**kwargs: Any)¶
Bases:
_DistanceFuzzyWuzzy Partial String similarity.
This follows the FuzzyWuzzy Partial String similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].
Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the FuzzyWuzzy Partial String similarity of two strings.
- sim(src: str, tar: str) float¶
Return the FuzzyWuzzy Partial String similarity of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
FuzzyWuzzy Partial String similarity
Examples
>>> cmp = FuzzyWuzzyPartialString() >>> round(cmp.sim('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.75 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.428571428571 >>> cmp.sim('ATCG', 'TAGC') 0.5
Added in version 0.4.0.
_fuzzywuzzy_token_set¶
abydos.distance._fuzzywuzzy_token_set.
FuzzyWuzzy Token Set similarity
- class distances._fuzzywuzzy_token_set.FuzzyWuzzyTokenSet(tokenizer: _Tokenizer | None = None, **kwargs: Any)¶
Bases:
_TokenDistanceFuzzyWuzzy Token Set similarity.
This follows the FuzzyWuzzy Token Set similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0]. Distinct from the
Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the FuzzyWuzzy Token Set similarity of two strings.
- __init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) None¶
Initialize FuzzyWuzzyTokenSet instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage. By default, the regexp tokenizer is employed, matching only letters.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the FuzzyWuzzy Token Set similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
FuzzyWuzzy Token Set similarity
Examples
>>> cmp = FuzzyWuzzyTokenSet() >>> cmp.sim('cat', 'hat') 0.75 >>> cmp.sim('Niall', 'Neil') 0.7272727272727273 >>> cmp.sim('aluminum', 'Catalan') 0.47058823529411764 >>> cmp.sim('ATCG', 'TAGC') 0.6
Added in version 0.4.0.
_fuzzywuzzy_token_sort¶
abydos.distance._fuzzywuzzy_token_sort.
FuzzyWuzzy Token Sort similarity
- class distances._fuzzywuzzy_token_sort.FuzzyWuzzyTokenSort(tokenizer: _Tokenizer | None = None, **kwargs: Any)¶
Bases:
_TokenDistanceFuzzyWuzzy Token Sort similarity.
This follows the FuzzyWuzzy Token Sort similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].
Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the FuzzyWuzzy Token Sort similarity of two strings.
- __init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) None¶
Initialize FuzzyWuzzyTokenSort instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage. By default, the regexp tokenizer is employed, matching only letters.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the FuzzyWuzzy Token Sort similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
FuzzyWuzzy Token Sort similarity
Examples
>>> cmp = FuzzyWuzzyTokenSort() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6666666666666666 >>> cmp.sim('aluminum', 'Catalan') 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.5
Added in version 0.4.0.
_hamming¶
abydos.distance._hamming.
Hamming distance
- class distances._hamming.Hamming(diff_lens: bool = True, **kwargs: Any)¶
Bases:
_DistanceHamming distance.
Hamming distance :cite:`Hamming:1950` equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings’ lengths and adds to this the difference in string lengths.
Added in version 0.3.6.
Methods
dist(src, tar)Return the normalized Hamming distance between two strings.
dist_abs(src, tar)Return the Hamming distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(diff_lens: bool = True, **kwargs: Any) None¶
Initialize Hamming instance.
- Parameters:
- diff_lensbool
If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized Hamming distance between two strings.
Hamming distance normalized to the interval [0, 1].
The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).
The arguments are identical to those of the hamming() function.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Normalized Hamming distance
Examples
>>> cmp = Hamming() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float¶
Return the Hamming distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int
The Hamming distance between src & tar
- Raises:
- ValueError
Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.
Examples
>>> cmp = Hamming() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 8 >>> cmp.dist_abs('ATCG', 'TAGC') 4
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_indel¶
abydos.distance._indel.
Indel distance
- class distances._indel.Indel(**kwargs: Any)¶
Bases:
LevenshteinIndel distance.
This is equivalent to Levenshtein distance, when only inserts and deletes are possible.
Added in version 0.3.6.
Methods
alignment(src, tar)Return the Levenshtein alignment of two strings.
dist(src, tar)Return the normalized indel distance between two strings.
dist_abs(src, tar)Return the Levenshtein distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(**kwargs: Any) None¶
Initialize Levenshtein instance.
- Parameters:
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized indel distance between two strings.
This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Normalized indel distance
Examples
>>> cmp = Indel() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.333333333333 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.454545454545 >>> cmp.dist('ATCG', 'TAGC') 0.5
Added in version 0.3.6.
_iterative_substring¶
abydos.distance._iterative_substring.
Iterative-SubString (I-Sub) correlation
- class distances._iterative_substring.IterativeSubString(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any)¶
Bases:
_DistanceIterative-SubString correlation.
Iterative-SubString (I-Sub) correlation :cite:`Stoilos:2005`
This is a straightforward port of the primary author’s Java implementation: http://www.image.ece.ntua.gr/~gstoil/software/I_Sub.java
Added in version 0.4.0.
Methods
corr(src, tar)Return the Iterative-SubString correlation of two strings.
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Iterative-SubString similarity of two strings.
- __init__(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any) None¶
Initialize IterativeSubString instance.
- Parameters:
- hamacherfloat
The constant factor for the Hamacher product
- normalize_stringsbool
Normalize the strings by removing the characters in ‘._ ‘ and lower casing
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- corr(src: str, tar: str) float¶
Return the Iterative-SubString correlation of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Iterative-SubString correlation
Examples
>>> cmp = IterativeSubString() >>> cmp.corr('cat', 'hat') -1.0 >>> cmp.corr('Niall', 'Neil') -0.9 >>> cmp.corr('aluminum', 'Catalan') -1.0 >>> cmp.corr('ATCG', 'TAGC') -1.0
Added in version 0.4.0.
- sim(src: str, tar: str) float¶
Return the Iterative-SubString similarity of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Iterative-SubString similarity
Examples
>>> cmp = IterativeSubString() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.04999999999999999 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.0.
_kuhns_iii¶
abydos.distance._kuhns_iii.
Kuhns III correlation
- class distances._kuhns_iii.KuhnsIII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceKuhns III correlation.
For two sets X and Y and a population N, Kuhns III correlation :cite:`Kuhns:1965`, the excess of proportion of overlap over its independence value (P), is
\[corr_{KuhnsIII}(X, Y) = \frac{\delta(X, Y)}{\big(1-\frac{|X \cap Y|}{|X|+|Y|}\big) \big(|X|+|Y|-\frac{|X|\cdot|Y|}{|N|}\big)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsIII} = \frac{\delta(a+b, a+c)}{\big(1-\frac{a}{2a+b+c}\big) \big(2a+b+c-\frac{(a+b)(a+c)}{n}\big)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]Methods
corr(src, tar)Return the Kuhns III correlation of two strings.
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Kuhns III similarity of two strings.
Notes
The coefficient presented in :cite:`Eidenberger:2014,Morris:2012` as Kuhns’ “Proportion of overlap above independence” is a significantly different coefficient, not evidenced in :cite:`Kuhns:1965`.
Added in version 0.4.0.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize KuhnsIII instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- corr(src: str, tar: str) float¶
Return the Kuhns III correlation of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Kuhns III correlation
Examples
>>> cmp = KuhnsIII() >>> cmp.corr('cat', 'hat') 0.3307757885763001 >>> cmp.corr('Niall', 'Neil') 0.21873141468207793 >>> cmp.corr('aluminum', 'Catalan') 0.05707545392902886 >>> cmp.corr('ATCG', 'TAGC') -0.003198976327575176
Added in version 0.4.0.
- sim(src: str, tar: str) float¶
Return the Kuhns III similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Kuhns III similarity
Examples
>>> cmp = KuhnsIII() >>> cmp.sim('cat', 'hat') 0.498081841432225 >>> cmp.sim('Niall', 'Neil') 0.41404856101155846 >>> cmp.sim('aluminum', 'Catalan') 0.29280659044677165 >>> cmp.sim('ATCG', 'TAGC') 0.24760076775431863
Added in version 0.4.0.
_lcprefix¶
abydos.distance._lcprefix.
Longest common prefix
- class distances._lcprefix.LCPrefix(**kwargs: Any)¶
Bases:
_DistanceLongest common prefix.
Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar, *args)Return the length of the longest common prefix of the strings.
lcprefix(strings)Return the longest common prefix of a list of strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar, *args)Return the longest common prefix similarity of two or more strings.
- dist_abs(src: str, tar: str, *args: str) int¶
Return the length of the longest common prefix of the strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- *argsstrs
Additional strings for comparison
- Returns:
- int
The length of the longest common prefix
- Raises:
- ValueError
All arguments must be of type str
Examples
>>> pfx = LCPrefix() >>> pfx.dist_abs('cat', 'hat') 0 >>> pfx.dist_abs('Niall', 'Neil') 1 >>> pfx.dist_abs('aluminum', 'Catalan') 0 >>> pfx.dist_abs('ATCG', 'TAGC') 0
Added in version 0.4.0.
- lcprefix(strings: List[str]) str¶
Return the longest common prefix of a list of strings.
Longest common prefix (LCPrefix).
- Parameters:
- stringslist of strings
Strings for comparison
- Returns:
- str
The longest common prefix
Examples
>>> pfx = LCPrefix() >>> pfx.lcprefix(['cat', 'hat']) '' >>> pfx.lcprefix(['Niall', 'Neil']) 'N' >>> pfx.lcprefix(['aluminum', 'Catalan']) '' >>> pfx.lcprefix(['ATCG', 'TAGC']) ''
Added in version 0.4.0.
- sim(src: str, tar: str, *args: str) float¶
Return the longest common prefix similarity of two or more strings.
Longest common prefix similarity (\(sim_{LCPrefix}\)).
This employs the LCPrefix function to derive a similarity metric: \(sim_{LCPrefix}(s,t) = \frac{|LCPrefix(s,t)|}{max(|s|, |t|)}\)
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- *argsstrs
Additional strings for comparison
- Returns:
- float
LCPrefix similarity
Examples
>>> pfx = LCPrefix() >>> pfx.sim('cat', 'hat') 0.0 >>> pfx.sim('Niall', 'Neil') 0.2 >>> pfx.sim('aluminum', 'Catalan') 0.0 >>> pfx.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.0.
_lcsseq¶
abydos.distance._lcsseq.
Longest common subsequence
- class distances._lcsseq.LCSseq(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶
Bases:
_DistanceLongest common subsequence.
Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.
Added in version 0.3.6.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
lcsseq(src, tar)Return the longest common subsequence of two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the longest common subsequence similarity of two strings.
- __init__(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any) None¶
Initialize LCSseq.
- Parameters:
- normalizerfunction
A normalization function for the normalized similarity & distance. By default, the max of the lengths of the input strings. If lambda x: sum(x)/2.0 is supplied, the normalization proposed in :cite:`Radev:2001` is used, i.e. \(\frac{2 \dot |LCS(src, tar)|}{|src| + |tar|}\).
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- lcsseq(src: str, tar: str) str¶
Return the longest common subsequence of two strings.
Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence :cite:`rosettacode:2018b`. This is licensed GFDL 1.2.
- Modifications include:
conversion to a numpy array in place of a list of lists
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- str
The longest common subsequence
Examples
>>> sseq = LCSseq() >>> sseq.lcsseq('cat', 'hat') 'at' >>> sseq.lcsseq('Niall', 'Neil') 'Nil' >>> sseq.lcsseq('aluminum', 'Catalan') 'aln' >>> sseq.lcsseq('ATCG', 'TAGC') 'AC'
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim(src: str, tar: str) float¶
Return the longest common subsequence similarity of two strings.
Longest common subsequence similarity (\(sim_{LCSseq}\)).
This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
LCSseq similarity
Examples
>>> sseq = LCSseq() >>> sseq.sim('cat', 'hat') 0.6666666666666666 >>> sseq.sim('Niall', 'Neil') 0.6 >>> sseq.sim('aluminum', 'Catalan') 0.375 >>> sseq.sim('ATCG', 'TAGC') 0.5
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.4.0: Added normalization option
_levenshtein¶
abydos.distance._levenshtein.
The distance._Levenshtein module implements string edit distance functions based on Levenshtein distance, including:
Levenshtein distance
Optimal String Alignment distance
- class distances._levenshtein.Levenshtein(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any)¶
Bases:
_DistanceLevenshtein distance.
This is the standard edit distance measure. Cf. :cite:`Levenshtein:1965,Levenshtein:1966`.
Optimal string alignment (aka restricted Damerau-Levenshtein distance) :cite:`Boytsov:2011` is also supported.
The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm :cite:`Wagner:1974`.
Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.
Added in version 0.3.6.
Changed in version 0.4.0: Added taper option
Methods
alignment(src, tar)Return the Levenshtein alignment of two strings.
dist(src, tar)Return the normalized Levenshtein distance between two strings.
dist_abs(src, tar)Return the Levenshtein distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any) None¶
Initialize Levenshtein instance.
- Parameters:
- modestr
Specifies a mode for computing the Levenshtein distance:
lev(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosacomputes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
- costtuple
A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- normalizerfunction
A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
- taperbool
Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- alignment(src: str, tar: str) Tuple[float, str, str]¶
Return the Levenshtein alignment of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- tuple
A tuple containing the Levenshtein distance and the two strings, aligned.
Examples
>>> cmp = Levenshtein() >>> cmp.alignment('cat', 'hat') (1.0, 'cat', 'hat') >>> cmp.alignment('Niall', 'Neil') (3.0, 'N-iall', 'Nei-l-') >>> cmp.alignment('aluminum', 'Catalan') (7.0, '-aluminum', 'Catalan--') >>> cmp.alignment('ATCG', 'TAGC') (3.0, 'ATCG-', '-TAGC')
>>> cmp = Levenshtein(mode='osa') >>> cmp.alignment('ATCG', 'TAGC') (2.0, 'ATCG', 'TAGC') >>> cmp.alignment('ACTG', 'TAGC') (4.0, 'ACT-G-', '--TAGC')
Added in version 0.4.1.
- dist(src: str, tar: str) float¶
Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The normalized Levenshtein distance between src & tar
Examples
>>> cmp = Levenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.75
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float¶
Return the Levenshtein distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int (may return a float if cost has float values)
The Levenshtein distance between src & tar
Examples
>>> cmp = Levenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 3
>>> cmp = Levenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 2 >>> cmp.dist_abs('ACTG', 'TAGC') 4
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim(src: str, tar: str)¶
Return similarity.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Similarity
Added in version 0.3.6: ..
_lig3¶
abydos.distance._lig3.
LIG3 similarity
- class distances._lig3.LIG3(**kwargs: Any)¶
Bases:
_DistanceLIG3 similarity.
:cite:`Snae:2002` proposes three Levenshtein-ISG-Guth hybrid similarity measures: LIG1, LIG2, and LIG3. Of these, LIG1 is identical to ISG and LIG2 is identical to normalized Levenshtein similarity. Only LIG3 is a novel measure, defined as:
\[sim_{LIG3}(X, Y) = \frac{2I}{2I+C}\]Here, I is the number of exact matches between the two words, truncated to the length of the shorter word, and C is the Levenshtein distance between the two words.
Added in version 0.4.1.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the LIG3 similarity of two words.
- sim(src: str, tar: str) float¶
Return the LIG3 similarity of two words.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The LIG3 similarity
Examples
>>> cmp = LIG3() >>> cmp.sim('cat', 'hat') 0.8 >>> cmp.sim('Niall', 'Neil') 0.5714285714285714 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.1.
_ncd_bz2¶
abydos.distance._ncd_bz2.
NCD using bzip2
- class distances._ncd_bz2.NCDbz2(level: int = 9, **kwargs: Any)¶
Bases:
_DistanceNormalized Compression Distance using bzip2 compression.
Cf. https://en.wikipedia.org/wiki/Bzip2
Normalized compression distance (NCD) :cite:`Cilibrasi:2005`.
Added in version 0.3.6.
Methods
dist(src, tar)Return the NCD between two strings using bzip2 compression.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(level: int = 9, **kwargs: Any) None¶
Initialize bzip2 compressor.
- Parameters:
- levelint
The compression level (0 to 9)
- .. versionadded:: 0.3.6
- .. versionchanged:: 0.3.6
Encapsulated in class
- dist(src: str, tar: str) float¶
Return the NCD between two strings using bzip2 compression.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Compression distance
Examples
>>> cmp = NCDbz2() >>> cmp.dist('cat', 'hat') 0.06666666666666667 >>> cmp.dist('Niall', 'Neil') 0.03125 >>> cmp.dist('aluminum', 'Catalan') 0.17647058823529413 >>> cmp.dist('ATCG', 'TAGC') 0.03125
Added in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
_overlap¶
abydos.distance._overlap.
Overlap similarity & distance
- class distances._overlap.Overlap(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceOverlap coefficient.
For two sets X and Y, the overlap coefficient :cite:`Szymkiewicz:1934,Simpson:1949`, also called the Szymkiewicz-Simpson coefficient and Simpson’s ecological coexistence coefficient, is
\[sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{overlap} = \frac{a}{min(a+b, a+c)}\]Added in version 0.3.6.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the overlap coefficient of two strings.
- __init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize Overlap instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the overlap coefficient of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Overlap similarity
Examples
>>> cmp = Overlap() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.125 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_pearson_chi_squared¶
abydos.distance._pearson_chi_squared.
Pearson’s Chi-Squared similarity
- class distances._pearson_chi_squared.PearsonChiSquared(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistancePearson’s Chi-Squared similarity.
For two sets X and Y and a population N, the Pearson’s \(\chi^2\) similarity :cite:`Pearson:1913` is
\[sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]This is also Pearson I similarity.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{PearsonChiSquared} = \frac{n(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]Added in version 0.4.0.
Methods
corr(src, tar)Return Pearson's Chi-Squared correlation of two strings.
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return Pearson's normalized Chi-Squared similarity of two strings.
sim_score(src, tar)Return Pearson's Chi-Squared similarity of two strings.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize PearsonChiSquared instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- corr(src: str, tar: str) float¶
Return Pearson’s Chi-Squared correlation of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Pearson’s Chi-Squared correlation
Examples
>>> cmp = PearsonChiSquared() >>> cmp.corr('cat', 'hat') 0.2474424720578567 >>> cmp.corr('Niall', 'Neil') 0.1300991207720222 >>> cmp.corr('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.corr('ATCG', 'TAGC') -4.1196952743799446e-05
Added in version 0.4.0.
- sim(src: str, tar: str) float¶
Return Pearson’s normalized Chi-Squared similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Normalized Pearson’s Chi-Squared similarity
Examples
>>> cmp = PearsonChiSquared() >>> cmp.corr('cat', 'hat') 0.2474424720578567 >>> cmp.corr('Niall', 'Neil') 0.1300991207720222 >>> cmp.corr('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.corr('ATCG', 'TAGC') -4.1196952743799446e-05
Added in version 0.4.0.
- sim_score(src: str, tar: str) float¶
Return Pearson’s Chi-Squared similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Pearson’s Chi-Squared similarity
Examples
>>> cmp = PearsonChiSquared() >>> cmp.sim_score('cat', 'hat') 193.99489809335964 >>> cmp.sim_score('Niall', 'Neil') 101.99771068526542 >>> cmp.sim_score('aluminum', 'Catalan') 9.19249664336649 >>> cmp.sim_score('ATCG', 'TAGC') 0.032298410951138765
Added in version 0.4.0.
_pearson_ii¶
abydos.distance._pearson_ii.
Pearson II similarity
- class distances._pearson_ii.PearsonII(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
PearsonChiSquaredPearson II similarity.
For two sets X and Y and a population N, the Pearson II similarity :cite:`Pearson:1913`, Pearson’s coefficient of mean square contingency, is
\[corr_{PearsonII} = \sqrt{\frac{\chi^2}{|N|+\chi^2}}\]where
\[\chi^2 = sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\chi^2 = sim_{PearsonChiSquared} = \frac{n \cdot (ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]Added in version 0.4.0.
Methods
corr(src, tar)Return Pearson's Chi-Squared correlation of two strings.
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the normalized Pearson II similarity of two strings.
sim_score(src, tar)Return the Pearson II similarity of two strings.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize PearsonII instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the normalized Pearson II similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Normalized Pearson II similarity
Examples
>>> cmp = PearsonII() >>> cmp.sim('cat', 'hat') 0.6298568508557214 >>> cmp.sim('Niall', 'Neil') 0.47983719547968123 >>> cmp.sim('aluminum', 'Catalan') 0.15214891090821628 >>> cmp.sim('ATCG', 'TAGC') 0.009076921903905551
Added in version 0.4.0.
- sim_score(src: str, tar: str) float¶
Return the Pearson II similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Pearson II similarity
Examples
>>> cmp = PearsonII() >>> cmp.sim_score('cat', 'hat') 0.44537605041688455 >>> cmp.sim_score('Niall', 'Neil') 0.3392961347892176 >>> cmp.sim_score('aluminum', 'Catalan') 0.10758552665334761 >>> cmp.sim_score('ATCG', 'TAGC') 0.006418353030552324
Added in version 0.4.0.
_phonetic¶
abydos.phonetic._phonetic.
The phonetic._phonetic module implements abstract class Phonetic.
_phonetic_distance¶
abydos.distance._phonetic_distance.
Phonetic distance.
- class distances._phonetic_distance.PhoneticDistance(transforms: Type[_Phonetic] | _Phonetic | Callable[[str], str] | Sequence[Type[_Phonetic] | _Phonetic | Callable[[str], str]] | None = None, metric: Type[_Distance] | _Distance | None = None, encode_alpha: bool = False, **kwargs: Any)¶
Bases:
_DistancePhonetic distance.
Phonetic distance applies one or more supplied string transformations to words and compares the resulting transformed strings using a supplied distance measure.
A simple example would be to create a ‘Soundex distance’:
>>> from abydos.phonetic import Soundex >>> soundex = PhoneticDistance(transforms=Soundex()) >>> soundex.dist('Ashcraft', 'Ashcroft') 0.0 >>> soundex.dist('Robert', 'Ashcraft') 1.0
Added in version 0.4.1.
Methods
dist(src, tar)Return the normalized Phonetic distance.
dist_abs(src, tar)Return the Phonetic distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(transforms: Type[_Phonetic] | _Phonetic | Callable[[str], str] | Sequence[Type[_Phonetic] | _Phonetic | Callable[[str], str]] | None = None, metric: Type[_Distance] | _Distance | None = None, encode_alpha: bool = False, **kwargs: Any) None¶
Initialize PhoneticDistance instance.
- Parameters:
- transformslist or _Phonetic or _Stemmer or _Fingerprint or type
An instance of a subclass of _Phonetic, _Stemmer, or _Fingerprint, or a list (or other iterable) of such instances to apply to each input word before computing their distance or similarity. If omitted, no transformations will be performed.
- metric_Distance or type
An instance of a subclass of _Distance, used for computing the inputs’ distance or similarity after being transformed. If omitted, the strings will be compared for identify (returning 0.0 if identical, otherwise 1.0, when distance is computed).
- encode_alphabool
Set to true to use the encode_alpha method of phonetic algorithms whenever possible.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.1
- dist(src: str, tar: str) float¶
Return the normalized Phonetic distance.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The normalized Phonetic distance
Examples
>>> from abydos.phonetic import Soundex >>> cmp = PhoneticDistance(Soundex()) >>> cmp.dist('cat', 'hat') 1.0 >>> cmp.dist('Niall', 'Neil') 0.0 >>> cmp.dist('Colin', 'Cuilen') 0.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
>>> from abydos.distance import Levenshtein >>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein) >>> cmp.dist('cat', 'hat') 0.25 >>> cmp.dist('Niall', 'Neil') 0.0 >>> cmp.dist('Colin', 'Cuilen') 0.0 >>> cmp.dist('ATCG', 'TAGC') 0.75
Added in version 0.4.1.
- dist_abs(src: str, tar: str) float¶
Return the Phonetic distance.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float or int
The Phonetic distance
Examples
>>> from abydos.phonetic import Soundex >>> cmp = PhoneticDistance(Soundex()) >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 0 >>> cmp.dist_abs('Colin', 'Cuilen') 0 >>> cmp.dist_abs('ATCG', 'TAGC') 1
>>> from abydos.distance import Levenshtein >>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein) >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 0 >>> cmp.dist_abs('Colin', 'Cuilen') 0 >>> cmp.dist_abs('ATCG', 'TAGC') 3
Added in version 0.4.1.
_q_grams¶
abydos.tokenizer._q_grams.
QGrams multi-set class
- class distances._q_grams.QGrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None)¶
Bases:
_TokenizerA q-gram class, which functions like a bag/multiset.
A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.
Added in version 0.1.0.
Methods
count()Return token count.
count_unique()Return the number of unique elements.
get_counter()Return the tokens as a Counter object.
get_list()Return the tokens as an ordered list.
get_set()Return the unique tokens as a set.
tokenize(string)Tokenize the term and store it.
- __init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None) None¶
Initialize QGrams.
- Parameters:
- qvalint or Iterable
The q-gram length (defaults to 2), can be an integer, range object, or list
- start_stopstr
A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)
- skipint or Iterable
The number of characters to skip, can be an integer, range object, or list
- scalerNone, str, or function
A scaling function for the Counter:
None : no scaling
‘set’ : All non-zero values are set to 1.
‘length’ : Each token has weight equal to its length.
- ‘length-log’Each token has weight equal to the log of its
length + 1.
- ‘length-exp’Each token has weight equal to e raised to its
length.
a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
- Raises:
- ValueError
Use WhitespaceTokenizer instead of qval=0.
Examples
>>> qg = QGrams().tokenize('AATTATAT') >>> qg QGrams({'$A': 1, 'AA': 1, 'AT': 3, 'TT': 1, 'TA': 2, 'T#': 1})
>>> qg = QGrams(qval=1, start_stop='').tokenize('AATTATAT') >>> qg QGrams({'A': 4, 'T': 4})
>>> qg = QGrams(qval=3, start_stop='').tokenize('AATTATAT') >>> qg QGrams({'AAT': 1, 'ATT': 1, 'TTA': 1, 'TAT': 2, 'ATA': 1})
>>> QGrams(qval=2, start_stop='$#').tokenize('interning') QGrams({'$i': 1, 'in': 2, 'nt': 1, 'te': 1, 'er': 1, 'rn': 1, 'ni': 1, 'ng': 1, 'g#': 1})
>>> QGrams(start_stop='', skip=1).tokenize('AACTAGAAC') QGrams({'AC': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'AA': 1, 'GA': 1, 'A': 1})
>>> QGrams(start_stop='', skip=[0, 1]).tokenize('AACTAGAAC') QGrams({'AA': 3, 'AC': 4, 'CT': 1, 'TA': 1, 'AG': 1, 'GA': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'A': 1})
>>> QGrams(qval=range(3), skip=[0, 1]).tokenize('interdisciplinarian') QGrams({'i': 10, 'n': 7, 't': 2, 'e': 2, 'r': 4, 'd': 2, 's': 2, 'c': 2, 'p': 2, 'l': 2, 'a': 4, '$i': 1, 'in': 3, 'nt': 1, 'te': 1, 'er': 1, 'rd': 1, 'di': 1, 'is': 1, 'sc': 1, 'ci': 1, 'ip': 1, 'pl': 1, 'li': 1, 'na': 1, 'ar': 1, 'ri': 2, 'ia': 2, 'an': 1, 'n#': 1, '$n': 1, 'it': 1, 'ne': 1, 'tr': 1, 'ed': 1, 'ds': 1, 'ic': 1, 'si': 1, 'cp': 1, 'il': 1, 'pi': 1, 'ln': 1, 'nr': 1, 'ai': 1, 'ra': 1, 'a#': 1})
Added in version 0.1.0.
Changed in version 0.4.0: Broke tokenization functions out into tokenize method
_q_skipgrams¶
abydos.tokenizer._q_skipgrams.
Q-Skipgrams multi-set class
- class distances._q_skipgrams.QSkipgrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9)¶
Bases:
_TokenizerA q-skipgram class, which functions like a bag/multiset.
A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.
Added in version 0.4.0.
Methods
count()Return token count.
count_unique()Return the number of unique elements.
get_counter()Return the tokens as a Counter object.
get_list()Return the tokens as an ordered list.
get_set()Return the unique tokens as a set.
tokenize(string)Tokenize the term and store it.
- __init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9) None¶
Initialize QSkipgrams.
- Parameters:
- qvalint or Iterable
The q-gram length (defaults to 2), can be an integer, range object, or list
- start_stopstr
A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)
- scalerNone, str, or function
A scaling function for the Counter:
None : no scaling
‘set’ : All non-zero values are set to 1.
‘length’ : Each token has weight equal to its length.
- ‘length-log’Each token has weight equal to the log of its
length + 1.
- ‘length-exp’Each token has weight equal to e raised to its
length.
a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
‘SSK’ : Applies weighting according to the substring kernel rules of :cite:`Lodhi:2002`.
- ssk_lambdafloat or Iterable
A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)
- Raises:
- ValueError
Use WhitespaceTokenizer instead of qval=0.
Examples
>>> QSkipgrams().tokenize('AATTAT') QSkipgrams({'$A': 3, '$T': 3, '$#': 1, 'AA': 3, 'AT': 7, 'A#': 3, 'TT': 3, 'TA': 2, 'T#': 3})
>>> QSkipgrams(qval=1, start_stop='').tokenize('AATTAT') QSkipgrams({'A': 3, 'T': 3})
>>> QSkipgrams(qval=3, start_stop='').tokenize('AATTAT') QSkipgrams({'AAT': 5, 'AAA': 1, 'ATT': 6, 'ATA': 4, 'TTA': 1, 'TTT': 1, 'TAT': 2})
>>> QSkipgrams(start_stop='').tokenize('ABCD') QSkipgrams({'AB': 1, 'AC': 1, 'AD': 1, 'BC': 1, 'BD': 1, 'CD': 1})
>>> QSkipgrams().tokenize('Colin') QSkipgrams({'$C': 1, '$o': 1, '$l': 1, '$i': 1, '$n': 1, '$#': 1, 'Co': 1, 'Cl': 1, 'Ci': 1, 'Cn': 1, 'C#': 1, 'ol': 1, 'oi': 1, 'on': 1, 'o#': 1, 'li': 1, 'ln': 1, 'l#': 1, 'in': 1, 'i#': 1, 'n#': 1})
>>> QSkipgrams(qval=3).tokenize('AACTAGAAC') QSkipgrams({'$$A': 5, '$$C': 2, '$$T': 1, '$$G': 1, '$$#': 2, '$AA': 20, '$AC': 14, '$AT': 4, '$AG': 6, '$A#': 20, '$CT': 2, '$CA': 6, '$CG': 2, '$CC': 2, '$C#': 8, '$TA': 6, '$TG': 2, '$TC': 2, '$T#': 4, '$GA': 4, '$GC': 2, '$G#': 4, '$##': 2, 'AAC': 11, 'AAT': 1, 'AAA': 10, 'AAG': 3, 'AA#': 20, 'ACT': 2, 'ACA': 6, 'ACG': 2, 'ACC': 2, 'AC#': 14, 'ATA': 6, 'ATG': 2, 'ATC': 2, 'AT#': 4, 'AGA': 6, 'AGC': 3, 'AG#': 6, 'A##': 5, 'CTA': 3, 'CTG': 1, 'CTC': 1, 'CT#': 2, 'CAG': 1, 'CAA': 3, 'CAC': 3, 'CA#': 6, 'CGA': 2, 'CGC': 1, 'CG#': 2, 'CC#': 2, 'C##': 2, 'TAG': 1, 'TAA': 3, 'TAC': 3, 'TA#': 6, 'TGA': 2, 'TGC': 1, 'TG#': 2, 'TC#': 2, 'T##': 1, 'GAA': 1, 'GAC': 2, 'GA#': 4, 'GC#': 2, 'G##': 1})
QSkipgrams may also be used to produce weights in accordance with the substring kernel rules of :cite:`Lodhi:2002` by passing the scaler value
'SSK':>>> QSkipgrams(scaler='SSK').tokenize('AACTAGAAC') QSkipgrams(, {'$A': 2.8883286990000006, '$C': 1.0047784401000002, '$T': 0.5904900000000001, '$G': 0.4782969000000001, '$#': 0.31381059609000006, 'AA': 6.170192010000001, 'AC': 4.486377699, 'AT': 1.3851, 'AG': 1.931931, 'A#': 2.6526399291000002, 'CT': 0.81, 'CA': 1.850931, 'CG': 0.6561, 'CC': 0.4782969000000001, 'C#': 1.2404672100000003, 'TA': 2.05659, 'TG': 0.7290000000000001, 'TC': 0.531441, 'T#': 0.4782969000000001, 'GA': 1.5390000000000001, 'GC': 0.6561, 'G#': 0.5904900000000001})
Added in version 0.4.0.
- tokenize(string: str) QSkipgrams¶
Tokenize the term and store it.
The tokenized term is stored as an ordered list and as a Counter object.
- Parameters:
- stringstr
The string to tokenize
- .. versionadded:: 0.4.0
_ratcliff_obershelp¶
abydos.distance._ratcliff_obershelp.
Ratcliff-Obershelp similarity
- class distances._ratcliff_obershelp.RatcliffObershelp(**kwargs: Any)¶
Bases:
_DistanceRatcliff-Obershelp similarity.
This follows the Ratcliff-Obershelp algorithm :cite:`Ratcliff:1988` to derive a similarity measure:
Find the length of the longest common substring in src & tar.
Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.
Multiply this length by 2 and divide by the sum of the lengths of src & tar.
Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
Added in version 0.3.6.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Ratcliff-Obershelp similarity of two strings.
- sim(src: str, tar: str) float¶
Return the Ratcliff-Obershelp similarity of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Ratcliff-Obershelp similarity
Examples
>>> cmp = RatcliffObershelp() >>> round(cmp.sim('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.666666666667 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.5
Added in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
_refined_soundex¶
abydos.phonetic._refined_soundex.
Refined Soundex
- class distances._refined_soundex.RefinedSoundex(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False)¶
Bases:
_PhoneticRefined Soundex.
This is Soundex, but with more character classes. It was defined at :cite:`Boyce:1998`.
Added in version 0.3.6.
Methods
encode(word)Return the Refined Soundex code for a word.
encode_alpha(word)Return the alphabetic Refined Soundex code for a word.
- __init__(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False) None¶
Initialize RefinedSoundex instance.
- Parameters:
- max_lengthint
The length of the code returned (defaults to unlimited)
- zero_padbool
Pad the end of the return value with 0s to achieve a max_length string
- retain_vowelsbool
Retain vowels (as 0) in the resulting code
- .. versionadded:: 0.4.0
- encode(word: str) str¶
Return the Refined Soundex code for a word.
- Parameters:
- wordstr
The word to transform
- Returns:
- str
The Refined Soundex value
Examples
>>> pe = RefinedSoundex() >>> pe.encode('Christopher') 'C93619' >>> pe.encode('Niall') 'N7' >>> pe.encode('Smith') 'S86' >>> pe.encode('Schmidt') 'S386'
Added in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- encode_alpha(word: str) str¶
Return the alphabetic Refined Soundex code for a word.
- Parameters:
- wordstr
The word to transform
- Returns:
- str
The alphabetic Refined Soundex value
Examples
>>> pe = RefinedSoundex() >>> pe.encode_alpha('Christopher') 'CRKTPR' >>> pe.encode_alpha('Niall') 'NL' >>> pe.encode_alpha('Smith') 'SNT' >>> pe.encode_alpha('Schmidt') 'SKNT'
Added in version 0.4.0.
_regexp¶
abydos.tokenizer._wordpunct.
Regexp tokenizer
- class distances._regexp.RegexpTokenizer(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0)¶
Bases:
_TokenizerA regexp tokenizer.
Added in version 0.4.0.
Methods
count()Return token count.
count_unique()Return the number of unique elements.
get_counter()Return the tokens as a Counter object.
get_list()Return the tokens as an ordered list.
get_set()Return the unique tokens as a set.
tokenize(string)Tokenize the term and store it.
- __init__(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0) None¶
Initialize tokenizer.
- Parameters:
- scalerNone, str, or function
A scaling function for the Counter:
None : no scaling
‘set’ : All non-zero values are set to 1.
‘length’ : Each token has weight equal to its length.
- ‘length-log’Each token has weight equal to the log of its
length + 1.
- ‘length-exp’Each token has weight equal to e raised to its
length.
a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
- regexpstr
A regular exprecssion used to match tokens in the input text.
- flagsint
Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.
- .. versionadded:: 0.4.0
- tokenize(string: str) RegexpTokenizer¶
Tokenize the term and store it.
The tokenized term is stored as an ordered list and as a Counter object.
- Parameters:
- stringstr
The string to tokenize
Examples
>>> RegexpTokenizer(regexp=r'[^-]+').tokenize('AA-CT-AG-AA-CD') RegexpTokenizer({'AA': 2, 'CT': 1, 'AG': 1, 'CD': 1})
Added in version 0.4.0.
_rouge_l¶
abydos.distance._rouge_l.
Rouge-L similarity
- class distances._rouge_l.RougeL(**kwargs: Any)¶
Bases:
_DistanceRouge-L similarity.
Rouge-L similarity :cite:`Lin:2004`
Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar[, beta])Return the Rouge-L similarity of two strings.
- __init__(**kwargs: Any) None¶
Initialize RougeL instance.
- Parameters:
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- sim(src: str, tar: str, beta: float = 8) float¶
Return the Rouge-L similarity of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- betaint or float
A weighting factor to prejudice similarity towards src
- Returns:
- float
Rouge-L similarity
Examples
>>> cmp = RougeL() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6018518518518519 >>> cmp.sim('aluminum', 'Catalan') 0.3757225433526012 >>> cmp.sim('ATCG', 'TAGC') 0.5
Added in version 0.4.0.
_ssk¶
abydos.distance._ssk.
String subsequence kernel (SSK) similarity
- class distances._ssk.SSK(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any)¶
Bases:
_TokenDistanceString subsequence kernel (SSK) similarity.
This is based on :cite:`Lodhi:2002`.
Added in version 0.4.1.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the normalized SSK similarity of two strings.
sim_score(src, tar)Return the SSK similarity of two strings.
- __init__(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any) None¶
Initialize SSK instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- ssk_lambdafloat or Iterable
A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)
- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-skipgram. Using this parameter and tokenizer=None will cause the instance to use the QGramskipgrams tokenizer with this q value.
- .. versionadded:: 0.4.1
- sim(src: str, tar: str) float¶
Return the normalized SSK similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Normalized string subsequence kernel similarity
Examples
>>> cmp = SSK() >>> cmp.sim('cat', 'hat') 0.3558718861209964 >>> cmp.sim('Niall', 'Neil') 0.4709007822130597 >>> cmp.sim('aluminum', 'Catalan') 0.13760157193822603 >>> cmp.sim('ATCG', 'TAGC') 0.6140899528060498
Added in version 0.4.1.
- sim_score(src: str, tar: str) float¶
Return the SSK similarity of two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
String subsequence kernel similarity
Examples
>>> cmp = SSK() >>> cmp.dist_abs('cat', 'hat') 0.6441281138790036 >>> cmp.dist_abs('Niall', 'Neil') 0.5290992177869402 >>> cmp.dist_abs('aluminum', 'Catalan') 0.862398428061774 >>> cmp.dist_abs('ATCG', 'TAGC') 0.38591004719395017
Added in version 0.4.1.
_tichy¶
abydos.distance._tichy.
Tichy edit distance
- class distances._tichy.Tichy(cost: Tuple[int, int] = (1, 1), **kwargs: Any)¶
Bases:
_DistanceTichy edit distance.
Tichy described an algorithm, implemented below, in :cite:`Tichy:1984`. Following this, :cite:`Cormode:2003` identifies an interpretation of this algorithm’s output as a distance measure, which is largely followed by the methods below.
Tichy’s algorithm locates substrings of a string S to be copied in order to create a string T. The only other operation used by his algorithms for string reconstruction are add operations.
Methods
dist(src, tar)Return the normalized Tichy edit distance between two strings.
dist_abs(src, tar)Return the Tichy distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
Notes
While :cite:`Cormode:2003` counts only move operations to calculate distance, I give the option (enabled by default) of counting add operations as part of the distance measure. To ignore the cost of add operations, set the cost value to (1, 0), for example, when initializing the object. Further, in the case that S and T are identical, a distance of 0 will be returned, even though this would still be counted as a single move operation spanning the whole of string S.
Added in version 0.4.0.
- __init__(cost: Tuple[int, int] = (1, 1), **kwargs: Any) None¶
Initialize Tichy instance.
- Parameters:
- costtuple
A 2-tuple representing the cost of the two possible edits: block moves and adds (by default: (1, 1))
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized Tichy edit distance between two strings.
The Tichy distance is normalized by dividing the distance by the length of the tar string.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
The normalized Tichy distance between src & tar
Examples
>>> cmp = Tichy() >>> round(cmp.dist('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.dist('Niall', 'Neil'), 12) 1.0 >>> cmp.dist('aluminum', 'Catalan') 0.8571428571428571 >>> cmp.dist('ATCG', 'TAGC') 1.0
Added in version 0.4.0.
- dist_abs(src: str, tar: str) float¶
Return the Tichy distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- int (may return a float if cost has float values)
The Tichy distance between src & tar
Examples
>>> cmp = Tichy() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 4 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
Added in version 0.4.0.
_tokenizer¶
abydos.tokenizer._tokenize.
_Tokenizer base class
_token_distance¶
abydos.distance._token_distance.
The distance._token_distance._TokenDistance module implements abstract class _TokenDistance.
_typo¶
abydos.distance._typo.
Typo edit distance functions.
- class distances._typo.Typo(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)¶
Bases:
_DistanceTypo distance.
This is inspired by Typo-Distance :cite:`Song:2011`, and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.
Added in version 0.3.6.
Methods
dist(src, tar)Return the normalized typo distance between two strings.
dist_abs(src, tar)Return the typo distance between two strings.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return similarity.
- __init__(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)¶
Initialize Typo instance.
- Parameters:
- metricstr
Supported values include:
euclidean,manhattan,log-euclidean, andlog-manhattan- costtuple
A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
- layoutstr
Name of the keyboard layout to use (Currently supported:
QWERTY,Dvorak,AZERTY,QWERTZ,auto). Ifautois selected, the class will attempt to determine an appropriate keyboard based on the supplied words.- failsafebool
If True, substitution of an unknown character (one not present on the selected keyboard) will incur a cost equal to an insertion plus a deletion.
- **kwargs
Arbitrary keyword arguments
- .. versionadded:: 0.4.0
- dist(src: str, tar: str) float¶
Return the normalized typo distance between two strings.
This is typo distance, normalized to [0, 1].
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Normalized typo distance
Examples
>>> cmp = Typo() >>> round(cmp.dist('cat', 'hat'), 12) 0.527046276695 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.565028153987 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.569035593729 >>> cmp.dist('ATCG', 'TAGC') 0.625
Added in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float¶
Return the typo distance between two strings.
- Parameters:
- srcstr
Source string for comparison
- tarstr
Target string for comparison
- Returns:
- float
Typo distance
- Raises:
- ValueError
char not found in any keyboard layouts
Examples
>>> cmp = Typo() >>> cmp.dist_abs('cat', 'hat') 1.5811388300841898 >>> cmp.dist_abs('Niall', 'Neil') 2.8251407699364424 >>> cmp.dist_abs('Colin', 'Cuilen') 3.414213562373095 >>> cmp.dist_abs('ATCG', 'TAGC') 2.5
>>> cmp = Typo(metric='manhattan') >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.0 >>> cmp.dist_abs('Colin', 'Cuilen') 3.5 >>> cmp.dist_abs('ATCG', 'TAGC') 2.5
>>> cmp = Typo(metric='log-manhattan') >>> cmp.dist_abs('cat', 'hat') 0.8047189562170501 >>> cmp.dist_abs('Niall', 'Neil') 2.2424533248940004 >>> cmp.dist_abs('Colin', 'Cuilen') 2.242453324894 >>> cmp.dist_abs('ATCG', 'TAGC') 2.3465735902799727
Added in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
_warrens_iv¶
abydos.distance._warrens_iv.
Warrens IV similarity
- class distances._warrens_iv.WarrensIV(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶
Bases:
_TokenDistanceWarrens IV similarity.
For two sets X and Y and a population N, Warrens IV similarity :cite:`Warrens:2008` is
\[sim_{WarrensIV}(X, Y) = \frac{4|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {4|X \cap Y| \cdot |(N \setminus X) \setminus Y| + (|X \cap Y| + |(N \setminus X) \setminus Y|) (|X \setminus Y| + |Y \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{WarrensIV} = \frac{4ad}{4ad + (a+d)(b+c)}\]Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Warrens IV similarity of two strings.
- __init__(alphabet: Counter[str] | Sequence[str] | Set[str] | int | None = None, tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) None¶
Initialize WarrensIV instance.
- Parameters:
- alphabetCounter, collection, int, or None
This represents the alphabet of possible tokens. See alphabet description in
_TokenDistancefor details.- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the Warrens IV similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Warrens IV similarity
Examples
>>> cmp = WarrensIV() >>> cmp.sim('cat', 'hat') 0.666095890410959 >>> cmp.sim('Niall', 'Neil') 0.5326918120113412 >>> cmp.sim('aluminum', 'Catalan') 0.21031040612607685 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.0.
_weighted_jaccard¶
abydos.distance._weighted_jaccard.
Weighted Jaccard similarity
- class distances._weighted_jaccard.WeightedJaccard(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any)¶
Bases:
_TokenDistanceWeighted Jaccard similarity.
For two sets X and Y and a weight w, the Weighted Jaccard similarity :cite:`Legendre:1998` is
\[sim_{Jaccard_w}(X, Y) = \frac{w \cdot |X \cap Y|} {w \cdot |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]Here, the intersection between the two sets is weighted by w. Compare to Jaccard similarity (\(w = 1\)), and to Dice similarity (\(w = 2\)). In the default case, the weight of the intersection is 3, following :cite:`Legendre:1998`.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Jaccard_w} = \frac{w\cdot a}{w\cdot a+b+c}\]Added in version 0.4.0.
Methods
dist(src, tar)Return distance.
dist_abs(src, tar)Return absolute distance.
set_params(**kwargs)Store params in the params dict.
sim(src, tar)Return the Triple Weighted Jaccard similarity of two strings.
- __init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any) None¶
Initialize TripleWeightedJaccard instance.
- Parameters:
- tokenizer_Tokenizer
A tokenizer instance from the
abydos.tokenizerpackage- intersection_typestr
Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistancefor details.- weightint
The weight to apply to the intersection cardinality. (3, by default.)
- **kwargs
Arbitrary keyword arguments
- Other Parameters:
- qvalint
The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- metric_Distance
A string distance measure class for use in the
softandfuzzyvariants.- thresholdfloat
A threshold value, similarities above which are counted as members of the intersection for the
fuzzyvariant.- .. versionadded:: 0.4.0
- sim(src: str, tar: str) float¶
Return the Triple Weighted Jaccard similarity of two strings.
- Parameters:
- srcstr
Source string (or QGrams/Counter objects) for comparison
- tarstr
Target string (or QGrams/Counter objects) for comparison
- Returns:
- float
Weighted Jaccard similarity
Examples
>>> cmp = WeightedJaccard() >>> cmp.sim('cat', 'hat') 0.6 >>> cmp.sim('Niall', 'Neil') 0.46153846153846156 >>> cmp.sim('aluminum', 'Catalan') 0.16666666666666666 >>> cmp.sim('ATCG', 'TAGC') 0.0
Added in version 0.4.0.
_whitespace¶
abydos.tokenizer._whitespace.
Whitespace tokenizer
- class distances._whitespace.WhitespaceTokenizer(scaler: str | Callable[[float], float] | None = None, flags: int = 0)¶
Bases:
RegexpTokenizerA whitespace tokenizer.
Methods
count()Return token count.
count_unique()Return the number of unique elements.
get_counter()Return the tokens as a Counter object.
get_list()Return the tokens as an ordered list.
get_set()Return the unique tokens as a set.
tokenize(string)Tokenize the term and store it.
Examples
>>> WhitespaceTokenizer().tokenize('a b c f a c g e a b') WhitespaceTokenizer({'a': 3, 'b': 2, 'c': 2, 'f': 1, 'g': 1, 'e': 1})
Added in version 0.4.0.
- __init__(scaler: str | Callable[[float], float] | None = None, flags: int = 0) None¶
Initialize tokenizer.
- Parameters:
- scalerNone, str, or function
A scaling function for the Counter:
None : no scaling
‘set’ : All non-zero values are set to 1.
‘length’ : Each token has weight equal to its length.
- ‘length-log’Each token has weight equal to the log of its
length + 1.
- ‘length-exp’Each token has weight equal to e raised to its
length.
a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
- flagsint
Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.
- .. versionadded:: 0.4.0