`distances`¶

`_bag`¶

abydos.distance._bag.

Bag similarity & distance

class distances._bag.Bag(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶

Bases: _TokenDistance

Bag distance.

Bag distance is proposed in :cite:`Bartolini:2002`. It is defined as

\[dist_{bag}(src, tar) = max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\]

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return the normalized bag distance between two strings.
`dist_abs`(src, tar[, normalized])	Return the bag distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the token simularity two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) → None¶

Initialize Bag instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized bag distance between two strings.

Bag distance is normalized by dividing by $max( |src|, |tar| )$.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Normalized bag distance

Examples

>>> cmp = Bag()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.4
>>> cmp.dist('aluminum', 'Catalan')
0.625
>>> cmp.dist('ATCG', 'TAGC')
0.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str, normalized: bool = False) → float¶

Return the bag distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison
normalizedbool: Normalizes to [0, 1] if True

Returns:

int or float: Bag distance

Examples

>>> cmp = Bag()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
5
>>> cmp.dist_abs('ATCG', 'TAGC')
0
>>> cmp.dist_abs('abcdefg', 'hijklm')
7
>>> cmp.dist_abs('abcdefg', 'hijklmno')
8

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_baulieu_xiii`¶

abydos.distance._baulieu_xiii.

Baulieu XIII distance

Bases: _TokenDistance

Baulieu XIII distance.

For two sets X and Y and a population N, Baulieu XIII distance :cite:`Baulieu:1997` is

\[dist_{BaulieuXIII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]

This is Baulieu’s 31st dissimilarity coefficient. This coefficient fails Baulieu’s (P4) property, that $D(a+1,b,c,d) \leq D(a,b,c,d) = 0$ with equality holding iff $D(a,b,c,d) = 0$.

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[dist_{BaulieuXIII} = \frac{b+c}{a+b+c+a \cdot (a-4)^2}\]

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return the Baulieu XIII distance of two strings.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the token simularity two strings.

Initialize BaulieuXIII instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the Baulieu XIII distance of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Baulieu XIII distance

Examples

>>> cmp = BaulieuXIII()
>>> cmp.dist('cat', 'hat')
0.2857142857142857
>>> cmp.dist('Niall', 'Neil')
0.4117647058823529
>>> cmp.dist('aluminum', 'Catalan')
0.6
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.4.0.

`_character`¶

abydos.tokenizer._character.

Character tokenizer

class distances._character.CharacterTokenizer(scaler: str | Callable[[float], float] | None = None)¶

Bases: _Tokenizer

A character tokenizer.

Added in version 0.4.0.

Methods

`count`()	Return token count.
`count_unique`()	Return the number of unique elements.
`get_counter`()	Return the tokens as a Counter object.
`get_list`()	Return the tokens as an ordered list.
`get_set`()	Return the unique tokens as a set.
`tokenize`(string)	Tokenize the term and store it.

__init__(scaler: str | Callable[[float], float] | None = None) → None¶

Initialize tokenizer.

Parameters:

scalerNone, str, or function

A scaling function for the Counter:

None : no scaling

‘set’ : All non-zero values are set to 1.

a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

.. versionadded:: 0.4.0

tokenize(string: str) → CharacterTokenizer¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:

stringstr: The string to tokenize

Examples

>>> CharacterTokenizer().tokenize('AACTAGAAC')
CharacterTokenizer({'A': 5, 'C': 2, 'T': 1, 'G': 1})

Added in version 0.4.0.

`_clement`¶

abydos.distance._clement.

Clement similarity

Bases: _TokenDistance

Clement similarity.

For two sets X and Y and a population N, Clement similarity :cite:`Clement:1976` is defined as

\[sim_{Clement}(X, Y) = \frac{|X \cap Y|}{|X|}\Big(1-\frac{|X|}{|N|}\Big) + \frac{|(N \setminus X) \setminus Y|}{|N \setminus X|} \Big(1-\frac{|N \setminus X|}{|N|}\Big)\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{Clement} = \frac{a}{a+b}\Big(1 - \frac{a+b}{n}\Big) + \frac{d}{c+d}\Big(1 - \frac{c+d}{n}\Big)\]

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Clement similarity of two strings.

Initialize Clement instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the Clement similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Clement similarity

Examples

>>> cmp = Clement()
>>> cmp.sim('cat', 'hat')
0.5025379382522239
>>> cmp.sim('Niall', 'Neil')
0.33840586363079933
>>> cmp.sim('aluminum', 'Catalan')
0.12119877280918714
>>> cmp.sim('ATCG', 'TAGC')
0.006336616803332366

Added in version 0.4.0.

`_cormode_lz`¶

abydos.distance._cormode_lz.

Cormode’s LZ distance

class distances._cormode_lz.CormodeLZ(**kwargs: Any)¶

Bases: _Distance

Cormode’s LZ distance.

Cormode’s LZ distance :cite:`Cormode:2000,Cormode:2003`

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return the normalized Cormode's LZ distance of two strings.
`dist_abs`(src, tar)	Return the Cormode's LZ distance of two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(**kwargs: Any) → None¶

Initialize CormodeLZ instance.

Parameters:

**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized Cormode’s LZ distance of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Cormode’s LZ distance

Examples

>>> cmp = CormodeLZ()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.8
>>> cmp.dist('aluminum', 'Catalan')
0.625
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.4.0.

dist_abs(src: str, tar: str) → float¶

Return the Cormode’s LZ distance of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Cormode’s LZ distance

Examples

>>> cmp = CormodeLZ()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
5
>>> cmp.dist_abs('aluminum', 'Catalan')
6
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.4.0.

`_damerau_levenshtein`¶

abydos.distance._damerau_levenshtein.

Damerau-Levenshtein distance

class distances._damerau_levenshtein.DamerauLevenshtein(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶

Bases: _Distance

Damerau-Levenshtein distance.

This computes the Damerau-Levenshtein distance :cite:`Damerau:1964`. Damerau-Levenshtein code is based on Java code by Kevin L. Stern :cite:`Stern:2014`, under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

Methods

`dist`(src, tar)	Return the Damerau-Levenshtein similarity of two strings.
`dist_abs`(src, tar)	Return the Damerau-Levenshtein distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶

Initialize Levenshtein instance.

Parameters:

costtuple: A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
normalizerfunction: A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the Damerau-Levenshtein similarity of two strings.

Damerau-Levenshtein distance normalized to the interval [0, 1].

The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have $cost = 1$, this is equivalent to the greater of the length of the two strings src & tar.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The normalized Damerau-Levenshtein distance

Examples

>>> cmp = DamerauLevenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) → float¶

Return the Damerau-Levenshtein distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int (may return a float if cost has float values): The Damerau-Levenshtein distance between src & tar

Raises:

ValueError: Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.

Examples

>>> cmp = DamerauLevenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
2

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_dice_asymmetric_i`¶

abydos.distance._dice_asymmetric_i.

Dice’s Asymmetric I similarity

class distances._dice_asymmetric_i.DiceAsymmetricI(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶

Bases: _TokenDistance

Dice’s Asymmetric I similarity.

For two sets X and Y and a population N, Dice’s Asymmetric I similarity :cite:`Dice:1945` is

\[sim_{DiceAsymmetricI}(X, Y) = \frac{|X \cap Y|}{|X|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{DiceAsymmetricI} = \frac{a}{a+b}\]

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Dice's Asymmetric I similarity of two strings.

Notes

In terms of a confusion matrix, this is equivalent to precision or positive predictive value ConfusionTable.precision().

Added in version 0.4.0.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) → None¶

Initialize DiceAsymmetricI instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the Dice’s Asymmetric I similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Dice’s Asymmetric I similarity

Examples

>>> cmp = DiceAsymmetricI()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.3333333333333333
>>> cmp.sim('aluminum', 'Catalan')
0.1111111111111111
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

`_discounted_levenshtein`¶

abydos.distance._discounted_levenshtein.

Discounted Levenshtein edit distance

class distances._discounted_levenshtein.DiscountedLevenshtein(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any)¶

Bases: Levenshtein

Discounted Levenshtein distance.

This is a variant of Levenshtein distance for which edits later in a string have discounted cost, on the theory that earlier edits are less likely than later ones.

Added in version 0.4.1.

Methods

`alignment`(src, tar)	Return the Levenshtein alignment of two strings.
`dist`(src, tar)	Return the normalized Levenshtein distance between two strings.
`dist_abs`(src, tar)	Return the Levenshtein distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: int | str = 1, discount_func: str | ~typing.Callable[[float], float] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any) → None¶

Initialize DiscountedLevenshtein instance.

Parameters:

modestr

Specifies a mode for computing the discounted Levenshtein distance:

lev (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions

osa computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once

normalizerfunction

A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.

discount_fromint or str

If an int is supplied, this is the first character whose edit cost will be discounted. If the str coda is supplied, discounting will start with the first non-vowel after the first vowel (the first syllable coda).

discount_funcstr or function

The two supported str arguments are log, for a logarithmic discount function, and exp for a exponential discount function. See notes below for information on how to supply your own discount function.

vowelsstr

These are the letters to consider as vowels when discount_from is set to coda. It defaults to the English vowels ‘aeiou’, but it would be reasonable to localize this to other languages or to add orthographic semi-vowels like ‘y’, ‘w’, and even ‘h’.

**kwargs

Arbitrary keyword arguments

Notes

This class is highly experimental and will need additional tuning.

The discount function can be passed as a callable function. It should expect an integer as its only argument and return a float, ideally less than or equal to 1.0. The argument represents the degree of discounting to apply.

Added in version 0.4.1.

dist(src: str, tar: str) → float¶

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have $cost = 1$, this is equivalent to the greater of the length of the two strings src & tar.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The normalized Levenshtein distance between src & tar

Examples

>>> cmp = DiscountedLevenshtein()
>>> cmp.dist('cat', 'hat')
0.3513958291799864
>>> cmp.dist('Niall', 'Neil')
0.5909885886270658
>>> cmp.dist('aluminum', 'Catalan')
0.8348163322045603
>>> cmp.dist('ATCG', 'TAGC')
0.7217609721523955

Added in version 0.4.1.

dist_abs(src: str, tar: str) → float¶

Return the Levenshtein distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float (may return a float if cost has float values): The Levenshtein distance between src & tar

Examples

>>> cmp = DiscountedLevenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2.526064024369237
>>> cmp.dist_abs('aluminum', 'Catalan')
5.053867269967515
>>> cmp.dist_abs('ATCG', 'TAGC')
2.594032108779918

>>> cmp = DiscountedLevenshtein(mode='osa')
>>> cmp.dist_abs('ATCG', 'TAGC')
1.7482385137517997
>>> cmp.dist_abs('ACTG', 'TAGC')
3.342270622531718

Added in version 0.4.1.

`_distance`¶

abydos.distance._distance.

The distance._distance module implements abstract class _Distance.

`_double_metaphone`¶

abydos.phonetic._double_metaphone.

Double Metaphone

class distances._double_metaphone.DoubleMetaphone(max_length: int = -1)¶

Bases: _Phonetic

Double Metaphone.

Based on Lawrence Philips’ (Visual) C++ code from 1999 :cite:`Philips:2000`.

Added in version 0.3.6.

Methods

`encode`(word)	Return the Double Metaphone code for a word.
`encode_alpha`(word)	Return the alphabetic Double Metaphone code for a word.

__init__(max_length: int = -1) → None¶

Initialize DoubleMetaphone instance.

Parameters:

max_lengthint: Maximum length of the returned Dolby code – this also activates the fixed-length code mode if it is greater than 0
.. versionadded:: 0.4.0

encode(word: str) → str¶

Return the Double Metaphone code for a word.

Parameters:

wordstr: The word to transform

Returns:

str: The Double Metaphone value(s)

Examples

>>> pe = DoubleMetaphone()
>>> pe.encode('Christopher')
'KRSTFR,'
>>> pe.encode('Niall')
'NL,'
>>> pe.encode('Smith')
'SM0,XMT'
>>> pe.encode('Schmidt')
'XMT,SMT'

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Made return a str only (comma-separated)

encode_alpha(word: str) → str¶

Return the alphabetic Double Metaphone code for a word.

Parameters:

wordstr: The word to transform

Returns:

str: The alphabetic Double Metaphone value(s)

Examples

>>> pe = DoubleMetaphone()
>>> pe.encode_alpha('Christopher')
'KRSTFR,'
>>> pe.encode_alpha('Niall')
'NL,'
>>> pe.encode_alpha('Smith')
'SMÞ,XMT'
>>> pe.encode_alpha('Schmidt')
'XMT,SMT'

Added in version 0.4.0.

Changed in version 0.6.0: Made return a str only (comma-separated)

`_editex`¶

abydos.distance._editex.

editex

class distances._editex.Editex(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any)¶

Bases: _Distance

Editex.

As described on pages 3 & 4 of :cite:`Zobel:1996`.

The local variant is based on :cite:`Ring:2009`.

Added in version 0.3.6.

Changed in version 0.4.0: Added taper option

Methods

`dist`(src, tar)	Return the normalized Editex distance between two strings.
`dist_abs`(src, tar)	Return the Editex distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any) → None¶

Initialize Editex instance.

Parameters:

costtuple: A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
localbool: If True, the local variant of Editex is used
taperbool: Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized Editex distance between two strings.

The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have $cost = 1$, this is equivalent to the greater of the length of the two strings src & tar.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int: Normalized Editex distance

Examples

>>> cmp = Editex()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.2
>>> cmp.dist('aluminum', 'Catalan')
0.75
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) → float¶

Return the Editex distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int: Editex distance

Examples

>>> cmp = Editex()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
12
>>> cmp.dist_abs('ATCG', 'TAGC')
6

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_fuzzywuzzy_partial_string`¶

abydos.distance._fuzzywuzzy_partial_string.

FuzzyWuzzy Partial String similarity

class distances._fuzzywuzzy_partial_string.FuzzyWuzzyPartialString(**kwargs: Any)¶

Bases: _Distance

FuzzyWuzzy Partial String similarity.

This follows the FuzzyWuzzy Partial String similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the FuzzyWuzzy Partial String similarity of two strings.

sim(src: str, tar: str) → float¶

Return the FuzzyWuzzy Partial String similarity of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: FuzzyWuzzy Partial String similarity

Examples

>>> cmp = FuzzyWuzzyPartialString()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.75
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.428571428571
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

`_fuzzywuzzy_token_set`¶

abydos.distance._fuzzywuzzy_token_set.

FuzzyWuzzy Token Set similarity

class distances._fuzzywuzzy_token_set.FuzzyWuzzyTokenSet(tokenizer: _Tokenizer | None = None, **kwargs: Any)¶

Bases: _TokenDistance

FuzzyWuzzy Token Set similarity.

This follows the FuzzyWuzzy Token Set similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0]. Distinct from the

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the FuzzyWuzzy Token Set similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) → None¶

Initialize FuzzyWuzzyTokenSet instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package. By default, the regexp tokenizer is employed, matching only letters.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the FuzzyWuzzy Token Set similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: FuzzyWuzzy Token Set similarity

Examples

>>> cmp = FuzzyWuzzyTokenSet()
>>> cmp.sim('cat', 'hat')
0.75
>>> cmp.sim('Niall', 'Neil')
0.7272727272727273
>>> cmp.sim('aluminum', 'Catalan')
0.47058823529411764
>>> cmp.sim('ATCG', 'TAGC')
0.6

Added in version 0.4.0.

`_fuzzywuzzy_token_sort`¶

abydos.distance._fuzzywuzzy_token_sort.

FuzzyWuzzy Token Sort similarity

class distances._fuzzywuzzy_token_sort.FuzzyWuzzyTokenSort(tokenizer: _Tokenizer | None = None, **kwargs: Any)¶

Bases: _TokenDistance

FuzzyWuzzy Token Sort similarity.

This follows the FuzzyWuzzy Token Sort similarity algorithm :cite:`Cohen:2011`. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the FuzzyWuzzy Token Sort similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, **kwargs: Any) → None¶

Initialize FuzzyWuzzyTokenSort instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package. By default, the regexp tokenizer is employed, matching only letters.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the FuzzyWuzzy Token Sort similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: FuzzyWuzzy Token Sort similarity

Examples

>>> cmp = FuzzyWuzzyTokenSort()
>>> cmp.sim('cat', 'hat')
0.6666666666666666
>>> cmp.sim('Niall', 'Neil')
0.6666666666666666
>>> cmp.sim('aluminum', 'Catalan')
0.4
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

`_hamming`¶

abydos.distance._hamming.

Hamming distance

class distances._hamming.Hamming(diff_lens: bool = True, **kwargs: Any)¶

Bases: _Distance

Hamming distance.

Hamming distance :cite:`Hamming:1950` equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings’ lengths and adds to this the difference in string lengths.

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return the normalized Hamming distance between two strings.
`dist_abs`(src, tar)	Return the Hamming distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(diff_lens: bool = True, **kwargs: Any) → None¶

Initialize Hamming instance.

Parameters:

diff_lensbool: If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized Hamming distance between two strings.

Hamming distance normalized to the interval [0, 1].

The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).

The arguments are identical to those of the hamming() function.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Normalized Hamming distance

Examples

>>> cmp = Hamming()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> cmp.dist('Niall', 'Neil')
0.6
>>> cmp.dist('aluminum', 'Catalan')
1.0
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) → float¶

Return the Hamming distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int: The Hamming distance between src & tar

Raises:

ValueError: Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.

Examples

>>> cmp = Hamming()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
8
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_indel`¶

abydos.distance._indel.

Indel distance

class distances._indel.Indel(**kwargs: Any)¶

Bases: Levenshtein

Indel distance.

This is equivalent to Levenshtein distance, when only inserts and deletes are possible.

Added in version 0.3.6.

Methods

`alignment`(src, tar)	Return the Levenshtein alignment of two strings.
`dist`(src, tar)	Return the normalized indel distance between two strings.
`dist_abs`(src, tar)	Return the Levenshtein distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(**kwargs: Any) → None¶

Initialize Levenshtein instance.

Parameters:

**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized indel distance between two strings.

This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Normalized indel distance

Examples

>>> cmp = Indel()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.333333333333
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.454545454545
>>> cmp.dist('ATCG', 'TAGC')
0.5

Added in version 0.3.6.

`_iterative_substring`¶

abydos.distance._iterative_substring.

Iterative-SubString (I-Sub) correlation

class distances._iterative_substring.IterativeSubString(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any)¶

Bases: _Distance

Iterative-SubString correlation.

Iterative-SubString (I-Sub) correlation :cite:`Stoilos:2005`

This is a straightforward port of the primary author’s Java implementation: http://www.image.ece.ntua.gr/~gstoil/software/I_Sub.java

Added in version 0.4.0.

Methods

`corr`(src, tar)	Return the Iterative-SubString correlation of two strings.
`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Iterative-SubString similarity of two strings.

__init__(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any) → None¶

Initialize IterativeSubString instance.

Parameters:

hamacherfloat: The constant factor for the Hamacher product
normalize_stringsbool: Normalize the strings by removing the characters in ‘._ ‘ and lower casing
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

corr(src: str, tar: str) → float¶

Return the Iterative-SubString correlation of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Iterative-SubString correlation

Examples

>>> cmp = IterativeSubString()
>>> cmp.corr('cat', 'hat')
-1.0
>>> cmp.corr('Niall', 'Neil')
-0.9
>>> cmp.corr('aluminum', 'Catalan')
-1.0
>>> cmp.corr('ATCG', 'TAGC')
-1.0

Added in version 0.4.0.

sim(src: str, tar: str) → float¶

Return the Iterative-SubString similarity of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Iterative-SubString similarity

Examples

>>> cmp = IterativeSubString()
>>> cmp.sim('cat', 'hat')
0.0
>>> cmp.sim('Niall', 'Neil')
0.04999999999999999
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

`_kuhns_iii`¶

abydos.distance._kuhns_iii.

Kuhns III correlation

Bases: _TokenDistance

Kuhns III correlation.

For two sets X and Y and a population N, Kuhns III correlation :cite:`Kuhns:1965`, the excess of proportion of overlap over its independence value (P), is

\[corr_{KuhnsIII}(X, Y) = \frac{\delta(X, Y)}{\big(1-\frac{|X \cap Y|}{|X|+|Y|}\big) \big(|X|+|Y|-\frac{|X|\cdot|Y|}{|N|}\big)}\]

where

\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[corr_{KuhnsIII} = \frac{\delta(a+b, a+c)}{\big(1-\frac{a}{2a+b+c}\big) \big(2a+b+c-\frac{(a+b)(a+c)}{n}\big)}\]

where

\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]

Methods

`corr`(src, tar)	Return the Kuhns III correlation of two strings.
`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Kuhns III similarity of two strings.

Notes

The coefficient presented in :cite:`Eidenberger:2014,Morris:2012` as Kuhns’ “Proportion of overlap above independence” is a significantly different coefficient, not evidenced in :cite:`Kuhns:1965`.

Added in version 0.4.0.

Initialize KuhnsIII instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

corr(src: str, tar: str) → float¶

Return the Kuhns III correlation of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Kuhns III correlation

Examples

>>> cmp = KuhnsIII()
>>> cmp.corr('cat', 'hat')
0.3307757885763001
>>> cmp.corr('Niall', 'Neil')
0.21873141468207793
>>> cmp.corr('aluminum', 'Catalan')
0.05707545392902886
>>> cmp.corr('ATCG', 'TAGC')
-0.003198976327575176

Added in version 0.4.0.

sim(src: str, tar: str) → float¶

Return the Kuhns III similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Kuhns III similarity

Examples

>>> cmp = KuhnsIII()
>>> cmp.sim('cat', 'hat')
0.498081841432225
>>> cmp.sim('Niall', 'Neil')
0.41404856101155846
>>> cmp.sim('aluminum', 'Catalan')
0.29280659044677165
>>> cmp.sim('ATCG', 'TAGC')
0.24760076775431863

Added in version 0.4.0.

`_lcprefix`¶

abydos.distance._lcprefix.

Longest common prefix

class distances._lcprefix.LCPrefix(**kwargs: Any)¶

Bases: _Distance

Longest common prefix.

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar, *args)	Return the length of the longest common prefix of the strings.
`lcprefix`(strings)	Return the longest common prefix of a list of strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar, *args)	Return the longest common prefix similarity of two or more strings.

dist_abs(src: str, tar: str, *args: str) → int¶

Return the length of the longest common prefix of the strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison
*argsstrs: Additional strings for comparison

Returns:

int: The length of the longest common prefix

Raises:

ValueError: All arguments must be of type str

Examples

>>> pfx = LCPrefix()
>>> pfx.dist_abs('cat', 'hat')
0
>>> pfx.dist_abs('Niall', 'Neil')
1
>>> pfx.dist_abs('aluminum', 'Catalan')
0
>>> pfx.dist_abs('ATCG', 'TAGC')
0

Added in version 0.4.0.

lcprefix(strings: List[str]) → str¶

Return the longest common prefix of a list of strings.

Longest common prefix (LCPrefix).

Parameters:

stringslist of strings: Strings for comparison

Returns:

str: The longest common prefix

Examples

>>> pfx = LCPrefix()
>>> pfx.lcprefix(['cat', 'hat'])
''
>>> pfx.lcprefix(['Niall', 'Neil'])
'N'
>>> pfx.lcprefix(['aluminum', 'Catalan'])
''
>>> pfx.lcprefix(['ATCG', 'TAGC'])
''

Added in version 0.4.0.

sim(src: str, tar: str, *args: str) → float¶

Return the longest common prefix similarity of two or more strings.

Longest common prefix similarity ($sim_{LCPrefix}$).

This employs the LCPrefix function to derive a similarity metric: $sim_{LCPrefix}(s,t) = \frac{|LCPrefix(s,t)|}{max(|s|, |t|)}$

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison
*argsstrs: Additional strings for comparison

Returns:

float: LCPrefix similarity

Examples

>>> pfx = LCPrefix()
>>> pfx.sim('cat', 'hat')
0.0
>>> pfx.sim('Niall', 'Neil')
0.2
>>> pfx.sim('aluminum', 'Catalan')
0.0
>>> pfx.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

`_lcsseq`¶

abydos.distance._lcsseq.

Longest common subsequence

class distances._lcsseq.LCSseq(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)¶

Bases: _Distance

Longest common subsequence.

Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`lcsseq`(src, tar)	Return the longest common subsequence of two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the longest common subsequence similarity of two strings.

__init__(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any) → None¶

Initialize LCSseq.

Parameters:

normalizerfunction: A normalization function for the normalized similarity & distance. By default, the max of the lengths of the input strings. If lambda x: sum(x)/2.0 is supplied, the normalization proposed in :cite:`Radev:2001` is used, i.e. $\frac{2 \dot |LCS(src, tar)|}{|src| + |tar|}$.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

lcsseq(src: str, tar: str) → str¶

Return the longest common subsequence of two strings.

Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence :cite:`rosettacode:2018b`. This is licensed GFDL 1.2.

Modifications include:: conversion to a numpy array in place of a list of lists

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

str: The longest common subsequence

Examples

>>> sseq = LCSseq()
>>> sseq.lcsseq('cat', 'hat')
'at'
>>> sseq.lcsseq('Niall', 'Neil')
'Nil'
>>> sseq.lcsseq('aluminum', 'Catalan')
'aln'
>>> sseq.lcsseq('ATCG', 'TAGC')
'AC'

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

sim(src: str, tar: str) → float¶

Return the longest common subsequence similarity of two strings.

Longest common subsequence similarity ($sim_{LCSseq}$).

This employs the LCSseq function to derive a similarity metric: $sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}$

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: LCSseq similarity

Examples

>>> sseq = LCSseq()
>>> sseq.sim('cat', 'hat')
0.6666666666666666
>>> sseq.sim('Niall', 'Neil')
0.6
>>> sseq.sim('aluminum', 'Catalan')
0.375
>>> sseq.sim('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.4.0: Added normalization option

`_levenshtein`¶

abydos.distance._levenshtein.

The distance._Levenshtein module implements string edit distance functions based on Levenshtein distance, including:

Levenshtein distance

Optimal String Alignment distance

class distances._levenshtein.Levenshtein(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any)¶

Bases: _Distance

Levenshtein distance.

This is the standard edit distance measure. Cf. :cite:`Levenshtein:1965,Levenshtein:1966`.

Optimal string alignment (aka restricted Damerau-Levenshtein distance) :cite:`Boytsov:2011` is also supported.

The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm :cite:`Wagner:1974`.

Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.

Added in version 0.3.6.

Changed in version 0.4.0: Added taper option

Methods

`alignment`(src, tar)	Return the Levenshtein alignment of two strings.
`dist`(src, tar)	Return the normalized Levenshtein distance between two strings.
`dist_abs`(src, tar)	Return the Levenshtein distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any) → None¶

Initialize Levenshtein instance.

Parameters:

modestr

Specifies a mode for computing the Levenshtein distance:

lev (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions

osa computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once

costtuple

A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))

normalizerfunction

A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.

taperbool

Enables cost tapering. Following :cite:`Zobel:1996`, it causes edits at the start of the string to “just [exceed] twice the minimum penalty for replacement or deletion at the end of the string”.

**kwargs

Arbitrary keyword arguments

.. versionadded:: 0.4.0

alignment(src: str, tar: str) → Tuple[float, str, str]¶

Return the Levenshtein alignment of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

tuple: A tuple containing the Levenshtein distance and the two strings, aligned.

Examples

>>> cmp = Levenshtein()
>>> cmp.alignment('cat', 'hat')
(1.0, 'cat', 'hat')
>>> cmp.alignment('Niall', 'Neil')
(3.0, 'N-iall', 'Nei-l-')
>>> cmp.alignment('aluminum', 'Catalan')
(7.0, '-aluminum', 'Catalan--')
>>> cmp.alignment('ATCG', 'TAGC')
(3.0, 'ATCG-', '-TAGC')

>>> cmp = Levenshtein(mode='osa')
>>> cmp.alignment('ATCG', 'TAGC')
(2.0, 'ATCG', 'TAGC')
>>> cmp.alignment('ACTG', 'TAGC')
(4.0, 'ACT-G-', '--TAGC')

Added in version 0.4.1.

dist(src: str, tar: str) → float¶

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have $cost = 1$, this is equivalent to the greater of the length of the two strings src & tar.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The normalized Levenshtein distance between src & tar

Examples

>>> cmp = Levenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) → float¶

Return the Levenshtein distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int (may return a float if cost has float values): The Levenshtein distance between src & tar

Examples

>>> cmp = Levenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
3

>>> cmp = Levenshtein(mode='osa')
>>> cmp.dist_abs('ATCG', 'TAGC')
2
>>> cmp.dist_abs('ACTG', 'TAGC')
4

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

sim(src: str, tar: str)¶

Return similarity.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Similarity

Added in version 0.3.6: ..

`_lig3`¶

abydos.distance._lig3.

LIG3 similarity

class distances._lig3.LIG3(**kwargs: Any)¶

Bases: _Distance

LIG3 similarity.

:cite:`Snae:2002` proposes three Levenshtein-ISG-Guth hybrid similarity measures: LIG1, LIG2, and LIG3. Of these, LIG1 is identical to ISG and LIG2 is identical to normalized Levenshtein similarity. Only LIG3 is a novel measure, defined as:

\[sim_{LIG3}(X, Y) = \frac{2I}{2I+C}\]

Here, I is the number of exact matches between the two words, truncated to the length of the shorter word, and C is the Levenshtein distance between the two words.

Added in version 0.4.1.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the LIG3 similarity of two words.

sim(src: str, tar: str) → float¶

Return the LIG3 similarity of two words.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The LIG3 similarity

Examples

>>> cmp = LIG3()
>>> cmp.sim('cat', 'hat')
0.8
>>> cmp.sim('Niall', 'Neil')
0.5714285714285714
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.1.

`_ncd_bz2`¶

abydos.distance._ncd_bz2.

NCD using bzip2

class distances._ncd_bz2.NCDbz2(level: int = 9, **kwargs: Any)¶

Bases: _Distance

Normalized Compression Distance using bzip2 compression.

Cf. https://en.wikipedia.org/wiki/Bzip2

Normalized compression distance (NCD) :cite:`Cilibrasi:2005`.

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return the NCD between two strings using bzip2 compression.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(level: int = 9, **kwargs: Any) → None¶

Initialize bzip2 compressor.

Parameters:

levelint: The compression level (0 to 9)
.. versionadded:: 0.3.6
.. versionchanged:: 0.3.6: Encapsulated in class

dist(src: str, tar: str) → float¶

Return the NCD between two strings using bzip2 compression.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Compression distance

Examples

>>> cmp = NCDbz2()
>>> cmp.dist('cat', 'hat')
0.06666666666666667
>>> cmp.dist('Niall', 'Neil')
0.03125
>>> cmp.dist('aluminum', 'Catalan')
0.17647058823529413
>>> cmp.dist('ATCG', 'TAGC')
0.03125

Added in version 0.3.5.

Changed in version 0.3.6: Encapsulated in class

`_overlap`¶

abydos.distance._overlap.

Overlap similarity & distance

class distances._overlap.Overlap(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any)¶

Bases: _TokenDistance

Overlap coefficient.

For two sets X and Y, the overlap coefficient :cite:`Szymkiewicz:1934,Simpson:1949`, also called the Szymkiewicz-Simpson coefficient and Simpson’s ecological coexistence coefficient, is

\[sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{overlap} = \frac{a}{min(a+b, a+c)}\]

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the overlap coefficient of two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', **kwargs: Any) → None¶

Initialize Overlap instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the overlap coefficient of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Overlap similarity

Examples

>>> cmp = Overlap()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.4
>>> cmp.sim('aluminum', 'Catalan')
0.125
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_pearson_chi_squared`¶

abydos.distance._pearson_chi_squared.

Pearson’s Chi-Squared similarity

Bases: _TokenDistance

Pearson’s Chi-Squared similarity.

For two sets X and Y and a population N, the Pearson’s $\chi^2$ similarity :cite:`Pearson:1913` is

\[sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]

This is also Pearson I similarity.

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{PearsonChiSquared} = \frac{n(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]

Added in version 0.4.0.

Methods

`corr`(src, tar)	Return Pearson's Chi-Squared correlation of two strings.
`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return Pearson's normalized Chi-Squared similarity of two strings.
`sim_score`(src, tar)	Return Pearson's Chi-Squared similarity of two strings.

Initialize PearsonChiSquared instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

corr(src: str, tar: str) → float¶

Return Pearson’s Chi-Squared correlation of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Pearson’s Chi-Squared correlation

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.corr('cat', 'hat')
0.2474424720578567
>>> cmp.corr('Niall', 'Neil')
0.1300991207720222
>>> cmp.corr('aluminum', 'Catalan')
0.011710186806836291
>>> cmp.corr('ATCG', 'TAGC')
-4.1196952743799446e-05

Added in version 0.4.0.

sim(src: str, tar: str) → float¶

Return Pearson’s normalized Chi-Squared similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Normalized Pearson’s Chi-Squared similarity

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.corr('cat', 'hat')
0.2474424720578567
>>> cmp.corr('Niall', 'Neil')
0.1300991207720222
>>> cmp.corr('aluminum', 'Catalan')
0.011710186806836291
>>> cmp.corr('ATCG', 'TAGC')
-4.1196952743799446e-05

Added in version 0.4.0.

sim_score(src: str, tar: str) → float¶

Return Pearson’s Chi-Squared similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Pearson’s Chi-Squared similarity

Examples

>>> cmp = PearsonChiSquared()
>>> cmp.sim_score('cat', 'hat')
193.99489809335964
>>> cmp.sim_score('Niall', 'Neil')
101.99771068526542
>>> cmp.sim_score('aluminum', 'Catalan')
9.19249664336649
>>> cmp.sim_score('ATCG', 'TAGC')
0.032298410951138765

Added in version 0.4.0.

`_pearson_ii`¶

abydos.distance._pearson_ii.

Pearson II similarity

Bases: PearsonChiSquared

Pearson II similarity.

For two sets X and Y and a population N, the Pearson II similarity :cite:`Pearson:1913`, Pearson’s coefficient of mean square contingency, is

\[corr_{PearsonII} = \sqrt{\frac{\chi^2}{|N|+\chi^2}}\]

where

\[\chi^2 = sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[\chi^2 = sim_{PearsonChiSquared} = \frac{n \cdot (ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]

Added in version 0.4.0.

Methods

`corr`(src, tar)	Return Pearson's Chi-Squared correlation of two strings.
`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the normalized Pearson II similarity of two strings.
`sim_score`(src, tar)	Return the Pearson II similarity of two strings.

Initialize PearsonII instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the normalized Pearson II similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Normalized Pearson II similarity

Examples

>>> cmp = PearsonII()
>>> cmp.sim('cat', 'hat')
0.6298568508557214
>>> cmp.sim('Niall', 'Neil')
0.47983719547968123
>>> cmp.sim('aluminum', 'Catalan')
0.15214891090821628
>>> cmp.sim('ATCG', 'TAGC')
0.009076921903905551

Added in version 0.4.0.

sim_score(src: str, tar: str) → float¶

Return the Pearson II similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Pearson II similarity

Examples

>>> cmp = PearsonII()
>>> cmp.sim_score('cat', 'hat')
0.44537605041688455
>>> cmp.sim_score('Niall', 'Neil')
0.3392961347892176
>>> cmp.sim_score('aluminum', 'Catalan')
0.10758552665334761
>>> cmp.sim_score('ATCG', 'TAGC')
0.006418353030552324

Added in version 0.4.0.

`_phonetic`¶

abydos.phonetic._phonetic.

The phonetic._phonetic module implements abstract class Phonetic.

`_phonetic_distance`¶

abydos.distance._phonetic_distance.

Phonetic distance.

Bases: _Distance

Phonetic distance.

Phonetic distance applies one or more supplied string transformations to words and compares the resulting transformed strings using a supplied distance measure.

A simple example would be to create a ‘Soundex distance’:

>>> from abydos.phonetic import Soundex
>>> soundex = PhoneticDistance(transforms=Soundex())
>>> soundex.dist('Ashcraft', 'Ashcroft')
0.0
>>> soundex.dist('Robert', 'Ashcraft')
1.0

Added in version 0.4.1.

Methods

`dist`(src, tar)	Return the normalized Phonetic distance.
`dist_abs`(src, tar)	Return the Phonetic distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

Initialize PhoneticDistance instance.

Parameters:

transformslist or _Phonetic or _Stemmer or _Fingerprint or type: An instance of a subclass of _Phonetic, _Stemmer, or _Fingerprint, or a list (or other iterable) of such instances to apply to each input word before computing their distance or similarity. If omitted, no transformations will be performed.
metric_Distance or type: An instance of a subclass of _Distance, used for computing the inputs’ distance or similarity after being transformed. If omitted, the strings will be compared for identify (returning 0.0 if identical, otherwise 1.0, when distance is computed).
encode_alphabool: Set to true to use the encode_alpha method of phonetic algorithms whenever possible.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.1

dist(src: str, tar: str) → float¶

Return the normalized Phonetic distance.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The normalized Phonetic distance

Examples

>>> from abydos.phonetic import Soundex
>>> cmp = PhoneticDistance(Soundex())
>>> cmp.dist('cat', 'hat')
1.0
>>> cmp.dist('Niall', 'Neil')
0.0
>>> cmp.dist('Colin', 'Cuilen')
0.0
>>> cmp.dist('ATCG', 'TAGC')
1.0

>>> from abydos.distance import Levenshtein
>>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein)
>>> cmp.dist('cat', 'hat')
0.25
>>> cmp.dist('Niall', 'Neil')
0.0
>>> cmp.dist('Colin', 'Cuilen')
0.0
>>> cmp.dist('ATCG', 'TAGC')
0.75

Added in version 0.4.1.

dist_abs(src: str, tar: str) → float¶

Return the Phonetic distance.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float or int: The Phonetic distance

Examples

>>> from abydos.phonetic import Soundex
>>> cmp = PhoneticDistance(Soundex())
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
0
>>> cmp.dist_abs('Colin', 'Cuilen')
0
>>> cmp.dist_abs('ATCG', 'TAGC')
1

>>> from abydos.distance import Levenshtein
>>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein)
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
0
>>> cmp.dist_abs('Colin', 'Cuilen')
0
>>> cmp.dist_abs('ATCG', 'TAGC')
3

Added in version 0.4.1.

`_q_grams`¶

abydos.tokenizer._q_grams.

QGrams multi-set class

class distances._q_grams.QGrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None)¶

Bases: _Tokenizer

A q-gram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

Added in version 0.1.0.

Methods

`count`()	Return token count.
`count_unique`()	Return the number of unique elements.
`get_counter`()	Return the tokens as a Counter object.
`get_list`()	Return the tokens as an ordered list.
`get_set`()	Return the unique tokens as a set.
`tokenize`(string)	Tokenize the term and store it.

__init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', skip: int | Iterable[int] = 0, scaler: str | Callable[[float], float] | None = None) → None¶

Initialize QGrams.

Parameters:

qvalint or Iterable

The q-gram length (defaults to 2), can be an integer, range object, or list

start_stopstr

A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

skipint or Iterable

The number of characters to skip, can be an integer, range object, or list

scalerNone, str, or function

A scaling function for the Counter:

None : no scaling

‘set’ : All non-zero values are set to 1.

‘length’ : Each token has weight equal to its length.

‘length-log’Each token has weight equal to the log of its
length + 1.

‘length-exp’Each token has weight equal to e raised to its
length.

a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

Raises:

ValueError: Use WhitespaceTokenizer instead of qval=0.

Examples

>>> qg = QGrams().tokenize('AATTATAT')
>>> qg
QGrams({'$A': 1, 'AA': 1, 'AT': 3, 'TT': 1, 'TA': 2, 'T#': 1})

>>> qg = QGrams(qval=1, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'A': 4, 'T': 4})

>>> qg = QGrams(qval=3, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'AAT': 1, 'ATT': 1, 'TTA': 1, 'TAT': 2, 'ATA': 1})

>>> QGrams(qval=2, start_stop='$#').tokenize('interning')
QGrams({'$i': 1, 'in': 2, 'nt': 1, 'te': 1, 'er': 1, 'rn': 1, 'ni': 1,
'ng': 1, 'g#': 1})

>>> QGrams(start_stop='', skip=1).tokenize('AACTAGAAC')
QGrams({'AC': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'AA': 1, 'GA': 1, 'A': 1})

>>> QGrams(start_stop='', skip=[0, 1]).tokenize('AACTAGAAC')
QGrams({'AA': 3, 'AC': 4, 'CT': 1, 'TA': 1, 'AG': 1, 'GA': 2, 'AT': 1,
'CA': 1, 'TG': 1, 'A': 1})

>>> QGrams(qval=range(3), skip=[0, 1]).tokenize('interdisciplinarian')
QGrams({'i': 10, 'n': 7, 't': 2, 'e': 2, 'r': 4, 'd': 2, 's': 2,
'c': 2, 'p': 2, 'l': 2, 'a': 4, '$i': 1, 'in': 3, 'nt': 1, 'te': 1,
'er': 1, 'rd': 1, 'di': 1, 'is': 1, 'sc': 1, 'ci': 1, 'ip': 1, 'pl': 1,
'li': 1, 'na': 1, 'ar': 1, 'ri': 2, 'ia': 2, 'an': 1, 'n#': 1, '$n': 1,
'it': 1, 'ne': 1, 'tr': 1, 'ed': 1, 'ds': 1, 'ic': 1, 'si': 1, 'cp': 1,
'il': 1, 'pi': 1, 'ln': 1, 'nr': 1, 'ai': 1, 'ra': 1, 'a#': 1})

Added in version 0.1.0.

Changed in version 0.4.0: Broke tokenization functions out into tokenize method

tokenize(string: str) → QGrams¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:

stringstr: The string to tokenize
.. versionadded:: 0.4.0

`_q_skipgrams`¶

abydos.tokenizer._q_skipgrams.

Q-Skipgrams multi-set class

class distances._q_skipgrams.QSkipgrams(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9)¶

Bases: _Tokenizer

A q-skipgram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

Added in version 0.4.0.

Methods

`count`()	Return token count.
`count_unique`()	Return the number of unique elements.
`get_counter`()	Return the tokens as a Counter object.
`get_list`()	Return the tokens as an ordered list.
`get_set`()	Return the unique tokens as a set.
`tokenize`(string)	Tokenize the term and store it.

__init__(qval: int | Iterable[int] = 2, start_stop: str = '$#', scaler: str | Callable[[float], float] | None = None, ssk_lambda: float | Iterable[float] = 0.9) → None¶

Initialize QSkipgrams.

Parameters:

qvalint or Iterable

The q-gram length (defaults to 2), can be an integer, range object, or list

start_stopstr

A string of length >= 0 indicating start & stop symbols. If the string is ‘’, q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

scalerNone, str, or function

A scaling function for the Counter:

None : no scaling

‘set’ : All non-zero values are set to 1.

‘length’ : Each token has weight equal to its length.

‘length-log’Each token has weight equal to the log of its
length + 1.

‘length-exp’Each token has weight equal to e raised to its
length.

a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

‘SSK’ : Applies weighting according to the substring kernel rules of :cite:`Lodhi:2002`.

ssk_lambdafloat or Iterable

A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)

Raises:

ValueError: Use WhitespaceTokenizer instead of qval=0.

Examples

>>> QSkipgrams().tokenize('AATTAT')
QSkipgrams({'$A': 3, '$T': 3, '$#': 1, 'AA': 3, 'AT': 7, 'A#': 3,
'TT': 3, 'TA': 2, 'T#': 3})

>>> QSkipgrams(qval=1, start_stop='').tokenize('AATTAT')
QSkipgrams({'A': 3, 'T': 3})

>>> QSkipgrams(qval=3, start_stop='').tokenize('AATTAT')
QSkipgrams({'AAT': 5, 'AAA': 1, 'ATT': 6, 'ATA': 4, 'TTA': 1, 'TTT': 1,
'TAT': 2})

>>> QSkipgrams(start_stop='').tokenize('ABCD')
QSkipgrams({'AB': 1, 'AC': 1, 'AD': 1, 'BC': 1, 'BD': 1, 'CD': 1})

>>> QSkipgrams().tokenize('Colin')
QSkipgrams({'$C': 1, '$o': 1, '$l': 1, '$i': 1, '$n': 1, '$#': 1,
'Co': 1, 'Cl': 1, 'Ci': 1, 'Cn': 1, 'C#': 1, 'ol': 1, 'oi': 1, 'on': 1,
'o#': 1, 'li': 1, 'ln': 1, 'l#': 1, 'in': 1, 'i#': 1, 'n#': 1})

>>> QSkipgrams(qval=3).tokenize('AACTAGAAC')
QSkipgrams({'$$A': 5, '$$C': 2, '$$T': 1, '$$G': 1, '$$#': 2,
'$AA': 20, '$AC': 14, '$AT': 4, '$AG': 6, '$A#': 20, '$CT': 2,
'$CA': 6, '$CG': 2, '$CC': 2, '$C#': 8, '$TA': 6, '$TG': 2, '$TC': 2,
'$T#': 4, '$GA': 4, '$GC': 2, '$G#': 4, '$##': 2, 'AAC': 11, 'AAT': 1,
'AAA': 10, 'AAG': 3, 'AA#': 20, 'ACT': 2, 'ACA': 6, 'ACG': 2, 'ACC': 2,
'AC#': 14, 'ATA': 6, 'ATG': 2, 'ATC': 2, 'AT#': 4, 'AGA': 6, 'AGC': 3,
'AG#': 6, 'A##': 5, 'CTA': 3, 'CTG': 1, 'CTC': 1, 'CT#': 2, 'CAG': 1,
'CAA': 3, 'CAC': 3, 'CA#': 6, 'CGA': 2, 'CGC': 1, 'CG#': 2, 'CC#': 2,
'C##': 2, 'TAG': 1, 'TAA': 3, 'TAC': 3, 'TA#': 6, 'TGA': 2, 'TGC': 1,
'TG#': 2, 'TC#': 2, 'T##': 1, 'GAA': 1, 'GAC': 2, 'GA#': 4, 'GC#': 2,
'G##': 1})

QSkipgrams may also be used to produce weights in accordance with the substring kernel rules of :cite:`Lodhi:2002` by passing the scaler value 'SSK':

>>> QSkipgrams(scaler='SSK').tokenize('AACTAGAAC')
QSkipgrams(, {'$A': 2.8883286990000006, '$C': 1.0047784401000002,
'$T': 0.5904900000000001, '$G': 0.4782969000000001,
'$#': 0.31381059609000006, 'AA': 6.170192010000001, 'AC': 4.486377699,
'AT': 1.3851, 'AG': 1.931931, 'A#': 2.6526399291000002, 'CT': 0.81,
'CA': 1.850931, 'CG': 0.6561, 'CC': 0.4782969000000001,
'C#': 1.2404672100000003, 'TA': 2.05659, 'TG': 0.7290000000000001,
'TC': 0.531441, 'T#': 0.4782969000000001, 'GA': 1.5390000000000001,
'GC': 0.6561, 'G#': 0.5904900000000001})

Added in version 0.4.0.

tokenize(string: str) → QSkipgrams¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:

stringstr: The string to tokenize
.. versionadded:: 0.4.0

`_ratcliff_obershelp`¶

abydos.distance._ratcliff_obershelp.

Ratcliff-Obershelp similarity

class distances._ratcliff_obershelp.RatcliffObershelp(**kwargs: Any)¶

Bases: _Distance

Ratcliff-Obershelp similarity.

This follows the Ratcliff-Obershelp algorithm :cite:`Ratcliff:1988` to derive a similarity measure:

Find the length of the longest common substring in src & tar.

Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.

Multiply this length by 2 and divide by the sum of the lengths of src & tar.

Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Ratcliff-Obershelp similarity of two strings.

sim(src: str, tar: str) → float¶

Return the Ratcliff-Obershelp similarity of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Ratcliff-Obershelp similarity

Examples

>>> cmp = RatcliffObershelp()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.666666666667
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.4
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

`_refined_soundex`¶

abydos.phonetic._refined_soundex.

Refined Soundex

class distances._refined_soundex.RefinedSoundex(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False)¶

Bases: _Phonetic

Refined Soundex.

This is Soundex, but with more character classes. It was defined at :cite:`Boyce:1998`.

Added in version 0.3.6.

Methods

`encode`(word)	Return the Refined Soundex code for a word.
`encode_alpha`(word)	Return the alphabetic Refined Soundex code for a word.

__init__(max_length: int = -1, zero_pad: bool = False, retain_vowels: bool = False) → None¶

Initialize RefinedSoundex instance.

Parameters:

max_lengthint: The length of the code returned (defaults to unlimited)
zero_padbool: Pad the end of the return value with 0s to achieve a max_length string
retain_vowelsbool: Retain vowels (as 0) in the resulting code
.. versionadded:: 0.4.0

encode(word: str) → str¶

Return the Refined Soundex code for a word.

Parameters:

wordstr: The word to transform

Returns:

str: The Refined Soundex value

Examples

>>> pe = RefinedSoundex()
>>> pe.encode('Christopher')
'C93619'
>>> pe.encode('Niall')
'N7'
>>> pe.encode('Smith')
'S86'
>>> pe.encode('Schmidt')
'S386'

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

encode_alpha(word: str) → str¶

Return the alphabetic Refined Soundex code for a word.

Parameters:

wordstr: The word to transform

Returns:

str: The alphabetic Refined Soundex value

Examples

>>> pe = RefinedSoundex()
>>> pe.encode_alpha('Christopher')
'CRKTPR'
>>> pe.encode_alpha('Niall')
'NL'
>>> pe.encode_alpha('Smith')
'SNT'
>>> pe.encode_alpha('Schmidt')
'SKNT'

Added in version 0.4.0.

`_regexp`¶

abydos.tokenizer._wordpunct.

Regexp tokenizer

class distances._regexp.RegexpTokenizer(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0)¶

Bases: _Tokenizer

A regexp tokenizer.

Added in version 0.4.0.

Methods

`count`()	Return token count.
`count_unique`()	Return the number of unique elements.
`get_counter`()	Return the tokens as a Counter object.
`get_list`()	Return the tokens as an ordered list.
`get_set`()	Return the unique tokens as a set.
`tokenize`(string)	Tokenize the term and store it.

__init__(scaler: str | Callable[[float], float] | None = None, regexp: str = '\\w+', flags: int = 0) → None¶

Initialize tokenizer.

Parameters:

scalerNone, str, or function

A scaling function for the Counter:

None : no scaling

‘set’ : All non-zero values are set to 1.

‘length’ : Each token has weight equal to its length.

‘length-log’Each token has weight equal to the log of its
length + 1.

‘length-exp’Each token has weight equal to e raised to its
length.

a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

regexpstr

A regular exprecssion used to match tokens in the input text.

flagsint

Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.

.. versionadded:: 0.4.0

tokenize(string: str) → RegexpTokenizer¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters:

stringstr: The string to tokenize

Examples

>>> RegexpTokenizer(regexp=r'[^-]+').tokenize('AA-CT-AG-AA-CD')
RegexpTokenizer({'AA': 2, 'CT': 1, 'AG': 1, 'CD': 1})

Added in version 0.4.0.

`_rouge_l`¶

abydos.distance._rouge_l.

Rouge-L similarity

class distances._rouge_l.RougeL(**kwargs: Any)¶

Bases: _Distance

Rouge-L similarity.

Rouge-L similarity :cite:`Lin:2004`

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar[, beta])	Return the Rouge-L similarity of two strings.

__init__(**kwargs: Any) → None¶

Initialize RougeL instance.

Parameters:

**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

sim(src: str, tar: str, beta: float = 8) → float¶

Return the Rouge-L similarity of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison
betaint or float: A weighting factor to prejudice similarity towards src

Returns:

float: Rouge-L similarity

Examples

>>> cmp = RougeL()
>>> cmp.sim('cat', 'hat')
0.6666666666666666
>>> cmp.sim('Niall', 'Neil')
0.6018518518518519
>>> cmp.sim('aluminum', 'Catalan')
0.3757225433526012
>>> cmp.sim('ATCG', 'TAGC')
0.5

Added in version 0.4.0.

`_ssk`¶

abydos.distance._ssk.

String subsequence kernel (SSK) similarity

class distances._ssk.SSK(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any)¶

Bases: _TokenDistance

String subsequence kernel (SSK) similarity.

This is based on :cite:`Lodhi:2002`.

Added in version 0.4.1.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the normalized SSK similarity of two strings.
`sim_score`(src, tar)	Return the SSK similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, ssk_lambda: float = 0.9, **kwargs: Any) → None¶

Initialize SSK instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
ssk_lambdafloat or Iterable: A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in :cite:`Lodhi:2002`. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-skipgram. Using this parameter and tokenizer=None will cause the instance to use the QGramskipgrams tokenizer with this q value.
.. versionadded:: 0.4.1

sim(src: str, tar: str) → float¶

Return the normalized SSK similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Normalized string subsequence kernel similarity

Examples

>>> cmp = SSK()
>>> cmp.sim('cat', 'hat')
0.3558718861209964
>>> cmp.sim('Niall', 'Neil')
0.4709007822130597
>>> cmp.sim('aluminum', 'Catalan')
0.13760157193822603
>>> cmp.sim('ATCG', 'TAGC')
0.6140899528060498

Added in version 0.4.1.

sim_score(src: str, tar: str) → float¶

Return the SSK similarity of two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: String subsequence kernel similarity

Examples

>>> cmp = SSK()
>>> cmp.dist_abs('cat', 'hat')
0.6441281138790036
>>> cmp.dist_abs('Niall', 'Neil')
0.5290992177869402
>>> cmp.dist_abs('aluminum', 'Catalan')
0.862398428061774
>>> cmp.dist_abs('ATCG', 'TAGC')
0.38591004719395017

Added in version 0.4.1.

`_tichy`¶

abydos.distance._tichy.

Tichy edit distance

class distances._tichy.Tichy(cost: Tuple[int, int] = (1, 1), **kwargs: Any)¶

Bases: _Distance

Tichy edit distance.

Tichy described an algorithm, implemented below, in :cite:`Tichy:1984`. Following this, :cite:`Cormode:2003` identifies an interpretation of this algorithm’s output as a distance measure, which is largely followed by the methods below.

Tichy’s algorithm locates substrings of a string S to be copied in order to create a string T. The only other operation used by his algorithms for string reconstruction are add operations.

Methods

`dist`(src, tar)	Return the normalized Tichy edit distance between two strings.
`dist_abs`(src, tar)	Return the Tichy distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

Notes

While :cite:`Cormode:2003` counts only move operations to calculate distance, I give the option (enabled by default) of counting add operations as part of the distance measure. To ignore the cost of add operations, set the cost value to (1, 0), for example, when initializing the object. Further, in the case that S and T are identical, a distance of 0 will be returned, even though this would still be counted as a single move operation spanning the whole of string S.

Added in version 0.4.0.

__init__(cost: Tuple[int, int] = (1, 1), **kwargs: Any) → None¶

Initialize Tichy instance.

Parameters:

costtuple: A 2-tuple representing the cost of the two possible edits: block moves and adds (by default: (1, 1))
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized Tichy edit distance between two strings.

The Tichy distance is normalized by dividing the distance by the length of the tar string.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: The normalized Tichy distance between src & tar

Examples

>>> cmp = Tichy()
>>> round(cmp.dist('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.dist('Niall', 'Neil'), 12)
1.0
>>> cmp.dist('aluminum', 'Catalan')
0.8571428571428571
>>> cmp.dist('ATCG', 'TAGC')
1.0

Added in version 0.4.0.

dist_abs(src: str, tar: str) → float¶

Return the Tichy distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

int (may return a float if cost has float values): The Tichy distance between src & tar

Examples

>>> cmp = Tichy()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
4
>>> cmp.dist_abs('aluminum', 'Catalan')
6
>>> cmp.dist_abs('ATCG', 'TAGC')
4

Added in version 0.4.0.

`_tokenizer`¶

abydos.tokenizer._tokenize.

_Tokenizer base class

`_token_distance`¶

abydos.distance._token_distance.

The distance._token_distance._TokenDistance module implements abstract class _TokenDistance.

`_typo`¶

abydos.distance._typo.

Typo edit distance functions.

class distances._typo.Typo(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)¶

Bases: _Distance

Typo distance.

This is inspired by Typo-Distance :cite:`Song:2011`, and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.

Added in version 0.3.6.

Methods

`dist`(src, tar)	Return the normalized typo distance between two strings.
`dist_abs`(src, tar)	Return the typo distance between two strings.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return similarity.

__init__(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)¶

Initialize Typo instance.

Parameters:

metricstr: Supported values include: euclidean, manhattan, log-euclidean, and log-manhattan
costtuple: A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
layoutstr: Name of the keyboard layout to use (Currently supported: QWERTY, Dvorak, AZERTY, QWERTZ, auto). If auto is selected, the class will attempt to determine an appropriate keyboard based on the supplied words.
failsafebool: If True, substitution of an unknown character (one not present on the selected keyboard) will incur a cost equal to an insertion plus a deletion.
**kwargs: Arbitrary keyword arguments
.. versionadded:: 0.4.0

dist(src: str, tar: str) → float¶

Return the normalized typo distance between two strings.

This is typo distance, normalized to [0, 1].

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Normalized typo distance

Examples

>>> cmp = Typo()
>>> round(cmp.dist('cat', 'hat'), 12)
0.527046276695
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.565028153987
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.569035593729
>>> cmp.dist('ATCG', 'TAGC')
0.625

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

dist_abs(src: str, tar: str) → float¶

Return the typo distance between two strings.

Parameters:

srcstr: Source string for comparison
tarstr: Target string for comparison

Returns:

float: Typo distance

Raises:

ValueError: char not found in any keyboard layouts

Examples

>>> cmp = Typo()
>>> cmp.dist_abs('cat', 'hat')
1.5811388300841898
>>> cmp.dist_abs('Niall', 'Neil')
2.8251407699364424
>>> cmp.dist_abs('Colin', 'Cuilen')
3.414213562373095
>>> cmp.dist_abs('ATCG', 'TAGC')
2.5

>>> cmp = Typo(metric='manhattan')
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> cmp.dist_abs('Niall', 'Neil')
3.0
>>> cmp.dist_abs('Colin', 'Cuilen')
3.5
>>> cmp.dist_abs('ATCG', 'TAGC')
2.5

>>> cmp = Typo(metric='log-manhattan')
>>> cmp.dist_abs('cat', 'hat')
0.8047189562170501
>>> cmp.dist_abs('Niall', 'Neil')
2.2424533248940004
>>> cmp.dist_abs('Colin', 'Cuilen')
2.242453324894
>>> cmp.dist_abs('ATCG', 'TAGC')
2.3465735902799727

Added in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

`_warrens_iv`¶

abydos.distance._warrens_iv.

Warrens IV similarity

Bases: _TokenDistance

Warrens IV similarity.

For two sets X and Y and a population N, Warrens IV similarity :cite:`Warrens:2008` is

\[sim_{WarrensIV}(X, Y) = \frac{4|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {4|X \cap Y| \cdot |(N \setminus X) \setminus Y| + (|X \cap Y| + |(N \setminus X) \setminus Y|) (|X \setminus Y| + |Y \setminus X|)}\]

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{WarrensIV} = \frac{4ad}{4ad + (a+d)(b+c)}\]

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Warrens IV similarity of two strings.

Initialize WarrensIV instance.

Parameters:

alphabetCounter, collection, int, or None: This represents the alphabet of possible tokens. See alphabet description in _TokenDistance for details.
tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the Warrens IV similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Warrens IV similarity

Examples

>>> cmp = WarrensIV()
>>> cmp.sim('cat', 'hat')
0.666095890410959
>>> cmp.sim('Niall', 'Neil')
0.5326918120113412
>>> cmp.sim('aluminum', 'Catalan')
0.21031040612607685
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

`_weighted_jaccard`¶

abydos.distance._weighted_jaccard.

Weighted Jaccard similarity

class distances._weighted_jaccard.WeightedJaccard(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any)¶

Bases: _TokenDistance

Weighted Jaccard similarity.

For two sets X and Y and a weight w, the Weighted Jaccard similarity :cite:`Legendre:1998` is

\[sim_{Jaccard_w}(X, Y) = \frac{w \cdot |X \cap Y|} {w \cdot |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]

Here, the intersection between the two sets is weighted by w. Compare to Jaccard similarity ($w = 1$), and to Dice similarity ($w = 2$). In the default case, the weight of the intersection is 3, following :cite:`Legendre:1998`.

In 2x2 confusion table terms, where a+b+c+d=n, this is

\[sim_{Jaccard_w} = \frac{w\cdot a}{w\cdot a+b+c}\]

Added in version 0.4.0.

Methods

`dist`(src, tar)	Return distance.
`dist_abs`(src, tar)	Return absolute distance.
`set_params`(**kwargs)	Store params in the params dict.
`sim`(src, tar)	Return the Triple Weighted Jaccard similarity of two strings.

__init__(tokenizer: _Tokenizer | None = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any) → None¶

Initialize TripleWeightedJaccard instance.

Parameters:

tokenizer_Tokenizer: A tokenizer instance from the abydos.tokenizer package
intersection_typestr: Specifies the intersection type, and set type as a result: See intersection_type description in _TokenDistance for details.
weightint: The weight to apply to the intersection cardinality. (3, by default.)
**kwargs: Arbitrary keyword arguments

Other Parameters:

qvalint: The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric_Distance: A string distance measure class for use in the soft and fuzzy variants.
thresholdfloat: A threshold value, similarities above which are counted as members of the intersection for the fuzzy variant.
.. versionadded:: 0.4.0

sim(src: str, tar: str) → float¶

Return the Triple Weighted Jaccard similarity of two strings.

Parameters:

srcstr: Source string (or QGrams/Counter objects) for comparison
tarstr: Target string (or QGrams/Counter objects) for comparison

Returns:

float: Weighted Jaccard similarity

Examples

>>> cmp = WeightedJaccard()
>>> cmp.sim('cat', 'hat')
0.6
>>> cmp.sim('Niall', 'Neil')
0.46153846153846156
>>> cmp.sim('aluminum', 'Catalan')
0.16666666666666666
>>> cmp.sim('ATCG', 'TAGC')
0.0

Added in version 0.4.0.

`_whitespace`¶

abydos.tokenizer._whitespace.

Whitespace tokenizer

class distances._whitespace.WhitespaceTokenizer(scaler: str | Callable[[float], float] | None = None, flags: int = 0)¶

Bases: RegexpTokenizer

A whitespace tokenizer.

Methods

`count`()	Return token count.
`count_unique`()	Return the number of unique elements.
`get_counter`()	Return the tokens as a Counter object.
`get_list`()	Return the tokens as an ordered list.
`get_set`()	Return the unique tokens as a set.
`tokenize`(string)	Tokenize the term and store it.

Examples

>>> WhitespaceTokenizer().tokenize('a b c f a c g e a b')
WhitespaceTokenizer({'a': 3, 'b': 2, 'c': 2, 'f': 1, 'g': 1, 'e': 1})

Added in version 0.4.0.

__init__(scaler: str | Callable[[float], float] | None = None, flags: int = 0) → None¶

Initialize tokenizer.

Parameters:

scalerNone, str, or function

A scaling function for the Counter:

None : no scaling

‘set’ : All non-zero values are set to 1.

‘length’ : Each token has weight equal to its length.

‘length-log’Each token has weight equal to the log of its
length + 1.

‘length-exp’Each token has weight equal to e raised to its
length.

a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

flagsint

Flags to pass to the regular expression matcher. See the documentation on Python’s re module for details.

.. versionadded:: 0.4.0

distances¶

_bag¶

_baulieu_xiii¶

_character¶

_clement¶

_cormode_lz¶

_damerau_levenshtein¶

_dice_asymmetric_i¶

_discounted_levenshtein¶

_distance¶

_double_metaphone¶

_editex¶

_fuzzywuzzy_partial_string¶

_fuzzywuzzy_token_set¶

_fuzzywuzzy_token_sort¶

_hamming¶

_indel¶

_iterative_substring¶

_kuhns_iii¶

_lcprefix¶

_lcsseq¶

_levenshtein¶

_lig3¶

_ncd_bz2¶

_overlap¶

_pearson_chi_squared¶

_pearson_ii¶

_phonetic¶

_phonetic_distance¶

_q_grams¶

_q_skipgrams¶

_ratcliff_obershelp¶

_refined_soundex¶

_regexp¶

_rouge_l¶

_ssk¶

_tichy¶

_tokenizer¶

_token_distance¶

_typo¶

_warrens_iv¶

_weighted_jaccard¶

_whitespace¶

`distances`¶

`_bag`¶

`_baulieu_xiii`¶

`_character`¶

`_clement`¶

`_cormode_lz`¶

`_damerau_levenshtein`¶

`_dice_asymmetric_i`¶

`_discounted_levenshtein`¶

`_distance`¶

`_double_metaphone`¶

`_editex`¶

`_fuzzywuzzy_partial_string`¶

`_fuzzywuzzy_token_set`¶

`_fuzzywuzzy_token_sort`¶

`_hamming`¶

`_indel`¶

`_iterative_substring`¶

`_kuhns_iii`¶

`_lcprefix`¶

`_lcsseq`¶

`_levenshtein`¶

`_lig3`¶

`_ncd_bz2`¶

`_overlap`¶

`_pearson_chi_squared`¶

`_pearson_ii`¶

`_phonetic`¶

`_phonetic_distance`¶

`_q_grams`¶

`_q_skipgrams`¶

`_ratcliff_obershelp`¶

`_refined_soundex`¶

`_regexp`¶

`_rouge_l`¶

`_ssk`¶

`_tichy`¶

`_tokenizer`¶

`_token_distance`¶

`_typo`¶

`_warrens_iv`¶

`_weighted_jaccard`¶

`_whitespace`¶